linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/15] Tracepoints v3 for linux-next
@ 2008-07-09 14:59 Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 01/15] Kernel Tracepoints Mathieu Desnoyers
                   ` (15 more replies)
  0 siblings, 16 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel

Hi,

This is the 3rd round of tracepoints patch submisison. The instrumentation
sites the more likely to come to a quick agreement has been selected in this
first step. The tracepoint infrastructure, heavily inspired from the kernel
markers, seems pretty solid and went through a thorough review by Masami.

This patchset applies over patch-v2.6.26-rc9-next-20080709.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 01/15] Kernel Tracepoints
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-15  7:50   ` Peter Zijlstra
  2008-07-09 14:59 ` [patch 02/15] Tracepoints Documentation Mathieu Desnoyers
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu

[-- Attachment #1: tracepoints.patch --]
[-- Type: text/plain, Size: 31800 bytes --]

Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.
Allows complete typing verification. No format string required. See the
tracepoint Documentation and Samples patches for usage examples.

Changelog :
- Use #name ":" #proto as string to identify the tracepoint in the
  tracepoint table. This will make sure not type mismatch happens due to
  connexion of a probe with the wrong type to a tracepoint declared with
  the same name in a different header.
- Add tracepoint_entry_free_old.

Masami Hiramatsu <mhiramat@redhat.com> :
Tested on x86-64.

Performance impact of a tracepoint : same as markers, except that it adds about
70 bytes of instructions in an unlikely branch of each instrumented function
(the for loop, the stack setup and the function call). It currently adds a
memory read, a test and a conditional branch at the instrumentation site (in the
hot path). Immediate values will eventually change this into a load immediate,
test and branch, which removes the memory read which will make the i-cache
impact smaller (changing the memory read for a load immediate removes 3-4 bytes
per site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it also
saves the d-cache hit).

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.


Quoting Hideo Aoki about Markers :

I evaluated overhead of kernel marker using linux-2.6-sched-fixes
git tree, which includes several markers for LTTng, using an ia64
server.

While the immediate trace mark feature isn't implemented on ia64,
there is no major performance regression. So, I think that we 
don't have any issues to propose merging marker point patches 
into Linus's tree from the viewpoint of performance impact.

I prepared two kernels to evaluate. The first one was compiled
without CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS.

I downloaded the original hackbench from the following URL:
http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c

I ran hackbench 5 times in each condition and calculated the
average and difference between the kernels.  

    The parameter of hackbench: every 50 from 50 to 800
    The number of CPUs of the server: 2, 4, and 8

Below is the results. As you can see, major performance
regression wasn't found in any case. Even if number of processes
increases, differences between marker-enabled kernel and marker-
disabled kernel doesn't increase. Moreover, if number of CPUs 
increases, the differences doesn't increase either.

Curiously, marker-enabled kernel is better than marker-disabled
kernel in more than half cases, although I guess it comes from
the difference of memory access pattern.


* 2 CPUs 

Number of | without      | with         | diff     | diff    |
processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
--------------------------------------------------------------
       50 |      4.811   |       4.872  |  +0.061  |  +1.27  |
      100 |      9.854   |      10.309  |  +0.454  |  +4.61  |
      150 |     15.602   |      15.040  |  -0.562  |  -3.6   |
      200 |     20.489   |      20.380  |  -0.109  |  -0.53  |
      250 |     25.798   |      25.652  |  -0.146  |  -0.56  |
      300 |     31.260   |      30.797  |  -0.463  |  -1.48  |
      350 |     36.121   |      35.770  |  -0.351  |  -0.97  |
      400 |     42.288   |      42.102  |  -0.186  |  -0.44  |
      450 |     47.778   |      47.253  |  -0.526  |  -1.1   |
      500 |     51.953   |      52.278  |  +0.325  |  +0.63  |
      550 |     58.401   |      57.700  |  -0.701  |  -1.2   | 
      600 |     63.334   |      63.222  |  -0.112  |  -0.18  |
      650 |     68.816   |      68.511  |  -0.306  |  -0.44  |
      700 |     74.667   |      74.088  |  -0.579  |  -0.78  |
      750 |     78.612   |      79.582  |  +0.970  |  +1.23  |
      800 |     85.431   |      85.263  |  -0.168  |  -0.2   |
--------------------------------------------------------------

* 4 CPUs 

Number of | without      | with         | diff     | diff    |
processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
--------------------------------------------------------------
       50 |      2.586   |       2.584  |  -0.003  |  -0.1   |
      100 |      5.254   |       5.283  |  +0.030  |  +0.56  |
      150 |      8.012   |       8.074  |  +0.061  |  +0.76  |
      200 |     11.172   |      11.000  |  -0.172  |  -1.54  |
      250 |     13.917   |      14.036  |  +0.119  |  +0.86  |
      300 |     16.905   |      16.543  |  -0.362  |  -2.14  |
      350 |     19.901   |      20.036  |  +0.135  |  +0.68  |
      400 |     22.908   |      23.094  |  +0.186  |  +0.81  |
      450 |     26.273   |      26.101  |  -0.172  |  -0.66  |
      500 |     29.554   |      29.092  |  -0.461  |  -1.56  |
      550 |     32.377   |      32.274  |  -0.103  |  -0.32  |
      600 |     35.855   |      35.322  |  -0.533  |  -1.49  |
      650 |     39.192   |      38.388  |  -0.804  |  -2.05  |
      700 |     41.744   |      41.719  |  -0.025  |  -0.06  |
      750 |     45.016   |      44.496  |  -0.520  |  -1.16  |
      800 |     48.212   |      47.603  |  -0.609  |  -1.26  |
--------------------------------------------------------------

* 8 CPUs 

Number of | without      | with         | diff     | diff    |
processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
--------------------------------------------------------------
       50 |      2.094   |       2.072  |  -0.022  |  -1.07  |
      100 |      4.162   |       4.273  |  +0.111  |  +2.66  |
      150 |      6.485   |       6.540  |  +0.055  |  +0.84  |
      200 |      8.556   |       8.478  |  -0.078  |  -0.91  |
      250 |     10.458   |      10.258  |  -0.200  |  -1.91  |
      300 |     12.425   |      12.750  |  +0.325  |  +2.62  |
      350 |     14.807   |      14.839  |  +0.032  |  +0.22  |
      400 |     16.801   |      16.959  |  +0.158  |  +0.94  |
      450 |     19.478   |      19.009  |  -0.470  |  -2.41  |
      500 |     21.296   |      21.504  |  +0.208  |  +0.98  |
      550 |     23.842   |      23.979  |  +0.137  |  +0.57  |
      600 |     26.309   |      26.111  |  -0.198  |  -0.75  |
      650 |     28.705   |      28.446  |  -0.259  |  -0.9   |
      700 |     31.233   |      31.394  |  +0.161  |  +0.52  |
      750 |     34.064   |      33.720  |  -0.344  |  -1.01  |
      800 |     36.320   |      36.114  |  -0.206  |  -0.57  |
--------------------------------------------------------------

Best regards,
Hideo


P.S. When I compiled the linux-2.6-sched-fixes tree on ia64, I
had to revert the following git commit since pteval_t is defined
on x86 only.

commit 8686f2b37e7394b51dd6593678cbfd85ecd28c65
Date:   Tue May 6 15:42:40 2008 -0700

    generic, x86, PAT: fix mprotect


Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/asm-generic/vmlinux.lds.h |    6 
 include/linux/module.h            |   17 +
 include/linux/tracepoint.h        |  123 +++++++++
 init/Kconfig                      |    7 
 kernel/Makefile                   |    1 
 kernel/module.c                   |   66 +++++
 kernel/tracepoint.c               |  474 ++++++++++++++++++++++++++++++++++++++
 7 files changed, 692 insertions(+), 2 deletions(-)

Index: linux-2.6-lttng/init/Kconfig
===================================================================
--- linux-2.6-lttng.orig/init/Kconfig	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/init/Kconfig	2008-07-09 10:55:58.000000000 -0400
@@ -782,6 +782,13 @@ config PROFILING
 	  Say Y here to enable the extended profiling support mechanisms used
 	  by profilers such as OProfile.
 
+config TRACEPOINTS
+	bool "Activate tracepoints"
+	default y
+	help
+	  Place an empty function call at each tracepoint site. Can be
+	  dynamically changed for a probe function.
+
 config MARKERS
 	bool "Activate markers"
 	help
Index: linux-2.6-lttng/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/kernel/Makefile	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/kernel/Makefile	2008-07-09 10:55:58.000000000 -0400
@@ -77,6 +77,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 obj-$(CONFIG_FTRACE) += trace/
 obj-$(CONFIG_TRACING) += trace/
Index: linux-2.6-lttng/include/linux/tracepoint.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/tracepoint.h	2008-07-09 10:55:58.000000000 -0400
@@ -0,0 +1,123 @@
+#ifndef _LINUX_TRACEPOINT_H
+#define _LINUX_TRACEPOINT_H
+
+/*
+ * Kernel Tracepoint API.
+ *
+ * See Documentation/tracepoint.txt.
+ *
+ * (C) Copyright 2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * Heavily inspired from the Linux Kernel Markers.
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <linux/types.h>
+
+struct module;
+struct tracepoint;
+
+struct tracepoint {
+	const char *name;		/* Tracepoint name */
+	int state;			/* State. */
+	void **funcs;
+} __attribute__((aligned(8)));
+
+
+#define TPPROTO(args...)	args
+#define TPARGS(args...)		args
+
+#ifdef CONFIG_TRACEPOINTS
+
+#define __DO_TRACE(tp, proto, args)					\
+	do {								\
+		int i;							\
+		void **funcs;						\
+		preempt_disable();					\
+		funcs = (tp)->funcs;					\
+		smp_read_barrier_depends();				\
+		if (funcs) {						\
+			for (i = 0; funcs[i]; i++) {			\
+				((void(*)(proto))(funcs[i]))(args);	\
+			}						\
+		}							\
+		preempt_enable();					\
+	} while (0)
+
+/*
+ * Make sure the alignment of the structure in the __tracepoints section will
+ * not add unwanted padding between the beginning of the section and the
+ * structure. Force alignment to the same alignment as the section start.
+ */
+#define DEFINE_TRACE(name, proto, args)					\
+	static inline void trace_##name(proto)				\
+	{								\
+		static const char __tpstrtab_##name[]			\
+		__attribute__((section("__tracepoints_strings")))	\
+		= #name ":" #proto;					\
+		static struct tracepoint __tracepoint_##name		\
+		__attribute__((section("__tracepoints"), aligned(8))) =	\
+		{ __tpstrtab_##name, 0, NULL };				\
+		if (unlikely(__tracepoint_##name.state))		\
+			__DO_TRACE(&__tracepoint_##name,		\
+				TPPROTO(proto), TPARGS(args));		\
+	}								\
+	static inline int register_trace_##name(void (*probe)(proto))	\
+	{								\
+		return tracepoint_probe_register(#name ":" #proto,	\
+			(void *)probe);					\
+	}								\
+	static inline void unregister_trace_##name(void (*probe)(proto))\
+	{								\
+		tracepoint_probe_unregister(#name ":" #proto,		\
+			(void *)probe);					\
+	}
+
+extern void tracepoint_update_probe_range(struct tracepoint *begin,
+	struct tracepoint *end);
+
+#else /* !CONFIG_TRACEPOINTS */
+#define DEFINE_TRACE(name, proto, args)			\
+	static inline void _do_trace_##name(struct tracepoint *tp, proto) \
+	{ }								\
+	static inline void trace_##name(proto)				\
+	{ }								\
+	static inline int register_trace_##name(void (*probe)(proto))	\
+	{								\
+		return -ENOSYS;						\
+	}								\
+	static inline void unregister_trace_##name(void (*probe)(proto))\
+	{ }
+
+static inline void tracepoint_update_probe_range(struct tracepoint *begin,
+	struct tracepoint *end)
+{ }
+#endif /* CONFIG_TRACEPOINTS */
+
+/*
+ * Connect a probe to a tracepoint.
+ * Internal API, should not be used directly.
+ */
+extern int tracepoint_probe_register(const char *name, void *probe);
+
+/*
+ * Disconnect a probe from a tracepoint.
+ * Internal API, should not be used directly.
+ */
+extern int tracepoint_probe_unregister(const char *name, void *probe);
+
+struct tracepoint_iter {
+	struct module *module;
+	struct tracepoint *tracepoint;
+};
+
+extern void tracepoint_iter_start(struct tracepoint_iter *iter);
+extern void tracepoint_iter_next(struct tracepoint_iter *iter);
+extern void tracepoint_iter_stop(struct tracepoint_iter *iter);
+extern void tracepoint_iter_reset(struct tracepoint_iter *iter);
+extern int tracepoint_get_iter_range(struct tracepoint **tracepoint,
+	struct tracepoint *begin, struct tracepoint *end);
+
+#endif
Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2008-07-09 10:55:58.000000000 -0400
@@ -52,7 +52,10 @@
 	. = ALIGN(8);							\
 	VMLINUX_SYMBOL(__start___markers) = .;				\
 	*(__markers)							\
-	VMLINUX_SYMBOL(__stop___markers) = .;
+	VMLINUX_SYMBOL(__stop___markers) = .;				\
+	VMLINUX_SYMBOL(__start___tracepoints) = .;			\
+	*(__tracepoints)						\
+	VMLINUX_SYMBOL(__stop___tracepoints) = .;
 
 #define RO_DATA(align)							\
 	. = ALIGN((align));						\
@@ -61,6 +64,7 @@
 		*(.rodata) *(.rodata.*)					\
 		*(__vermagic)		/* Kernel version magic */	\
 		*(__markers_strings)	/* Markers: strings */		\
+		*(__tracepoints_strings)/* Tracepoints: strings */	\
 	}								\
 									\
 	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\
Index: linux-2.6-lttng/kernel/tracepoint.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/tracepoint.c	2008-07-09 10:55:58.000000000 -0400
@@ -0,0 +1,474 @@
+/*
+ * Copyright (C) 2008 Mathieu Desnoyers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/types.h>
+#include <linux/jhash.h>
+#include <linux/list.h>
+#include <linux/rcupdate.h>
+#include <linux/tracepoint.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+
+extern struct tracepoint __start___tracepoints[];
+extern struct tracepoint __stop___tracepoints[];
+
+/* Set to 1 to enable tracepoint debug output */
+static const int tracepoint_debug;
+
+/*
+ * tracepoints_mutex nests inside module_mutex. Tracepoints mutex protects the
+ * builtin and module tracepoints and the hash table.
+ */
+static DEFINE_MUTEX(tracepoints_mutex);
+
+/*
+ * Tracepoint hash table, containing the active tracepoints.
+ * Protected by tracepoints_mutex.
+ */
+#define TRACEPOINT_HASH_BITS 6
+#define TRACEPOINT_TABLE_SIZE (1 << TRACEPOINT_HASH_BITS)
+
+/*
+ * Note about RCU :
+ * It is used to to delay the free of multiple probes array until a quiescent
+ * state is reached.
+ * Tracepoint entries modifications are protected by the tracepoints_mutex.
+ */
+struct tracepoint_entry {
+	struct hlist_node hlist;
+	void **funcs;
+	int refcount;	/* Number of times armed. 0 if disarmed. */
+	struct rcu_head rcu;
+	void *oldptr;
+	unsigned char rcu_pending:1;
+	char name[0];
+};
+
+static struct hlist_head tracepoint_table[TRACEPOINT_TABLE_SIZE];
+
+static void free_old_closure(struct rcu_head *head)
+{
+	struct tracepoint_entry *entry = container_of(head,
+		struct tracepoint_entry, rcu);
+	kfree(entry->oldptr);
+	/* Make sure we free the data before setting the pending flag to 0 */
+	smp_wmb();
+	entry->rcu_pending = 0;
+}
+
+static void tracepoint_entry_free_old(struct tracepoint_entry *entry, void *old)
+{
+	if (!old)
+		return;
+	entry->oldptr = old;
+	entry->rcu_pending = 1;
+	/* write rcu_pending before calling the RCU callback */
+	smp_wmb();
+#ifdef CONFIG_PREEMPT_RCU
+	synchronize_sched();	/* Until we have the call_rcu_sched() */
+#endif
+	call_rcu(&entry->rcu, free_old_closure);
+}
+
+static void debug_print_probes(struct tracepoint_entry *entry)
+{
+	int i;
+
+	if (!tracepoint_debug)
+		return;
+
+	for (i = 0; entry->funcs[i]; i++)
+		printk(KERN_DEBUG "Probe %d : %p\n", i, entry->funcs[i]);
+}
+
+static void *
+tracepoint_entry_add_probe(struct tracepoint_entry *entry, void *probe)
+{
+	int nr_probes = 0;
+	void **old, **new;
+
+	WARN_ON(!probe);
+
+	debug_print_probes(entry);
+	old = entry->funcs;
+	if (old) {
+		/* (N -> N+1), (N != 0, 1) probes */
+		for (nr_probes = 0; old[nr_probes]; nr_probes++)
+			if (old[nr_probes] == probe)
+				return ERR_PTR(-EBUSY);
+	}
+	/* + 2 : one for new probe, one for NULL func */
+	new = kzalloc((nr_probes + 2) * sizeof(void *), GFP_KERNEL);
+	if (new == NULL)
+		return ERR_PTR(-ENOMEM);
+	if (old)
+		memcpy(new, old, nr_probes * sizeof(void *));
+	new[nr_probes] = probe;
+	entry->refcount = nr_probes + 1;
+	entry->funcs = new;
+	debug_print_probes(entry);
+	return old;
+}
+
+static void *
+tracepoint_entry_remove_probe(struct tracepoint_entry *entry, void *probe)
+{
+	int nr_probes = 0, nr_del = 0, i;
+	void **old, **new;
+
+	old = entry->funcs;
+
+	debug_print_probes(entry);
+	/* (N -> M), (N > 1, M >= 0) probes */
+	for (nr_probes = 0; old[nr_probes]; nr_probes++) {
+		if ((!probe || old[nr_probes] == probe))
+			nr_del++;
+	}
+
+	if (nr_probes - nr_del == 0) {
+		/* N -> 0, (N > 1) */
+		entry->funcs = NULL;
+		entry->refcount = 0;
+		debug_print_probes(entry);
+		return old;
+	} else {
+		int j = 0;
+		/* N -> M, (N > 1, M > 0) */
+		/* + 1 for NULL */
+		new = kzalloc((nr_probes - nr_del + 1)
+			* sizeof(void *), GFP_KERNEL);
+		if (new == NULL)
+			return ERR_PTR(-ENOMEM);
+		for (i = 0; old[i]; i++)
+			if ((probe && old[i] != probe))
+				new[j++] = old[i];
+		entry->refcount = nr_probes - nr_del;
+		entry->funcs = new;
+	}
+	debug_print_probes(entry);
+	return old;
+}
+
+/*
+ * Get tracepoint if the tracepoint is present in the tracepoint hash table.
+ * Must be called with tracepoints_mutex held.
+ * Returns NULL if not present.
+ */
+static struct tracepoint_entry *get_tracepoint(const char *name)
+{
+	struct hlist_head *head;
+	struct hlist_node *node;
+	struct tracepoint_entry *e;
+	u32 hash = jhash(name, strlen(name), 0);
+
+	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
+	hlist_for_each_entry(e, node, head, hlist) {
+		if (!strcmp(name, e->name))
+			return e;
+	}
+	return NULL;
+}
+
+/*
+ * Add the tracepoint to the tracepoint hash table. Must be called with
+ * tracepoints_mutex held.
+ */
+static struct tracepoint_entry *add_tracepoint(const char *name)
+{
+	struct hlist_head *head;
+	struct hlist_node *node;
+	struct tracepoint_entry *e;
+	size_t name_len = strlen(name) + 1;
+	u32 hash = jhash(name, name_len-1, 0);
+
+	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
+	hlist_for_each_entry(e, node, head, hlist) {
+		if (!strcmp(name, e->name)) {
+			printk(KERN_NOTICE
+				"tracepoint %s busy\n", name);
+			return ERR_PTR(-EBUSY);	/* Already there */
+		}
+	}
+	/*
+	 * Using kmalloc here to allocate a variable length element. Could
+	 * cause some memory fragmentation if overused.
+	 */
+	e = kmalloc(sizeof(struct tracepoint_entry) + name_len, GFP_KERNEL);
+	if (!e)
+		return ERR_PTR(-ENOMEM);
+	memcpy(&e->name[0], name, name_len);
+	e->funcs = NULL;
+	e->refcount = 0;
+	e->rcu_pending = 0;
+	hlist_add_head(&e->hlist, head);
+	return e;
+}
+
+/*
+ * Remove the tracepoint from the tracepoint hash table. Must be called with
+ * mutex_lock held.
+ */
+static int remove_tracepoint(const char *name)
+{
+	struct hlist_head *head;
+	struct hlist_node *node;
+	struct tracepoint_entry *e;
+	int found = 0;
+	size_t len = strlen(name) + 1;
+	u32 hash = jhash(name, len-1, 0);
+
+	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
+	hlist_for_each_entry(e, node, head, hlist) {
+		if (!strcmp(name, e->name)) {
+			found = 1;
+			break;
+		}
+	}
+	if (!found)
+		return -ENOENT;
+	if (e->refcount)
+		return -EBUSY;
+	hlist_del(&e->hlist);
+	/* Make sure the call_rcu has been executed */
+	if (e->rcu_pending)
+		rcu_barrier();
+	kfree(e);
+	return 0;
+}
+
+/*
+ * Sets the probe callback corresponding to one tracepoint.
+ */
+static void set_tracepoint(struct tracepoint_entry **entry,
+	struct tracepoint *elem, int active)
+{
+	WARN_ON(strcmp((*entry)->name, elem->name) != 0);
+
+	smp_wmb();
+	/*
+	 * We also make sure that the new probe callbacks array is consistent
+	 * before setting a pointer to it.
+	 */
+	rcu_assign_pointer(elem->funcs, (*entry)->funcs);
+	elem->state = active;
+}
+
+/*
+ * Disable a tracepoint and its probe callback.
+ * Note: only waiting an RCU period after setting elem->call to the empty
+ * function insures that the original callback is not used anymore. This insured
+ * by preempt_disable around the call site.
+ */
+static void disable_tracepoint(struct tracepoint *elem)
+{
+	elem->state = 0;
+}
+
+/**
+ * tracepoint_update_probe_range - Update a probe range
+ * @begin: beginning of the range
+ * @end: end of the range
+ *
+ * Updates the probe callback corresponding to a range of tracepoints.
+ */
+void tracepoint_update_probe_range(struct tracepoint *begin,
+	struct tracepoint *end)
+{
+	struct tracepoint *iter;
+	struct tracepoint_entry *mark_entry;
+
+	mutex_lock(&tracepoints_mutex);
+	for (iter = begin; iter < end; iter++) {
+		mark_entry = get_tracepoint(iter->name);
+		if (mark_entry) {
+			set_tracepoint(&mark_entry, iter,
+					!!mark_entry->refcount);
+		} else {
+			disable_tracepoint(iter);
+		}
+	}
+	mutex_unlock(&tracepoints_mutex);
+}
+
+/*
+ * Update probes, removing the faulty probes.
+ */
+static void tracepoint_update_probes(void)
+{
+	/* Core kernel tracepoints */
+	tracepoint_update_probe_range(__start___tracepoints,
+		__stop___tracepoints);
+	/* tracepoints in modules. */
+	module_update_tracepoints();
+}
+
+/**
+ * tracepoint_probe_register -  Connect a probe to a tracepoint
+ * @name: tracepoint name
+ * @probe: probe handler
+ *
+ * Returns 0 if ok, error value on error.
+ * The probe address must at least be aligned on the architecture pointer size.
+ */
+int tracepoint_probe_register(const char *name, void *probe)
+{
+	struct tracepoint_entry *entry;
+	int ret = 0;
+	void *old;
+
+	mutex_lock(&tracepoints_mutex);
+	entry = get_tracepoint(name);
+	if (!entry) {
+		entry = add_tracepoint(name);
+		if (IS_ERR(entry)) {
+			ret = PTR_ERR(entry);
+			goto end;
+		}
+	}
+	/*
+	 * If we detect that a call_rcu is pending for this tracepoint,
+	 * make sure it's executed now.
+	 */
+	if (entry->rcu_pending)
+		rcu_barrier();
+	old = tracepoint_entry_add_probe(entry, probe);
+	if (IS_ERR(old)) {
+		ret = PTR_ERR(old);
+		goto end;
+	}
+	mutex_unlock(&tracepoints_mutex);
+	tracepoint_update_probes();		/* may update entry */
+	mutex_lock(&tracepoints_mutex);
+	entry = get_tracepoint(name);
+	WARN_ON(!entry);
+	tracepoint_entry_free_old(entry, old);
+end:
+	mutex_unlock(&tracepoints_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_register);
+
+/**
+ * tracepoint_probe_unregister -  Disconnect a probe from a tracepoint
+ * @name: tracepoint name
+ * @probe: probe function pointer
+ *
+ * We do not need to call a synchronize_sched to make sure the probes have
+ * finished running before doing a module unload, because the module unload
+ * itself uses stop_machine(), which insures that every preempt disabled section
+ * have finished.
+ */
+int tracepoint_probe_unregister(const char *name, void *probe)
+{
+	struct tracepoint_entry *entry;
+	void *old;
+	int ret = -ENOENT;
+
+	mutex_lock(&tracepoints_mutex);
+	entry = get_tracepoint(name);
+	if (!entry)
+		goto end;
+	if (entry->rcu_pending)
+		rcu_barrier();
+	old = tracepoint_entry_remove_probe(entry, probe);
+	mutex_unlock(&tracepoints_mutex);
+	tracepoint_update_probes();		/* may update entry */
+	mutex_lock(&tracepoints_mutex);
+	entry = get_tracepoint(name);
+	if (!entry)
+		goto end;
+	tracepoint_entry_free_old(entry, old);
+	remove_tracepoint(name);	/* Ignore busy error message */
+	ret = 0;
+end:
+	mutex_unlock(&tracepoints_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_unregister);
+
+/**
+ * tracepoint_get_iter_range - Get a next tracepoint iterator given a range.
+ * @tracepoint: current tracepoints (in), next tracepoint (out)
+ * @begin: beginning of the range
+ * @end: end of the range
+ *
+ * Returns whether a next tracepoint has been found (1) or not (0).
+ * Will return the first tracepoint in the range if the input tracepoint is
+ * NULL.
+ */
+int tracepoint_get_iter_range(struct tracepoint **tracepoint,
+	struct tracepoint *begin, struct tracepoint *end)
+{
+	if (!*tracepoint && begin != end) {
+		*tracepoint = begin;
+		return 1;
+	}
+	if (*tracepoint >= begin && *tracepoint < end)
+		return 1;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(tracepoint_get_iter_range);
+
+static void tracepoint_get_iter(struct tracepoint_iter *iter)
+{
+	int found = 0;
+
+	/* Core kernel tracepoints */
+	if (!iter->module) {
+		found = tracepoint_get_iter_range(&iter->tracepoint,
+				__start___tracepoints, __stop___tracepoints);
+		if (found)
+			goto end;
+	}
+	/* tracepoints in modules. */
+	found = module_get_iter_tracepoints(iter);
+end:
+	if (!found)
+		tracepoint_iter_reset(iter);
+}
+
+void tracepoint_iter_start(struct tracepoint_iter *iter)
+{
+	tracepoint_get_iter(iter);
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_start);
+
+void tracepoint_iter_next(struct tracepoint_iter *iter)
+{
+	iter->tracepoint++;
+	/*
+	 * iter->tracepoint may be invalid because we blindly incremented it.
+	 * Make sure it is valid by marshalling on the tracepoints, getting the
+	 * tracepoints from following modules if necessary.
+	 */
+	tracepoint_get_iter(iter);
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_next);
+
+void tracepoint_iter_stop(struct tracepoint_iter *iter)
+{
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_stop);
+
+void tracepoint_iter_reset(struct tracepoint_iter *iter)
+{
+	iter->module = NULL;
+	iter->tracepoint = NULL;
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_reset);
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c	2008-07-09 10:55:58.000000000 -0400
@@ -47,6 +47,7 @@
 #include <asm/sections.h>
 #include <linux/license.h>
 #include <asm/sections.h>
+#include <linux/tracepoint.h>
 
 #if 0
 #define DEBUGP printk
@@ -1824,6 +1825,8 @@ static struct module *load_module(void _
 #endif
 	unsigned int markersindex;
 	unsigned int markersstringsindex;
+	unsigned int tracepointsindex;
+	unsigned int tracepointsstringsindex;
 	struct module *mod;
 	long err = 0;
 	void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */
@@ -2110,6 +2113,9 @@ static struct module *load_module(void _
 	markersindex = find_sec(hdr, sechdrs, secstrings, "__markers");
  	markersstringsindex = find_sec(hdr, sechdrs, secstrings,
 					"__markers_strings");
+	tracepointsindex = find_sec(hdr, sechdrs, secstrings, "__tracepoints");
+	tracepointsstringsindex = find_sec(hdr, sechdrs, secstrings,
+					"__tracepoints_strings");
 
 	/* Now do relocations. */
 	for (i = 1; i < hdr->e_shnum; i++) {
@@ -2137,6 +2143,12 @@ static struct module *load_module(void _
 	mod->num_markers =
 		sechdrs[markersindex].sh_size / sizeof(*mod->markers);
 #endif
+#ifdef CONFIG_TRACEPOINTS
+	mod->tracepoints = (void *)sechdrs[tracepointsindex].sh_addr;
+	mod->num_tracepoints =
+		sechdrs[tracepointsindex].sh_size / sizeof(*mod->tracepoints);
+#endif
+
 
         /* Find duplicate symbols */
 	err = verify_export_symbols(mod);
@@ -2155,11 +2167,16 @@ static struct module *load_module(void _
 
 	add_kallsyms(mod, sechdrs, symindex, strindex, secstrings);
 
+	if (!mod->taints) {
 #ifdef CONFIG_MARKERS
-	if (!mod->taints)
 		marker_update_probe_range(mod->markers,
 			mod->markers + mod->num_markers);
 #endif
+#ifdef CONFIG_TRACEPOINTS
+		tracepoint_update_probe_range(mod->tracepoints,
+			mod->tracepoints + mod->num_tracepoints);
+#endif
+	}
 	err = module_finalize(hdr, sechdrs, mod);
 	if (err < 0)
 		goto cleanup;
@@ -2710,3 +2727,50 @@ void module_update_markers(void)
 	mutex_unlock(&module_mutex);
 }
 #endif
+
+#ifdef CONFIG_TRACEPOINTS
+void module_update_tracepoints(void)
+{
+	struct module *mod;
+
+	mutex_lock(&module_mutex);
+	list_for_each_entry(mod, &modules, list)
+		if (!mod->taints)
+			tracepoint_update_probe_range(mod->tracepoints,
+				mod->tracepoints + mod->num_tracepoints);
+	mutex_unlock(&module_mutex);
+}
+
+/*
+ * Returns 0 if current not found.
+ * Returns 1 if current found.
+ */
+int module_get_iter_tracepoints(struct tracepoint_iter *iter)
+{
+	struct module *iter_mod;
+	int found = 0;
+
+	mutex_lock(&module_mutex);
+	list_for_each_entry(iter_mod, &modules, list) {
+		if (!iter_mod->taints) {
+			/*
+			 * Sorted module list
+			 */
+			if (iter_mod < iter->module)
+				continue;
+			else if (iter_mod > iter->module)
+				iter->tracepoint = NULL;
+			found = tracepoint_get_iter_range(&iter->tracepoint,
+				iter_mod->tracepoints,
+				iter_mod->tracepoints
+					+ iter_mod->num_tracepoints);
+			if (found) {
+				iter->module = iter_mod;
+				break;
+			}
+		}
+	}
+	mutex_unlock(&module_mutex);
+	return found;
+}
+#endif
Index: linux-2.6-lttng/include/linux/module.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/module.h	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/include/linux/module.h	2008-07-09 10:57:22.000000000 -0400
@@ -16,6 +16,7 @@
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
 #include <linux/marker.h>
+#include <linux/tracepoint.h>
 #include <asm/local.h>
 
 #include <asm/module.h>
@@ -331,6 +332,10 @@ struct module
 	struct marker *markers;
 	unsigned int num_markers;
 #endif
+#ifdef CONFIG_TRACEPOINTS
+	struct tracepoint *tracepoints;
+	unsigned int num_tracepoints;
+#endif
 
 #ifdef CONFIG_MODULE_UNLOAD
 	/* What modules depend on me? */
@@ -454,6 +459,9 @@ extern void print_modules(void);
 
 extern void module_update_markers(void);
 
+extern void module_update_tracepoints(void);
+extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
+
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -558,6 +566,15 @@ static inline void module_update_markers
 {
 }
 
+static inline void module_update_tracepoints(void)
+{
+}
+
+static inline int module_get_iter_tracepoints(struct tracepoint_iter *iter)
+{
+	return 0;
+}
+
 #endif /* CONFIG_MODULES */
 
 struct device_driver;

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 02/15] Tracepoints Documentation
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 01/15] Kernel Tracepoints Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 03/15] Tracepoints Samples Mathieu Desnoyers
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

[-- Attachment #1: tracepoints-documentation.patch --]
[-- Type: text/plain, Size: 4877 bytes --]

Documentation of tracepoint usage.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 Documentation/tracepoints.txt |  101 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

Index: linux-2.6-lttng/Documentation/tracepoints.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/Documentation/tracepoints.txt	2008-07-07 09:55:13.000000000 -0400
@@ -0,0 +1,101 @@
+ 	             Using the Linux Kernel Tracepoints
+
+			    Mathieu Desnoyers
+
+
+This document introduces Linux Kernel Tracepoints and their use. It provides
+examples of how to insert tracepoints in the kernel and connect probe functions
+to them and provides some examples of probe functions.
+
+
+* Purpose of tracepoints
+
+A tracepoint placed in code provides a hook to call a function (probe) that you
+can provide at runtime. A tracepoint can be "on" (a probe is connected to it) or
+"off" (no probe is attached). When a tracepoint is "off" it has no effect,
+except for adding a tiny time penalty (checking a condition for a branch) and
+space penalty (adding a few bytes for the function call at the end of the
+instrumented function and adds a data structure in a separate section).  When a
+tracepoint is "on", the function you provide is called each time the tracepoint
+is executed, in the execution context of the caller. When the function provided
+ends its execution, it returns to the caller (continuing from the tracepoint
+site).
+
+You can put tracepoints at important locations in the code. They are
+lightweight hooks that can pass an arbitrary number of parameters,
+which prototypes are described in a tracepoint declaration placed in a header
+file.
+
+They can be used for tracing and performance accounting.
+
+
+* Usage
+
+Two elements are required for tracepoints :
+
+- A tracepoint definition, placed in a header file.
+- The tracepoint statement, in C code.
+
+In order to use tracepoints, you should include linux/tracepoint.h.
+
+In subsys/subsys-trace.h :
+
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(subsys_eventname,
+	TPPTOTO(int firstarg, struct task_struct *p),
+	TPARGS(firstarg, p));
+
+In subsys/file.c (where the tracing statement must be added) :
+
+#include "subsys-trace.h"
+
+void somefct(void)
+{
+	...
+	trace_subsys_eventname(arg, task);
+	...
+}
+
+Where :
+- subsys_eventname is an identifier unique to your event
+    - subsys is the name of your subsystem.
+    - eventname is the name of the event to trace.
+- TPPTOTO(int firstarg, struct task_struct *p) is the prototype of the function
+  called by this tracepoint.
+- TPARGS(firstarg, p) are the parameters names, same as found in the prototype.
+
+Connecting a function (probe) to a tracepoint is done by providing a probe
+(function to call) for the specific tracepoint through
+register_trace_subsys_eventname().  Removing a probe is done through
+unregister_trace_subsys_eventname(); it will remove the probe sure there is no
+caller left using the probe when it returns. Probe removal is preempt-safe
+because preemption is disabled around the probe call. See the "Probe example"
+section below for a sample probe module.
+
+The tracepoint mechanism supports inserting multiple instances of the same
+tracepoint, but a single definition must be made of a given tracepoint name over
+all the kernel to make sure no type conflict will occur. Name mangling of the
+tracepoints is done using the prototypes to make sure typing is correct.
+Verification of probe type correctness is done at the registration site by the
+compiler. Tracepoints can be put in inline functions, inlined static functions,
+and unrolled loops as well as regular functions.
+
+The naming scheme "subsys_event" is suggested here as a convention intended
+to limit collisions. Tracepoint names are global to the kernel: they are
+considered as being the same whether they are in the core kernel image or in
+modules.
+
+
+* Probe / tracepoint example
+
+See the example provided in samples/tracepoints/src
+
+Compile them with your kernel.
+
+Run, as root :
+modprobe tracepoint-example (insmod order is not important)
+modprobe tracepoint-probe-example
+cat /proc/tracepoint-example (returns an expected error)
+rmmod tracepoint-example tracepoint-probe-example
+dmesg

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 03/15] Tracepoints Samples
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 01/15] Kernel Tracepoints Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 02/15] Tracepoints Documentation Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 04/15] LTTng instrumentation - irq Mathieu Desnoyers
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

[-- Attachment #1: tracepoints-samples.patch --]
[-- Type: text/plain, Size: 7644 bytes --]

Tracepoint example code under samples/.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 samples/Kconfig                                |    6 ++
 samples/Makefile                               |    2 
 samples/tracepoints/Makefile                   |    6 ++
 samples/tracepoints/tp-samples-trace.h         |   13 +++++
 samples/tracepoints/tracepoint-probe-sample.c  |   55 +++++++++++++++++++++++++
 samples/tracepoints/tracepoint-probe-sample2.c |   42 +++++++++++++++++++
 samples/tracepoints/tracepoint-sample.c        |   53 ++++++++++++++++++++++++
 7 files changed, 176 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/samples/Kconfig
===================================================================
--- linux-2.6-lttng.orig/samples/Kconfig	2008-07-09 10:46:33.000000000 -0400
+++ linux-2.6-lttng/samples/Kconfig	2008-07-09 10:57:28.000000000 -0400
@@ -13,6 +13,12 @@ config SAMPLE_MARKERS
 	help
 	  This build markers example modules.
 
+config SAMPLE_TRACEPOINTS
+	tristate "Build tracepoints examples -- loadable modules only"
+	depends on TRACEPOINTS && m
+	help
+	  This build tracepoints example modules.
+
 config SAMPLE_KOBJECT
 	tristate "Build kobject examples"
 	help
Index: linux-2.6-lttng/samples/tracepoints/Makefile
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/samples/tracepoints/Makefile	2008-07-09 10:57:28.000000000 -0400
@@ -0,0 +1,6 @@
+# builds the tracepoint example kernel modules;
+# then to use one (as root):  insmod <module_name.ko>
+
+obj-$(CONFIG_SAMPLE_TRACEPOINTS) += tracepoint-sample.o
+obj-$(CONFIG_SAMPLE_TRACEPOINTS) += tracepoint-probe-sample.o
+obj-$(CONFIG_SAMPLE_TRACEPOINTS) += tracepoint-probe-sample2.o
Index: linux-2.6-lttng/samples/tracepoints/tracepoint-probe-sample.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/samples/tracepoints/tracepoint-probe-sample.c	2008-07-09 10:57:28.000000000 -0400
@@ -0,0 +1,55 @@
+/*
+ * tracepoint-probe-sample.c
+ *
+ * sample tracepoint probes.
+ */
+
+#include <linux/module.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include "tp-samples-trace.h"
+
+/*
+ * Here the caller only guarantees locking for struct file and struct inode.
+ * Locking must therefore be done in the probe to use the dentry.
+ */
+static void probe_subsys_event(struct inode *inode, struct file *file)
+{
+	path_get(&file->f_path);
+	dget(file->f_path.dentry);
+	printk(KERN_INFO "Event is encountered with filename %s\n",
+		file->f_path.dentry->d_name.name);
+	dput(file->f_path.dentry);
+	path_put(&file->f_path);
+}
+
+static void probe_subsys_eventb(void)
+{
+	printk(KERN_INFO "Event B is encountered\n");
+}
+
+int __init tp_sample_trace_init(void)
+{
+	int ret;
+
+	ret = register_trace_subsys_event(probe_subsys_event);
+	WARN_ON(ret);
+	ret = register_trace_subsys_eventb(probe_subsys_eventb);
+	WARN_ON(ret);
+
+	return 0;
+}
+
+module_init(tp_sample_trace_init);
+
+void __exit tp_sample_trace_exit(void)
+{
+	unregister_trace_subsys_eventb(probe_subsys_eventb);
+	unregister_trace_subsys_event(probe_subsys_event);
+}
+
+module_exit(tp_sample_trace_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Mathieu Desnoyers");
+MODULE_DESCRIPTION("Tracepoint Probes Samples");
Index: linux-2.6-lttng/samples/tracepoints/tracepoint-sample.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/samples/tracepoints/tracepoint-sample.c	2008-07-09 10:57:28.000000000 -0400
@@ -0,0 +1,53 @@
+/* tracepoint-sample.c
+ *
+ * Executes a tracepoint when /proc/tracepoint-example is opened.
+ *
+ * (C) Copyright 2007 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/proc_fs.h>
+#include "tp-samples-trace.h"
+
+struct proc_dir_entry *pentry_example;
+
+static int my_open(struct inode *inode, struct file *file)
+{
+	int i;
+
+	trace_subsys_event(inode, file);
+	for (i = 0; i < 10; i++)
+		trace_subsys_eventb();
+	return -EPERM;
+}
+
+static struct file_operations mark_ops = {
+	.open = my_open,
+};
+
+static int example_init(void)
+{
+	printk(KERN_ALERT "example init\n");
+	pentry_example = proc_create("tracepoint-example", 0444, NULL,
+		&mark_ops);
+	if (!pentry_example)
+		return -EPERM;
+	return 0;
+}
+
+static void example_exit(void)
+{
+	printk(KERN_ALERT "example exit\n");
+	remove_proc_entry("tracepoint-example", NULL);
+}
+
+module_init(example_init)
+module_exit(example_exit)
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Mathieu Desnoyers");
+MODULE_DESCRIPTION("Tracepoint example");
Index: linux-2.6-lttng/samples/tracepoints/tp-samples-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/samples/tracepoints/tp-samples-trace.h	2008-07-09 10:57:28.000000000 -0400
@@ -0,0 +1,13 @@
+#ifndef _TP_SAMPLES_TRACE_H
+#define _TP_SAMPLES_TRACE_H
+
+#include <linux/proc_fs.h>	/* for struct inode and struct file */
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(subsys_event,
+	TPPROTO(struct inode *inode, struct file *file),
+	TPARGS(inode, file));
+DEFINE_TRACE(subsys_eventb,
+	TPPROTO(void),
+	TPARGS());
+#endif
Index: linux-2.6-lttng/samples/Makefile
===================================================================
--- linux-2.6-lttng.orig/samples/Makefile	2008-07-09 10:46:33.000000000 -0400
+++ linux-2.6-lttng/samples/Makefile	2008-07-09 10:57:28.000000000 -0400
@@ -1,3 +1,3 @@
 # Makefile for Linux samples code
 
-obj-$(CONFIG_SAMPLES)	+= markers/ kobject/ kprobes/
+obj-$(CONFIG_SAMPLES)	+= markers/ kobject/ kprobes/ tracepoints/
Index: linux-2.6-lttng/samples/tracepoints/tracepoint-probe-sample2.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/samples/tracepoints/tracepoint-probe-sample2.c	2008-07-09 10:57:28.000000000 -0400
@@ -0,0 +1,42 @@
+/*
+ * tracepoint-probe-sample2.c
+ *
+ * 2nd sample tracepoint probes.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include "tp-samples-trace.h"
+
+/*
+ * Here the caller only guarantees locking for struct file and struct inode.
+ * Locking must therefore be done in the probe to use the dentry.
+ */
+static void probe_subsys_event(struct inode *inode, struct file *file)
+{
+	printk(KERN_INFO "Event is encountered with inode number %lu\n",
+		inode->i_ino);
+}
+
+int __init tp_sample_trace_init(void)
+{
+	int ret;
+
+	ret = register_trace_subsys_event(probe_subsys_event);
+	WARN_ON(ret);
+
+	return 0;
+}
+
+module_init(tp_sample_trace_init);
+
+void __exit tp_sample_trace_exit(void)
+{
+	unregister_trace_subsys_event(probe_subsys_event);
+}
+
+module_exit(tp_sample_trace_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Mathieu Desnoyers");
+MODULE_DESCRIPTION("Tracepoint Probes Samples");

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 04/15] LTTng instrumentation - irq
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2008-07-09 14:59 ` [patch 03/15] Tracepoints Samples Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 16:39   ` Masami Hiramatsu
  2008-07-09 14:59 ` [patch 05/15] LTTng instrumentation - scheduler Mathieu Desnoyers
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Thomas Gleixner, Russell King,
	Masami Hiramatsu, Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-irq.patch --]
[-- Type: text/plain, Size: 5170 bytes --]

Instrumentation of IRQ related events : irq, softirq, tasklet entry and exit and
softirq "raise" events.

It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Russell King <rmk+lkml@arm.linux.org.uk>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 kernel/irq-trace.h  |   36 ++++++++++++++++++++++++++++++++++++
 kernel/irq/handle.c |    6 ++++++
 kernel/softirq.c    |    8 ++++++++
 3 files changed, 50 insertions(+)

Index: linux-2.6-lttng/kernel/irq/handle.c
===================================================================
--- linux-2.6-lttng.orig/kernel/irq/handle.c	2008-07-09 10:57:33.000000000 -0400
+++ linux-2.6-lttng/kernel/irq/handle.c	2008-07-09 10:57:35.000000000 -0400
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
+#include "../irq-trace.h"
 
 #include "internals.h"
 
@@ -130,6 +131,9 @@ irqreturn_t handle_IRQ_event(unsigned in
 {
 	irqreturn_t ret, retval = IRQ_NONE;
 	unsigned int status = 0;
+	struct pt_regs *regs = get_irq_regs();
+
+	trace_irq_entry(irq, regs);
 
 	handle_dynamic_tick(action);
 
@@ -148,6 +152,8 @@ irqreturn_t handle_IRQ_event(unsigned in
 		add_interrupt_randomness(irq);
 	local_irq_disable();
 
+	trace_irq_exit();
+
 	return retval;
 }
 
Index: linux-2.6-lttng/kernel/softirq.c
===================================================================
--- linux-2.6-lttng.orig/kernel/softirq.c	2008-07-09 10:57:33.000000000 -0400
+++ linux-2.6-lttng/kernel/softirq.c	2008-07-09 10:57:35.000000000 -0400
@@ -21,6 +21,7 @@
 #include <linux/rcupdate.h>
 #include <linux/smp.h>
 #include <linux/tick.h>
+#include "irq-trace.h"
 
 #include <asm/irq.h>
 /*
@@ -205,7 +206,9 @@ restart:
 
 	do {
 		if (pending & 1) {
+			trace_irq_softirq_entry(h, softirq_vec);
 			h->action(h);
+			trace_irq_softirq_exit(h, softirq_vec);
 			rcu_bh_qsctr_inc(cpu);
 		}
 		h++;
@@ -297,6 +300,7 @@ void irq_exit(void)
  */
 inline void raise_softirq_irqoff(unsigned int nr)
 {
+	trace_irq_softirq_raise(nr);
 	__raise_softirq_irqoff(nr);
 
 	/*
@@ -394,7 +398,9 @@ static void tasklet_action(struct softir
 			if (!atomic_read(&t->count)) {
 				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
 					BUG();
+				trace_irq_tasklet_low_entry(t);
 				t->func(t->data);
+				trace_irq_tasklet_low_exit(t);
 				tasklet_unlock(t);
 				continue;
 			}
@@ -429,7 +435,9 @@ static void tasklet_hi_action(struct sof
 			if (!atomic_read(&t->count)) {
 				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
 					BUG();
+				trace_irq_tasklet_high_entry(t);
 				t->func(t->data);
+				trace_irq_tasklet_high_exit(t);
 				tasklet_unlock(t);
 				continue;
 			}
Index: linux-2.6-lttng/kernel/irq-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/irq-trace.h	2008-07-09 10:57:35.000000000 -0400
@@ -0,0 +1,36 @@
+#ifndef _IRQ_TRACE_H
+#define _IRQ_TRACE_H
+
+#include <linux/kdebug.h>
+#include <linux/interrupt.h>
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(irq_entry,
+	TPPROTO(unsigned int id, struct pt_regs *regs),
+	TPARGS(id, regs));
+DEFINE_TRACE(irq_exit,
+	TPPROTO(void),
+	TPARGS());
+DEFINE_TRACE(irq_softirq_entry,
+	TPPROTO(struct softirq_action *h, struct softirq_action *softirq_vec),
+	TPARGS(h, softirq_vec));
+DEFINE_TRACE(irq_softirq_exit,
+	TPPROTO(struct softirq_action *h, struct softirq_action *softirq_vec),
+	TPARGS(h, softirq_vec));
+DEFINE_TRACE(irq_softirq_raise,
+	TPPROTO(unsigned int nr),
+	TPARGS(nr));
+DEFINE_TRACE(irq_tasklet_low_entry,
+	TPPROTO(struct tasklet_struct *t),
+	TPARGS(t));
+DEFINE_TRACE(irq_tasklet_low_exit,
+	TPPROTO(struct tasklet_struct *t),
+	TPARGS(t));
+DEFINE_TRACE(irq_tasklet_high_entry,
+	TPPROTO(struct tasklet_struct *t),
+	TPARGS(t));
+DEFINE_TRACE(irq_tasklet_high_exit,
+	TPPROTO(struct tasklet_struct *t),
+	TPARGS(t));
+
+#endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 05/15] LTTng instrumentation - scheduler
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2008-07-09 14:59 ` [patch 04/15] LTTng instrumentation - irq Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 15:34   ` [patch 05/15] LTTng instrumentation - scheduler (repost) Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 06/15] LTTng instrumentation - timer Mathieu Desnoyers
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-scheduler.patch --]
[-- Type: text/plain, Size: 7968 bytes --]

Instrument the scheduler activity (sched_switch, migration, wakeups, wait for a
task, signal delivery) and process/thread creation/destruction (fork, exit,
kthread stop). Actually, kthread creation is not instrumented in this patch
because it is architecture dependent. It allows to connect tracers such as
ftrace which detects scheduling latencies, good/bad scheduler decisions. Tools
like LTTng can export this scheduler information along with instrumentation of
the rest of the kernel activity to perform post-mortem analysis on the scheduler
activity.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 kernel/exit.c        |    6 ++++++
 kernel/fork.c        |    3 +++
 kernel/kthread.c     |    5 +++++
 kernel/sched-trace.h |   43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c       |    4 ++++
 kernel/signal.c      |    3 +++
 6 files changed, 64 insertions(+)

Index: linux-2.6-lttng/kernel/kthread.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kthread.c	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/kernel/kthread.c	2008-07-09 10:57:43.000000000 -0400
@@ -13,6 +13,7 @@
 #include <linux/file.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
+#include "sched-trace.h"
 
 #define KTHREAD_NICE_LEVEL (-5)
 
@@ -187,6 +188,8 @@ int kthread_stop(struct task_struct *k)
 	/* It could exit after stop_info.k set, but before wake_up_process. */
 	get_task_struct(k);
 
+	trace_sched_kthread_stop(k);
+
 	/* Must init completion *before* thread sees kthread_stop_info.k */
 	init_completion(&kthread_stop_info.done);
 	smp_wmb();
@@ -202,6 +205,8 @@ int kthread_stop(struct task_struct *k)
 	ret = kthread_stop_info.err;
 	mutex_unlock(&kthread_stop_lock);
 
+	trace_sched_kthread_stop_ret(ret);
+
 	return ret;
 }
 EXPORT_SYMBOL(kthread_stop);
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/kernel/sched.c	2008-07-09 10:57:43.000000000 -0400
@@ -1987,6 +1987,7 @@ void wait_task_inactive(struct task_stru
 		 * just go back and repeat.
 		 */
 		rq = task_rq_lock(p, &flags);
+		trace_sched_wait_task(p);
 		running = task_running(rq, p);
 		on_rq = p->se.on_rq;
 		task_rq_unlock(rq, &flags);
@@ -2275,6 +2276,7 @@ static int try_to_wake_up(struct task_st
 
 	smp_wmb();
 	rq = task_rq_lock(p, &flags);
+	trace_sched_try_wakeup(p);
 	old_state = p->state;
 	if (!(old_state & state))
 		goto out;
@@ -2457,6 +2459,7 @@ void wake_up_new_task(struct task_struct
 	struct rq *rq;
 
 	rq = task_rq_lock(p, &flags);
+	trace_sched_wakeup_new_task(p);
 	BUG_ON(p->state != TASK_RUNNING);
 	update_rq_clock(rq);
 
@@ -2884,6 +2887,7 @@ static void sched_migrate_task(struct ta
 	    || unlikely(cpu_is_offline(dest_cpu)))
 		goto out;
 
+	trace_sched_migrate_task(p, dest_cpu);
 	/* force the process onto the specified CPU */
 	if (migrate_task(p, dest_cpu, &req)) {
 		/* Need to wait for migration thread (might exit: take ref). */
Index: linux-2.6-lttng/kernel/exit.c
===================================================================
--- linux-2.6-lttng.orig/kernel/exit.c	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/kernel/exit.c	2008-07-09 10:57:43.000000000 -0400
@@ -46,6 +46,7 @@
 #include <linux/resource.h>
 #include <linux/blkdev.h>
 #include <linux/task_io_accounting_ops.h>
+#include "sched-trace.h"
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -149,6 +150,7 @@ static void __exit_signal(struct task_st
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
 {
+	trace_sched_process_free(container_of(rhp, struct task_struct, rcu));
 	put_task_struct(container_of(rhp, struct task_struct, rcu));
 }
 
@@ -1040,6 +1042,8 @@ NORET_TYPE void do_exit(long code)
 
 	if (group_dead)
 		acct_process();
+	trace_sched_process_exit(tsk);
+
 	exit_sem(tsk);
 	exit_files(tsk);
 	exit_fs(tsk);
@@ -1524,6 +1528,8 @@ static long do_wait(enum pid_type type, 
 	struct task_struct *tsk;
 	int flag, retval;
 
+	trace_sched_process_wait(pid);
+
 	add_wait_queue(&current->signal->wait_chldexit,&wait);
 repeat:
 	/* If there is nothing that can match our critier just get out */
Index: linux-2.6-lttng/kernel/fork.c
===================================================================
--- linux-2.6-lttng.orig/kernel/fork.c	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/kernel/fork.c	2008-07-09 10:58:05.000000000 -0400
@@ -56,6 +56,7 @@
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
 #include <linux/magic.h>
+#include "sched-trace.h"
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1362,6 +1363,8 @@ long do_fork(unsigned long clone_flags,
 	if (!IS_ERR(p)) {
 		struct completion vfork;
 
+		trace_sched_process_fork(current, p);
+
 		nr = task_pid_vnr(p);
 
 		if (clone_flags & CLONE_PARENT_SETTID)
Index: linux-2.6-lttng/kernel/signal.c
===================================================================
--- linux-2.6-lttng.orig/kernel/signal.c	2008-07-09 10:46:33.000000000 -0400
+++ linux-2.6-lttng/kernel/signal.c	2008-07-09 10:57:43.000000000 -0400
@@ -26,6 +26,7 @@
 #include <linux/freezer.h>
 #include <linux/pid_namespace.h>
 #include <linux/nsproxy.h>
+#include "sched-trace.h"
 
 #include <asm/param.h>
 #include <asm/uaccess.h>
@@ -807,6 +808,8 @@ static int send_signal(int sig, struct s
 	struct sigpending *pending;
 	struct sigqueue *q;
 
+	trace_sched_signal_send(sig, t);
+
 	assert_spin_locked(&t->sighand->siglock);
 	if (!prepare_signal(sig, t))
 		return 0;
Index: linux-2.6-lttng/kernel/sched-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/sched-trace.h	2008-07-09 10:57:43.000000000 -0400
@@ -0,0 +1,43 @@
+#ifndef _SCHED_TRACE_H
+#define _SCHED_TRACE_H
+
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(sched_kthread_stop,
+	TPPROTO(struct task_struct *t),
+	TPARGS(t));
+DEFINE_TRACE(sched_kthread_stop_ret,
+	TPPROTO(int ret),
+	TPARGS(ret));
+DEFINE_TRACE(sched_wait_task,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_try_wakeup,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_wakeup_new_task,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_switch,
+	TPPROTO(struct task_struct *prev, struct task_struct *next),
+	TPARGS(prev, next));
+DEFINE_TRACE(sched_migrate_task,
+	TPPROTO(struct task_struct *p, int dest_cpu),
+	TPARGS(p, dest_cpu));
+DEFINE_TRACE(sched_process_free,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_process_exit,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_process_wait,
+	TPPROTO(struct pid *pid),
+	TPARGS(pid));
+DEFINE_TRACE(sched_process_fork,
+	TPPROTO(struct task_struct *parent, struct task_struct *child),
+	TPARGS(parent, child));
+DEFINE_TRACE(sched_signal_send,
+	TPPROTO(int sig, struct task_struct *p),
+	TPARGS(sig, p));
+
+#endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 06/15] LTTng instrumentation - timer
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2008-07-09 14:59 ` [patch 05/15] LTTng instrumentation - scheduler Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 07/15] LTTng instrumentation - kernel Mathieu Desnoyers
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, David S. Miller, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-timer.patch --]
[-- Type: text/plain, Size: 4609 bytes --]

Instrument timer activity (timer set, expired, current time updates) to keep
information about the "real time" flow within the kernel. It can be used by a
trace analysis tool to synchronize information coming from various sources, e.g.
to merge traces with system logs.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: "David S. Miller" <davem@davemloft.net>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 kernel/itimer.c      |    5 +++++
 kernel/timer-trace.h |   24 ++++++++++++++++++++++++
 kernel/timer.c       |    8 +++++++-
 3 files changed, 36 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/kernel/itimer.c
===================================================================
--- linux-2.6-lttng.orig/kernel/itimer.c	2008-07-09 10:46:33.000000000 -0400
+++ linux-2.6-lttng/kernel/itimer.c	2008-07-09 10:58:07.000000000 -0400
@@ -12,6 +12,7 @@
 #include <linux/time.h>
 #include <linux/posix-timers.h>
 #include <linux/hrtimer.h>
+#include "timer-trace.h"
 
 #include <asm/uaccess.h>
 
@@ -132,6 +133,8 @@ enum hrtimer_restart it_real_fn(struct h
 	struct signal_struct *sig =
 		container_of(timer, struct signal_struct, real_timer);
 
+	trace_timer_itimer_expired(sig);
+
 	kill_pid_info(SIGALRM, SEND_SIG_PRIV, sig->leader_pid);
 
 	return HRTIMER_NORESTART;
@@ -157,6 +160,8 @@ int do_setitimer(int which, struct itime
 	    !timeval_valid(&value->it_interval))
 		return -EINVAL;
 
+	trace_timer_itimer_set(which, value);
+
 	switch (which) {
 	case ITIMER_REAL:
 again:
Index: linux-2.6-lttng/kernel/timer.c
===================================================================
--- linux-2.6-lttng.orig/kernel/timer.c	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/kernel/timer.c	2008-07-09 10:58:07.000000000 -0400
@@ -37,12 +37,14 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include "timer-trace.h"
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 #include <asm/div64.h>
 #include <asm/timex.h>
 #include <asm/io.h>
+#include <asm/irq_regs.h>
 
 u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
 
@@ -288,6 +290,7 @@ static void internal_add_timer(struct tv
 		i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
 		vec = base->tv5.vec + i;
 	}
+	trace_timer_set(timer);
 	/*
 	 * Timers are FIFO:
 	 */
@@ -1066,6 +1069,7 @@ void do_timer(unsigned long ticks)
 {
 	jiffies_64 += ticks;
 	update_times(ticks);
+	trace_timer_update_time(&xtime, &wall_to_monotonic);
 }
 
 #ifdef __ARCH_WANT_SYS_ALARM
@@ -1147,7 +1151,9 @@ asmlinkage long sys_getegid(void)
 
 static void process_timeout(unsigned long __data)
 {
-	wake_up_process((struct task_struct *)__data);
+	struct task_struct *task = (struct task_struct *)__data;
+	trace_timer_timeout(task);
+	wake_up_process(task);
 }
 
 /**
Index: linux-2.6-lttng/kernel/timer-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/timer-trace.h	2008-07-09 10:58:07.000000000 -0400
@@ -0,0 +1,24 @@
+#ifndef _TIMER_TRACE_H
+#define _TIMER_TRACE_H
+
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(timer_itimer_expired,
+	TPPROTO(struct signal_struct *sig),
+	TPARGS(sig));
+DEFINE_TRACE(timer_itimer_set,
+	TPPROTO(int which, struct itimerval *value),
+	TPARGS(which, value));
+DEFINE_TRACE(timer_set,
+	TPPROTO(struct timer_list *timer),
+	TPARGS(timer));
+/*
+ * xtime_lock is taken when kernel_timer_update_time tracepoint is reached.
+ */
+DEFINE_TRACE(timer_update_time,
+	TPPROTO(struct timespec *_xtime, struct timespec *_wall_to_monotonic),
+	TPARGS(_xtime, _wall_to_monotonic));
+DEFINE_TRACE(timer_timeout,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+#endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 07/15] LTTng instrumentation - kernel
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2008-07-09 14:59 ` [patch 06/15] LTTng instrumentation - timer Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 08/15] LTTng instrumentation - filemap Mathieu Desnoyers
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-kernel.patch --]
[-- Type: text/plain, Size: 4146 bytes --]

Instrument the core kernel : module load/free and printk events. It helps the
tracer to keep track of module related events and to export valuable printk
information into the traces.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 kernel/kernel-trace.h |   19 +++++++++++++++++++
 kernel/module.c       |    5 +++++
 kernel/printk.c       |    6 ++++++
 3 files changed, 30 insertions(+)

Index: linux-2.6-lttng/kernel/printk.c
===================================================================
--- linux-2.6-lttng.orig/kernel/printk.c	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/kernel/printk.c	2008-07-09 10:58:11.000000000 -0400
@@ -32,6 +32,7 @@
 #include <linux/security.h>
 #include <linux/bootmem.h>
 #include <linux/syscalls.h>
+#include "kernel-trace.h"
 
 #include <asm/uaccess.h>
 
@@ -59,6 +60,7 @@ int console_printk[4] = {
 	MINIMUM_CONSOLE_LOGLEVEL,	/* minimum_console_loglevel */
 	DEFAULT_CONSOLE_LOGLEVEL,	/* default_console_loglevel */
 };
+EXPORT_SYMBOL_GPL(console_printk);
 
 /*
  * Low level drivers may need that to know if they can schedule in
@@ -601,6 +603,7 @@ asmlinkage int printk(const char *fmt, .
 	int r;
 
 	va_start(args, fmt);
+	trace_kernel_printk(__builtin_return_address(0));
 	r = vprintk(fmt, args);
 	va_end(args);
 
@@ -677,6 +680,9 @@ asmlinkage int vprintk(const char *fmt, 
 	raw_local_irq_save(flags);
 	this_cpu = smp_processor_id();
 
+	trace_kernel_vprintk(__builtin_return_address(0),
+		printk_buf, printed_len);
+
 	/*
 	 * Ouch, printk recursed into itself!
 	 */
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c	2008-07-09 10:55:58.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c	2008-07-09 10:58:11.000000000 -0400
@@ -48,6 +48,7 @@
 #include <linux/license.h>
 #include <asm/sections.h>
 #include <linux/tracepoint.h>
+#include "kernel-trace.h"
 
 #if 0
 #define DEBUGP printk
@@ -1422,6 +1423,8 @@ static int __unlink_module(void *_mod)
 /* Free a module, remove from lists, etc (must hold module_mutex). */
 static void free_module(struct module *mod)
 {
+	trace_kernel_module_free(mod);
+
 	/* Delete from various lists */
 	stop_machine_run(__unlink_module, mod, NULL);
 	remove_notes_attrs(mod);
@@ -2237,6 +2240,8 @@ static struct module *load_module(void _
 	/* Get rid of temporary copy */
 	vfree(hdr);
 
+	trace_kernel_module_load(mod);
+
 	/* Done! */
 	return mod;
 
Index: linux-2.6-lttng/kernel/kernel-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/kernel-trace.h	2008-07-09 10:58:11.000000000 -0400
@@ -0,0 +1,19 @@
+#ifndef _KERNEL_TRACE_H
+#define _KERNEL_TRACE_H
+
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(kernel_printk,
+	TPPROTO(void *retaddr),
+	TPARGS(retaddr));
+DEFINE_TRACE(kernel_vprintk,
+	TPPROTO(void *retaddr, char *buf, int len),
+	TPARGS(retaddr, buf, len));
+DEFINE_TRACE(kernel_module_free,
+	TPPROTO(struct module *mod),
+	TPARGS(mod));
+DEFINE_TRACE(kernel_module_load,
+	TPPROTO(struct module *mod),
+	TPARGS(mod));
+
+#endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 08/15] LTTng instrumentation - filemap
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (6 preceding siblings ...)
  2008-07-09 14:59 ` [patch 07/15] LTTng instrumentation - kernel Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 09/15] LTTng instrumentation - swap Mathieu Desnoyers
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, linux-mm, Dave Hansen, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-filemap.patch --]
[-- Type: text/plain, Size: 2589 bytes --]

Instrumentation of waits caused by memory accesses on mmap regions.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: linux-mm@kvack.org
CC: Dave Hansen <haveblue@us.ibm.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 mm/filemap-trace.h |   13 +++++++++++++
 mm/filemap.c       |    3 +++
 2 files changed, 16 insertions(+)

Index: linux-2.6-lttng/mm/filemap.c
===================================================================
--- linux-2.6-lttng.orig/mm/filemap.c	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/mm/filemap.c	2008-07-09 10:58:27.000000000 -0400
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include "filemap-trace.h"
 #include "internal.h"
 
 /*
@@ -541,9 +542,11 @@ void wait_on_page_bit(struct page *page,
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
+	trace_filemap_wait_start(page, bit_nr);
 	if (test_bit(bit_nr, &page->flags))
 		__wait_on_bit(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
+	trace_filemap_wait_end(page, bit_nr);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
Index: linux-2.6-lttng/mm/filemap-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/mm/filemap-trace.h	2008-07-09 10:58:27.000000000 -0400
@@ -0,0 +1,13 @@
+#ifndef _FILEMAP_TRACE_H
+#define _FILEMAP_TRACE_H
+
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(filemap_wait_start,
+	TPPROTO(struct page *page, int bit_nr),
+	TPARGS(page, bit_nr));
+DEFINE_TRACE(filemap_wait_end,
+	TPPROTO(struct page *page, int bit_nr),
+	TPARGS(page, bit_nr));
+
+#endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 09/15] LTTng instrumentation - swap
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (7 preceding siblings ...)
  2008-07-09 14:59 ` [patch 08/15] LTTng instrumentation - filemap Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 10/15] LTTng instrumentation - memory page faults Mathieu Desnoyers
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, linux-mm, Dave Hansen, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-swap.patch --]
[-- Type: text/plain, Size: 4549 bytes --]

Instrumentation of waits caused by swap activity. Also instrumentation
swapon/swapoff events to keep track of active swap partitions.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: linux-mm@kvack.org
CC: Dave Hansen <haveblue@us.ibm.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 mm/memory.c     |    2 ++
 mm/page_io.c    |    2 ++
 mm/swap-trace.h |   20 ++++++++++++++++++++
 mm/swapfile.c   |    4 ++++
 4 files changed, 28 insertions(+)

Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c	2008-07-09 10:46:33.000000000 -0400
+++ linux-2.6-lttng/mm/memory.c	2008-07-09 10:58:31.000000000 -0400
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include "swap-trace.h"
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -2213,6 +2214,7 @@ static int do_swap_page(struct mm_struct
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+		trace_swap_in(page, entry);
 	}
 
 	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
Index: linux-2.6-lttng/mm/page_io.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_io.c	2008-07-09 10:46:33.000000000 -0400
+++ linux-2.6-lttng/mm/page_io.c	2008-07-09 10:58:31.000000000 -0400
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include "swap-trace.h"
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -114,6 +115,7 @@ int swap_writepage(struct page *page, st
 		rw |= (1 << BIO_RW_SYNC);
 	count_vm_event(PSWPOUT);
 	set_page_writeback(page);
+	trace_swap_out(page);
 	unlock_page(page);
 	submit_bio(rw, bio);
 out:
Index: linux-2.6-lttng/mm/swapfile.c
===================================================================
--- linux-2.6-lttng.orig/mm/swapfile.c	2008-07-09 10:46:33.000000000 -0400
+++ linux-2.6-lttng/mm/swapfile.c	2008-07-09 10:58:31.000000000 -0400
@@ -32,6 +32,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
+#include "swap-trace.h"
 
 DEFINE_SPINLOCK(swap_lock);
 unsigned int nr_swapfiles;
@@ -1310,6 +1311,7 @@ asmlinkage long sys_swapoff(const char _
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
 	p->flags = 0;
+	trace_swap_file_close(swap_file);
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
@@ -1695,6 +1697,7 @@ asmlinkage long sys_swapon(const char __
 	} else {
 		swap_info[prev].next = p - swap_info;
 	}
+	trace_swap_file_open(swap_file, name);
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	error = 0;
@@ -1796,6 +1799,7 @@ get_swap_info_struct(unsigned type)
 {
 	return &swap_info[type];
 }
+EXPORT_SYMBOL_GPL(get_swap_info_struct);
 
 /*
  * swap_lock prevents swap_map being freed. Don't grab an extra
Index: linux-2.6-lttng/mm/swap-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/mm/swap-trace.h	2008-07-09 10:58:31.000000000 -0400
@@ -0,0 +1,20 @@
+#ifndef _SWAP_TRACE_H
+#define _SWAP_TRACE_H
+
+#include <linux/swap.h>
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(swap_in,
+	TPPROTO(struct page *page, swp_entry_t entry),
+	TPARGS(page, entry));
+DEFINE_TRACE(swap_out,
+	TPPROTO(struct page *page),
+	TPARGS(page));
+DEFINE_TRACE(swap_file_open,
+	TPPROTO(struct file *file, char *filename),
+	TPARGS(file, filename));
+DEFINE_TRACE(swap_file_close,
+	TPPROTO(struct file *file),
+	TPARGS(file));
+
+#endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 10/15] LTTng instrumentation - memory page faults
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (8 preceding siblings ...)
  2008-07-09 14:59 ` [patch 09/15] LTTng instrumentation - swap Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 11/15] LTTng instrumentation - page Mathieu Desnoyers
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Andi Kleen, linux-mm, Dave Hansen,
	Masami Hiramatsu, Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-memory.patch --]
[-- Type: text/plain, Size: 3639 bytes --]

Instrument the page fault entry and exit. Useful to detect delays caused by page
faults and bad memory usage patterns.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@suse.de>
CC: linux-mm@kvack.org
CC: Dave Hansen <haveblue@us.ibm.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 mm/memory-trace.h |   14 ++++++++++++++
 mm/memory.c       |   33 ++++++++++++++++++++++++---------
 2 files changed, 38 insertions(+), 9 deletions(-)

Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c	2008-07-09 10:58:31.000000000 -0400
+++ linux-2.6-lttng/mm/memory.c	2008-07-09 10:58:34.000000000 -0400
@@ -61,6 +61,7 @@
 
 #include <linux/swapops.h>
 #include <linux/elf.h>
+#include "memory-trace.h"
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 /* use the per-pgdat data instead for discontigmem - mbligh */
@@ -2664,30 +2665,44 @@ unlock:
 int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, int write_access)
 {
+	int res;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
+	trace_memory_handle_fault_entry(mm, vma, address, write_access);
+
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
 
-	if (unlikely(is_vm_hugetlb_page(vma)))
-		return hugetlb_fault(mm, vma, address, write_access);
+	if (unlikely(is_vm_hugetlb_page(vma))) {
+		res = hugetlb_fault(mm, vma, address, write_access);
+		goto end;
+	}
 
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);
-	if (!pud)
-		return VM_FAULT_OOM;
+	if (!pud) {
+		res = VM_FAULT_OOM;
+		goto end;
+	}
 	pmd = pmd_alloc(mm, pud, address);
-	if (!pmd)
-		return VM_FAULT_OOM;
+	if (!pmd) {
+		res = VM_FAULT_OOM;
+		goto end;
+	}
 	pte = pte_alloc_map(mm, pmd, address);
-	if (!pte)
-		return VM_FAULT_OOM;
+	if (!pte) {
+		res = VM_FAULT_OOM;
+		goto end;
+	}
 
-	return handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+	res = handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+end:
+	trace_memory_handle_fault_exit(res);
+	return res;
 }
 
 #ifndef __PAGETABLE_PUD_FOLDED
Index: linux-2.6-lttng/mm/memory-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/mm/memory-trace.h	2008-07-09 10:58:34.000000000 -0400
@@ -0,0 +1,14 @@
+#ifndef _MEMORY_TRACE_H
+#define _MEMORY_TRACE_H
+
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(memory_handle_fault_entry,
+	TPPROTO(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, int write_access),
+	TPARGS(mm, vma, address, write_access));
+DEFINE_TRACE(memory_handle_fault_exit,
+	TPPROTO(int res),
+	TPARGS(res));
+
+#endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 11/15] LTTng instrumentation - page
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (9 preceding siblings ...)
  2008-07-09 14:59 ` [patch 10/15] LTTng instrumentation - memory page faults Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 12/15] LTTng instrumentation - hugetlb Mathieu Desnoyers
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Martin Bligh, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-page.patch --]
[-- Type: text/plain, Size: 2910 bytes --]

Paging activity instrumentation. Instruments page allocation/free to keep track
of page allocation. This does not cover hugetlb activity, which is covered by a
separate patch.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Martin Bligh <mbligh@google.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 mm/page-trace.h |   16 ++++++++++++++++
 mm/page_alloc.c |    6 ++++++
 2 files changed, 22 insertions(+)

Index: linux-2.6-lttng/mm/page_alloc.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_alloc.c	2008-07-07 15:56:53.000000000 -0400
+++ linux-2.6-lttng/mm/page_alloc.c	2008-07-07 15:57:46.000000000 -0400
@@ -46,6 +46,7 @@
 #include <linux/page-isolation.h>
 #include <linux/memcontrol.h>
 #include <linux/debugobjects.h>
+#include "page-trace.h"
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -510,6 +511,8 @@ static void __free_pages_ok(struct page 
 	int i;
 	int reserved = 0;
 
+	trace_page_free(page, order);
+
 	for (i = 0 ; i < (1 << order) ; ++i)
 		reserved += free_pages_check(page + i);
 	if (reserved)
@@ -966,6 +969,8 @@ static void free_hot_cold_page(struct pa
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 
+	trace_page_free(page, 0);
+
 	if (PageAnon(page))
 		page->mapping = NULL;
 	if (free_pages_check(page))
@@ -1630,6 +1635,7 @@ nopage:
 		show_mem();
 	}
 got_pg:
+	trace_page_alloc(page, order);
 	return page;
 }
 
Index: linux-2.6-lttng/mm/page-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/mm/page-trace.h	2008-07-07 15:58:11.000000000 -0400
@@ -0,0 +1,16 @@
+#ifndef _PAGE_TRACE_H
+#define _PAGE_TRACE_H
+
+#include <linux/tracepoint.h>
+
+/*
+ * mm_page_alloc : page can be NULL.
+ */
+DEFINE_TRACE(page_alloc,
+	TPPROTO(struct page *page, unsigned int order),
+	TPARGS(page, order));
+DEFINE_TRACE(page_free,
+	TPPROTO(struct page *page, unsigned int order),
+	TPARGS(page, order));
+
+#endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 12/15] LTTng instrumentation - hugetlb
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (10 preceding siblings ...)
  2008-07-09 14:59 ` [patch 11/15] LTTng instrumentation - page Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-11 14:30   ` [patch 12/15] LTTng instrumentation - hugetlb (update) Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 13/15] LTTng instrumentation - net Mathieu Desnoyers
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, William Lee Irwin III, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-hugetlb.patch --]
[-- Type: text/plain, Size: 3518 bytes --]

Instrumentation of hugetlb activity (alloc/free/reserve).

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: William Lee Irwin III <wli@holomorphy.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 mm/hugetlb-trace.h |   19 +++++++++++++++++++
 mm/hugetlb.c       |    8 +++++++-
 2 files changed, 26 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/mm/hugetlb.c
===================================================================
--- linux-2.6-lttng.orig/mm/hugetlb.c	2008-07-07 16:31:44.000000000 -0400
+++ linux-2.6-lttng/mm/hugetlb.c	2008-07-08 10:51:17.000000000 -0400
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include "hugetlb-trace.h"
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -141,6 +142,7 @@ static void free_huge_page(struct page *
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
+	trace_hugetlb_page_free(page);
 	mapping = (struct address_space *) page_private(page);
 	set_page_private(page, 0);
 	BUG_ON(page_count(page));
@@ -509,6 +511,7 @@ static struct page *alloc_huge_page(stru
 	if (!IS_ERR(page)) {
 		set_page_refcounted(page);
 		set_page_private(page, (unsigned long) mapping);
+		trace_hugetlb_page_alloc(page);
 	}
 	return page;
 }
@@ -1306,13 +1309,16 @@ int hugetlb_reserve_pages(struct inode *
 		return ret;
 	}
 	region_add(&inode->i_mapping->private_list, from, to);
+	trace_hugetlb_pages_reserve(inode, from, to);
 	return 0;
 }
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
-	long chg = region_truncate(&inode->i_mapping->private_list, offset);
+	long chg;
 
+	trace_hugetlb_pages_unreserve(inode, offset, freed);
+	chg = region_truncate(&inode->i_mapping->private_list, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
 	spin_unlock(&inode->i_lock);
Index: linux-2.6-lttng/mm/hugetlb-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/mm/hugetlb-trace.h	2008-07-08 10:45:15.000000000 -0400
@@ -0,0 +1,19 @@
+#ifndef _HUGETLB_TRACE_H
+#define _HUGETLB_TRACE_H
+
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(hugetlb_page_alloc,
+	TPPROTO(struct page *page),
+	TPARGS(page));
+DEFINE_TRACE(hugetlb_page_free,
+	TPPROTO(struct page *page),
+	TPARGS(page));
+DEFINE_TRACE(hugetlb_pages_reserve,
+	TPPROTO(struct inode *inode, long from, long to),
+	TPARGS(inode, from, to));
+DEFINE_TRACE(hugetlb_pages_unreserve,
+	TPPROTO(struct inode *inode, long offset, long freed),
+	TPARGS(inode, offset, freed));
+
+#endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 13/15] LTTng instrumentation - net
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (11 preceding siblings ...)
  2008-07-09 14:59 ` [patch 12/15] LTTng instrumentation - hugetlb Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 14:59 ` [patch 14/15] LTTng instrumentation - ipv4 Mathieu Desnoyers
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, netdev, Jeff Garzik, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-net.patch --]
[-- Type: text/plain, Size: 3027 bytes --]

Network device activity instrumentation (xmit/receive). Allows to detect when a
packet had arrived on the network card or when it is going to be sent. This is
the instrumentation point outside of the drivers that is the closest to the
hardware. It allows to detect the amount of time taken by a packet to go through
the kernel between the system call and the actual delivery to the network card
(given that system calls are instrumented).

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: netdev@vger.kernel.org
CC: Jeff Garzik <jgarzik@pobox.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 net/core/dev.c       |    4 ++++
 net/core/net-trace.h |   14 ++++++++++++++
 2 files changed, 18 insertions(+)

Index: linux-2.6-lttng/net/core/dev.c
===================================================================
--- linux-2.6-lttng.orig/net/core/dev.c	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/net/core/dev.c	2008-07-09 10:58:38.000000000 -0400
@@ -122,6 +122,7 @@
 #include <linux/if_arp.h>
 #include <linux/if_vlan.h>
 
+#include "net-trace.h"
 #include "net-sysfs.h"
 
 /*
@@ -1699,6 +1700,8 @@ int dev_queue_xmit(struct sk_buff *skb)
 	}
 
 gso:
+	trace_net_dev_xmit(skb);
+
 	spin_lock_prefetch(&dev->queue_lock);
 
 	/* Disable soft irqs for various locks below. Also
@@ -2099,6 +2102,7 @@ int netif_receive_skb(struct sk_buff *sk
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
+	trace_net_dev_receive(skb);
 	skb_reset_network_header(skb);
 	skb_reset_transport_header(skb);
 	skb->mac_len = skb->network_header - skb->mac_header;
Index: linux-2.6-lttng/net/core/net-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/net/core/net-trace.h	2008-07-09 10:58:38.000000000 -0400
@@ -0,0 +1,14 @@
+#ifndef _NET_TRACE_H
+#define _NET_TRACE_H
+
+#include <net/sock.h>
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(net_dev_xmit,
+	TPPROTO(struct sk_buff *skb),
+	TPARGS(skb));
+DEFINE_TRACE(net_dev_receive,
+	TPPROTO(struct sk_buff *skb),
+	TPARGS(skb));
+
+#endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 14/15] LTTng instrumentation - ipv4
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (12 preceding siblings ...)
  2008-07-09 14:59 ` [patch 13/15] LTTng instrumentation - net Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 17:01 ` [patch 00/15] Tracepoints v3 for linux-next Masami Hiramatsu
  15 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, netdev, David S. Miller, Alexey Kuznetsov,
	Masami Hiramatsu, Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-ipv4.patch --]
[-- Type: text/plain, Size: 2804 bytes --]

Keep track of interface up/down for ipv4. Allows to keep track of interface
address changes in a trace.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: netdev@vger.kernel.org
CC: David S. Miller <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: 
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 net/ipv4/devinet.c    |    3 +++
 net/ipv4/ipv4-trace.h |   14 ++++++++++++++
 2 files changed, 17 insertions(+)

Index: linux-2.6-lttng/net/ipv4/devinet.c
===================================================================
--- linux-2.6-lttng.orig/net/ipv4/devinet.c	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/net/ipv4/devinet.c	2008-07-09 10:58:41.000000000 -0400
@@ -61,6 +61,7 @@
 #include <net/ip_fib.h>
 #include <net/rtnetlink.h>
 #include <net/net_namespace.h>
+#include "ipv4-trace.h"
 
 static struct ipv4_devconf ipv4_devconf = {
 	.data = {
@@ -257,6 +258,7 @@ static void __inet_del_ifa(struct in_dev
 		struct in_ifaddr **ifap1 = &ifa1->ifa_next;
 
 		while ((ifa = *ifap1) != NULL) {
+			trace_ipv4_addr_del(ifa);
 			if (!(ifa->ifa_flags & IFA_F_SECONDARY) &&
 			    ifa1->ifa_scope <= ifa->ifa_scope)
 				last_prim = ifa;
@@ -363,6 +365,7 @@ static int __inet_insert_ifa(struct in_i
 			}
 			ifa->ifa_flags |= IFA_F_SECONDARY;
 		}
+		trace_ipv4_addr_add(ifa);
 	}
 
 	if (!(ifa->ifa_flags & IFA_F_SECONDARY)) {
Index: linux-2.6-lttng/net/ipv4/ipv4-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/net/ipv4/ipv4-trace.h	2008-07-09 10:58:41.000000000 -0400
@@ -0,0 +1,14 @@
+#ifndef _IPV4_TRACE_H
+#define _IPV4_TRACE_H
+
+#include <linux/inetdevice.h>
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(ipv4_addr_add,
+	TPPROTO(struct in_ifaddr *ifa),
+	TPARGS(ifa));
+DEFINE_TRACE(ipv4_addr_del,
+	TPPROTO(struct in_ifaddr *ifa),
+	TPARGS(ifa));
+
+#endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* (no subject)
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (13 preceding siblings ...)
  2008-07-09 14:59 ` [patch 14/15] LTTng instrumentation - ipv4 Mathieu Desnoyers
@ 2008-07-09 14:59 ` Mathieu Desnoyers
  2008-07-09 17:01 ` [patch 00/15] Tracepoints v3 for linux-next Masami Hiramatsu
  15 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 14:59 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 05/15] LTTng instrumentation - scheduler (repost)
  2008-07-09 14:59 ` [patch 05/15] LTTng instrumentation - scheduler Mathieu Desnoyers
@ 2008-07-09 15:34   ` Mathieu Desnoyers
  2008-07-09 15:39     ` Ingo Molnar
  2008-07-09 16:21     ` [patch 05/15] LTTng instrumentation - scheduler (merge ftrace markers) Mathieu Desnoyers
  0 siblings, 2 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 15:34 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Eduard - Gabriel Munteanu

There were 2 rejects when I ported the patch to linux-next. Sorry. Here
is a repost.


Instrument the scheduler activity (sched_switch, migration, wakeups, wait for a
task, signal delivery) and process/thread creation/destruction (fork, exit,
kthread stop). Actually, kthread creation is not instrumented in this patch
because it is architecture dependent. It allows to connect tracers such as
ftrace which detects scheduling latencies, good/bad scheduler decisions. Tools
like LTTng can export this scheduler information along with instrumentation of
the rest of the kernel activity to perform post-mortem analysis on the scheduler
activity.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 kernel/exit.c        |    6 ++++++
 kernel/fork.c        |    3 +++
 kernel/kthread.c     |    5 +++++
 kernel/sched-trace.h |   43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c       |   11 ++++++-----
 kernel/signal.c      |    3 +++
 6 files changed, 66 insertions(+), 5 deletions(-)

Index: linux-2.6-lttng/kernel/kthread.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kthread.c	2008-07-09 11:27:01.000000000 -0400
+++ linux-2.6-lttng/kernel/kthread.c	2008-07-09 11:27:08.000000000 -0400
@@ -13,6 +13,7 @@
 #include <linux/file.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
+#include "sched-trace.h"
 
 #define KTHREAD_NICE_LEVEL (-5)
 
@@ -187,6 +188,8 @@ int kthread_stop(struct task_struct *k)
 	/* It could exit after stop_info.k set, but before wake_up_process. */
 	get_task_struct(k);
 
+	trace_sched_kthread_stop(k);
+
 	/* Must init completion *before* thread sees kthread_stop_info.k */
 	init_completion(&kthread_stop_info.done);
 	smp_wmb();
@@ -202,6 +205,8 @@ int kthread_stop(struct task_struct *k)
 	ret = kthread_stop_info.err;
 	mutex_unlock(&kthread_stop_lock);
 
+	trace_sched_kthread_stop_ret(ret);
+
 	return ret;
 }
 EXPORT_SYMBOL(kthread_stop);
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2008-07-09 11:27:01.000000000 -0400
+++ linux-2.6-lttng/kernel/sched.c	2008-07-09 11:27:56.000000000 -0400
@@ -71,6 +71,7 @@
 #include <linux/debugfs.h>
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
+#include "sched-trace.h"
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -1987,6 +1988,7 @@ void wait_task_inactive(struct task_stru
 		 * just go back and repeat.
 		 */
 		rq = task_rq_lock(p, &flags);
+		trace_sched_wait_task(p);
 		running = task_running(rq, p);
 		on_rq = p->se.on_rq;
 		task_rq_unlock(rq, &flags);
@@ -2275,6 +2277,7 @@ static int try_to_wake_up(struct task_st
 
 	smp_wmb();
 	rq = task_rq_lock(p, &flags);
+	trace_sched_try_wakeup(p);
 	old_state = p->state;
 	if (!(old_state & state))
 		goto out;
@@ -2457,6 +2460,7 @@ void wake_up_new_task(struct task_struct
 	struct rq *rq;
 
 	rq = task_rq_lock(p, &flags);
+	trace_sched_wakeup_new_task(p);
 	BUG_ON(p->state != TASK_RUNNING);
 	update_rq_clock(rq);
 
@@ -2647,11 +2651,7 @@ context_switch(struct rq *rq, struct tas
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_mark(kernel_sched_schedule,
-		"prev_pid %d next_pid %d prev_state %ld "
-		"## rq %p prev %p next %p",
-		prev->pid, next->pid, prev->state,
-		rq, prev, next);
+	trace_sched_switch(prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
@@ -2884,6 +2884,7 @@ static void sched_migrate_task(struct ta
 	    || unlikely(cpu_is_offline(dest_cpu)))
 		goto out;
 
+	trace_sched_migrate_task(p, dest_cpu);
 	/* force the process onto the specified CPU */
 	if (migrate_task(p, dest_cpu, &req)) {
 		/* Need to wait for migration thread (might exit: take ref). */
Index: linux-2.6-lttng/kernel/exit.c
===================================================================
--- linux-2.6-lttng.orig/kernel/exit.c	2008-07-09 11:27:01.000000000 -0400
+++ linux-2.6-lttng/kernel/exit.c	2008-07-09 11:27:08.000000000 -0400
@@ -46,6 +46,7 @@
 #include <linux/resource.h>
 #include <linux/blkdev.h>
 #include <linux/task_io_accounting_ops.h>
+#include "sched-trace.h"
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -149,6 +150,7 @@ static void __exit_signal(struct task_st
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
 {
+	trace_sched_process_free(container_of(rhp, struct task_struct, rcu));
 	put_task_struct(container_of(rhp, struct task_struct, rcu));
 }
 
@@ -1040,6 +1042,8 @@ NORET_TYPE void do_exit(long code)
 
 	if (group_dead)
 		acct_process();
+	trace_sched_process_exit(tsk);
+
 	exit_sem(tsk);
 	exit_files(tsk);
 	exit_fs(tsk);
@@ -1524,6 +1528,8 @@ static long do_wait(enum pid_type type, 
 	struct task_struct *tsk;
 	int flag, retval;
 
+	trace_sched_process_wait(pid);
+
 	add_wait_queue(&current->signal->wait_chldexit,&wait);
 repeat:
 	/* If there is nothing that can match our critier just get out */
Index: linux-2.6-lttng/kernel/fork.c
===================================================================
--- linux-2.6-lttng.orig/kernel/fork.c	2008-07-09 11:27:01.000000000 -0400
+++ linux-2.6-lttng/kernel/fork.c	2008-07-09 11:27:08.000000000 -0400
@@ -56,6 +56,7 @@
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
 #include <linux/magic.h>
+#include "sched-trace.h"
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1362,6 +1363,8 @@ long do_fork(unsigned long clone_flags,
 	if (!IS_ERR(p)) {
 		struct completion vfork;
 
+		trace_sched_process_fork(current, p);
+
 		nr = task_pid_vnr(p);
 
 		if (clone_flags & CLONE_PARENT_SETTID)
Index: linux-2.6-lttng/kernel/signal.c
===================================================================
--- linux-2.6-lttng.orig/kernel/signal.c	2008-07-09 11:25:24.000000000 -0400
+++ linux-2.6-lttng/kernel/signal.c	2008-07-09 11:27:08.000000000 -0400
@@ -26,6 +26,7 @@
 #include <linux/freezer.h>
 #include <linux/pid_namespace.h>
 #include <linux/nsproxy.h>
+#include "sched-trace.h"
 
 #include <asm/param.h>
 #include <asm/uaccess.h>
@@ -807,6 +808,8 @@ static int send_signal(int sig, struct s
 	struct sigpending *pending;
 	struct sigqueue *q;
 
+	trace_sched_signal_send(sig, t);
+
 	assert_spin_locked(&t->sighand->siglock);
 	if (!prepare_signal(sig, t))
 		return 0;
Index: linux-2.6-lttng/kernel/sched-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/sched-trace.h	2008-07-09 11:27:08.000000000 -0400
@@ -0,0 +1,43 @@
+#ifndef _SCHED_TRACE_H
+#define _SCHED_TRACE_H
+
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(sched_kthread_stop,
+	TPPROTO(struct task_struct *t),
+	TPARGS(t));
+DEFINE_TRACE(sched_kthread_stop_ret,
+	TPPROTO(int ret),
+	TPARGS(ret));
+DEFINE_TRACE(sched_wait_task,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_try_wakeup,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_wakeup_new_task,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_switch,
+	TPPROTO(struct task_struct *prev, struct task_struct *next),
+	TPARGS(prev, next));
+DEFINE_TRACE(sched_migrate_task,
+	TPPROTO(struct task_struct *p, int dest_cpu),
+	TPARGS(p, dest_cpu));
+DEFINE_TRACE(sched_process_free,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_process_exit,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_process_wait,
+	TPPROTO(struct pid *pid),
+	TPARGS(pid));
+DEFINE_TRACE(sched_process_fork,
+	TPPROTO(struct task_struct *parent, struct task_struct *child),
+	TPARGS(parent, child));
+DEFINE_TRACE(sched_signal_send,
+	TPPROTO(int sig, struct task_struct *p),
+	TPARGS(sig, p));
+
+#endif
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 05/15] LTTng instrumentation - scheduler (repost)
  2008-07-09 15:34   ` [patch 05/15] LTTng instrumentation - scheduler (repost) Mathieu Desnoyers
@ 2008-07-09 15:39     ` Ingo Molnar
  2008-07-09 16:00       ` Mathieu Desnoyers
  2008-07-09 16:21     ` [patch 05/15] LTTng instrumentation - scheduler (merge ftrace markers) Mathieu Desnoyers
  1 sibling, 1 reply; 58+ messages in thread
From: Ingo Molnar @ 2008-07-09 15:39 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Eduard - Gabriel Munteanu


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> There were 2 rejects when I ported the patch to linux-next. Sorry. 
> Here is a repost.

there's still a standing objection (NAK) from Peter Zijstra so none of 
this can go into linux-next yet i'm afraid.

	Ingo

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 05/15] LTTng instrumentation - scheduler (repost)
  2008-07-09 15:39     ` Ingo Molnar
@ 2008-07-09 16:00       ` Mathieu Desnoyers
  0 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 16:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: akpm, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Eduard - Gabriel Munteanu

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > There were 2 rejects when I ported the patch to linux-next. Sorry. 
> > Here is a repost.
> 
> there's still a standing objection (NAK) from Peter Zijstra so none of 
> this can go into linux-next yet i'm afraid.
> 
> 	Ingo

Hi Ingo,

This "tracepoint" infrastructure has been created to answer to Peter's
concerns about trace_mark uglyness. The last comment I had from him when
we discussed the tracepoint idea was "Looking forward to your
proposal..", so I guess it's up to him to decide if tracepoints are
clean enough to live.

I'll submit a revised scheduler instrumentation which turns all
scheduler trace_mark() already in linux-next into tracepoints, with the
same arguments. ftrace will have to be updated accordingly.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 05/15] LTTng instrumentation - scheduler (merge ftrace markers)
  2008-07-09 15:34   ` [patch 05/15] LTTng instrumentation - scheduler (repost) Mathieu Desnoyers
  2008-07-09 15:39     ` Ingo Molnar
@ 2008-07-09 16:21     ` Mathieu Desnoyers
  2008-07-09 19:09       ` [PATCH] ftrace port to tracepoints (linux-next) Mathieu Desnoyers
  1 sibling, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 16:21 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, Peter Zijlstra
  Cc: Steven Rostedt, Thomas Gleixner, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Eduard - Gabriel Munteanu

Peter : this is the instrumentation from ftrace (exact same location of
trace_mark statements) turned into tracepoints. How do you like it ?
If we merge this, ftrace will have to be adapted accordingly.


LTTng instrumentation - scheduler

Instrument the scheduler activity (sched_switch, migration, wakeups, wait for a
task, signal delivery) and process/thread creation/destruction (fork, exit,
kthread stop). Actually, kthread creation is not instrumented in this patch
because it is architecture dependent. It allows to connect tracers such as
ftrace which detects scheduling latencies, good/bad scheduler decisions. Tools
like LTTng can export this scheduler information along with instrumentation of
the rest of the kernel activity to perform post-mortem analysis on the scheduler
activity.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Changelog :
- Change instrumentation location and parameter to match ftrace instrumentation,
  previously done with kernel markers.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 kernel/exit.c        |    6 ++++++
 kernel/fork.c        |    3 +++
 kernel/kthread.c     |    5 +++++
 kernel/sched-trace.h |   45 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c       |   17 ++++++-----------
 kernel/signal.c      |    3 +++
 6 files changed, 68 insertions(+), 11 deletions(-)

Index: linux-2.6-lttng/kernel/kthread.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kthread.c	2008-07-09 12:11:59.000000000 -0400
+++ linux-2.6-lttng/kernel/kthread.c	2008-07-09 12:12:07.000000000 -0400
@@ -13,6 +13,7 @@
 #include <linux/file.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
+#include "sched-trace.h"
 
 #define KTHREAD_NICE_LEVEL (-5)
 
@@ -187,6 +188,8 @@ int kthread_stop(struct task_struct *k)
 	/* It could exit after stop_info.k set, but before wake_up_process. */
 	get_task_struct(k);
 
+	trace_sched_kthread_stop(k);
+
 	/* Must init completion *before* thread sees kthread_stop_info.k */
 	init_completion(&kthread_stop_info.done);
 	smp_wmb();
@@ -202,6 +205,8 @@ int kthread_stop(struct task_struct *k)
 	ret = kthread_stop_info.err;
 	mutex_unlock(&kthread_stop_lock);
 
+	trace_sched_kthread_stop_ret(ret);
+
 	return ret;
 }
 EXPORT_SYMBOL(kthread_stop);
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2008-07-09 12:11:59.000000000 -0400
+++ linux-2.6-lttng/kernel/sched.c	2008-07-09 12:14:28.000000000 -0400
@@ -71,6 +71,7 @@
 #include <linux/debugfs.h>
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
+#include "sched-trace.h"
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -1987,6 +1988,7 @@ void wait_task_inactive(struct task_stru
 		 * just go back and repeat.
 		 */
 		rq = task_rq_lock(p, &flags);
+		trace_sched_wait_task(rq, p);
 		running = task_running(rq, p);
 		on_rq = p->se.on_rq;
 		task_rq_unlock(rq, &flags);
@@ -2337,9 +2339,7 @@ out_activate:
 	success = 1;
 
 out_running:
-	trace_mark(kernel_sched_wakeup,
-		"pid %d state %ld ## rq %p task %p rq->curr %p",
-		p->pid, p->state, rq, p, rq->curr);
+	trace_sched_wakeup(rq, p);
 	check_preempt_curr(rq, p);
 
 	p->state = TASK_RUNNING;
@@ -2472,9 +2472,7 @@ void wake_up_new_task(struct task_struct
 		p->sched_class->task_new(rq, p);
 		inc_nr_running(rq);
 	}
-	trace_mark(kernel_sched_wakeup_new,
-		"pid %d state %ld ## rq %p task %p rq->curr %p",
-		p->pid, p->state, rq, p, rq->curr);
+	trace_sched_wakeup_new(rq, p);
 	check_preempt_curr(rq, p);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_wake_up)
@@ -2647,11 +2645,7 @@ context_switch(struct rq *rq, struct tas
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_mark(kernel_sched_schedule,
-		"prev_pid %d next_pid %d prev_state %ld "
-		"## rq %p prev %p next %p",
-		prev->pid, next->pid, prev->state,
-		rq, prev, next);
+	trace_sched_switch(rq, prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
@@ -2884,6 +2878,7 @@ static void sched_migrate_task(struct ta
 	    || unlikely(cpu_is_offline(dest_cpu)))
 		goto out;
 
+	trace_sched_migrate_task(rq, p, dest_cpu);
 	/* force the process onto the specified CPU */
 	if (migrate_task(p, dest_cpu, &req)) {
 		/* Need to wait for migration thread (might exit: take ref). */
Index: linux-2.6-lttng/kernel/exit.c
===================================================================
--- linux-2.6-lttng.orig/kernel/exit.c	2008-07-09 12:11:59.000000000 -0400
+++ linux-2.6-lttng/kernel/exit.c	2008-07-09 12:12:07.000000000 -0400
@@ -46,6 +46,7 @@
 #include <linux/resource.h>
 #include <linux/blkdev.h>
 #include <linux/task_io_accounting_ops.h>
+#include "sched-trace.h"
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -149,6 +150,7 @@ static void __exit_signal(struct task_st
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
 {
+	trace_sched_process_free(container_of(rhp, struct task_struct, rcu));
 	put_task_struct(container_of(rhp, struct task_struct, rcu));
 }
 
@@ -1040,6 +1042,8 @@ NORET_TYPE void do_exit(long code)
 
 	if (group_dead)
 		acct_process();
+	trace_sched_process_exit(tsk);
+
 	exit_sem(tsk);
 	exit_files(tsk);
 	exit_fs(tsk);
@@ -1524,6 +1528,8 @@ static long do_wait(enum pid_type type, 
 	struct task_struct *tsk;
 	int flag, retval;
 
+	trace_sched_process_wait(pid);
+
 	add_wait_queue(&current->signal->wait_chldexit,&wait);
 repeat:
 	/* If there is nothing that can match our critier just get out */
Index: linux-2.6-lttng/kernel/fork.c
===================================================================
--- linux-2.6-lttng.orig/kernel/fork.c	2008-07-09 12:11:59.000000000 -0400
+++ linux-2.6-lttng/kernel/fork.c	2008-07-09 12:12:27.000000000 -0400
@@ -56,6 +56,7 @@
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
 #include <linux/magic.h>
+#include "sched-trace.h"
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1362,6 +1363,8 @@ long do_fork(unsigned long clone_flags,
 	if (!IS_ERR(p)) {
 		struct completion vfork;
 
+		trace_sched_process_fork(current, p);
+
 		nr = task_pid_vnr(p);
 
 		if (clone_flags & CLONE_PARENT_SETTID)
Index: linux-2.6-lttng/kernel/signal.c
===================================================================
--- linux-2.6-lttng.orig/kernel/signal.c	2008-07-09 12:11:45.000000000 -0400
+++ linux-2.6-lttng/kernel/signal.c	2008-07-09 12:12:07.000000000 -0400
@@ -26,6 +26,7 @@
 #include <linux/freezer.h>
 #include <linux/pid_namespace.h>
 #include <linux/nsproxy.h>
+#include "sched-trace.h"
 
 #include <asm/param.h>
 #include <asm/uaccess.h>
@@ -807,6 +808,8 @@ static int send_signal(int sig, struct s
 	struct sigpending *pending;
 	struct sigqueue *q;
 
+	trace_sched_signal_send(sig, t);
+
 	assert_spin_locked(&t->sighand->siglock);
 	if (!prepare_signal(sig, t))
 		return 0;
Index: linux-2.6-lttng/kernel/sched-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/sched-trace.h	2008-07-09 12:14:59.000000000 -0400
@@ -0,0 +1,45 @@
+#ifndef _SCHED_TRACE_H
+#define _SCHED_TRACE_H
+
+#include <linux/sched.h>
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(sched_kthread_stop,
+	TPPROTO(struct task_struct *t),
+	TPARGS(t));
+DEFINE_TRACE(sched_kthread_stop_ret,
+	TPPROTO(int ret),
+	TPARGS(ret));
+DEFINE_TRACE(sched_wait_task,
+	TPPROTO(struct rq *rq, struct task_struct *p),
+	TPARGS(rq, p));
+DEFINE_TRACE(sched_wakeup,
+	TPPROTO(struct rq *rq, struct task_struct *p),
+	TPARGS(rq, p));
+DEFINE_TRACE(sched_wakeup_new,
+	TPPROTO(struct rq *rq, struct task_struct *p),
+	TPARGS(rq, p));
+DEFINE_TRACE(sched_switch,
+	TPPROTO(struct rq *rq, struct task_struct *prev,
+		struct task_struct *next),
+	TPARGS(rq, prev, next));
+DEFINE_TRACE(sched_migrate_task,
+	TPPROTO(struct rq *rq, struct task_struct *p, int dest_cpu),
+	TPARGS(rq, p, dest_cpu));
+DEFINE_TRACE(sched_process_free,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_process_exit,
+	TPPROTO(struct task_struct *p),
+	TPARGS(p));
+DEFINE_TRACE(sched_process_wait,
+	TPPROTO(struct pid *pid),
+	TPARGS(pid));
+DEFINE_TRACE(sched_process_fork,
+	TPPROTO(struct task_struct *parent, struct task_struct *child),
+	TPARGS(parent, child));
+DEFINE_TRACE(sched_signal_send,
+	TPPROTO(int sig, struct task_struct *p),
+	TPARGS(sig, p));
+
+#endif
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 04/15] LTTng instrumentation - irq
  2008-07-09 14:59 ` [patch 04/15] LTTng instrumentation - irq Mathieu Desnoyers
@ 2008-07-09 16:39   ` Masami Hiramatsu
  2008-07-09 17:05     ` [patch 04/15] LTTng instrumentation - irq (update) Mathieu Desnoyers
  0 siblings, 1 reply; 58+ messages in thread
From: Masami Hiramatsu @ 2008-07-09 16:39 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, Thomas Gleixner, Russell King,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu

Mathieu Desnoyers wrote:
> Instrumentation of IRQ related events : irq, softirq, tasklet entry and exit and
> softirq "raise" events.
> 
> It allows tracers to perform latency analysis on those various types of
> interrupts and to detect interrupts with max/min/avg duration. It helps
> detecting driver or hardware problems which cause an ISR to take ages to
> execute. It has been shown to be the case with bogus hardware causing an mmio
> read to take a few milliseconds.
> 
> Those tracepoints are used by LTTng.
> 
> About the performance impact of tracepoints (which is comparable to markers),
> even without immediate values optimizations, tests done by Hideo Aoki on ia64
> show no regression. His test case was using hackbench on a kernel where
> scheduler instrumentation (about 5 events in code scheduler code) was added.
> See the "Tracepoints" patch header for performance result detail.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Russell King <rmk+lkml@arm.linux.org.uk>
> CC: Masami Hiramatsu <mhiramat@redhat.com>
> CC: 'Peter Zijlstra' <peterz@infradead.org>
> CC: "Frank Ch. Eigler" <fche@redhat.com>
> CC: 'Ingo Molnar' <mingo@elte.hu>
> CC: 'Hideo AOKI' <haoki@redhat.com>
> CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
> CC: 'Steven Rostedt' <rostedt@goodmis.org>
> CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
> ---
>  kernel/irq-trace.h  |   36 ++++++++++++++++++++++++++++++++++++
>  kernel/irq/handle.c |    6 ++++++
>  kernel/softirq.c    |    8 ++++++++
>  3 files changed, 50 insertions(+)
> 
> Index: linux-2.6-lttng/kernel/irq/handle.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/irq/handle.c	2008-07-09 10:57:33.000000000 -0400
> +++ linux-2.6-lttng/kernel/irq/handle.c	2008-07-09 10:57:35.000000000 -0400
> @@ -15,6 +15,7 @@
>  #include <linux/random.h>
>  #include <linux/interrupt.h>
>  #include <linux/kernel_stat.h>
> +#include "../irq-trace.h"
>  
>  #include "internals.h"
>  
> @@ -130,6 +131,9 @@ irqreturn_t handle_IRQ_event(unsigned in
>  {
>  	irqreturn_t ret, retval = IRQ_NONE;
>  	unsigned int status = 0;
> +	struct pt_regs *regs = get_irq_regs();
> +
> +	trace_irq_entry(irq, regs);
>  
>  	handle_dynamic_tick(action);
>  
> @@ -148,6 +152,8 @@ irqreturn_t handle_IRQ_event(unsigned in
>  		add_interrupt_randomness(irq);
>  	local_irq_disable();
>  
> +	trace_irq_exit();
> +

Hi Mathieu,
What would you think tracing return value of irq handlers here?
like:
 trace_irq_exit(retval);

So, we can check the irq was handled correctly or not.

Thank you,


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/15] Tracepoints v3 for linux-next
  2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
                   ` (14 preceding siblings ...)
  2008-07-09 14:59 ` Mathieu Desnoyers
@ 2008-07-09 17:01 ` Masami Hiramatsu
  2008-07-09 17:11   ` [patch 15/15] LTTng instrumentation - ipv6 Mathieu Desnoyers
  15 siblings, 1 reply; 58+ messages in thread
From: Masami Hiramatsu @ 2008-07-09 17:01 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Ingo Molnar, linux-kernel

Hi Mathieu,

I couldn't find your 15th patch in my mailbox, neither lkml archive.
Could you resend it?

Thank you,

Mathieu Desnoyers wrote:
> Hi,
> 
> This is the 3rd round of tracepoints patch submisison. The instrumentation
> sites the more likely to come to a quick agreement has been selected in this
> first step. The tracepoint infrastructure, heavily inspired from the kernel
> markers, seems pretty solid and went through a thorough review by Masami.
> 
> This patchset applies over patch-v2.6.26-rc9-next-20080709.
> 
> Mathieu
> 

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 04/15] LTTng instrumentation - irq (update)
  2008-07-09 16:39   ` Masami Hiramatsu
@ 2008-07-09 17:05     ` Mathieu Desnoyers
  0 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 17:05 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: akpm, Ingo Molnar, linux-kernel, Thomas Gleixner, Russell King,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> 
> Hi Mathieu,
> What would you think tracing return value of irq handlers here?
> like:
>  trace_irq_exit(retval);
> 
> So, we can check the irq was handled correctly or not.
> 
> Thank you,
> 
> 

Sure, here is the updated patch.

Thanks,

Mathieu


LTTng instrumentation - irq

Instrumentation of IRQ related events : irq, softirq, tasklet entry and exit and
softirq "raise" events.

It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Changelog:
- Add retval as irq_exit argument.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Russell King <rmk+lkml@arm.linux.org.uk>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 kernel/irq-trace.h  |   36 ++++++++++++++++++++++++++++++++++++
 kernel/irq/handle.c |    6 ++++++
 kernel/softirq.c    |    8 ++++++++
 3 files changed, 50 insertions(+)

Index: linux-2.6-lttng/kernel/irq/handle.c
===================================================================
--- linux-2.6-lttng.orig/kernel/irq/handle.c	2008-07-09 12:36:08.000000000 -0400
+++ linux-2.6-lttng/kernel/irq/handle.c	2008-07-09 12:55:33.000000000 -0400
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
+#include "../irq-trace.h"
 
 #include "internals.h"
 
@@ -130,6 +131,9 @@ irqreturn_t handle_IRQ_event(unsigned in
 {
 	irqreturn_t ret, retval = IRQ_NONE;
 	unsigned int status = 0;
+	struct pt_regs *regs = get_irq_regs();
+
+	trace_irq_entry(irq, regs);
 
 	handle_dynamic_tick(action);
 
@@ -148,6 +152,8 @@ irqreturn_t handle_IRQ_event(unsigned in
 		add_interrupt_randomness(irq);
 	local_irq_disable();
 
+	trace_irq_exit(retval);
+
 	return retval;
 }
 
Index: linux-2.6-lttng/kernel/softirq.c
===================================================================
--- linux-2.6-lttng.orig/kernel/softirq.c	2008-07-09 12:37:15.000000000 -0400
+++ linux-2.6-lttng/kernel/softirq.c	2008-07-09 12:54:58.000000000 -0400
@@ -21,6 +21,7 @@
 #include <linux/rcupdate.h>
 #include <linux/smp.h>
 #include <linux/tick.h>
+#include "irq-trace.h"
 
 #include <asm/irq.h>
 /*
@@ -231,7 +232,9 @@ restart:
 
 	do {
 		if (pending & 1) {
+			trace_irq_softirq_entry(h, softirq_vec);
 			h->action(h);
+			trace_irq_softirq_exit(h, softirq_vec);
 			rcu_bh_qsctr_inc(cpu);
 		}
 		h++;
@@ -323,6 +326,7 @@ void irq_exit(void)
  */
 inline void raise_softirq_irqoff(unsigned int nr)
 {
+	trace_irq_softirq_raise(nr);
 	__raise_softirq_irqoff(nr);
 
 	/*
@@ -412,7 +416,9 @@ static void tasklet_action(struct softir
 			if (!atomic_read(&t->count)) {
 				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
 					BUG();
+				trace_irq_tasklet_low_entry(t);
 				t->func(t->data);
+				trace_irq_tasklet_low_exit(t);
 				tasklet_unlock(t);
 				continue;
 			}
@@ -447,7 +453,9 @@ static void tasklet_hi_action(struct sof
 			if (!atomic_read(&t->count)) {
 				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
 					BUG();
+				trace_irq_tasklet_high_entry(t);
 				t->func(t->data);
+				trace_irq_tasklet_high_exit(t);
 				tasklet_unlock(t);
 				continue;
 			}
Index: linux-2.6-lttng/kernel/irq-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/irq-trace.h	2008-07-09 12:56:11.000000000 -0400
@@ -0,0 +1,36 @@
+#ifndef _IRQ_TRACE_H
+#define _IRQ_TRACE_H
+
+#include <linux/kdebug.h>
+#include <linux/interrupt.h>
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(irq_entry,
+	TPPROTO(unsigned int id, struct pt_regs *regs),
+	TPARGS(id, regs));
+DEFINE_TRACE(irq_exit,
+	TPPROTO(irqreturn_t retval),
+	TPARGS(retval));
+DEFINE_TRACE(irq_softirq_entry,
+	TPPROTO(struct softirq_action *h, struct softirq_action *softirq_vec),
+	TPARGS(h, softirq_vec));
+DEFINE_TRACE(irq_softirq_exit,
+	TPPROTO(struct softirq_action *h, struct softirq_action *softirq_vec),
+	TPARGS(h, softirq_vec));
+DEFINE_TRACE(irq_softirq_raise,
+	TPPROTO(unsigned int nr),
+	TPARGS(nr));
+DEFINE_TRACE(irq_tasklet_low_entry,
+	TPPROTO(struct tasklet_struct *t),
+	TPARGS(t));
+DEFINE_TRACE(irq_tasklet_low_exit,
+	TPPROTO(struct tasklet_struct *t),
+	TPARGS(t));
+DEFINE_TRACE(irq_tasklet_high_entry,
+	TPPROTO(struct tasklet_struct *t),
+	TPARGS(t));
+DEFINE_TRACE(irq_tasklet_high_exit,
+	TPPROTO(struct tasklet_struct *t),
+	TPARGS(t));
+
+#endif

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 15/15] LTTng instrumentation - ipv6
  2008-07-09 17:01 ` [patch 00/15] Tracepoints v3 for linux-next Masami Hiramatsu
@ 2008-07-09 17:11   ` Mathieu Desnoyers
  0 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 17:11 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: akpm, Ingo Molnar, linux-kernel, Pekka Savola, netdev,
	David S. Miller, Alexey Kuznetsov, Masami Hiramatsu,
	'Peter Zijlstra', Frank Ch. Eigler, 'Hideo AOKI',
	Takashi Nishiie, 'Steven Rostedt',
	Eduard - Gabriel Munteanu

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Hi Mathieu,
> 
> I couldn't find your 15th patch in my mailbox, neither lkml archive.
> Could you resend it?
> 
> Thank you,
> 

Sure, I think quilt had some problem with it because of a ill formatted
email. Here it is.



Instrument addr_add and del of network interfaces. Lets a tracer know the
interface address changes.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Pekka Savola <pekkas@netcore.fi>
CC: netdev@vger.kernel.org
CC: David S. Miller <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 net/ipv6/addrconf.c   |    4 ++++
 net/ipv6/ipv6-trace.h |   14 ++++++++++++++
 2 files changed, 18 insertions(+)

Index: linux-2.6-lttng/net/ipv6/addrconf.c
===================================================================
--- linux-2.6-lttng.orig/net/ipv6/addrconf.c	2008-07-09 10:55:46.000000000 -0400
+++ linux-2.6-lttng/net/ipv6/addrconf.c	2008-07-09 10:58:43.000000000 -0400
@@ -85,6 +85,7 @@
 
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include "ipv6-trace.h"
 
 /* Set to 3 to get tracing... */
 #define ACONF_DEBUG 2
@@ -650,6 +651,8 @@ ipv6_add_addr(struct inet6_dev *idev, co
 	/* For caller */
 	in6_ifa_hold(ifa);
 
+	trace_ipv6_addr_add(ifa);
+
 	/* Add to big hash table */
 	hash = ipv6_addr_hash(addr);
 
@@ -2163,6 +2166,7 @@ static int inet6_addr_del(struct net *ne
 			in6_ifa_hold(ifp);
 			read_unlock_bh(&idev->lock);
 
+			trace_ipv6_addr_del(ifp);
 			ipv6_del_addr(ifp);
 
 			/* If the last address is deleted administratively,
Index: linux-2.6-lttng/net/ipv6/ipv6-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/net/ipv6/ipv6-trace.h	2008-07-09 10:58:43.000000000 -0400
@@ -0,0 +1,14 @@
+#ifndef _IPV6_TRACE_H
+#define _IPV6_TRACE_H
+
+#include <net/if_inet6.h>
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(ipv6_addr_add,
+	TPPROTO(struct inet6_ifaddr *ifa),
+	TPARGS(ifa));
+DEFINE_TRACE(ipv6_addr_del,
+	TPPROTO(struct inet6_ifaddr *ifa),
+	TPARGS(ifa));
+
+#endif


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH] ftrace port to tracepoints (linux-next)
  2008-07-09 16:21     ` [patch 05/15] LTTng instrumentation - scheduler (merge ftrace markers) Mathieu Desnoyers
@ 2008-07-09 19:09       ` Mathieu Desnoyers
  2008-07-10  3:14         ` Takashi Nishiie
       [not found]         ` <20080711143709.GB11500@Krystal>
  0 siblings, 2 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-09 19:09 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, Peter Zijlstra
  Cc: Steven Rostedt, Thomas Gleixner, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Eduard - Gabriel Munteanu

Porting the trace_mark() used by ftrace to tracepoints. (cleanup)

This patch applies after the "Tracepoints v3" patchset for linux-next.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 kernel/trace/trace_sched_switch.c |  114 ++++++---------------------------
 kernel/trace/trace_sched_wakeup.c |  129 +++++++++-----------------------------
 2 files changed, 52 insertions(+), 191 deletions(-)

Index: linux-2.6-lttng/kernel/trace/trace_sched_switch.c
===================================================================
--- linux-2.6-lttng.orig/kernel/trace/trace_sched_switch.c	2008-07-09 14:33:28.000000000 -0400
+++ linux-2.6-lttng/kernel/trace/trace_sched_switch.c	2008-07-09 15:03:07.000000000 -0400
@@ -9,9 +9,9 @@
 #include <linux/debugfs.h>
 #include <linux/kallsyms.h>
 #include <linux/uaccess.h>
-#include <linux/marker.h>
 #include <linux/ftrace.h>
 
+#include "../sched-trace.h"
 #include "trace.h"
 
 static struct trace_array	*ctx_trace;
@@ -19,16 +19,17 @@ static int __read_mostly	tracer_enabled;
 static atomic_t			sched_ref;
 
 static void
-sched_switch_func(void *private, void *__rq, struct task_struct *prev,
+probe_sched_switch(struct rq *__rq, struct task_struct *prev,
 			struct task_struct *next)
 {
-	struct trace_array **ptr = private;
-	struct trace_array *tr = *ptr;
 	struct trace_array_cpu *data;
 	unsigned long flags;
 	long disabled;
 	int cpu;
 
+	if (!atomic_read(&sched_ref))
+		return;
+
 	tracing_record_cmdline(prev);
 	tracing_record_cmdline(next);
 
@@ -37,95 +38,42 @@ sched_switch_func(void *private, void *_
 
 	local_irq_save(flags);
 	cpu = raw_smp_processor_id();
-	data = tr->data[cpu];
+	data = ctx_trace->data[cpu];
 	disabled = atomic_inc_return(&data->disabled);
 
 	if (likely(disabled == 1))
-		tracing_sched_switch_trace(tr, data, prev, next, flags);
+		tracing_sched_switch_trace(ctx_trace, data, prev, next, flags);
 
 	atomic_dec(&data->disabled);
 	local_irq_restore(flags);
 }
 
-static notrace void
-sched_switch_callback(void *probe_data, void *call_data,
-		      const char *format, va_list *args)
-{
-	struct task_struct *prev;
-	struct task_struct *next;
-	struct rq *__rq;
-
-	if (!atomic_read(&sched_ref))
-		return;
-
-	/* skip prev_pid %d next_pid %d prev_state %ld */
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, long);
-	__rq = va_arg(*args, typeof(__rq));
-	prev = va_arg(*args, typeof(prev));
-	next = va_arg(*args, typeof(next));
-
-	/*
-	 * If tracer_switch_func only points to the local
-	 * switch func, it still needs the ptr passed to it.
-	 */
-	sched_switch_func(probe_data, __rq, prev, next);
-}
-
 static void
-wakeup_func(void *private, void *__rq, struct task_struct *wakee, struct
-			task_struct *curr)
+probe_sched_wakeup(struct rq *__rq, struct task_struct *wakee)
 {
-	struct trace_array **ptr = private;
-	struct trace_array *tr = *ptr;
 	struct trace_array_cpu *data;
 	unsigned long flags;
 	long disabled;
 	int cpu;
 
-	if (!tracer_enabled)
+	if (!likely(tracer_enabled))
 		return;
 
-	tracing_record_cmdline(curr);
+	tracing_record_cmdline(current);
 
 	local_irq_save(flags);
 	cpu = raw_smp_processor_id();
-	data = tr->data[cpu];
+	data = ctx_trace->data[cpu];
 	disabled = atomic_inc_return(&data->disabled);
 
 	if (likely(disabled == 1))
-		tracing_sched_wakeup_trace(tr, data, wakee, curr, flags);
+		tracing_sched_wakeup_trace(ctx_trace, data, wakee, current,
+			flags);
 
 	atomic_dec(&data->disabled);
 	local_irq_restore(flags);
 }
 
-static notrace void
-wake_up_callback(void *probe_data, void *call_data,
-		 const char *format, va_list *args)
-{
-	struct task_struct *curr;
-	struct task_struct *task;
-	struct rq *__rq;
-
-	if (likely(!tracer_enabled))
-		return;
-
-	/* Skip pid %d state %ld */
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, long);
-	/* now get the meat: "rq %p task %p rq->curr %p" */
-	__rq = va_arg(*args, typeof(__rq));
-	task = va_arg(*args, typeof(task));
-	curr = va_arg(*args, typeof(curr));
-
-	tracing_record_cmdline(task);
-	tracing_record_cmdline(curr);
-
-	wakeup_func(probe_data, __rq, task, curr);
-}
-
 static void sched_switch_reset(struct trace_array *tr)
 {
 	int cpu;
@@ -140,31 +88,21 @@ static int tracing_sched_register(void)
 {
 	int ret;
 
-	ret = marker_probe_register("kernel_sched_wakeup",
-			"pid %d state %ld ## rq %p task %p rq->curr %p",
-			wake_up_callback,
-			&ctx_trace);
+	ret = register_trace_sched_wakeup(probe_sched_wakeup);
 	if (ret) {
 		pr_info("wakeup trace: Couldn't add marker"
 			" probe to kernel_sched_wakeup\n");
 		return ret;
 	}
 
-	ret = marker_probe_register("kernel_sched_wakeup_new",
-			"pid %d state %ld ## rq %p task %p rq->curr %p",
-			wake_up_callback,
-			&ctx_trace);
+	ret = register_trace_sched_wakeup_new(probe_sched_wakeup);
 	if (ret) {
 		pr_info("wakeup trace: Couldn't add marker"
 			" probe to kernel_sched_wakeup_new\n");
 		goto fail_deprobe;
 	}
 
-	ret = marker_probe_register("kernel_sched_schedule",
-		"prev_pid %d next_pid %d prev_state %ld "
-		"## rq %p prev %p next %p",
-		sched_switch_callback,
-		&ctx_trace);
+	ret = register_trace_sched_switch(probe_sched_switch);
 	if (ret) {
 		pr_info("sched trace: Couldn't add marker"
 			" probe to kernel_sched_schedule\n");
@@ -173,27 +111,17 @@ static int tracing_sched_register(void)
 
 	return ret;
 fail_deprobe_wake_new:
-	marker_probe_unregister("kernel_sched_wakeup_new",
-				wake_up_callback,
-				&ctx_trace);
+	unregister_trace_sched_wakeup_new(probe_sched_wakeup);
 fail_deprobe:
-	marker_probe_unregister("kernel_sched_wakeup",
-				wake_up_callback,
-				&ctx_trace);
+	unregister_trace_sched_wakeup(probe_sched_wakeup);
 	return ret;
 }
 
 static void tracing_sched_unregister(void)
 {
-	marker_probe_unregister("kernel_sched_schedule",
-				sched_switch_callback,
-				&ctx_trace);
-	marker_probe_unregister("kernel_sched_wakeup_new",
-				wake_up_callback,
-				&ctx_trace);
-	marker_probe_unregister("kernel_sched_wakeup",
-				wake_up_callback,
-				&ctx_trace);
+	unregister_trace_sched_switch(probe_sched_switch);
+	unregister_trace_sched_wakeup_new(probe_sched_wakeup);
+	unregister_trace_sched_wakeup(probe_sched_wakeup);
 }
 
 static void tracing_start_sched_switch(void)
Index: linux-2.6-lttng/kernel/trace/trace_sched_wakeup.c
===================================================================
--- linux-2.6-lttng.orig/kernel/trace/trace_sched_wakeup.c	2008-07-09 14:33:34.000000000 -0400
+++ linux-2.6-lttng/kernel/trace/trace_sched_wakeup.c	2008-07-09 15:05:23.000000000 -0400
@@ -15,8 +15,8 @@
 #include <linux/kallsyms.h>
 #include <linux/uaccess.h>
 #include <linux/ftrace.h>
-#include <linux/marker.h>
 
+#include "../sched-trace.h"
 #include "trace.h"
 
 static struct trace_array	*wakeup_trace;
@@ -109,18 +109,18 @@ static int report_latency(cycle_t delta)
 }
 
 static void notrace
-wakeup_sched_switch(void *private, void *rq, struct task_struct *prev,
+probe_wakeup_sched_switch(struct rq *rq, struct task_struct *prev,
 	struct task_struct *next)
 {
 	unsigned long latency = 0, t0 = 0, t1 = 0;
-	struct trace_array **ptr = private;
-	struct trace_array *tr = *ptr;
 	struct trace_array_cpu *data;
 	cycle_t T0, T1, delta;
 	unsigned long flags;
 	long disabled;
 	int cpu;
 
+	tracing_record_cmdline(prev);
+
 	if (unlikely(!tracer_enabled))
 		return;
 
@@ -137,11 +137,11 @@ wakeup_sched_switch(void *private, void 
 		return;
 
 	/* The task we are waiting for is waking up */
-	data = tr->data[wakeup_cpu];
+	data = wakeup_trace->data[wakeup_cpu];
 
 	/* disable local data, not wakeup_cpu data */
 	cpu = raw_smp_processor_id();
-	disabled = atomic_inc_return(&tr->data[cpu]->disabled);
+	disabled = atomic_inc_return(&wakeup_trace->data[cpu]->disabled);
 	if (likely(disabled != 1))
 		goto out;
 
@@ -151,7 +151,7 @@ wakeup_sched_switch(void *private, void 
 	if (unlikely(!tracer_enabled || next != wakeup_task))
 		goto out_unlock;
 
-	trace_function(tr, data, CALLER_ADDR1, CALLER_ADDR2, flags);
+	trace_function(wakeup_trace, data, CALLER_ADDR1, CALLER_ADDR2, flags);
 
 	/*
 	 * usecs conversion is slow so we try to delay the conversion
@@ -170,38 +170,13 @@ wakeup_sched_switch(void *private, void 
 	t0 = nsecs_to_usecs(T0);
 	t1 = nsecs_to_usecs(T1);
 
-	update_max_tr(tr, wakeup_task, wakeup_cpu);
+	update_max_tr(wakeup_trace, wakeup_task, wakeup_cpu);
 
 out_unlock:
-	__wakeup_reset(tr);
+	__wakeup_reset(wakeup_trace);
 	spin_unlock_irqrestore(&wakeup_lock, flags);
 out:
-	atomic_dec(&tr->data[cpu]->disabled);
-}
-
-static notrace void
-sched_switch_callback(void *probe_data, void *call_data,
-		      const char *format, va_list *args)
-{
-	struct task_struct *prev;
-	struct task_struct *next;
-	struct rq *__rq;
-
-	/* skip prev_pid %d next_pid %d prev_state %ld */
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, long);
-	__rq = va_arg(*args, typeof(__rq));
-	prev = va_arg(*args, typeof(prev));
-	next = va_arg(*args, typeof(next));
-
-	tracing_record_cmdline(prev);
-
-	/*
-	 * If tracer_switch_func only points to the local
-	 * switch func, it still needs the ptr passed to it.
-	 */
-	wakeup_sched_switch(probe_data, __rq, prev, next);
+	atomic_dec(&wakeup_trace->data[cpu]->disabled);
 }
 
 static void __wakeup_reset(struct trace_array *tr)
@@ -235,19 +210,24 @@ static void wakeup_reset(struct trace_ar
 }
 
 static void
-wakeup_check_start(struct trace_array *tr, struct task_struct *p,
-		   struct task_struct *curr)
+probe_wakeup(struct rq *rq, struct task_struct *p)
 {
 	int cpu = smp_processor_id();
 	unsigned long flags;
 	long disabled;
 
+	if (likely(!tracer_enabled))
+		return;
+
+	tracing_record_cmdline(p);
+	tracing_record_cmdline(current);
+
 	if (likely(!rt_task(p)) ||
 			p->prio >= wakeup_prio ||
-			p->prio >= curr->prio)
+			p->prio >= current->prio)
 		return;
 
-	disabled = atomic_inc_return(&tr->data[cpu]->disabled);
+	disabled = atomic_inc_return(&wakeup_trace->data[cpu]->disabled);
 	if (unlikely(disabled != 1))
 		goto out;
 
@@ -259,7 +239,7 @@ wakeup_check_start(struct trace_array *t
 		goto out_locked;
 
 	/* reset the trace */
-	__wakeup_reset(tr);
+	__wakeup_reset(wakeup_trace);
 
 	wakeup_cpu = task_cpu(p);
 	wakeup_prio = p->prio;
@@ -269,72 +249,35 @@ wakeup_check_start(struct trace_array *t
 
 	local_save_flags(flags);
 
-	tr->data[wakeup_cpu]->preempt_timestamp = ftrace_now(cpu);
-	trace_function(tr, tr->data[wakeup_cpu],
+	wakeup_trace->data[wakeup_cpu]->preempt_timestamp = ftrace_now(cpu);
+	trace_function(wakeup_trace, wakeup_trace->data[wakeup_cpu],
 		       CALLER_ADDR1, CALLER_ADDR2, flags);
 
 out_locked:
 	spin_unlock(&wakeup_lock);
 out:
-	atomic_dec(&tr->data[cpu]->disabled);
-}
-
-static notrace void
-wake_up_callback(void *probe_data, void *call_data,
-		 const char *format, va_list *args)
-{
-	struct trace_array **ptr = probe_data;
-	struct trace_array *tr = *ptr;
-	struct task_struct *curr;
-	struct task_struct *task;
-	struct rq *__rq;
-
-	if (likely(!tracer_enabled))
-		return;
-
-	/* Skip pid %d state %ld */
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, long);
-	/* now get the meat: "rq %p task %p rq->curr %p" */
-	__rq = va_arg(*args, typeof(__rq));
-	task = va_arg(*args, typeof(task));
-	curr = va_arg(*args, typeof(curr));
-
-	tracing_record_cmdline(task);
-	tracing_record_cmdline(curr);
-
-	wakeup_check_start(tr, task, curr);
+	atomic_dec(&wakeup_trace->data[cpu]->disabled);
 }
 
 static void start_wakeup_tracer(struct trace_array *tr)
 {
 	int ret;
 
-	ret = marker_probe_register("kernel_sched_wakeup",
-			"pid %d state %ld ## rq %p task %p rq->curr %p",
-			wake_up_callback,
-			&wakeup_trace);
+	ret = register_trace_sched_wakeup(probe_wakeup);
 	if (ret) {
 		pr_info("wakeup trace: Couldn't add marker"
 			" probe to kernel_sched_wakeup\n");
 		return;
 	}
 
-	ret = marker_probe_register("kernel_sched_wakeup_new",
-			"pid %d state %ld ## rq %p task %p rq->curr %p",
-			wake_up_callback,
-			&wakeup_trace);
+	ret = register_trace_sched_wakeup_new(probe_wakeup);
 	if (ret) {
 		pr_info("wakeup trace: Couldn't add marker"
 			" probe to kernel_sched_wakeup_new\n");
 		goto fail_deprobe;
 	}
 
-	ret = marker_probe_register("kernel_sched_schedule",
-		"prev_pid %d next_pid %d prev_state %ld "
-		"## rq %p prev %p next %p",
-		sched_switch_callback,
-		&wakeup_trace);
+	ret = register_trace_sched_switch(probe_wakeup_sched_switch);
 	if (ret) {
 		pr_info("sched trace: Couldn't add marker"
 			" probe to kernel_sched_schedule\n");
@@ -357,28 +300,18 @@ static void start_wakeup_tracer(struct t
 
 	return;
 fail_deprobe_wake_new:
-	marker_probe_unregister("kernel_sched_wakeup_new",
-				wake_up_callback,
-				&wakeup_trace);
+	unregister_trace_sched_wakeup_new(probe_wakeup);
 fail_deprobe:
-	marker_probe_unregister("kernel_sched_wakeup",
-				wake_up_callback,
-				&wakeup_trace);
+	unregister_trace_sched_wakeup(probe_wakeup);
 }
 
 static void stop_wakeup_tracer(struct trace_array *tr)
 {
 	tracer_enabled = 0;
 	unregister_ftrace_function(&trace_ops);
-	marker_probe_unregister("kernel_sched_schedule",
-				sched_switch_callback,
-				&wakeup_trace);
-	marker_probe_unregister("kernel_sched_wakeup_new",
-				wake_up_callback,
-				&wakeup_trace);
-	marker_probe_unregister("kernel_sched_wakeup",
-				wake_up_callback,
-				&wakeup_trace);
+	unregister_trace_sched_switch(probe_wakeup_sched_switch);
+	unregister_trace_sched_wakeup_new(probe_wakeup);
+	unregister_trace_sched_wakeup(probe_wakeup);
 }
 
 static void wakeup_tracer_init(struct trace_array *tr)
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH] ftrace port to tracepoints (linux-next)
  2008-07-09 19:09       ` [PATCH] ftrace port to tracepoints (linux-next) Mathieu Desnoyers
@ 2008-07-10  3:14         ` Takashi Nishiie
  2008-07-10  3:57           ` [PATCH] ftrace port to tracepoints (linux-next) (nitpick update) Mathieu Desnoyers
       [not found]         ` <20080711143709.GB11500@Krystal>
  1 sibling, 1 reply; 58+ messages in thread
From: Takashi Nishiie @ 2008-07-10  3:14 UTC (permalink / raw)
  To: 'Mathieu Desnoyers', akpm, 'Ingo Molnar',
	linux-kernel, 'Peter Zijlstra'
  Cc: 'Steven Rostedt', 'Thomas Gleixner',
	'Masami Hiramatsu', 'Frank Ch. Eigler',
	'Hideo AOKI', 'Eduard - Gabriel Munteanu'

Hi,Mathieu

I think that it is wonderful that the source code becomes simple by 
changing kernel markers to tracepoints. 

However, it seems to forget to correct the error message.

For example.
Mathieu Wrote:
>-	ret = marker_probe_register("kernel_sched_wakeup",
>-			"pid %d state %ld ## rq %p task %p rq->curr %p",
>-			wake_up_callback,
>-			&ctx_trace);
>+	ret = register_trace_sched_wakeup(probe_sched_wakeup);
> 	if (ret) {
> 		pr_info("wakeup trace: Couldn't add marker"
> 			" probe to kernel_sched_wakeup\n");
> 		return ret;
> 	}

 	if (ret) {
- 		pr_info("wakeup trace: Couldn't add marker"
+		pr_info("wakeup trace: Couldn't activate tracepoint"
 			" probe to kernel_sched_wakeup\n");
 		return ret;
 	}

Thank you,
--
Takashi Nishiie




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] ftrace port to tracepoints (linux-next) (nitpick update)
  2008-07-10  3:14         ` Takashi Nishiie
@ 2008-07-10  3:57           ` Mathieu Desnoyers
  0 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-10  3:57 UTC (permalink / raw)
  To: Takashi Nishiie
  Cc: akpm, 'Ingo Molnar',
	linux-kernel, 'Peter Zijlstra', 'Steven Rostedt',
	'Thomas Gleixner', 'Masami Hiramatsu',
	'Frank Ch. Eigler', 'Hideo AOKI',
	'Eduard - Gabriel Munteanu'

* Takashi Nishiie (t-nishiie@np.css.fujitsu.com) wrote:
> Hi,Mathieu
> 
> I think that it is wonderful that the source code becomes simple by 
> changing kernel markers to tracepoints. 
> 
> However, it seems to forget to correct the error message.
> 

:)

Good catch. Here is the revised version. Thanks!

Mathieu


ftrace port to tracepoints

Porting the trace_mark() used by ftrace to tracepoints. (cleanup)

Changelog :
- Change error messages : marker -> tracepoint

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 kernel/trace/trace_sched_switch.c |  120 ++++++---------------------------
 kernel/trace/trace_sched_wakeup.c |  135 +++++++++-----------------------------
 2 files changed, 58 insertions(+), 197 deletions(-)

Index: linux-2.6-lttng/kernel/trace/trace_sched_switch.c
===================================================================
--- linux-2.6-lttng.orig/kernel/trace/trace_sched_switch.c	2008-07-09 23:50:26.000000000 -0400
+++ linux-2.6-lttng/kernel/trace/trace_sched_switch.c	2008-07-09 23:53:06.000000000 -0400
@@ -9,9 +9,9 @@
 #include <linux/debugfs.h>
 #include <linux/kallsyms.h>
 #include <linux/uaccess.h>
-#include <linux/marker.h>
 #include <linux/ftrace.h>
 
+#include "../sched-trace.h"
 #include "trace.h"
 
 static struct trace_array	*ctx_trace;
@@ -19,16 +19,17 @@ static int __read_mostly	tracer_enabled;
 static atomic_t			sched_ref;
 
 static void
-sched_switch_func(void *private, void *__rq, struct task_struct *prev,
+probe_sched_switch(struct rq *__rq, struct task_struct *prev,
 			struct task_struct *next)
 {
-	struct trace_array **ptr = private;
-	struct trace_array *tr = *ptr;
 	struct trace_array_cpu *data;
 	unsigned long flags;
 	long disabled;
 	int cpu;
 
+	if (!atomic_read(&sched_ref))
+		return;
+
 	tracing_record_cmdline(prev);
 	tracing_record_cmdline(next);
 
@@ -37,95 +38,42 @@ sched_switch_func(void *private, void *_
 
 	local_irq_save(flags);
 	cpu = raw_smp_processor_id();
-	data = tr->data[cpu];
+	data = ctx_trace->data[cpu];
 	disabled = atomic_inc_return(&data->disabled);
 
 	if (likely(disabled == 1))
-		tracing_sched_switch_trace(tr, data, prev, next, flags);
+		tracing_sched_switch_trace(ctx_trace, data, prev, next, flags);
 
 	atomic_dec(&data->disabled);
 	local_irq_restore(flags);
 }
 
-static notrace void
-sched_switch_callback(void *probe_data, void *call_data,
-		      const char *format, va_list *args)
-{
-	struct task_struct *prev;
-	struct task_struct *next;
-	struct rq *__rq;
-
-	if (!atomic_read(&sched_ref))
-		return;
-
-	/* skip prev_pid %d next_pid %d prev_state %ld */
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, long);
-	__rq = va_arg(*args, typeof(__rq));
-	prev = va_arg(*args, typeof(prev));
-	next = va_arg(*args, typeof(next));
-
-	/*
-	 * If tracer_switch_func only points to the local
-	 * switch func, it still needs the ptr passed to it.
-	 */
-	sched_switch_func(probe_data, __rq, prev, next);
-}
-
 static void
-wakeup_func(void *private, void *__rq, struct task_struct *wakee, struct
-			task_struct *curr)
+probe_sched_wakeup(struct rq *__rq, struct task_struct *wakee)
 {
-	struct trace_array **ptr = private;
-	struct trace_array *tr = *ptr;
 	struct trace_array_cpu *data;
 	unsigned long flags;
 	long disabled;
 	int cpu;
 
-	if (!tracer_enabled)
+	if (!likely(tracer_enabled))
 		return;
 
-	tracing_record_cmdline(curr);
+	tracing_record_cmdline(current);
 
 	local_irq_save(flags);
 	cpu = raw_smp_processor_id();
-	data = tr->data[cpu];
+	data = ctx_trace->data[cpu];
 	disabled = atomic_inc_return(&data->disabled);
 
 	if (likely(disabled == 1))
-		tracing_sched_wakeup_trace(tr, data, wakee, curr, flags);
+		tracing_sched_wakeup_trace(ctx_trace, data, wakee, current,
+			flags);
 
 	atomic_dec(&data->disabled);
 	local_irq_restore(flags);
 }
 
-static notrace void
-wake_up_callback(void *probe_data, void *call_data,
-		 const char *format, va_list *args)
-{
-	struct task_struct *curr;
-	struct task_struct *task;
-	struct rq *__rq;
-
-	if (likely(!tracer_enabled))
-		return;
-
-	/* Skip pid %d state %ld */
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, long);
-	/* now get the meat: "rq %p task %p rq->curr %p" */
-	__rq = va_arg(*args, typeof(__rq));
-	task = va_arg(*args, typeof(task));
-	curr = va_arg(*args, typeof(curr));
-
-	tracing_record_cmdline(task);
-	tracing_record_cmdline(curr);
-
-	wakeup_func(probe_data, __rq, task, curr);
-}
-
 static void sched_switch_reset(struct trace_array *tr)
 {
 	int cpu;
@@ -140,60 +88,40 @@ static int tracing_sched_register(void)
 {
 	int ret;
 
-	ret = marker_probe_register("kernel_sched_wakeup",
-			"pid %d state %ld ## rq %p task %p rq->curr %p",
-			wake_up_callback,
-			&ctx_trace);
+	ret = register_trace_sched_wakeup(probe_sched_wakeup);
 	if (ret) {
-		pr_info("wakeup trace: Couldn't add marker"
+		pr_info("wakeup trace: Couldn't activate tracepoint"
 			" probe to kernel_sched_wakeup\n");
 		return ret;
 	}
 
-	ret = marker_probe_register("kernel_sched_wakeup_new",
-			"pid %d state %ld ## rq %p task %p rq->curr %p",
-			wake_up_callback,
-			&ctx_trace);
+	ret = register_trace_sched_wakeup_new(probe_sched_wakeup);
 	if (ret) {
-		pr_info("wakeup trace: Couldn't add marker"
+		pr_info("wakeup trace: Couldn't activate tracepoint"
 			" probe to kernel_sched_wakeup_new\n");
 		goto fail_deprobe;
 	}
 
-	ret = marker_probe_register("kernel_sched_schedule",
-		"prev_pid %d next_pid %d prev_state %ld "
-		"## rq %p prev %p next %p",
-		sched_switch_callback,
-		&ctx_trace);
+	ret = register_trace_sched_switch(probe_sched_switch);
 	if (ret) {
-		pr_info("sched trace: Couldn't add marker"
+		pr_info("sched trace: Couldn't activate tracepoint"
 			" probe to kernel_sched_schedule\n");
 		goto fail_deprobe_wake_new;
 	}
 
 	return ret;
 fail_deprobe_wake_new:
-	marker_probe_unregister("kernel_sched_wakeup_new",
-				wake_up_callback,
-				&ctx_trace);
+	unregister_trace_sched_wakeup_new(probe_sched_wakeup);
 fail_deprobe:
-	marker_probe_unregister("kernel_sched_wakeup",
-				wake_up_callback,
-				&ctx_trace);
+	unregister_trace_sched_wakeup(probe_sched_wakeup);
 	return ret;
 }
 
 static void tracing_sched_unregister(void)
 {
-	marker_probe_unregister("kernel_sched_schedule",
-				sched_switch_callback,
-				&ctx_trace);
-	marker_probe_unregister("kernel_sched_wakeup_new",
-				wake_up_callback,
-				&ctx_trace);
-	marker_probe_unregister("kernel_sched_wakeup",
-				wake_up_callback,
-				&ctx_trace);
+	unregister_trace_sched_switch(probe_sched_switch);
+	unregister_trace_sched_wakeup_new(probe_sched_wakeup);
+	unregister_trace_sched_wakeup(probe_sched_wakeup);
 }
 
 static void tracing_start_sched_switch(void)
Index: linux-2.6-lttng/kernel/trace/trace_sched_wakeup.c
===================================================================
--- linux-2.6-lttng.orig/kernel/trace/trace_sched_wakeup.c	2008-07-09 23:50:26.000000000 -0400
+++ linux-2.6-lttng/kernel/trace/trace_sched_wakeup.c	2008-07-09 23:53:33.000000000 -0400
@@ -15,8 +15,8 @@
 #include <linux/kallsyms.h>
 #include <linux/uaccess.h>
 #include <linux/ftrace.h>
-#include <linux/marker.h>
 
+#include "../sched-trace.h"
 #include "trace.h"
 
 static struct trace_array	*wakeup_trace;
@@ -109,18 +109,18 @@ static int report_latency(cycle_t delta)
 }
 
 static void notrace
-wakeup_sched_switch(void *private, void *rq, struct task_struct *prev,
+probe_wakeup_sched_switch(struct rq *rq, struct task_struct *prev,
 	struct task_struct *next)
 {
 	unsigned long latency = 0, t0 = 0, t1 = 0;
-	struct trace_array **ptr = private;
-	struct trace_array *tr = *ptr;
 	struct trace_array_cpu *data;
 	cycle_t T0, T1, delta;
 	unsigned long flags;
 	long disabled;
 	int cpu;
 
+	tracing_record_cmdline(prev);
+
 	if (unlikely(!tracer_enabled))
 		return;
 
@@ -137,11 +137,11 @@ wakeup_sched_switch(void *private, void 
 		return;
 
 	/* The task we are waiting for is waking up */
-	data = tr->data[wakeup_cpu];
+	data = wakeup_trace->data[wakeup_cpu];
 
 	/* disable local data, not wakeup_cpu data */
 	cpu = raw_smp_processor_id();
-	disabled = atomic_inc_return(&tr->data[cpu]->disabled);
+	disabled = atomic_inc_return(&wakeup_trace->data[cpu]->disabled);
 	if (likely(disabled != 1))
 		goto out;
 
@@ -151,7 +151,7 @@ wakeup_sched_switch(void *private, void 
 	if (unlikely(!tracer_enabled || next != wakeup_task))
 		goto out_unlock;
 
-	trace_function(tr, data, CALLER_ADDR1, CALLER_ADDR2, flags);
+	trace_function(wakeup_trace, data, CALLER_ADDR1, CALLER_ADDR2, flags);
 
 	/*
 	 * usecs conversion is slow so we try to delay the conversion
@@ -170,38 +170,13 @@ wakeup_sched_switch(void *private, void 
 	t0 = nsecs_to_usecs(T0);
 	t1 = nsecs_to_usecs(T1);
 
-	update_max_tr(tr, wakeup_task, wakeup_cpu);
+	update_max_tr(wakeup_trace, wakeup_task, wakeup_cpu);
 
 out_unlock:
-	__wakeup_reset(tr);
+	__wakeup_reset(wakeup_trace);
 	spin_unlock_irqrestore(&wakeup_lock, flags);
 out:
-	atomic_dec(&tr->data[cpu]->disabled);
-}
-
-static notrace void
-sched_switch_callback(void *probe_data, void *call_data,
-		      const char *format, va_list *args)
-{
-	struct task_struct *prev;
-	struct task_struct *next;
-	struct rq *__rq;
-
-	/* skip prev_pid %d next_pid %d prev_state %ld */
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, long);
-	__rq = va_arg(*args, typeof(__rq));
-	prev = va_arg(*args, typeof(prev));
-	next = va_arg(*args, typeof(next));
-
-	tracing_record_cmdline(prev);
-
-	/*
-	 * If tracer_switch_func only points to the local
-	 * switch func, it still needs the ptr passed to it.
-	 */
-	wakeup_sched_switch(probe_data, __rq, prev, next);
+	atomic_dec(&wakeup_trace->data[cpu]->disabled);
 }
 
 static void __wakeup_reset(struct trace_array *tr)
@@ -235,19 +210,24 @@ static void wakeup_reset(struct trace_ar
 }
 
 static void
-wakeup_check_start(struct trace_array *tr, struct task_struct *p,
-		   struct task_struct *curr)
+probe_wakeup(struct rq *rq, struct task_struct *p)
 {
 	int cpu = smp_processor_id();
 	unsigned long flags;
 	long disabled;
 
+	if (likely(!tracer_enabled))
+		return;
+
+	tracing_record_cmdline(p);
+	tracing_record_cmdline(current);
+
 	if (likely(!rt_task(p)) ||
 			p->prio >= wakeup_prio ||
-			p->prio >= curr->prio)
+			p->prio >= current->prio)
 		return;
 
-	disabled = atomic_inc_return(&tr->data[cpu]->disabled);
+	disabled = atomic_inc_return(&wakeup_trace->data[cpu]->disabled);
 	if (unlikely(disabled != 1))
 		goto out;
 
@@ -259,7 +239,7 @@ wakeup_check_start(struct trace_array *t
 		goto out_locked;
 
 	/* reset the trace */
-	__wakeup_reset(tr);
+	__wakeup_reset(wakeup_trace);
 
 	wakeup_cpu = task_cpu(p);
 	wakeup_prio = p->prio;
@@ -269,74 +249,37 @@ wakeup_check_start(struct trace_array *t
 
 	local_save_flags(flags);
 
-	tr->data[wakeup_cpu]->preempt_timestamp = ftrace_now(cpu);
-	trace_function(tr, tr->data[wakeup_cpu],
+	wakeup_trace->data[wakeup_cpu]->preempt_timestamp = ftrace_now(cpu);
+	trace_function(wakeup_trace, wakeup_trace->data[wakeup_cpu],
 		       CALLER_ADDR1, CALLER_ADDR2, flags);
 
 out_locked:
 	spin_unlock(&wakeup_lock);
 out:
-	atomic_dec(&tr->data[cpu]->disabled);
-}
-
-static notrace void
-wake_up_callback(void *probe_data, void *call_data,
-		 const char *format, va_list *args)
-{
-	struct trace_array **ptr = probe_data;
-	struct trace_array *tr = *ptr;
-	struct task_struct *curr;
-	struct task_struct *task;
-	struct rq *__rq;
-
-	if (likely(!tracer_enabled))
-		return;
-
-	/* Skip pid %d state %ld */
-	(void)va_arg(*args, int);
-	(void)va_arg(*args, long);
-	/* now get the meat: "rq %p task %p rq->curr %p" */
-	__rq = va_arg(*args, typeof(__rq));
-	task = va_arg(*args, typeof(task));
-	curr = va_arg(*args, typeof(curr));
-
-	tracing_record_cmdline(task);
-	tracing_record_cmdline(curr);
-
-	wakeup_check_start(tr, task, curr);
+	atomic_dec(&wakeup_trace->data[cpu]->disabled);
 }
 
 static void start_wakeup_tracer(struct trace_array *tr)
 {
 	int ret;
 
-	ret = marker_probe_register("kernel_sched_wakeup",
-			"pid %d state %ld ## rq %p task %p rq->curr %p",
-			wake_up_callback,
-			&wakeup_trace);
+	ret = register_trace_sched_wakeup(probe_wakeup);
 	if (ret) {
-		pr_info("wakeup trace: Couldn't add marker"
+		pr_info("wakeup trace: Couldn't activate tracepoint"
 			" probe to kernel_sched_wakeup\n");
 		return;
 	}
 
-	ret = marker_probe_register("kernel_sched_wakeup_new",
-			"pid %d state %ld ## rq %p task %p rq->curr %p",
-			wake_up_callback,
-			&wakeup_trace);
+	ret = register_trace_sched_wakeup_new(probe_wakeup);
 	if (ret) {
-		pr_info("wakeup trace: Couldn't add marker"
+		pr_info("wakeup trace: Couldn't activate tracepoint"
 			" probe to kernel_sched_wakeup_new\n");
 		goto fail_deprobe;
 	}
 
-	ret = marker_probe_register("kernel_sched_schedule",
-		"prev_pid %d next_pid %d prev_state %ld "
-		"## rq %p prev %p next %p",
-		sched_switch_callback,
-		&wakeup_trace);
+	ret = register_trace_sched_switch(probe_wakeup_sched_switch);
 	if (ret) {
-		pr_info("sched trace: Couldn't add marker"
+		pr_info("sched trace: Couldn't activate tracepoint"
 			" probe to kernel_sched_schedule\n");
 		goto fail_deprobe_wake_new;
 	}
@@ -357,28 +300,18 @@ static void start_wakeup_tracer(struct t
 
 	return;
 fail_deprobe_wake_new:
-	marker_probe_unregister("kernel_sched_wakeup_new",
-				wake_up_callback,
-				&wakeup_trace);
+	unregister_trace_sched_wakeup_new(probe_wakeup);
 fail_deprobe:
-	marker_probe_unregister("kernel_sched_wakeup",
-				wake_up_callback,
-				&wakeup_trace);
+	unregister_trace_sched_wakeup(probe_wakeup);
 }
 
 static void stop_wakeup_tracer(struct trace_array *tr)
 {
 	tracer_enabled = 0;
 	unregister_ftrace_function(&trace_ops);
-	marker_probe_unregister("kernel_sched_schedule",
-				sched_switch_callback,
-				&wakeup_trace);
-	marker_probe_unregister("kernel_sched_wakeup_new",
-				wake_up_callback,
-				&wakeup_trace);
-	marker_probe_unregister("kernel_sched_wakeup",
-				wake_up_callback,
-				&wakeup_trace);
+	unregister_trace_sched_switch(probe_wakeup_sched_switch);
+	unregister_trace_sched_wakeup_new(probe_wakeup);
+	unregister_trace_sched_wakeup(probe_wakeup);
 }
 
 static void wakeup_tracer_init(struct trace_array *tr)
 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 12/15] LTTng instrumentation - hugetlb (update)
  2008-07-09 14:59 ` [patch 12/15] LTTng instrumentation - hugetlb Mathieu Desnoyers
@ 2008-07-11 14:30   ` Mathieu Desnoyers
  0 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-11 14:30 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: William Lee Irwin III, Masami Hiramatsu, Peter Zijlstra,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

Instrumentation of hugetlb activity (alloc/free/reserve/grab/release).

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Changelog :
- instrument page grab, buddy allocator alloc, page release.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: William Lee Irwin III <wli@holomorphy.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 mm/hugetlb-trace.h |   28 ++++++++++++++++++++++++++++
 mm/hugetlb.c       |   41 +++++++++++++++++++++++++++++------------
 2 files changed, 57 insertions(+), 12 deletions(-)

Index: linux-2.6-lttng/mm/hugetlb.c
===================================================================
--- linux-2.6-lttng.orig/mm/hugetlb.c	2008-07-10 00:23:34.000000000 -0400
+++ linux-2.6-lttng/mm/hugetlb.c	2008-07-11 10:23:48.000000000 -0400
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include "hugetlb-trace.h"
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -123,6 +124,7 @@ static struct page *dequeue_huge_page_vm
 static void update_and_free_page(struct page *page)
 {
 	int i;
+	trace_hugetlb_page_release(page);
 	nr_huge_pages--;
 	nr_huge_pages_node[page_to_nid(page)]--;
 	for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
@@ -141,6 +143,7 @@ static void free_huge_page(struct page *
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
+	trace_hugetlb_page_free(page);
 	mapping = (struct address_space *) page_private(page);
 	set_page_private(page, 0);
 	BUG_ON(page_count(page));
@@ -205,7 +208,8 @@ static struct page *alloc_fresh_huge_pag
 	if (page) {
 		if (arch_prepare_hugepage(page)) {
 			__free_pages(page, HUGETLB_PAGE_ORDER);
-			return NULL;
+			page = NULL;
+			goto end;
 		}
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);
@@ -214,7 +218,8 @@ static struct page *alloc_fresh_huge_pag
 		spin_unlock(&hugetlb_lock);
 		put_page(page); /* free it into the hugepage allocator */
 	}
-
+end:
+	trace_hugetlb_page_grab(page);
 	return page;
 }
 
@@ -288,7 +293,8 @@ static struct page *alloc_buddy_huge_pag
 	spin_lock(&hugetlb_lock);
 	if (surplus_huge_pages >= nr_overcommit_huge_pages) {
 		spin_unlock(&hugetlb_lock);
-		return NULL;
+		page = NULL;
+		goto end;
 	} else {
 		nr_huge_pages++;
 		surplus_huge_pages++;
@@ -321,7 +327,8 @@ static struct page *alloc_buddy_huge_pag
 		__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
 	}
 	spin_unlock(&hugetlb_lock);
-
+end:
+	trace_hugetlb_buddy_pgalloc(page);
 	return page;
 }
 
@@ -510,6 +517,7 @@ static struct page *alloc_huge_page(stru
 		set_page_refcounted(page);
 		set_page_private(page, (unsigned long) mapping);
 	}
+	trace_hugetlb_page_alloc(page);
 	return page;
 }
 
@@ -1292,27 +1300,36 @@ out:
 
 int hugetlb_reserve_pages(struct inode *inode, long from, long to)
 {
-	long ret, chg;
+	int ret;
+	long chg;
 
 	chg = region_chg(&inode->i_mapping->private_list, from, to);
-	if (chg < 0)
-		return chg;
+	if (chg < 0) {
+		ret = chg;
+		goto end;
+	}
 
-	if (hugetlb_get_quota(inode->i_mapping, chg))
-		return -ENOSPC;
+	if (hugetlb_get_quota(inode->i_mapping, chg)) {
+		ret = -ENOSPC;
+		goto end;
+	}
 	ret = hugetlb_acct_memory(chg);
 	if (ret < 0) {
 		hugetlb_put_quota(inode->i_mapping, chg);
-		return ret;
+		goto end;
 	}
 	region_add(&inode->i_mapping->private_list, from, to);
-	return 0;
+end:
+	trace_hugetlb_pages_reserve(inode, from, to, ret);
+	return ret;
 }
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
-	long chg = region_truncate(&inode->i_mapping->private_list, offset);
+	long chg;
 
+	trace_hugetlb_pages_unreserve(inode, offset, freed);
+	chg = region_truncate(&inode->i_mapping->private_list, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
 	spin_unlock(&inode->i_lock);
Index: linux-2.6-lttng/mm/hugetlb-trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/mm/hugetlb-trace.h	2008-07-11 10:24:34.000000000 -0400
@@ -0,0 +1,28 @@
+#ifndef _HUGETLB_TRACE_H
+#define _HUGETLB_TRACE_H
+
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(hugetlb_page_release,
+	TPPROTO(struct page *page),
+	TPARGS(page));
+DEFINE_TRACE(hugetlb_page_grab,
+	TPPROTO(struct page *page),
+	TPARGS(page));
+DEFINE_TRACE(hugetlb_buddy_pgalloc,
+	TPPROTO(struct page *page),
+	TPARGS(page));
+DEFINE_TRACE(hugetlb_page_alloc,
+	TPPROTO(struct page *page),
+	TPARGS(page));
+DEFINE_TRACE(hugetlb_page_free,
+	TPPROTO(struct page *page),
+	TPARGS(page));
+DEFINE_TRACE(hugetlb_pages_reserve,
+	TPPROTO(struct inode *inode, long from, long to, int ret),
+	TPARGS(inode, from, to, ret));
+DEFINE_TRACE(hugetlb_pages_unreserve,
+	TPPROTO(struct inode *inode, long offset, long freed),
+	TPARGS(inode, offset, freed));
+
+#endif
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] ftrace memory barriers
       [not found]               ` <Pine.LNX.4.58.0807141153250.29493@gandalf.stny.rr.com>
@ 2008-07-14 16:25                 ` Mathieu Desnoyers
  2008-07-14 16:35                   ` Steven Rostedt
  0 siblings, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-14 16:25 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Peter Zijlstra, linux-kernel

(CC'ed LKML)

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Mon, 14 Jul 2008, Mathieu Desnoyers wrote:
> >
> > Thanks a lot Steve! By the way, you might be interested in looking at
> > the extra tracing enable/disabled tests you have in ftrace : it should
> > be enough to register/unregister the probes to activate/deactivate
> > tracing. I don't see why you would need if() checks in the probe itself.
> >
> 
> That was actually one of the causes of the bugs I had. We had two markers,
> one tracing schedule switches, the other wake ups. As soon as one is
> enabled, tracing starts, and you will miss the tracing of the other one.
> 
> The result was a strange trace were things were waking up then the waker
> was being scheduled in??
> 
> I need to enable all markers at the same time, which can be done with the
> if () check. Not to mention, I like to be able to shut of tracing on oops
> without needing to do anyting drastic (like deactivating probes).
> 
> -- Steve
> 

I see the reason why you need this, but beware that the activation
boolean you are using in your check won't be seen as active at the same
time by all CPUs because of out-of-order read/writes. So while the
update is per se atomic, the replication of the updated value to all
processors isn't. However, when one CPU sees the new version, this
version will stay valid.

This is one of the reasons why I don't aim to provide atomic activation
of multiple markers : thinking that the state will be atomically changed
on all CPUs without using the proper memory barriers is a bit bogus
anyway. Does ftrace expect such atomic propagation of the change across
all CPUs or does it deal with the information it receives on a per-cpu
basis ?

In any case, you should also make sure that you call synchronize_sched()
after the probe registration to make sure the quiescent state is reached
and that all probes are connected (so that no marker still see the old
data structures). This should be done before you activate your boolean
in the probes.

I agree that the extra check is very well suited to the OOPS case.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] ftrace memory barriers
  2008-07-14 16:25                 ` [PATCH] ftrace memory barriers Mathieu Desnoyers
@ 2008-07-14 16:35                   ` Steven Rostedt
  0 siblings, 0 replies; 58+ messages in thread
From: Steven Rostedt @ 2008-07-14 16:35 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Peter Zijlstra, linux-kernel


On Mon, 14 Jul 2008, Mathieu Desnoyers wrote:

>
> (CC'ed LKML)
>
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> >
> > On Mon, 14 Jul 2008, Mathieu Desnoyers wrote:
> > >
> > > Thanks a lot Steve! By the way, you might be interested in looking at
> > > the extra tracing enable/disabled tests you have in ftrace : it should
> > > be enough to register/unregister the probes to activate/deactivate
> > > tracing. I don't see why you would need if() checks in the probe itself.
> > >
> >
> > That was actually one of the causes of the bugs I had. We had two markers,
> > one tracing schedule switches, the other wake ups. As soon as one is
> > enabled, tracing starts, and you will miss the tracing of the other one.
> >
> > The result was a strange trace were things were waking up then the waker
> > was being scheduled in??
> >
> > I need to enable all markers at the same time, which can be done with the
> > if () check. Not to mention, I like to be able to shut of tracing on oops
> > without needing to do anyting drastic (like deactivating probes).
> >
> > -- Steve
> >
>
> I see the reason why you need this, but beware that the activation
> boolean you are using in your check won't be seen as active at the same
> time by all CPUs because of out-of-order read/writes. So while the
> update is per se atomic, the replication of the updated value to all
> processors isn't. However, when one CPU sees the new version, this
> version will stay valid.

Yep, I understand that. It was that the same CPU trace provided funny
output. It it were a trace over separate CPUS then that is something
people would accept.

>
> This is one of the reasons why I don't aim to provide atomic activation
> of multiple markers : thinking that the state will be atomically changed
> on all CPUs without using the proper memory barriers is a bit bogus
> anyway. Does ftrace expect such atomic propagation of the change across
> all CPUs or does it deal with the information it receives on a per-cpu
> basis ?

All the buffers and traces are per-cpu. We sort out the rest on output,
and we use the clock as the interleaver. Doing that doesn't guarantee that
events actually do happen in the order that appear, but it gives one a
good idea of where to look.

>
> In any case, you should also make sure that you call synchronize_sched()
> after the probe registration to make sure the quiescent state is reached
> and that all probes are connected (so that no marker still see the old
> data structures). This should be done before you activate your boolean
> in the probes.

Good to know, thanks. Hmm, I'm not sure that code can schedule :-/ Maybe
it can. I'll have to take a look.

>
> I agree that the extra check is very well suited to the OOPS case.


Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-09 14:59 ` [patch 01/15] Kernel Tracepoints Mathieu Desnoyers
@ 2008-07-15  7:50   ` Peter Zijlstra
  2008-07-15 13:25     ` Mathieu Desnoyers
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2008-07-15  7:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu

On Wed, 2008-07-09 at 10:59 -0400, Mathieu Desnoyers wrote:
> plain text document attachment (tracepoints.patch)
> Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.
> Allows complete typing verification. No format string required. See the
> tracepoint Documentation and Samples patches for usage examples.

I think the patch description (aka changelog, not to be confused with
the below) could use a lot more attention.. There are a lot of things
going on in this code non of which are mentioned.

I often read changelogs when I try to understand a piece of code, this
one is utterly unfulfilling.

Aside from that, I think the general picture looks good.

I sprinkled some comments in the code below...

> Changelog :
> - Use #name ":" #proto as string to identify the tracepoint in the
>   tracepoint table. This will make sure not type mismatch happens due to
>   connexion of a probe with the wrong type to a tracepoint declared with
>   the same name in a different header.
> - Add tracepoint_entry_free_old.
> 
> Masami Hiramatsu <mhiramat@redhat.com> :
> Tested on x86-64.
> 
> Performance impact of a tracepoint : same as markers, except that it adds about
> 70 bytes of instructions in an unlikely branch of each instrumented function
> (the for loop, the stack setup and the function call). It currently adds a
> memory read, a test and a conditional branch at the instrumentation site (in the
> hot path). Immediate values will eventually change this into a load immediate,
> test and branch, which removes the memory read which will make the i-cache
> impact smaller (changing the memory read for a load immediate removes 3-4 bytes
> per site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it also
> saves the d-cache hit).
> 
> About the performance impact of tracepoints (which is comparable to markers),
> even without immediate values optimizations, tests done by Hideo Aoki on ia64
> show no regression. His test case was using hackbench on a kernel where
> scheduler instrumentation (about 5 events in code scheduler code) was added.

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
> CC: 'Peter Zijlstra' <peterz@infradead.org>
> CC: "Frank Ch. Eigler" <fche@redhat.com>
> CC: 'Ingo Molnar' <mingo@elte.hu>
> CC: 'Hideo AOKI' <haoki@redhat.com>
> CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
> CC: 'Steven Rostedt' <rostedt@goodmis.org>
> CC: Alexander Viro <viro@zeniv.linux.org.uk>
> CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
> ---
>  include/asm-generic/vmlinux.lds.h |    6 
>  include/linux/module.h            |   17 +
>  include/linux/tracepoint.h        |  123 +++++++++
>  init/Kconfig                      |    7 
>  kernel/Makefile                   |    1 
>  kernel/module.c                   |   66 +++++
>  kernel/tracepoint.c               |  474 ++++++++++++++++++++++++++++++++++++++
>  7 files changed, 692 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6-lttng/init/Kconfig
> ===================================================================
> --- linux-2.6-lttng.orig/init/Kconfig	2008-07-09 10:55:46.000000000 -0400
> +++ linux-2.6-lttng/init/Kconfig	2008-07-09 10:55:58.000000000 -0400
> @@ -782,6 +782,13 @@ config PROFILING
>  	  Say Y here to enable the extended profiling support mechanisms used
>  	  by profilers such as OProfile.
>  
> +config TRACEPOINTS
> +	bool "Activate tracepoints"
> +	default y
> +	help
> +	  Place an empty function call at each tracepoint site. Can be
> +	  dynamically changed for a probe function.
> +
>  config MARKERS
>  	bool "Activate markers"
>  	help
> Index: linux-2.6-lttng/kernel/Makefile
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/Makefile	2008-07-09 10:55:46.000000000 -0400
> +++ linux-2.6-lttng/kernel/Makefile	2008-07-09 10:55:58.000000000 -0400
> @@ -77,6 +77,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
>  obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
>  obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
>  obj-$(CONFIG_MARKERS) += marker.o
> +obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
>  obj-$(CONFIG_LATENCYTOP) += latencytop.o
>  obj-$(CONFIG_FTRACE) += trace/
>  obj-$(CONFIG_TRACING) += trace/
> Index: linux-2.6-lttng/include/linux/tracepoint.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-lttng/include/linux/tracepoint.h	2008-07-09 10:55:58.000000000 -0400
> @@ -0,0 +1,123 @@
> +#ifndef _LINUX_TRACEPOINT_H
> +#define _LINUX_TRACEPOINT_H
> +
> +/*
> + * Kernel Tracepoint API.
> + *
> + * See Documentation/tracepoint.txt.
> + *
> + * (C) Copyright 2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> + *
> + * Heavily inspired from the Linux Kernel Markers.
> + *
> + * This file is released under the GPLv2.
> + * See the file COPYING for more details.
> + */
> +
> +#include <linux/types.h>
> +
> +struct module;
> +struct tracepoint;
> +
> +struct tracepoint {
> +	const char *name;		/* Tracepoint name */
> +	int state;			/* State. */
> +	void **funcs;
> +} __attribute__((aligned(8)));
> +
> +
> +#define TPPROTO(args...)	args
> +#define TPARGS(args...)		args
> +
> +#ifdef CONFIG_TRACEPOINTS
> +
> +#define __DO_TRACE(tp, proto, args)					\
> +	do {								\
> +		int i;							\
> +		void **funcs;						\
> +		preempt_disable();					\
> +		funcs = (tp)->funcs;					\
> +		smp_read_barrier_depends();				\
> +		if (funcs) {						\
> +			for (i = 0; funcs[i]; i++) {			\

can't you get rid of 'i' and write:

  void **func;

  preempt_disable();
  func = (tp)->funcs;
  smp_read_barrier_depends();
  for (; func; func++)
    ((void (*)(proto))func)(args);
  preempt_enable();

Also, why is the preempt_disable needed?

> +				((void(*)(proto))(funcs[i]))(args);	\
> +			}						\
> +		}							\
> +		preempt_enable();					\
> +	} while (0)
> +
> +/*
> + * Make sure the alignment of the structure in the __tracepoints section will
> + * not add unwanted padding between the beginning of the section and the
> + * structure. Force alignment to the same alignment as the section start.
> + */
> +#define DEFINE_TRACE(name, proto, args)					\
> +	static inline void trace_##name(proto)				\
> +	{								\
> +		static const char __tpstrtab_##name[]			\
> +		__attribute__((section("__tracepoints_strings")))	\
> +		= #name ":" #proto;					\
> +		static struct tracepoint __tracepoint_##name		\
> +		__attribute__((section("__tracepoints"), aligned(8))) =	\
> +		{ __tpstrtab_##name, 0, NULL };				\
> +		if (unlikely(__tracepoint_##name.state))		\
> +			__DO_TRACE(&__tracepoint_##name,		\
> +				TPPROTO(proto), TPARGS(args));		\
> +	}								\
> +	static inline int register_trace_##name(void (*probe)(proto))	\
> +	{								\
> +		return tracepoint_probe_register(#name ":" #proto,	\
> +			(void *)probe);					\
> +	}								\
> +	static inline void unregister_trace_##name(void (*probe)(proto))\
> +	{								\
> +		tracepoint_probe_unregister(#name ":" #proto,		\
> +			(void *)probe);					\
> +	}
> +
> +extern void tracepoint_update_probe_range(struct tracepoint *begin,
> +	struct tracepoint *end);
> +
> +#else /* !CONFIG_TRACEPOINTS */
> +#define DEFINE_TRACE(name, proto, args)			\
> +	static inline void _do_trace_##name(struct tracepoint *tp, proto) \
> +	{ }								\
> +	static inline void trace_##name(proto)				\
> +	{ }								\
> +	static inline int register_trace_##name(void (*probe)(proto))	\
> +	{								\
> +		return -ENOSYS;						\
> +	}								\
> +	static inline void unregister_trace_##name(void (*probe)(proto))\
> +	{ }
> +
> +static inline void tracepoint_update_probe_range(struct tracepoint *begin,
> +	struct tracepoint *end)
> +{ }
> +#endif /* CONFIG_TRACEPOINTS */
> +
> +/*
> + * Connect a probe to a tracepoint.
> + * Internal API, should not be used directly.
> + */
> +extern int tracepoint_probe_register(const char *name, void *probe);
> +
> +/*
> + * Disconnect a probe from a tracepoint.
> + * Internal API, should not be used directly.
> + */
> +extern int tracepoint_probe_unregister(const char *name, void *probe);
> +
> +struct tracepoint_iter {
> +	struct module *module;
> +	struct tracepoint *tracepoint;
> +};
> +
> +extern void tracepoint_iter_start(struct tracepoint_iter *iter);
> +extern void tracepoint_iter_next(struct tracepoint_iter *iter);
> +extern void tracepoint_iter_stop(struct tracepoint_iter *iter);
> +extern void tracepoint_iter_reset(struct tracepoint_iter *iter);
> +extern int tracepoint_get_iter_range(struct tracepoint **tracepoint,
> +	struct tracepoint *begin, struct tracepoint *end);
> +
> +#endif
> Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
> ===================================================================
> --- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2008-07-09 10:55:46.000000000 -0400
> +++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2008-07-09 10:55:58.000000000 -0400
> @@ -52,7 +52,10 @@
>  	. = ALIGN(8);							\
>  	VMLINUX_SYMBOL(__start___markers) = .;				\
>  	*(__markers)							\
> -	VMLINUX_SYMBOL(__stop___markers) = .;
> +	VMLINUX_SYMBOL(__stop___markers) = .;				\
> +	VMLINUX_SYMBOL(__start___tracepoints) = .;			\
> +	*(__tracepoints)						\
> +	VMLINUX_SYMBOL(__stop___tracepoints) = .;
>  
>  #define RO_DATA(align)							\
>  	. = ALIGN((align));						\
> @@ -61,6 +64,7 @@
>  		*(.rodata) *(.rodata.*)					\
>  		*(__vermagic)		/* Kernel version magic */	\
>  		*(__markers_strings)	/* Markers: strings */		\
> +		*(__tracepoints_strings)/* Tracepoints: strings */	\
>  	}								\
>  									\
>  	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\
> Index: linux-2.6-lttng/kernel/tracepoint.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-lttng/kernel/tracepoint.c	2008-07-09 10:55:58.000000000 -0400
> @@ -0,0 +1,474 @@
> +/*
> + * Copyright (C) 2008 Mathieu Desnoyers
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + */
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/types.h>
> +#include <linux/jhash.h>
> +#include <linux/list.h>
> +#include <linux/rcupdate.h>
> +#include <linux/tracepoint.h>
> +#include <linux/err.h>
> +#include <linux/slab.h>
> +
> +extern struct tracepoint __start___tracepoints[];
> +extern struct tracepoint __stop___tracepoints[];
> +
> +/* Set to 1 to enable tracepoint debug output */
> +static const int tracepoint_debug;
> +
> +/*
> + * tracepoints_mutex nests inside module_mutex. Tracepoints mutex protects the
> + * builtin and module tracepoints and the hash table.
> + */
> +static DEFINE_MUTEX(tracepoints_mutex);
> +
> +/*
> + * Tracepoint hash table, containing the active tracepoints.
> + * Protected by tracepoints_mutex.
> + */
> +#define TRACEPOINT_HASH_BITS 6
> +#define TRACEPOINT_TABLE_SIZE (1 << TRACEPOINT_HASH_BITS)
> +
> +/*
> + * Note about RCU :
> + * It is used to to delay the free of multiple probes array until a quiescent
> + * state is reached.
> + * Tracepoint entries modifications are protected by the tracepoints_mutex.
> + */
> +struct tracepoint_entry {
> +	struct hlist_node hlist;
> +	void **funcs;
> +	int refcount;	/* Number of times armed. 0 if disarmed. */
> +	struct rcu_head rcu;
> +	void *oldptr;
> +	unsigned char rcu_pending:1;
> +	char name[0];
> +};
> +
> +static struct hlist_head tracepoint_table[TRACEPOINT_TABLE_SIZE];
> +
> +static void free_old_closure(struct rcu_head *head)
> +{
> +	struct tracepoint_entry *entry = container_of(head,
> +		struct tracepoint_entry, rcu);
> +	kfree(entry->oldptr);
> +	/* Make sure we free the data before setting the pending flag to 0 */
> +	smp_wmb();
> +	entry->rcu_pending = 0;
> +}
> +
> +static void tracepoint_entry_free_old(struct tracepoint_entry *entry, void *old)
> +{
> +	if (!old)
> +		return;
> +	entry->oldptr = old;
> +	entry->rcu_pending = 1;
> +	/* write rcu_pending before calling the RCU callback */
> +	smp_wmb();
> +#ifdef CONFIG_PREEMPT_RCU
> +	synchronize_sched();	/* Until we have the call_rcu_sched() */
> +#endif

Does this have something to do with the preempt_disable above?

> +	call_rcu(&entry->rcu, free_old_closure);
> +}
> +
> +static void debug_print_probes(struct tracepoint_entry *entry)
> +{
> +	int i;
> +
> +	if (!tracepoint_debug)
> +		return;
> +
> +	for (i = 0; entry->funcs[i]; i++)
> +		printk(KERN_DEBUG "Probe %d : %p\n", i, entry->funcs[i]);
> +}
> +
> +static void *
> +tracepoint_entry_add_probe(struct tracepoint_entry *entry, void *probe)
> +{
> +	int nr_probes = 0;
> +	void **old, **new;
> +
> +	WARN_ON(!probe);
> +
> +	debug_print_probes(entry);
> +	old = entry->funcs;
> +	if (old) {
> +		/* (N -> N+1), (N != 0, 1) probes */
> +		for (nr_probes = 0; old[nr_probes]; nr_probes++)
> +			if (old[nr_probes] == probe)
> +				return ERR_PTR(-EBUSY);

-EEXIST ?

> +	}
> +	/* + 2 : one for new probe, one for NULL func */
> +	new = kzalloc((nr_probes + 2) * sizeof(void *), GFP_KERNEL);
> +	if (new == NULL)
> +		return ERR_PTR(-ENOMEM);
> +	if (old)
> +		memcpy(new, old, nr_probes * sizeof(void *));
> +	new[nr_probes] = probe;
> +	entry->refcount = nr_probes + 1;
> +	entry->funcs = new;
> +	debug_print_probes(entry);
> +	return old;
> +}
> +
> +static void *
> +tracepoint_entry_remove_probe(struct tracepoint_entry *entry, void *probe)
> +{
> +	int nr_probes = 0, nr_del = 0, i;
> +	void **old, **new;
> +
> +	old = entry->funcs;
> +
> +	debug_print_probes(entry);
> +	/* (N -> M), (N > 1, M >= 0) probes */
> +	for (nr_probes = 0; old[nr_probes]; nr_probes++) {
> +		if ((!probe || old[nr_probes] == probe))
> +			nr_del++;
> +	}
> +
> +	if (nr_probes - nr_del == 0) {
> +		/* N -> 0, (N > 1) */
> +		entry->funcs = NULL;
> +		entry->refcount = 0;
> +		debug_print_probes(entry);
> +		return old;
> +	} else {
> +		int j = 0;
> +		/* N -> M, (N > 1, M > 0) */
> +		/* + 1 for NULL */
> +		new = kzalloc((nr_probes - nr_del + 1)
> +			* sizeof(void *), GFP_KERNEL);
> +		if (new == NULL)
> +			return ERR_PTR(-ENOMEM);
> +		for (i = 0; old[i]; i++)
> +			if ((probe && old[i] != probe))
> +				new[j++] = old[i];
> +		entry->refcount = nr_probes - nr_del;
> +		entry->funcs = new;
> +	}
> +	debug_print_probes(entry);
> +	return old;
> +}
> +
> +/*
> + * Get tracepoint if the tracepoint is present in the tracepoint hash table.
> + * Must be called with tracepoints_mutex held.
> + * Returns NULL if not present.
> + */
> +static struct tracepoint_entry *get_tracepoint(const char *name)
> +{
> +	struct hlist_head *head;
> +	struct hlist_node *node;
> +	struct tracepoint_entry *e;
> +	u32 hash = jhash(name, strlen(name), 0);
> +
> +	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> +	hlist_for_each_entry(e, node, head, hlist) {
> +		if (!strcmp(name, e->name))
> +			return e;
> +	}
> +	return NULL;
> +}
> +
> +/*
> + * Add the tracepoint to the tracepoint hash table. Must be called with
> + * tracepoints_mutex held.
> + */
> +static struct tracepoint_entry *add_tracepoint(const char *name)
> +{
> +	struct hlist_head *head;
> +	struct hlist_node *node;
> +	struct tracepoint_entry *e;
> +	size_t name_len = strlen(name) + 1;
> +	u32 hash = jhash(name, name_len-1, 0);
> +
> +	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> +	hlist_for_each_entry(e, node, head, hlist) {
> +		if (!strcmp(name, e->name)) {
> +			printk(KERN_NOTICE
> +				"tracepoint %s busy\n", name);
> +			return ERR_PTR(-EBUSY);	/* Already there */

-EEXIST

> +		}
> +	}
> +	/*
> +	 * Using kmalloc here to allocate a variable length element. Could
> +	 * cause some memory fragmentation if overused.
> +	 */
> +	e = kmalloc(sizeof(struct tracepoint_entry) + name_len, GFP_KERNEL);
> +	if (!e)
> +		return ERR_PTR(-ENOMEM);
> +	memcpy(&e->name[0], name, name_len);
> +	e->funcs = NULL;
> +	e->refcount = 0;
> +	e->rcu_pending = 0;
> +	hlist_add_head(&e->hlist, head);
> +	return e;
> +}
> +
> +/*
> + * Remove the tracepoint from the tracepoint hash table. Must be called with
> + * mutex_lock held.
> + */
> +static int remove_tracepoint(const char *name)
> +{
> +	struct hlist_head *head;
> +	struct hlist_node *node;
> +	struct tracepoint_entry *e;
> +	int found = 0;
> +	size_t len = strlen(name) + 1;
> +	u32 hash = jhash(name, len-1, 0);
> +
> +	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> +	hlist_for_each_entry(e, node, head, hlist) {
> +		if (!strcmp(name, e->name)) {
> +			found = 1;
> +			break;
> +		}
> +	}
> +	if (!found)
> +		return -ENOENT;
> +	if (e->refcount)
> +		return -EBUSY;

ok, this really is busy.

> +	hlist_del(&e->hlist);
> +	/* Make sure the call_rcu has been executed */
> +	if (e->rcu_pending)
> +		rcu_barrier();
> +	kfree(e);
> +	return 0;
> +}
> +
> +/*
> + * Sets the probe callback corresponding to one tracepoint.
> + */
> +static void set_tracepoint(struct tracepoint_entry **entry,
> +	struct tracepoint *elem, int active)
> +{
> +	WARN_ON(strcmp((*entry)->name, elem->name) != 0);
> +
> +	smp_wmb();
> +	/*
> +	 * We also make sure that the new probe callbacks array is consistent
> +	 * before setting a pointer to it.
> +	 */
> +	rcu_assign_pointer(elem->funcs, (*entry)->funcs);

rcu_assign_pointer() already does that wmb !?
Also, its polite to reference the pairing site in the barrier comment.

> +	elem->state = active;
> +}
> +
> +/*
> + * Disable a tracepoint and its probe callback.
> + * Note: only waiting an RCU period after setting elem->call to the empty
> + * function insures that the original callback is not used anymore. This insured
> + * by preempt_disable around the call site.
> + */
> +static void disable_tracepoint(struct tracepoint *elem)
> +{
> +	elem->state = 0;
> +}
> +
> +/**
> + * tracepoint_update_probe_range - Update a probe range
> + * @begin: beginning of the range
> + * @end: end of the range
> + *
> + * Updates the probe callback corresponding to a range of tracepoints.
> + */
> +void tracepoint_update_probe_range(struct tracepoint *begin,
> +	struct tracepoint *end)
> +{
> +	struct tracepoint *iter;
> +	struct tracepoint_entry *mark_entry;
> +
> +	mutex_lock(&tracepoints_mutex);
> +	for (iter = begin; iter < end; iter++) {
> +		mark_entry = get_tracepoint(iter->name);
> +		if (mark_entry) {
> +			set_tracepoint(&mark_entry, iter,
> +					!!mark_entry->refcount);
> +		} else {
> +			disable_tracepoint(iter);
> +		}
> +	}
> +	mutex_unlock(&tracepoints_mutex);
> +}
> +
> +/*
> + * Update probes, removing the faulty probes.
> + */
> +static void tracepoint_update_probes(void)
> +{
> +	/* Core kernel tracepoints */
> +	tracepoint_update_probe_range(__start___tracepoints,
> +		__stop___tracepoints);
> +	/* tracepoints in modules. */
> +	module_update_tracepoints();
> +}
> +
> +/**
> + * tracepoint_probe_register -  Connect a probe to a tracepoint
> + * @name: tracepoint name
> + * @probe: probe handler
> + *
> + * Returns 0 if ok, error value on error.
> + * The probe address must at least be aligned on the architecture pointer size.
> + */
> +int tracepoint_probe_register(const char *name, void *probe)
> +{
> +	struct tracepoint_entry *entry;
> +	int ret = 0;
> +	void *old;
> +
> +	mutex_lock(&tracepoints_mutex);
> +	entry = get_tracepoint(name);
> +	if (!entry) {
> +		entry = add_tracepoint(name);
> +		if (IS_ERR(entry)) {
> +			ret = PTR_ERR(entry);
> +			goto end;
> +		}
> +	}
> +	/*
> +	 * If we detect that a call_rcu is pending for this tracepoint,
> +	 * make sure it's executed now.
> +	 */
> +	if (entry->rcu_pending)
> +		rcu_barrier();
> +	old = tracepoint_entry_add_probe(entry, probe);
> +	if (IS_ERR(old)) {
> +		ret = PTR_ERR(old);
> +		goto end;
> +	}
> +	mutex_unlock(&tracepoints_mutex);
> +	tracepoint_update_probes();		/* may update entry */
> +	mutex_lock(&tracepoints_mutex);
> +	entry = get_tracepoint(name);
> +	WARN_ON(!entry);
> +	tracepoint_entry_free_old(entry, old);
> +end:
> +	mutex_unlock(&tracepoints_mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tracepoint_probe_register);
> +
> +/**
> + * tracepoint_probe_unregister -  Disconnect a probe from a tracepoint
> + * @name: tracepoint name
> + * @probe: probe function pointer
> + *
> + * We do not need to call a synchronize_sched to make sure the probes have
> + * finished running before doing a module unload, because the module unload
> + * itself uses stop_machine(), which insures that every preempt disabled section
> + * have finished.
> + */
> +int tracepoint_probe_unregister(const char *name, void *probe)
> +{
> +	struct tracepoint_entry *entry;
> +	void *old;
> +	int ret = -ENOENT;
> +
> +	mutex_lock(&tracepoints_mutex);
> +	entry = get_tracepoint(name);
> +	if (!entry)
> +		goto end;
> +	if (entry->rcu_pending)
> +		rcu_barrier();
> +	old = tracepoint_entry_remove_probe(entry, probe);
> +	mutex_unlock(&tracepoints_mutex);
> +	tracepoint_update_probes();		/* may update entry */
> +	mutex_lock(&tracepoints_mutex);
> +	entry = get_tracepoint(name);
> +	if (!entry)
> +		goto end;
> +	tracepoint_entry_free_old(entry, old);
> +	remove_tracepoint(name);	/* Ignore busy error message */
> +	ret = 0;
> +end:
> +	mutex_unlock(&tracepoints_mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tracepoint_probe_unregister);
> +
> +/**
> + * tracepoint_get_iter_range - Get a next tracepoint iterator given a range.
> + * @tracepoint: current tracepoints (in), next tracepoint (out)
> + * @begin: beginning of the range
> + * @end: end of the range
> + *
> + * Returns whether a next tracepoint has been found (1) or not (0).
> + * Will return the first tracepoint in the range if the input tracepoint is
> + * NULL.
> + */
> +int tracepoint_get_iter_range(struct tracepoint **tracepoint,
> +	struct tracepoint *begin, struct tracepoint *end)
> +{
> +	if (!*tracepoint && begin != end) {
> +		*tracepoint = begin;
> +		return 1;
> +	}
> +	if (*tracepoint >= begin && *tracepoint < end)
> +		return 1;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(tracepoint_get_iter_range);
> +
> +static void tracepoint_get_iter(struct tracepoint_iter *iter)
> +{
> +	int found = 0;
> +
> +	/* Core kernel tracepoints */
> +	if (!iter->module) {
> +		found = tracepoint_get_iter_range(&iter->tracepoint,
> +				__start___tracepoints, __stop___tracepoints);
> +		if (found)
> +			goto end;
> +	}
> +	/* tracepoints in modules. */
> +	found = module_get_iter_tracepoints(iter);
> +end:
> +	if (!found)
> +		tracepoint_iter_reset(iter);
> +}
> +
> +void tracepoint_iter_start(struct tracepoint_iter *iter)
> +{
> +	tracepoint_get_iter(iter);
> +}
> +EXPORT_SYMBOL_GPL(tracepoint_iter_start);
> +
> +void tracepoint_iter_next(struct tracepoint_iter *iter)
> +{
> +	iter->tracepoint++;
> +	/*
> +	 * iter->tracepoint may be invalid because we blindly incremented it.
> +	 * Make sure it is valid by marshalling on the tracepoints, getting the
> +	 * tracepoints from following modules if necessary.
> +	 */
> +	tracepoint_get_iter(iter);
> +}
> +EXPORT_SYMBOL_GPL(tracepoint_iter_next);
> +
> +void tracepoint_iter_stop(struct tracepoint_iter *iter)
> +{
> +}
> +EXPORT_SYMBOL_GPL(tracepoint_iter_stop);
> +
> +void tracepoint_iter_reset(struct tracepoint_iter *iter)
> +{
> +	iter->module = NULL;
> +	iter->tracepoint = NULL;
> +}
> +EXPORT_SYMBOL_GPL(tracepoint_iter_reset);
> Index: linux-2.6-lttng/kernel/module.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/module.c	2008-07-09 10:55:46.000000000 -0400
> +++ linux-2.6-lttng/kernel/module.c	2008-07-09 10:55:58.000000000 -0400
> @@ -47,6 +47,7 @@
>  #include <asm/sections.h>
>  #include <linux/license.h>
>  #include <asm/sections.h>
> +#include <linux/tracepoint.h>
>  
>  #if 0
>  #define DEBUGP printk
> @@ -1824,6 +1825,8 @@ static struct module *load_module(void _
>  #endif
>  	unsigned int markersindex;
>  	unsigned int markersstringsindex;
> +	unsigned int tracepointsindex;
> +	unsigned int tracepointsstringsindex;
>  	struct module *mod;
>  	long err = 0;
>  	void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */
> @@ -2110,6 +2113,9 @@ static struct module *load_module(void _
>  	markersindex = find_sec(hdr, sechdrs, secstrings, "__markers");
>   	markersstringsindex = find_sec(hdr, sechdrs, secstrings,
>  					"__markers_strings");
> +	tracepointsindex = find_sec(hdr, sechdrs, secstrings, "__tracepoints");
> +	tracepointsstringsindex = find_sec(hdr, sechdrs, secstrings,
> +					"__tracepoints_strings");
>  
>  	/* Now do relocations. */
>  	for (i = 1; i < hdr->e_shnum; i++) {
> @@ -2137,6 +2143,12 @@ static struct module *load_module(void _
>  	mod->num_markers =
>  		sechdrs[markersindex].sh_size / sizeof(*mod->markers);
>  #endif
> +#ifdef CONFIG_TRACEPOINTS
> +	mod->tracepoints = (void *)sechdrs[tracepointsindex].sh_addr;
> +	mod->num_tracepoints =
> +		sechdrs[tracepointsindex].sh_size / sizeof(*mod->tracepoints);
> +#endif
> +
>  
>          /* Find duplicate symbols */
>  	err = verify_export_symbols(mod);
> @@ -2155,11 +2167,16 @@ static struct module *load_module(void _
>  
>  	add_kallsyms(mod, sechdrs, symindex, strindex, secstrings);
>  
> +	if (!mod->taints) {
>  #ifdef CONFIG_MARKERS
> -	if (!mod->taints)
>  		marker_update_probe_range(mod->markers,
>  			mod->markers + mod->num_markers);
>  #endif
> +#ifdef CONFIG_TRACEPOINTS
> +		tracepoint_update_probe_range(mod->tracepoints,
> +			mod->tracepoints + mod->num_tracepoints);
> +#endif
> +	}
>  	err = module_finalize(hdr, sechdrs, mod);
>  	if (err < 0)
>  		goto cleanup;
> @@ -2710,3 +2727,50 @@ void module_update_markers(void)
>  	mutex_unlock(&module_mutex);
>  }
>  #endif
> +
> +#ifdef CONFIG_TRACEPOINTS
> +void module_update_tracepoints(void)
> +{
> +	struct module *mod;
> +
> +	mutex_lock(&module_mutex);
> +	list_for_each_entry(mod, &modules, list)
> +		if (!mod->taints)
> +			tracepoint_update_probe_range(mod->tracepoints,
> +				mod->tracepoints + mod->num_tracepoints);
> +	mutex_unlock(&module_mutex);
> +}
> +
> +/*
> + * Returns 0 if current not found.
> + * Returns 1 if current found.
> + */
> +int module_get_iter_tracepoints(struct tracepoint_iter *iter)
> +{
> +	struct module *iter_mod;
> +	int found = 0;
> +
> +	mutex_lock(&module_mutex);
> +	list_for_each_entry(iter_mod, &modules, list) {
> +		if (!iter_mod->taints) {
> +			/*
> +			 * Sorted module list
> +			 */
> +			if (iter_mod < iter->module)
> +				continue;
> +			else if (iter_mod > iter->module)
> +				iter->tracepoint = NULL;
> +			found = tracepoint_get_iter_range(&iter->tracepoint,
> +				iter_mod->tracepoints,
> +				iter_mod->tracepoints
> +					+ iter_mod->num_tracepoints);
> +			if (found) {
> +				iter->module = iter_mod;
> +				break;
> +			}
> +		}
> +	}
> +	mutex_unlock(&module_mutex);
> +	return found;
> +}
> +#endif
> Index: linux-2.6-lttng/include/linux/module.h
> ===================================================================
> --- linux-2.6-lttng.orig/include/linux/module.h	2008-07-09 10:55:46.000000000 -0400
> +++ linux-2.6-lttng/include/linux/module.h	2008-07-09 10:57:22.000000000 -0400
> @@ -16,6 +16,7 @@
>  #include <linux/kobject.h>
>  #include <linux/moduleparam.h>
>  #include <linux/marker.h>
> +#include <linux/tracepoint.h>
>  #include <asm/local.h>
>  
>  #include <asm/module.h>
> @@ -331,6 +332,10 @@ struct module
>  	struct marker *markers;
>  	unsigned int num_markers;
>  #endif
> +#ifdef CONFIG_TRACEPOINTS
> +	struct tracepoint *tracepoints;
> +	unsigned int num_tracepoints;
> +#endif
>  
>  #ifdef CONFIG_MODULE_UNLOAD
>  	/* What modules depend on me? */
> @@ -454,6 +459,9 @@ extern void print_modules(void);
>  
>  extern void module_update_markers(void);
>  
> +extern void module_update_tracepoints(void);
> +extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
> +
>  #else /* !CONFIG_MODULES... */
>  #define EXPORT_SYMBOL(sym)
>  #define EXPORT_SYMBOL_GPL(sym)
> @@ -558,6 +566,15 @@ static inline void module_update_markers
>  {
>  }
>  
> +static inline void module_update_tracepoints(void)
> +{
> +}
> +
> +static inline int module_get_iter_tracepoints(struct tracepoint_iter *iter)
> +{
> +	return 0;
> +}
> +
>  #endif /* CONFIG_MODULES */
>  
>  struct device_driver;
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15  7:50   ` Peter Zijlstra
@ 2008-07-15 13:25     ` Mathieu Desnoyers
  2008-07-15 13:59       ` Peter Zijlstra
  2008-07-15 14:03       ` Peter Zijlstra
  0 siblings, 2 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 13:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2008-07-09 at 10:59 -0400, Mathieu Desnoyers wrote:
> > plain text document attachment (tracepoints.patch)
> > Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.
> > Allows complete typing verification. No format string required. See the
> > tracepoint Documentation and Samples patches for usage examples.
> 

Hi Peter,

Thanks for the review,

> I think the patch description (aka changelog, not to be confused with
> the below) could use a lot more attention.. There are a lot of things
> going on in this code non of which are mentioned.
> 
> I often read changelogs when I try to understand a piece of code, this
> one is utterly unfulfilling.
> 

Yes, given that I started from the marker code as a base, I did not
re-explain everything that wasn't changed from those. I guess it's good
to give more details about this though. I'll address this.

> Aside from that, I think the general picture looks good.
> 
> I sprinkled some comments in the code below...
> 

Thanks, let's looks at them,

> > Changelog :
> > - Use #name ":" #proto as string to identify the tracepoint in the
> >   tracepoint table. This will make sure not type mismatch happens due to
> >   connexion of a probe with the wrong type to a tracepoint declared with
> >   the same name in a different header.
> > - Add tracepoint_entry_free_old.
> > 
> > Masami Hiramatsu <mhiramat@redhat.com> :
> > Tested on x86-64.
> > 
> > Performance impact of a tracepoint : same as markers, except that it adds about
> > 70 bytes of instructions in an unlikely branch of each instrumented function
> > (the for loop, the stack setup and the function call). It currently adds a
> > memory read, a test and a conditional branch at the instrumentation site (in the
> > hot path). Immediate values will eventually change this into a load immediate,
> > test and branch, which removes the memory read which will make the i-cache
> > impact smaller (changing the memory read for a load immediate removes 3-4 bytes
> > per site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it also
> > saves the d-cache hit).
> > 
> > About the performance impact of tracepoints (which is comparable to markers),
> > even without immediate values optimizations, tests done by Hideo Aoki on ia64
> > show no regression. His test case was using hackbench on a kernel where
> > scheduler instrumentation (about 5 events in code scheduler code) was added.
> 
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
> > CC: 'Peter Zijlstra' <peterz@infradead.org>
> > CC: "Frank Ch. Eigler" <fche@redhat.com>
> > CC: 'Ingo Molnar' <mingo@elte.hu>
> > CC: 'Hideo AOKI' <haoki@redhat.com>
> > CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
> > CC: 'Steven Rostedt' <rostedt@goodmis.org>
> > CC: Alexander Viro <viro@zeniv.linux.org.uk>
> > CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
> > ---
> >  include/asm-generic/vmlinux.lds.h |    6 
> >  include/linux/module.h            |   17 +
> >  include/linux/tracepoint.h        |  123 +++++++++
> >  init/Kconfig                      |    7 
> >  kernel/Makefile                   |    1 
> >  kernel/module.c                   |   66 +++++
> >  kernel/tracepoint.c               |  474 ++++++++++++++++++++++++++++++++++++++
> >  7 files changed, 692 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6-lttng/init/Kconfig
> > ===================================================================
> > --- linux-2.6-lttng.orig/init/Kconfig	2008-07-09 10:55:46.000000000 -0400
> > +++ linux-2.6-lttng/init/Kconfig	2008-07-09 10:55:58.000000000 -0400
> > @@ -782,6 +782,13 @@ config PROFILING
> >  	  Say Y here to enable the extended profiling support mechanisms used
> >  	  by profilers such as OProfile.
> >  
> > +config TRACEPOINTS
> > +	bool "Activate tracepoints"
> > +	default y
> > +	help
> > +	  Place an empty function call at each tracepoint site. Can be
> > +	  dynamically changed for a probe function.
> > +
> >  config MARKERS
> >  	bool "Activate markers"
> >  	help
> > Index: linux-2.6-lttng/kernel/Makefile
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/Makefile	2008-07-09 10:55:46.000000000 -0400
> > +++ linux-2.6-lttng/kernel/Makefile	2008-07-09 10:55:58.000000000 -0400
> > @@ -77,6 +77,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
> >  obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
> >  obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
> >  obj-$(CONFIG_MARKERS) += marker.o
> > +obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
> >  obj-$(CONFIG_LATENCYTOP) += latencytop.o
> >  obj-$(CONFIG_FTRACE) += trace/
> >  obj-$(CONFIG_TRACING) += trace/
> > Index: linux-2.6-lttng/include/linux/tracepoint.h
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6-lttng/include/linux/tracepoint.h	2008-07-09 10:55:58.000000000 -0400
> > @@ -0,0 +1,123 @@
> > +#ifndef _LINUX_TRACEPOINT_H
> > +#define _LINUX_TRACEPOINT_H
> > +
> > +/*
> > + * Kernel Tracepoint API.
> > + *
> > + * See Documentation/tracepoint.txt.
> > + *
> > + * (C) Copyright 2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > + *
> > + * Heavily inspired from the Linux Kernel Markers.
> > + *
> > + * This file is released under the GPLv2.
> > + * See the file COPYING for more details.
> > + */
> > +
> > +#include <linux/types.h>
> > +
> > +struct module;
> > +struct tracepoint;
> > +
> > +struct tracepoint {
> > +	const char *name;		/* Tracepoint name */
> > +	int state;			/* State. */
> > +	void **funcs;
> > +} __attribute__((aligned(8)));
> > +
> > +
> > +#define TPPROTO(args...)	args
> > +#define TPARGS(args...)		args
> > +
> > +#ifdef CONFIG_TRACEPOINTS
> > +
> > +#define __DO_TRACE(tp, proto, args)					\
> > +	do {								\
> > +		int i;							\
> > +		void **funcs;						\
> > +		preempt_disable();					\
> > +		funcs = (tp)->funcs;					\
> > +		smp_read_barrier_depends();				\
> > +		if (funcs) {						\
> > +			for (i = 0; funcs[i]; i++) {			\
> 
> can't you get rid of 'i' and write:
> 
>   void **func;
> 
>   preempt_disable();
>   func = (tp)->funcs;
>   smp_read_barrier_depends();
>   for (; func; func++)
>     ((void (*)(proto))func)(args);
>   preempt_enable();
> 

Yes, I though there would be an optimization to do here, I'll use your
proposal. This code snippet is especially important since it will
generate instructions near every tracepoint side. Saving a few bytes
becomes important.

Given that (tp)->funcs references an array of function pointers and that
it can be NULL, the if (funcs) test must still be there and we must use

#define __DO_TRACE(tp, proto, args)					\
	do {								\
		void *func;						\
									\
		preempt_disable();					\
		if ((tp)->funcs) {					\
			func = rcu_dereference((tp)->funcs);		\
			for (; func; func++) {				\
				((void(*)(proto))(func))(args);		\
			}						\
		}							\
		preempt_enable();					\
	} while (0)


The resulting assembly is a bit more dense than my previous
implementation, which is good :

On x86_64 :

 820:   bf 01 00 00 00          mov    $0x1,%edi
 825:   e8 00 00 00 00          callq  82a <thread_return+0x136>
 82a:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 831 <thread_retur
n+0x13d>
 831:   48 85 c0                test   %rax,%rax
 834:   74 22                   je     858 <thread_return+0x164>
 836:   48 8b 18                mov    (%rax),%rbx
 839:   48 85 db                test   %rbx,%rbx
 83c:   74 1a                   je     858 <thread_return+0x164>
 83e:   66 90                   xchg   %ax,%ax
 840:   48 8b 95 68 ff ff ff    mov    -0x98(%rbp),%rdx
 847:   48 8b b5 60 ff ff ff    mov    -0xa0(%rbp),%rsi
 84e:   4c 89 e7                mov    %r12,%rdi
 851:   ff d3                   callq  *%rbx
 853:   48 ff c3                inc    %rbx
 856:   75 e8                   jne    840 <thread_return+0x14c>
 858:   bf 01 00 00 00          mov    $0x1,%edi
 85d:   e8 00 00 00 00          callq  862 <thread_return+0x16e>
 862:

For 66 bytes (a 2 arguments tracepoint). Note that these bytes are
outside of the critical path in an unlikely branch. The branch test
bytes have been discussed thoroughly in the "Immediate Values" work,
which can be seen as an optimisation to be integrated later.

The current branch code added to the critical path is, on x86_64 :

 5ff:   b8 00 00 00 00          mov    $0x0,%eax
 604:   85 c0                   test   %eax,%eax
 606:   0f 85 14 02 00 00       jne    820 <thread_return+0x12c>

The immediate values can make this smaller by using a 2-bytes movb
instructions to populate eax, which would save 3 bytes of cache-hot
cachelines.

> Also, why is the preempt_disable needed?
> 

Addition and removal of tracepoints is synchronized by RCU using the
scheduler (and preempt_disable) as guarantees to find a quiescent state
(this is really RCU "classic"). The update side uses rcu_barrier_sched()
with call_rcu_sched() and the read/execute side uses
"preempt_disable()/preempt_enable()".


> > +				((void(*)(proto))(funcs[i]))(args);	\
> > +			}						\
> > +		}							\
> > +		preempt_enable();					\
> > +	} while (0)
> > +
> > +/*
> > + * Make sure the alignment of the structure in the __tracepoints section will
> > + * not add unwanted padding between the beginning of the section and the
> > + * structure. Force alignment to the same alignment as the section start.
> > + */
> > +#define DEFINE_TRACE(name, proto, args)					\
> > +	static inline void trace_##name(proto)				\
> > +	{								\
> > +		static const char __tpstrtab_##name[]			\
> > +		__attribute__((section("__tracepoints_strings")))	\
> > +		= #name ":" #proto;					\
> > +		static struct tracepoint __tracepoint_##name		\
> > +		__attribute__((section("__tracepoints"), aligned(8))) =	\
> > +		{ __tpstrtab_##name, 0, NULL };				\
> > +		if (unlikely(__tracepoint_##name.state))		\
> > +			__DO_TRACE(&__tracepoint_##name,		\
> > +				TPPROTO(proto), TPARGS(args));		\
> > +	}								\
> > +	static inline int register_trace_##name(void (*probe)(proto))	\
> > +	{								\
> > +		return tracepoint_probe_register(#name ":" #proto,	\
> > +			(void *)probe);					\
> > +	}								\
> > +	static inline void unregister_trace_##name(void (*probe)(proto))\
> > +	{								\
> > +		tracepoint_probe_unregister(#name ":" #proto,		\
> > +			(void *)probe);					\
> > +	}
> > +
> > +extern void tracepoint_update_probe_range(struct tracepoint *begin,
> > +	struct tracepoint *end);
> > +
> > +#else /* !CONFIG_TRACEPOINTS */
> > +#define DEFINE_TRACE(name, proto, args)			\
> > +	static inline void _do_trace_##name(struct tracepoint *tp, proto) \
> > +	{ }								\
> > +	static inline void trace_##name(proto)				\
> > +	{ }								\
> > +	static inline int register_trace_##name(void (*probe)(proto))	\
> > +	{								\
> > +		return -ENOSYS;						\
> > +	}								\
> > +	static inline void unregister_trace_##name(void (*probe)(proto))\
> > +	{ }
> > +
> > +static inline void tracepoint_update_probe_range(struct tracepoint *begin,
> > +	struct tracepoint *end)
> > +{ }
> > +#endif /* CONFIG_TRACEPOINTS */
> > +
> > +/*
> > + * Connect a probe to a tracepoint.
> > + * Internal API, should not be used directly.
> > + */
> > +extern int tracepoint_probe_register(const char *name, void *probe);
> > +
> > +/*
> > + * Disconnect a probe from a tracepoint.
> > + * Internal API, should not be used directly.
> > + */
> > +extern int tracepoint_probe_unregister(const char *name, void *probe);
> > +
> > +struct tracepoint_iter {
> > +	struct module *module;
> > +	struct tracepoint *tracepoint;
> > +};
> > +
> > +extern void tracepoint_iter_start(struct tracepoint_iter *iter);
> > +extern void tracepoint_iter_next(struct tracepoint_iter *iter);
> > +extern void tracepoint_iter_stop(struct tracepoint_iter *iter);
> > +extern void tracepoint_iter_reset(struct tracepoint_iter *iter);
> > +extern int tracepoint_get_iter_range(struct tracepoint **tracepoint,
> > +	struct tracepoint *begin, struct tracepoint *end);
> > +
> > +#endif
> > Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2008-07-09 10:55:46.000000000 -0400
> > +++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2008-07-09 10:55:58.000000000 -0400
> > @@ -52,7 +52,10 @@
> >  	. = ALIGN(8);							\
> >  	VMLINUX_SYMBOL(__start___markers) = .;				\
> >  	*(__markers)							\
> > -	VMLINUX_SYMBOL(__stop___markers) = .;
> > +	VMLINUX_SYMBOL(__stop___markers) = .;				\
> > +	VMLINUX_SYMBOL(__start___tracepoints) = .;			\
> > +	*(__tracepoints)						\
> > +	VMLINUX_SYMBOL(__stop___tracepoints) = .;
> >  
> >  #define RO_DATA(align)							\
> >  	. = ALIGN((align));						\
> > @@ -61,6 +64,7 @@
> >  		*(.rodata) *(.rodata.*)					\
> >  		*(__vermagic)		/* Kernel version magic */	\
> >  		*(__markers_strings)	/* Markers: strings */		\
> > +		*(__tracepoints_strings)/* Tracepoints: strings */	\
> >  	}								\
> >  									\
> >  	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\
> > Index: linux-2.6-lttng/kernel/tracepoint.c
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6-lttng/kernel/tracepoint.c	2008-07-09 10:55:58.000000000 -0400
> > @@ -0,0 +1,474 @@
> > +/*
> > + * Copyright (C) 2008 Mathieu Desnoyers
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> > + */
> > +#include <linux/module.h>
> > +#include <linux/mutex.h>
> > +#include <linux/types.h>
> > +#include <linux/jhash.h>
> > +#include <linux/list.h>
> > +#include <linux/rcupdate.h>
> > +#include <linux/tracepoint.h>
> > +#include <linux/err.h>
> > +#include <linux/slab.h>
> > +
> > +extern struct tracepoint __start___tracepoints[];
> > +extern struct tracepoint __stop___tracepoints[];
> > +
> > +/* Set to 1 to enable tracepoint debug output */
> > +static const int tracepoint_debug;
> > +
> > +/*
> > + * tracepoints_mutex nests inside module_mutex. Tracepoints mutex protects the
> > + * builtin and module tracepoints and the hash table.
> > + */
> > +static DEFINE_MUTEX(tracepoints_mutex);
> > +
> > +/*
> > + * Tracepoint hash table, containing the active tracepoints.
> > + * Protected by tracepoints_mutex.
> > + */
> > +#define TRACEPOINT_HASH_BITS 6
> > +#define TRACEPOINT_TABLE_SIZE (1 << TRACEPOINT_HASH_BITS)
> > +
> > +/*
> > + * Note about RCU :
> > + * It is used to to delay the free of multiple probes array until a quiescent
> > + * state is reached.
> > + * Tracepoint entries modifications are protected by the tracepoints_mutex.
> > + */
> > +struct tracepoint_entry {
> > +	struct hlist_node hlist;
> > +	void **funcs;
> > +	int refcount;	/* Number of times armed. 0 if disarmed. */
> > +	struct rcu_head rcu;
> > +	void *oldptr;
> > +	unsigned char rcu_pending:1;
> > +	char name[0];
> > +};
> > +
> > +static struct hlist_head tracepoint_table[TRACEPOINT_TABLE_SIZE];
> > +
> > +static void free_old_closure(struct rcu_head *head)
> > +{
> > +	struct tracepoint_entry *entry = container_of(head,
> > +		struct tracepoint_entry, rcu);
> > +	kfree(entry->oldptr);
> > +	/* Make sure we free the data before setting the pending flag to 0 */
> > +	smp_wmb();
> > +	entry->rcu_pending = 0;
> > +}
> > +
> > +static void tracepoint_entry_free_old(struct tracepoint_entry *entry, void *old)
> > +{
> > +	if (!old)
> > +		return;
> > +	entry->oldptr = old;
> > +	entry->rcu_pending = 1;
> > +	/* write rcu_pending before calling the RCU callback */
> > +	smp_wmb();
> > +#ifdef CONFIG_PREEMPT_RCU
> > +	synchronize_sched();	/* Until we have the call_rcu_sched() */
> > +#endif
> 
> Does this have something to do with the preempt_disable above?
> 

Yes, it does. We make sure the previous array containing probes, which
has been scheduled for deletion by the rcu callback, is indeed freed
before we proceed to the next update. It therefore limits the rate of
modification of a single tracepoint to one update per RCU period. The
objective here is to permit fast batch add/removal of probes on
_different_ tracepoints.

This use of "synchronize_sched()" can be changed for call_rcu_sched() in
linux-next, I'll fix this.


> > +	call_rcu(&entry->rcu, free_old_closure);
> > +}
> > +
> > +static void debug_print_probes(struct tracepoint_entry *entry)
> > +{
> > +	int i;
> > +
> > +	if (!tracepoint_debug)
> > +		return;
> > +
> > +	for (i = 0; entry->funcs[i]; i++)
> > +		printk(KERN_DEBUG "Probe %d : %p\n", i, entry->funcs[i]);
> > +}
> > +
> > +static void *
> > +tracepoint_entry_add_probe(struct tracepoint_entry *entry, void *probe)
> > +{
> > +	int nr_probes = 0;
> > +	void **old, **new;
> > +
> > +	WARN_ON(!probe);
> > +
> > +	debug_print_probes(entry);
> > +	old = entry->funcs;
> > +	if (old) {
> > +		/* (N -> N+1), (N != 0, 1) probes */
> > +		for (nr_probes = 0; old[nr_probes]; nr_probes++)
> > +			if (old[nr_probes] == probe)
> > +				return ERR_PTR(-EBUSY);
> 
> -EEXIST ?
> 

good point.

> > +	}
> > +	/* + 2 : one for new probe, one for NULL func */
> > +	new = kzalloc((nr_probes + 2) * sizeof(void *), GFP_KERNEL);
> > +	if (new == NULL)
> > +		return ERR_PTR(-ENOMEM);
> > +	if (old)
> > +		memcpy(new, old, nr_probes * sizeof(void *));
> > +	new[nr_probes] = probe;
> > +	entry->refcount = nr_probes + 1;
> > +	entry->funcs = new;
> > +	debug_print_probes(entry);
> > +	return old;
> > +}
> > +
> > +static void *
> > +tracepoint_entry_remove_probe(struct tracepoint_entry *entry, void *probe)
> > +{
> > +	int nr_probes = 0, nr_del = 0, i;
> > +	void **old, **new;
> > +
> > +	old = entry->funcs;
> > +
> > +	debug_print_probes(entry);
> > +	/* (N -> M), (N > 1, M >= 0) probes */
> > +	for (nr_probes = 0; old[nr_probes]; nr_probes++) {
> > +		if ((!probe || old[nr_probes] == probe))
> > +			nr_del++;
> > +	}
> > +
> > +	if (nr_probes - nr_del == 0) {
> > +		/* N -> 0, (N > 1) */
> > +		entry->funcs = NULL;
> > +		entry->refcount = 0;
> > +		debug_print_probes(entry);
> > +		return old;
> > +	} else {
> > +		int j = 0;
> > +		/* N -> M, (N > 1, M > 0) */
> > +		/* + 1 for NULL */
> > +		new = kzalloc((nr_probes - nr_del + 1)
> > +			* sizeof(void *), GFP_KERNEL);
> > +		if (new == NULL)
> > +			return ERR_PTR(-ENOMEM);
> > +		for (i = 0; old[i]; i++)
> > +			if ((probe && old[i] != probe))
> > +				new[j++] = old[i];
> > +		entry->refcount = nr_probes - nr_del;
> > +		entry->funcs = new;
> > +	}
> > +	debug_print_probes(entry);
> > +	return old;
> > +}
> > +
> > +/*
> > + * Get tracepoint if the tracepoint is present in the tracepoint hash table.
> > + * Must be called with tracepoints_mutex held.
> > + * Returns NULL if not present.
> > + */
> > +static struct tracepoint_entry *get_tracepoint(const char *name)
> > +{
> > +	struct hlist_head *head;
> > +	struct hlist_node *node;
> > +	struct tracepoint_entry *e;
> > +	u32 hash = jhash(name, strlen(name), 0);
> > +
> > +	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> > +	hlist_for_each_entry(e, node, head, hlist) {
> > +		if (!strcmp(name, e->name))
> > +			return e;
> > +	}
> > +	return NULL;
> > +}
> > +
> > +/*
> > + * Add the tracepoint to the tracepoint hash table. Must be called with
> > + * tracepoints_mutex held.
> > + */
> > +static struct tracepoint_entry *add_tracepoint(const char *name)
> > +{
> > +	struct hlist_head *head;
> > +	struct hlist_node *node;
> > +	struct tracepoint_entry *e;
> > +	size_t name_len = strlen(name) + 1;
> > +	u32 hash = jhash(name, name_len-1, 0);
> > +
> > +	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> > +	hlist_for_each_entry(e, node, head, hlist) {
> > +		if (!strcmp(name, e->name)) {
> > +			printk(KERN_NOTICE
> > +				"tracepoint %s busy\n", name);
> > +			return ERR_PTR(-EBUSY);	/* Already there */
> 
> -EEXIST
> 

Yes.

> > +		}
> > +	}
> > +	/*
> > +	 * Using kmalloc here to allocate a variable length element. Could
> > +	 * cause some memory fragmentation if overused.
> > +	 */
> > +	e = kmalloc(sizeof(struct tracepoint_entry) + name_len, GFP_KERNEL);
> > +	if (!e)
> > +		return ERR_PTR(-ENOMEM);
> > +	memcpy(&e->name[0], name, name_len);
> > +	e->funcs = NULL;
> > +	e->refcount = 0;
> > +	e->rcu_pending = 0;
> > +	hlist_add_head(&e->hlist, head);
> > +	return e;
> > +}
> > +
> > +/*
> > + * Remove the tracepoint from the tracepoint hash table. Must be called with
> > + * mutex_lock held.
> > + */
> > +static int remove_tracepoint(const char *name)
> > +{
> > +	struct hlist_head *head;
> > +	struct hlist_node *node;
> > +	struct tracepoint_entry *e;
> > +	int found = 0;
> > +	size_t len = strlen(name) + 1;
> > +	u32 hash = jhash(name, len-1, 0);
> > +
> > +	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> > +	hlist_for_each_entry(e, node, head, hlist) {
> > +		if (!strcmp(name, e->name)) {
> > +			found = 1;
> > +			break;
> > +		}
> > +	}
> > +	if (!found)
> > +		return -ENOENT;
> > +	if (e->refcount)
> > +		return -EBUSY;
> 
> ok, this really is busy.
> 
> > +	hlist_del(&e->hlist);
> > +	/* Make sure the call_rcu has been executed */
> > +	if (e->rcu_pending)
> > +		rcu_barrier();
> > +	kfree(e);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Sets the probe callback corresponding to one tracepoint.
> > + */
> > +static void set_tracepoint(struct tracepoint_entry **entry,
> > +	struct tracepoint *elem, int active)
> > +{
> > +	WARN_ON(strcmp((*entry)->name, elem->name) != 0);
> > +
> > +	smp_wmb();
> > +	/*
> > +	 * We also make sure that the new probe callbacks array is consistent
> > +	 * before setting a pointer to it.
> > +	 */
> > +	rcu_assign_pointer(elem->funcs, (*entry)->funcs);
> 
> rcu_assign_pointer() already does that wmb !?
> Also, its polite to reference the pairing site in the barrier comment.
> 

Good point. I'll then remove the wmb, and change the
smp_read_barrier_depends for a rcu_dereference in __DO_TRACE. It will
make things clearer.

The comments becomes :

        /*
         * rcu_assign_pointer has a smp_wmb() which makes sure that the new
         * probe callbacks array is consistent before setting a pointer to it.
         * This array is referenced by __DO_TRACE from
         * include/linux/tracepoints.h. A matching rcu_dereference() is used.
         */

I'll release a new version including those changes shortly,

Thanks,

Mathieu

> > +	elem->state = active;
> > +}
> > +
> > +/*
> > + * Disable a tracepoint and its probe callback.
> > + * Note: only waiting an RCU period after setting elem->call to the empty
> > + * function insures that the original callback is not used anymore. This insured
> > + * by preempt_disable around the call site.
> > + */
> > +static void disable_tracepoint(struct tracepoint *elem)
> > +{
> > +	elem->state = 0;
> > +}
> > +
> > +/**
> > + * tracepoint_update_probe_range - Update a probe range
> > + * @begin: beginning of the range
> > + * @end: end of the range
> > + *
> > + * Updates the probe callback corresponding to a range of tracepoints.
> > + */
> > +void tracepoint_update_probe_range(struct tracepoint *begin,
> > +	struct tracepoint *end)
> > +{
> > +	struct tracepoint *iter;
> > +	struct tracepoint_entry *mark_entry;
> > +
> > +	mutex_lock(&tracepoints_mutex);
> > +	for (iter = begin; iter < end; iter++) {
> > +		mark_entry = get_tracepoint(iter->name);
> > +		if (mark_entry) {
> > +			set_tracepoint(&mark_entry, iter,
> > +					!!mark_entry->refcount);
> > +		} else {
> > +			disable_tracepoint(iter);
> > +		}
> > +	}
> > +	mutex_unlock(&tracepoints_mutex);
> > +}
> > +
> > +/*
> > + * Update probes, removing the faulty probes.
> > + */
> > +static void tracepoint_update_probes(void)
> > +{
> > +	/* Core kernel tracepoints */
> > +	tracepoint_update_probe_range(__start___tracepoints,
> > +		__stop___tracepoints);
> > +	/* tracepoints in modules. */
> > +	module_update_tracepoints();
> > +}
> > +
> > +/**
> > + * tracepoint_probe_register -  Connect a probe to a tracepoint
> > + * @name: tracepoint name
> > + * @probe: probe handler
> > + *
> > + * Returns 0 if ok, error value on error.
> > + * The probe address must at least be aligned on the architecture pointer size.
> > + */
> > +int tracepoint_probe_register(const char *name, void *probe)
> > +{
> > +	struct tracepoint_entry *entry;
> > +	int ret = 0;
> > +	void *old;
> > +
> > +	mutex_lock(&tracepoints_mutex);
> > +	entry = get_tracepoint(name);
> > +	if (!entry) {
> > +		entry = add_tracepoint(name);
> > +		if (IS_ERR(entry)) {
> > +			ret = PTR_ERR(entry);
> > +			goto end;
> > +		}
> > +	}
> > +	/*
> > +	 * If we detect that a call_rcu is pending for this tracepoint,
> > +	 * make sure it's executed now.
> > +	 */
> > +	if (entry->rcu_pending)
> > +		rcu_barrier();
> > +	old = tracepoint_entry_add_probe(entry, probe);
> > +	if (IS_ERR(old)) {
> > +		ret = PTR_ERR(old);
> > +		goto end;
> > +	}
> > +	mutex_unlock(&tracepoints_mutex);
> > +	tracepoint_update_probes();		/* may update entry */
> > +	mutex_lock(&tracepoints_mutex);
> > +	entry = get_tracepoint(name);
> > +	WARN_ON(!entry);
> > +	tracepoint_entry_free_old(entry, old);
> > +end:
> > +	mutex_unlock(&tracepoints_mutex);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_probe_register);
> > +
> > +/**
> > + * tracepoint_probe_unregister -  Disconnect a probe from a tracepoint
> > + * @name: tracepoint name
> > + * @probe: probe function pointer
> > + *
> > + * We do not need to call a synchronize_sched to make sure the probes have
> > + * finished running before doing a module unload, because the module unload
> > + * itself uses stop_machine(), which insures that every preempt disabled section
> > + * have finished.
> > + */
> > +int tracepoint_probe_unregister(const char *name, void *probe)
> > +{
> > +	struct tracepoint_entry *entry;
> > +	void *old;
> > +	int ret = -ENOENT;
> > +
> > +	mutex_lock(&tracepoints_mutex);
> > +	entry = get_tracepoint(name);
> > +	if (!entry)
> > +		goto end;
> > +	if (entry->rcu_pending)
> > +		rcu_barrier();
> > +	old = tracepoint_entry_remove_probe(entry, probe);
> > +	mutex_unlock(&tracepoints_mutex);
> > +	tracepoint_update_probes();		/* may update entry */
> > +	mutex_lock(&tracepoints_mutex);
> > +	entry = get_tracepoint(name);
> > +	if (!entry)
> > +		goto end;
> > +	tracepoint_entry_free_old(entry, old);
> > +	remove_tracepoint(name);	/* Ignore busy error message */
> > +	ret = 0;
> > +end:
> > +	mutex_unlock(&tracepoints_mutex);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_probe_unregister);
> > +
> > +/**
> > + * tracepoint_get_iter_range - Get a next tracepoint iterator given a range.
> > + * @tracepoint: current tracepoints (in), next tracepoint (out)
> > + * @begin: beginning of the range
> > + * @end: end of the range
> > + *
> > + * Returns whether a next tracepoint has been found (1) or not (0).
> > + * Will return the first tracepoint in the range if the input tracepoint is
> > + * NULL.
> > + */
> > +int tracepoint_get_iter_range(struct tracepoint **tracepoint,
> > +	struct tracepoint *begin, struct tracepoint *end)
> > +{
> > +	if (!*tracepoint && begin != end) {
> > +		*tracepoint = begin;
> > +		return 1;
> > +	}
> > +	if (*tracepoint >= begin && *tracepoint < end)
> > +		return 1;
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_get_iter_range);
> > +
> > +static void tracepoint_get_iter(struct tracepoint_iter *iter)
> > +{
> > +	int found = 0;
> > +
> > +	/* Core kernel tracepoints */
> > +	if (!iter->module) {
> > +		found = tracepoint_get_iter_range(&iter->tracepoint,
> > +				__start___tracepoints, __stop___tracepoints);
> > +		if (found)
> > +			goto end;
> > +	}
> > +	/* tracepoints in modules. */
> > +	found = module_get_iter_tracepoints(iter);
> > +end:
> > +	if (!found)
> > +		tracepoint_iter_reset(iter);
> > +}
> > +
> > +void tracepoint_iter_start(struct tracepoint_iter *iter)
> > +{
> > +	tracepoint_get_iter(iter);
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_iter_start);
> > +
> > +void tracepoint_iter_next(struct tracepoint_iter *iter)
> > +{
> > +	iter->tracepoint++;
> > +	/*
> > +	 * iter->tracepoint may be invalid because we blindly incremented it.
> > +	 * Make sure it is valid by marshalling on the tracepoints, getting the
> > +	 * tracepoints from following modules if necessary.
> > +	 */
> > +	tracepoint_get_iter(iter);
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_iter_next);
> > +
> > +void tracepoint_iter_stop(struct tracepoint_iter *iter)
> > +{
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_iter_stop);
> > +
> > +void tracepoint_iter_reset(struct tracepoint_iter *iter)
> > +{
> > +	iter->module = NULL;
> > +	iter->tracepoint = NULL;
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_iter_reset);
> > Index: linux-2.6-lttng/kernel/module.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/module.c	2008-07-09 10:55:46.000000000 -0400
> > +++ linux-2.6-lttng/kernel/module.c	2008-07-09 10:55:58.000000000 -0400
> > @@ -47,6 +47,7 @@
> >  #include <asm/sections.h>
> >  #include <linux/license.h>
> >  #include <asm/sections.h>
> > +#include <linux/tracepoint.h>
> >  
> >  #if 0
> >  #define DEBUGP printk
> > @@ -1824,6 +1825,8 @@ static struct module *load_module(void _
> >  #endif
> >  	unsigned int markersindex;
> >  	unsigned int markersstringsindex;
> > +	unsigned int tracepointsindex;
> > +	unsigned int tracepointsstringsindex;
> >  	struct module *mod;
> >  	long err = 0;
> >  	void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */
> > @@ -2110,6 +2113,9 @@ static struct module *load_module(void _
> >  	markersindex = find_sec(hdr, sechdrs, secstrings, "__markers");
> >   	markersstringsindex = find_sec(hdr, sechdrs, secstrings,
> >  					"__markers_strings");
> > +	tracepointsindex = find_sec(hdr, sechdrs, secstrings, "__tracepoints");
> > +	tracepointsstringsindex = find_sec(hdr, sechdrs, secstrings,
> > +					"__tracepoints_strings");
> >  
> >  	/* Now do relocations. */
> >  	for (i = 1; i < hdr->e_shnum; i++) {
> > @@ -2137,6 +2143,12 @@ static struct module *load_module(void _
> >  	mod->num_markers =
> >  		sechdrs[markersindex].sh_size / sizeof(*mod->markers);
> >  #endif
> > +#ifdef CONFIG_TRACEPOINTS
> > +	mod->tracepoints = (void *)sechdrs[tracepointsindex].sh_addr;
> > +	mod->num_tracepoints =
> > +		sechdrs[tracepointsindex].sh_size / sizeof(*mod->tracepoints);
> > +#endif
> > +
> >  
> >          /* Find duplicate symbols */
> >  	err = verify_export_symbols(mod);
> > @@ -2155,11 +2167,16 @@ static struct module *load_module(void _
> >  
> >  	add_kallsyms(mod, sechdrs, symindex, strindex, secstrings);
> >  
> > +	if (!mod->taints) {
> >  #ifdef CONFIG_MARKERS
> > -	if (!mod->taints)
> >  		marker_update_probe_range(mod->markers,
> >  			mod->markers + mod->num_markers);
> >  #endif
> > +#ifdef CONFIG_TRACEPOINTS
> > +		tracepoint_update_probe_range(mod->tracepoints,
> > +			mod->tracepoints + mod->num_tracepoints);
> > +#endif
> > +	}
> >  	err = module_finalize(hdr, sechdrs, mod);
> >  	if (err < 0)
> >  		goto cleanup;
> > @@ -2710,3 +2727,50 @@ void module_update_markers(void)
> >  	mutex_unlock(&module_mutex);
> >  }
> >  #endif
> > +
> > +#ifdef CONFIG_TRACEPOINTS
> > +void module_update_tracepoints(void)
> > +{
> > +	struct module *mod;
> > +
> > +	mutex_lock(&module_mutex);
> > +	list_for_each_entry(mod, &modules, list)
> > +		if (!mod->taints)
> > +			tracepoint_update_probe_range(mod->tracepoints,
> > +				mod->tracepoints + mod->num_tracepoints);
> > +	mutex_unlock(&module_mutex);
> > +}
> > +
> > +/*
> > + * Returns 0 if current not found.
> > + * Returns 1 if current found.
> > + */
> > +int module_get_iter_tracepoints(struct tracepoint_iter *iter)
> > +{
> > +	struct module *iter_mod;
> > +	int found = 0;
> > +
> > +	mutex_lock(&module_mutex);
> > +	list_for_each_entry(iter_mod, &modules, list) {
> > +		if (!iter_mod->taints) {
> > +			/*
> > +			 * Sorted module list
> > +			 */
> > +			if (iter_mod < iter->module)
> > +				continue;
> > +			else if (iter_mod > iter->module)
> > +				iter->tracepoint = NULL;
> > +			found = tracepoint_get_iter_range(&iter->tracepoint,
> > +				iter_mod->tracepoints,
> > +				iter_mod->tracepoints
> > +					+ iter_mod->num_tracepoints);
> > +			if (found) {
> > +				iter->module = iter_mod;
> > +				break;
> > +			}
> > +		}
> > +	}
> > +	mutex_unlock(&module_mutex);
> > +	return found;
> > +}
> > +#endif
> > Index: linux-2.6-lttng/include/linux/module.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/include/linux/module.h	2008-07-09 10:55:46.000000000 -0400
> > +++ linux-2.6-lttng/include/linux/module.h	2008-07-09 10:57:22.000000000 -0400
> > @@ -16,6 +16,7 @@
> >  #include <linux/kobject.h>
> >  #include <linux/moduleparam.h>
> >  #include <linux/marker.h>
> > +#include <linux/tracepoint.h>
> >  #include <asm/local.h>
> >  
> >  #include <asm/module.h>
> > @@ -331,6 +332,10 @@ struct module
> >  	struct marker *markers;
> >  	unsigned int num_markers;
> >  #endif
> > +#ifdef CONFIG_TRACEPOINTS
> > +	struct tracepoint *tracepoints;
> > +	unsigned int num_tracepoints;
> > +#endif
> >  
> >  #ifdef CONFIG_MODULE_UNLOAD
> >  	/* What modules depend on me? */
> > @@ -454,6 +459,9 @@ extern void print_modules(void);
> >  
> >  extern void module_update_markers(void);
> >  
> > +extern void module_update_tracepoints(void);
> > +extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
> > +
> >  #else /* !CONFIG_MODULES... */
> >  #define EXPORT_SYMBOL(sym)
> >  #define EXPORT_SYMBOL_GPL(sym)
> > @@ -558,6 +566,15 @@ static inline void module_update_markers
> >  {
> >  }
> >  
> > +static inline void module_update_tracepoints(void)
> > +{
> > +}
> > +
> > +static inline int module_get_iter_tracepoints(struct tracepoint_iter *iter)
> > +{
> > +	return 0;
> > +}
> > +
> >  #endif /* CONFIG_MODULES */
> >  
> >  struct device_driver;
> > 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 13:25     ` Mathieu Desnoyers
@ 2008-07-15 13:59       ` Peter Zijlstra
  2008-07-15 14:27         ` Mathieu Desnoyers
  2008-07-15 14:03       ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2008-07-15 13:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu

On Tue, 2008-07-15 at 09:25 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Wed, 2008-07-09 at 10:59 -0400, Mathieu Desnoyers wrote:

> > > +#define __DO_TRACE(tp, proto, args)					\
> > > +	do {								\
> > > +		int i;							\
> > > +		void **funcs;						\
> > > +		preempt_disable();					\
> > > +		funcs = (tp)->funcs;					\
> > > +		smp_read_barrier_depends();				\
> > > +		if (funcs) {						\
> > > +			for (i = 0; funcs[i]; i++) {			\
> > 
> > can't you get rid of 'i' and write:
> > 
> >   void **func;
> > 
> >   preempt_disable();
> >   func = (tp)->funcs;
> >   smp_read_barrier_depends();
> >   for (; func; func++)
> >     ((void (*)(proto))func)(args);
> >   preempt_enable();
> > 
> 
> Yes, I though there would be an optimization to do here, I'll use your
> proposal. This code snippet is especially important since it will
> generate instructions near every tracepoint side. Saving a few bytes
> becomes important.
> 
> Given that (tp)->funcs references an array of function pointers and that
> it can be NULL, the if (funcs) test must still be there and we must use
> 
> #define __DO_TRACE(tp, proto, args)					\
> 	do {								\
> 		void *func;						\
> 									\
> 		preempt_disable();					\
> 		if ((tp)->funcs) {					\
> 			func = rcu_dereference((tp)->funcs);		\
> 			for (; func; func++) {				\
> 				((void(*)(proto))(func))(args);		\
> 			}						\
> 		}							\
> 		preempt_enable();					\
> 	} while (0)
> 
> 
> The resulting assembly is a bit more dense than my previous
> implementation, which is good :

My version also has that if ((tp)->funcs), but its hidden in the 
for (; func; func++) loop. The only thing your version does is an extra
test of tp->funcs but without read depends barrier - not sure if that is
ok.




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 13:25     ` Mathieu Desnoyers
  2008-07-15 13:59       ` Peter Zijlstra
@ 2008-07-15 14:03       ` Peter Zijlstra
  2008-07-15 14:46         ` Mathieu Desnoyers
  2008-07-15 19:02         ` Mathieu Desnoyers
  1 sibling, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2008-07-15 14:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

On Tue, 2008-07-15 at 09:25 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Wed, 2008-07-09 at 10:59 -0400, Mathieu Desnoyers wrote:

> > > +#define __DO_TRACE(tp, proto, args)					\
> > > +	do {								\
> > > +		int i;							\
> > > +		void **funcs;						\
> > > +		preempt_disable();					\
> > > +		funcs = (tp)->funcs;					\
> > > +		smp_read_barrier_depends();				\
> > > +		if (funcs) {						\
> > > +			for (i = 0; funcs[i]; i++) {			\
> > 
> > Also, why is the preempt_disable needed?
> > 
> 
> Addition and removal of tracepoints is synchronized by RCU using the
> scheduler (and preempt_disable) as guarantees to find a quiescent state
> (this is really RCU "classic"). The update side uses rcu_barrier_sched()
> with call_rcu_sched() and the read/execute side uses
> "preempt_disable()/preempt_enable()".

> > > +static void tracepoint_entry_free_old(struct tracepoint_entry *entry, void *old)
> > > +{
> > > +	if (!old)
> > > +		return;
> > > +	entry->oldptr = old;
> > > +	entry->rcu_pending = 1;
> > > +	/* write rcu_pending before calling the RCU callback */
> > > +	smp_wmb();
> > > +#ifdef CONFIG_PREEMPT_RCU
> > > +	synchronize_sched();	/* Until we have the call_rcu_sched() */
> > > +#endif
> > 
> > Does this have something to do with the preempt_disable above?
> > 
> 
> Yes, it does. We make sure the previous array containing probes, which
> has been scheduled for deletion by the rcu callback, is indeed freed
> before we proceed to the next update. It therefore limits the rate of
> modification of a single tracepoint to one update per RCU period. The
> objective here is to permit fast batch add/removal of probes on
> _different_ tracepoints.
> 
> This use of "synchronize_sched()" can be changed for call_rcu_sched() in
> linux-next, I'll fix this.

Right, I thought as much, its just that the raw preempt_disable()
without comments leaves one wondering if there is anything else going
on.

Would it make sense to add:

rcu_read_sched_lock()
rcu_read_sched_unlock()

to match:

call_rcu_sched()
rcu_barrier_sched()
synchronize_sched()

?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 13:59       ` Peter Zijlstra
@ 2008-07-15 14:27         ` Mathieu Desnoyers
  2008-07-15 14:42           ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2008-07-15 at 09:25 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Wed, 2008-07-09 at 10:59 -0400, Mathieu Desnoyers wrote:
> 
> > > > +#define __DO_TRACE(tp, proto, args)					\
> > > > +	do {								\
> > > > +		int i;							\
> > > > +		void **funcs;						\
> > > > +		preempt_disable();					\
> > > > +		funcs = (tp)->funcs;					\
> > > > +		smp_read_barrier_depends();				\
> > > > +		if (funcs) {						\
> > > > +			for (i = 0; funcs[i]; i++) {			\
> > > 
> > > can't you get rid of 'i' and write:
> > > 
> > >   void **func;
> > > 
> > >   preempt_disable();
> > >   func = (tp)->funcs;
> > >   smp_read_barrier_depends();
> > >   for (; func; func++)
> > >     ((void (*)(proto))func)(args);
> > >   preempt_enable();
> > > 
> > 
> > Yes, I though there would be an optimization to do here, I'll use your
> > proposal. This code snippet is especially important since it will
> > generate instructions near every tracepoint side. Saving a few bytes
> > becomes important.
> > 
> > Given that (tp)->funcs references an array of function pointers and that
> > it can be NULL, the if (funcs) test must still be there and we must use
> > 
> > #define __DO_TRACE(tp, proto, args)					\
> > 	do {								\
> > 		void *func;						\
> > 									\
> > 		preempt_disable();					\
> > 		if ((tp)->funcs) {					\
> > 			func = rcu_dereference((tp)->funcs);		\
> > 			for (; func; func++) {				\
> > 				((void(*)(proto))(func))(args);		\
> > 			}						\
> > 		}							\
> > 		preempt_enable();					\
> > 	} while (0)
> > 
> > 
> > The resulting assembly is a bit more dense than my previous
> > implementation, which is good :
> 
> My version also has that if ((tp)->funcs), but its hidden in the 
> for (; func; func++) loop. The only thing your version does is an extra
> test of tp->funcs but without read depends barrier - not sure if that is
> ok.
> 

Hrm, you are right, the implementation I just proposed is bogus. (but so
was yours) ;)

func is an iterator on the funcs array. My typing of func is thus wrong,
it should be void **. Otherwise I'm just incrementing the function
address which is plain wrong.

The read barrier is included in rcu_dereference() now. But given that we
have to take a pointer to the array as an iterator, we would have to
rcu_dereference() our iterator multiple times and then have many read
barrier depends, which we don't need. This is why I would go back to a
smp_read_barrier_depends().

Also, I use a NULL entry at the end of the funcs array as an end of
array identifier. However, I cannot use this in the for loop both as a
check for NULL array and check for NULL array element. This is why a if
() test is needed in addition to the for loop test. (this is actually
what is wrong in the implementation you proposed : you treat func both
as a pointer to the function pointer array and as a function pointer)

Something like this seems better :

#define __DO_TRACE(tp, proto, args)                                     \
        do {                                                            \
                void **it_func;                                         \
                                                                        \
                preempt_disable();                                      \
                it_func = (tp)->funcs;                                  \
                if (it_func) {                                          \
                        smp_read_barrier_depends();                     \
                        for (; *it_func; it_func++)                     \
                                ((void(*)(proto))(*it_func))(args);     \
                }                                                       \
                preempt_enable();                                       \
        } while (0)

What do you think ?

Mathieu

> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 14:27         ` Mathieu Desnoyers
@ 2008-07-15 14:42           ` Peter Zijlstra
  2008-07-15 15:22             ` Mathieu Desnoyers
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2008-07-15 14:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

On Tue, 2008-07-15 at 10:27 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2008-07-15 at 09:25 -0400, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > On Wed, 2008-07-09 at 10:59 -0400, Mathieu Desnoyers wrote:
> > 
> > > > > +#define __DO_TRACE(tp, proto, args)					\
> > > > > +	do {								\
> > > > > +		int i;							\
> > > > > +		void **funcs;						\
> > > > > +		preempt_disable();					\
> > > > > +		funcs = (tp)->funcs;					\
> > > > > +		smp_read_barrier_depends();				\
> > > > > +		if (funcs) {						\
> > > > > +			for (i = 0; funcs[i]; i++) {			\
> > > > 
> > > > can't you get rid of 'i' and write:
> > > > 
> > > >   void **func;
> > > > 
> > > >   preempt_disable();
> > > >   func = (tp)->funcs;
> > > >   smp_read_barrier_depends();
> > > >   for (; func; func++)
> > > >     ((void (*)(proto))func)(args);
> > > >   preempt_enable();
> > > > 
> > > 
> > > Yes, I though there would be an optimization to do here, I'll use your
> > > proposal. This code snippet is especially important since it will
> > > generate instructions near every tracepoint side. Saving a few bytes
> > > becomes important.
> > > 
> > > Given that (tp)->funcs references an array of function pointers and that
> > > it can be NULL, the if (funcs) test must still be there and we must use
> > > 
> > > #define __DO_TRACE(tp, proto, args)					\
> > > 	do {								\
> > > 		void *func;						\
> > > 									\
> > > 		preempt_disable();					\
> > > 		if ((tp)->funcs) {					\
> > > 			func = rcu_dereference((tp)->funcs);		\
> > > 			for (; func; func++) {				\
> > > 				((void(*)(proto))(func))(args);		\
> > > 			}						\
> > > 		}							\
> > > 		preempt_enable();					\
> > > 	} while (0)
> > > 
> > > 
> > > The resulting assembly is a bit more dense than my previous
> > > implementation, which is good :
> > 
> > My version also has that if ((tp)->funcs), but its hidden in the 
> > for (; func; func++) loop. The only thing your version does is an extra
> > test of tp->funcs but without read depends barrier - not sure if that is
> > ok.
> > 
> 
> Hrm, you are right, the implementation I just proposed is bogus. (but so
> was yours) ;)
> 
> func is an iterator on the funcs array. My typing of func is thus wrong,
> it should be void **. Otherwise I'm just incrementing the function
> address which is plain wrong.
> 
> The read barrier is included in rcu_dereference() now. But given that we
> have to take a pointer to the array as an iterator, we would have to
> rcu_dereference() our iterator multiple times and then have many read
> barrier depends, which we don't need. This is why I would go back to a
> smp_read_barrier_depends().
> 
> Also, I use a NULL entry at the end of the funcs array as an end of
> array identifier. However, I cannot use this in the for loop both as a
> check for NULL array and check for NULL array element. This is why a if
> () test is needed in addition to the for loop test. (this is actually
> what is wrong in the implementation you proposed : you treat func both
> as a pointer to the function pointer array and as a function pointer)

Ah, D'0h! Indeed.

> Something like this seems better :
> 
> #define __DO_TRACE(tp, proto, args)                                     \
>         do {                                                            \
>                 void **it_func;                                         \
>                                                                         \
>                 preempt_disable();                                      \
>                 it_func = (tp)->funcs;                                  \
>                 if (it_func) {                                          \
>                         smp_read_barrier_depends();                     \
>                         for (; *it_func; it_func++)                     \
>                                 ((void(*)(proto))(*it_func))(args);     \
>                 }                                                       \
>                 preempt_enable();                                       \
>         } while (0)
> 
> What do you think ?

I'm confused by the barrier games here.

Why not:

  void **it_func;

  preempt_disable();
  it_func = rcu_dereference((tp)->funcs);
  if (it_func) {
    for (; *it_func; it_func++)
      ((void(*)(proto))(*it_func))(args);
  }
  preempt_enable();

That is, why can we skip the barrier when !it_func? is that because at
that time we don't actually dereference it_func and therefore cannot
observe stale data?

If so, does this really matter since we're already in an unlikely
section? Again, if so, this deserves a comment ;-)

[ still think those preempt_* calls should be called
  rcu_read_sched_lock() or such. ]

Anyway, does this still generate better code?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 14:03       ` Peter Zijlstra
@ 2008-07-15 14:46         ` Mathieu Desnoyers
  2008-07-15 15:13           ` Peter Zijlstra
  2008-07-15 19:02         ` Mathieu Desnoyers
  1 sibling, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2008-07-15 at 09:25 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Wed, 2008-07-09 at 10:59 -0400, Mathieu Desnoyers wrote:
> 
> > > > +#define __DO_TRACE(tp, proto, args)					\
> > > > +	do {								\
> > > > +		int i;							\
> > > > +		void **funcs;						\
> > > > +		preempt_disable();					\
> > > > +		funcs = (tp)->funcs;					\
> > > > +		smp_read_barrier_depends();				\
> > > > +		if (funcs) {						\
> > > > +			for (i = 0; funcs[i]; i++) {			\
> > > 
> > > Also, why is the preempt_disable needed?
> > > 
> > 
> > Addition and removal of tracepoints is synchronized by RCU using the
> > scheduler (and preempt_disable) as guarantees to find a quiescent state
> > (this is really RCU "classic"). The update side uses rcu_barrier_sched()
> > with call_rcu_sched() and the read/execute side uses
> > "preempt_disable()/preempt_enable()".
> 
> > > > +static void tracepoint_entry_free_old(struct tracepoint_entry *entry, void *old)
> > > > +{
> > > > +	if (!old)
> > > > +		return;
> > > > +	entry->oldptr = old;
> > > > +	entry->rcu_pending = 1;
> > > > +	/* write rcu_pending before calling the RCU callback */
> > > > +	smp_wmb();
> > > > +#ifdef CONFIG_PREEMPT_RCU
> > > > +	synchronize_sched();	/* Until we have the call_rcu_sched() */
> > > > +#endif
> > > 
> > > Does this have something to do with the preempt_disable above?
> > > 
> > 
> > Yes, it does. We make sure the previous array containing probes, which
> > has been scheduled for deletion by the rcu callback, is indeed freed
> > before we proceed to the next update. It therefore limits the rate of
> > modification of a single tracepoint to one update per RCU period. The
> > objective here is to permit fast batch add/removal of probes on
> > _different_ tracepoints.
> > 
> > This use of "synchronize_sched()" can be changed for call_rcu_sched() in
> > linux-next, I'll fix this.
> 
> Right, I thought as much, its just that the raw preempt_disable()
> without comments leaves one wondering if there is anything else going
> on.
> 
> Would it make sense to add:
> 
> rcu_read_sched_lock()
> rcu_read_sched_unlock()
> 
> to match:
> 
> call_rcu_sched()
> rcu_barrier_sched()
> synchronize_sched()
> 
> ?
> 

Yes, I would add them to include/linux/rcupdate.h. I'll include it with
my next release.

Talking about headers, I have noticed that placing headers with the code
may not be as clean as I would hope. For instance, the kernel/irq-trace.h
header, when included from kernel/irq/handle.c, has to be included with:

#include "../irq-trace.h"

Which is not _that_ bad, but we we want to instrument the irq handler
found in arch/x86/kernel/cpu/mcheck/mce_intel_64.c, including
#include "../../../../../kernel/irq-trace.h" makes me go "yeeeek!"

How about creating include/trace/irq.h and friends ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 14:46         ` Mathieu Desnoyers
@ 2008-07-15 15:13           ` Peter Zijlstra
  2008-07-15 18:22             ` Mathieu Desnoyers
  2008-07-15 18:52             ` Masami Hiramatsu
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2008-07-15 15:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

On Tue, 2008-07-15 at 10:46 -0400, Mathieu Desnoyers wrote:

> Talking about headers, I have noticed that placing headers with the code
> may not be as clean as I would hope. For instance, the kernel/irq-trace.h
> header, when included from kernel/irq/handle.c, has to be included with:
> 
> #include "../irq-trace.h"
> 
> Which is not _that_ bad, but we we want to instrument the irq handler
> found in arch/x86/kernel/cpu/mcheck/mce_intel_64.c, including
> #include "../../../../../kernel/irq-trace.h" makes me go "yeeeek!"
> 
> How about creating include/trace/irq.h and friends ?

Might as well.. anybody else got opinions?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 14:42           ` Peter Zijlstra
@ 2008-07-15 15:22             ` Mathieu Desnoyers
  2008-07-15 15:31               ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2008-07-15 at 10:27 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Tue, 2008-07-15 at 09:25 -0400, Mathieu Desnoyers wrote:
> > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > On Wed, 2008-07-09 at 10:59 -0400, Mathieu Desnoyers wrote:
> > > 
> > > > > > +#define __DO_TRACE(tp, proto, args)					\
> > > > > > +	do {								\
> > > > > > +		int i;							\
> > > > > > +		void **funcs;						\
> > > > > > +		preempt_disable();					\
> > > > > > +		funcs = (tp)->funcs;					\
> > > > > > +		smp_read_barrier_depends();				\
> > > > > > +		if (funcs) {						\
> > > > > > +			for (i = 0; funcs[i]; i++) {			\
> > > > > 
> > > > > can't you get rid of 'i' and write:
> > > > > 
> > > > >   void **func;
> > > > > 
> > > > >   preempt_disable();
> > > > >   func = (tp)->funcs;
> > > > >   smp_read_barrier_depends();
> > > > >   for (; func; func++)
> > > > >     ((void (*)(proto))func)(args);
> > > > >   preempt_enable();
> > > > > 
> > > > 
> > > > Yes, I though there would be an optimization to do here, I'll use your
> > > > proposal. This code snippet is especially important since it will
> > > > generate instructions near every tracepoint side. Saving a few bytes
> > > > becomes important.
> > > > 
> > > > Given that (tp)->funcs references an array of function pointers and that
> > > > it can be NULL, the if (funcs) test must still be there and we must use
> > > > 
> > > > #define __DO_TRACE(tp, proto, args)					\
> > > > 	do {								\
> > > > 		void *func;						\
> > > > 									\
> > > > 		preempt_disable();					\
> > > > 		if ((tp)->funcs) {					\
> > > > 			func = rcu_dereference((tp)->funcs);		\
> > > > 			for (; func; func++) {				\
> > > > 				((void(*)(proto))(func))(args);		\
> > > > 			}						\
> > > > 		}							\
> > > > 		preempt_enable();					\
> > > > 	} while (0)
> > > > 
> > > > 
> > > > The resulting assembly is a bit more dense than my previous
> > > > implementation, which is good :
> > > 
> > > My version also has that if ((tp)->funcs), but its hidden in the 
> > > for (; func; func++) loop. The only thing your version does is an extra
> > > test of tp->funcs but without read depends barrier - not sure if that is
> > > ok.
> > > 
> > 
> > Hrm, you are right, the implementation I just proposed is bogus. (but so
> > was yours) ;)
> > 
> > func is an iterator on the funcs array. My typing of func is thus wrong,
> > it should be void **. Otherwise I'm just incrementing the function
> > address which is plain wrong.
> > 
> > The read barrier is included in rcu_dereference() now. But given that we
> > have to take a pointer to the array as an iterator, we would have to
> > rcu_dereference() our iterator multiple times and then have many read
> > barrier depends, which we don't need. This is why I would go back to a
> > smp_read_barrier_depends().
> > 
> > Also, I use a NULL entry at the end of the funcs array as an end of
> > array identifier. However, I cannot use this in the for loop both as a
> > check for NULL array and check for NULL array element. This is why a if
> > () test is needed in addition to the for loop test. (this is actually
> > what is wrong in the implementation you proposed : you treat func both
> > as a pointer to the function pointer array and as a function pointer)
> 
> Ah, D'0h! Indeed.
> 
> > Something like this seems better :
> > 
> > #define __DO_TRACE(tp, proto, args)                                     \
> >         do {                                                            \
> >                 void **it_func;                                         \
> >                                                                         \
> >                 preempt_disable();                                      \
> >                 it_func = (tp)->funcs;                                  \
> >                 if (it_func) {                                          \
> >                         smp_read_barrier_depends();                     \
> >                         for (; *it_func; it_func++)                     \
> >                                 ((void(*)(proto))(*it_func))(args);     \
> >                 }                                                       \
> >                 preempt_enable();                                       \
> >         } while (0)
> > 
> > What do you think ?
> 
> I'm confused by the barrier games here.
> 
> Why not:
> 
>   void **it_func;
> 
>   preempt_disable();
>   it_func = rcu_dereference((tp)->funcs);
>   if (it_func) {
>     for (; *it_func; it_func++)
>       ((void(*)(proto))(*it_func))(args);
>   }
>   preempt_enable();
> 
> That is, why can we skip the barrier when !it_func? is that because at
> that time we don't actually dereference it_func and therefore cannot
> observe stale data?
> 

Exactly. I used the implementation of rcu_assign_pointer as a hint that
we did not need barriers when setting the pointer to NULL, and thus we
should not need the read barrier when reading the NULL pointer, because
it references no data.

#define rcu_assign_pointer(p, v) \
        ({ \
                if (!__builtin_constant_p(v) || \
                    ((v) != NULL)) \
                        smp_wmb(); \
                (p) = (v); \
        })

#define rcu_dereference(p)     ({ \
                                typeof(p) _________p1 = ACCESS_ONCE(p); \
                                smp_read_barrier_depends(); \
                                (_________p1); \
                                })

But I think you are right, since we are already in unlikely code, using
rcu_dereference as you do is better than my use of read barrier depends.
It should not change anything in the assembly result except on alpha,
where the read_barrier_depends() is not a nop.

I wonder if there would be a way to add this kind of NULL pointer case
check without overhead in rcu_dereference() on alpha. I guess not, since
the pointer is almost never known at compile-time. And I guess Paul must
already have thought about it. The only case where we could add this
test is when we know that we have a if (ptr != NULL) test following the
rcu_dereference(); we could then assume the compiler will merge the two
branches since they depend on the same condition.

> If so, does this really matter since we're already in an unlikely
> section? Again, if so, this deserves a comment ;-)
> 
> [ still think those preempt_* calls should be called
>   rcu_read_sched_lock() or such. ]
> 
> Anyway, does this still generate better code?
> 

On x86_64 :

 820:   bf 01 00 00 00          mov    $0x1,%edi
 825:   e8 00 00 00 00          callq  82a <thread_return+0x136>
 82a:   48 8b 1d 00 00 00 00    mov    0x0(%rip),%rbx        # 831 <thread_return+0x13d>
 831:   48 85 db                test   %rbx,%rbx
 834:   75 21                   jne    857 <thread_return+0x163>
 836:   eb 27                   jmp    85f <thread_return+0x16b>
 838:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
 83f:   00 
 840:   48 8b 95 68 ff ff ff    mov    -0x98(%rbp),%rdx
 847:   48 8b b5 60 ff ff ff    mov    -0xa0(%rbp),%rsi
 84e:   4c 89 e7                mov    %r12,%rdi
 851:   48 83 c3 08             add    $0x8,%rbx
 855:   ff d0                   callq  *%rax
 857:   48 8b 03                mov    (%rbx),%rax
 85a:   48 85 c0                test   %rax,%rax
 85d:   75 e1                   jne    840 <thread_return+0x14c>
 85f:   bf 01 00 00 00          mov    $0x1,%edi
 864:

for 68 bytes.

My original implementation was 77 bytes, so yes, we have a win.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 15:22             ` Mathieu Desnoyers
@ 2008-07-15 15:31               ` Peter Zijlstra
  2008-07-15 15:50                 ` Mathieu Desnoyers
                                   ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Peter Zijlstra @ 2008-07-15 15:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

On Tue, 2008-07-15 at 11:22 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > 
> > I'm confused by the barrier games here.
> > 
> > Why not:
> > 
> >   void **it_func;
> > 
> >   preempt_disable();
> >   it_func = rcu_dereference((tp)->funcs);
> >   if (it_func) {
> >     for (; *it_func; it_func++)
> >       ((void(*)(proto))(*it_func))(args);
> >   }
> >   preempt_enable();
> > 
> > That is, why can we skip the barrier when !it_func? is that because at
> > that time we don't actually dereference it_func and therefore cannot
> > observe stale data?
> > 
> 
> Exactly. I used the implementation of rcu_assign_pointer as a hint that
> we did not need barriers when setting the pointer to NULL, and thus we
> should not need the read barrier when reading the NULL pointer, because
> it references no data.
> 
> #define rcu_assign_pointer(p, v) \
>         ({ \
>                 if (!__builtin_constant_p(v) || \
>                     ((v) != NULL)) \
>                         smp_wmb(); \
>                 (p) = (v); \
>         })

Yeah, I saw that,.. made me wonder. It basically assumes that when we
write:

  rcu_assign_pointer(foo, NULL);

foo will not be used as an index or offset.

I guess Paul has thought it through and verified all in-kernel use
cases, but it still makes me feel unconfortable.

> #define rcu_dereference(p)     ({ \
>                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
>                                 smp_read_barrier_depends(); \
>                                 (_________p1); \
>                                 })
> 
> But I think you are right, since we are already in unlikely code, using
> rcu_dereference as you do is better than my use of read barrier depends.
> It should not change anything in the assembly result except on alpha,
> where the read_barrier_depends() is not a nop.
> 
> I wonder if there would be a way to add this kind of NULL pointer case
> check without overhead in rcu_dereference() on alpha. I guess not, since
> the pointer is almost never known at compile-time. And I guess Paul must
> already have thought about it. The only case where we could add this
> test is when we know that we have a if (ptr != NULL) test following the
> rcu_dereference(); we could then assume the compiler will merge the two
> branches since they depend on the same condition.

I remember seeing a thread about all this special casing NULL, but have
never been able to find it again - my google skillz always fail me.

Basically it doesn't work if you use the variable as an index/offset,
because in that case 0 is a valid offset and you still generate a data
dependency.

IIRC the conclusion was that the gains were too small to spend more time
on it, although I would like to hear about the special case in
rcu_assign_pointer.

/me goes use git blame....

> > If so, does this really matter since we're already in an unlikely
> > section? Again, if so, this deserves a comment ;-)
> > 
> > [ still think those preempt_* calls should be called
> >   rcu_read_sched_lock() or such. ]
> > 
> > Anyway, does this still generate better code?
> > 
> 
> On x86_64 :
> 
>  820:   bf 01 00 00 00          mov    $0x1,%edi
>  825:   e8 00 00 00 00          callq  82a <thread_return+0x136>
>  82a:   48 8b 1d 00 00 00 00    mov    0x0(%rip),%rbx        # 831 <thread_return+0x13d>
>  831:   48 85 db                test   %rbx,%rbx
>  834:   75 21                   jne    857 <thread_return+0x163>
>  836:   eb 27                   jmp    85f <thread_return+0x16b>
>  838:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
>  83f:   00 
>  840:   48 8b 95 68 ff ff ff    mov    -0x98(%rbp),%rdx
>  847:   48 8b b5 60 ff ff ff    mov    -0xa0(%rbp),%rsi
>  84e:   4c 89 e7                mov    %r12,%rdi
>  851:   48 83 c3 08             add    $0x8,%rbx
>  855:   ff d0                   callq  *%rax
>  857:   48 8b 03                mov    (%rbx),%rax
>  85a:   48 85 c0                test   %rax,%rax
>  85d:   75 e1                   jne    840 <thread_return+0x14c>
>  85f:   bf 01 00 00 00          mov    $0x1,%edi
>  864:
> 
> for 68 bytes.
> 
> My original implementation was 77 bytes, so yes, we have a win.

Ah, good good ! :-)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 15:31               ` Peter Zijlstra
@ 2008-07-15 15:50                 ` Mathieu Desnoyers
  2008-08-01 21:10                   ` Paul E. McKenney
  2008-07-15 16:08                 ` Mathieu Desnoyers
  2008-07-15 17:50                 ` Mathieu Desnoyers
  2 siblings, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 15:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2008-07-15 at 11:22 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > 
> > > I'm confused by the barrier games here.
> > > 
> > > Why not:
> > > 
> > >   void **it_func;
> > > 
> > >   preempt_disable();
> > >   it_func = rcu_dereference((tp)->funcs);
> > >   if (it_func) {
> > >     for (; *it_func; it_func++)
> > >       ((void(*)(proto))(*it_func))(args);
> > >   }
> > >   preempt_enable();
> > > 
> > > That is, why can we skip the barrier when !it_func? is that because at
> > > that time we don't actually dereference it_func and therefore cannot
> > > observe stale data?
> > > 
> > 
> > Exactly. I used the implementation of rcu_assign_pointer as a hint that
> > we did not need barriers when setting the pointer to NULL, and thus we
> > should not need the read barrier when reading the NULL pointer, because
> > it references no data.
> > 
> > #define rcu_assign_pointer(p, v) \
> >         ({ \
> >                 if (!__builtin_constant_p(v) || \
> >                     ((v) != NULL)) \
> >                         smp_wmb(); \
> >                 (p) = (v); \
> >         })
> 
> Yeah, I saw that,.. made me wonder. It basically assumes that when we
> write:
> 
>   rcu_assign_pointer(foo, NULL);
> 
> foo will not be used as an index or offset.
> 
> I guess Paul has thought it through and verified all in-kernel use
> cases, but it still makes me feel unconfortable.
> 
> > #define rcu_dereference(p)     ({ \
> >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> >                                 smp_read_barrier_depends(); \
> >                                 (_________p1); \
> >                                 })
> > 
> > But I think you are right, since we are already in unlikely code, using
> > rcu_dereference as you do is better than my use of read barrier depends.
> > It should not change anything in the assembly result except on alpha,
> > where the read_barrier_depends() is not a nop.
> > 
> > I wonder if there would be a way to add this kind of NULL pointer case
> > check without overhead in rcu_dereference() on alpha. I guess not, since
> > the pointer is almost never known at compile-time. And I guess Paul must
> > already have thought about it. The only case where we could add this
> > test is when we know that we have a if (ptr != NULL) test following the
> > rcu_dereference(); we could then assume the compiler will merge the two
> > branches since they depend on the same condition.
> 
> I remember seeing a thread about all this special casing NULL, but have
> never been able to find it again - my google skillz always fail me.
> 
> Basically it doesn't work if you use the variable as an index/offset,
> because in that case 0 is a valid offset and you still generate a data
> dependency.
> 
> IIRC the conclusion was that the gains were too small to spend more time
> on it, although I would like to hear about the special case in
> rcu_assign_pointer.
> 
> /me goes use git blame....
> 

Seems to come from :

commit d99c4f6b13b3149bc83703ab1493beaeaaaf8a2d

which refers to this discussion :

http://www.mail-archive.com/netdev@vger.kernel.org/msg54852.html

Mathieu


> > > If so, does this really matter since we're already in an unlikely
> > > section? Again, if so, this deserves a comment ;-)
> > > 
> > > [ still think those preempt_* calls should be called
> > >   rcu_read_sched_lock() or such. ]
> > > 
> > > Anyway, does this still generate better code?
> > > 
> > 
> > On x86_64 :
> > 
> >  820:   bf 01 00 00 00          mov    $0x1,%edi
> >  825:   e8 00 00 00 00          callq  82a <thread_return+0x136>
> >  82a:   48 8b 1d 00 00 00 00    mov    0x0(%rip),%rbx        # 831 <thread_return+0x13d>
> >  831:   48 85 db                test   %rbx,%rbx
> >  834:   75 21                   jne    857 <thread_return+0x163>
> >  836:   eb 27                   jmp    85f <thread_return+0x16b>
> >  838:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
> >  83f:   00 
> >  840:   48 8b 95 68 ff ff ff    mov    -0x98(%rbp),%rdx
> >  847:   48 8b b5 60 ff ff ff    mov    -0xa0(%rbp),%rsi
> >  84e:   4c 89 e7                mov    %r12,%rdi
> >  851:   48 83 c3 08             add    $0x8,%rbx
> >  855:   ff d0                   callq  *%rax
> >  857:   48 8b 03                mov    (%rbx),%rax
> >  85a:   48 85 c0                test   %rax,%rax
> >  85d:   75 e1                   jne    840 <thread_return+0x14c>
> >  85f:   bf 01 00 00 00          mov    $0x1,%edi
> >  864:
> > 
> > for 68 bytes.
> > 
> > My original implementation was 77 bytes, so yes, we have a win.
> 
> Ah, good good ! :-)
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 15:31               ` Peter Zijlstra
  2008-07-15 15:50                 ` Mathieu Desnoyers
@ 2008-07-15 16:08                 ` Mathieu Desnoyers
  2008-07-15 16:25                   ` Peter Zijlstra
                                     ` (2 more replies)
  2008-07-15 17:50                 ` Mathieu Desnoyers
  2 siblings, 3 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2008-07-15 at 11:22 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > 
> > > I'm confused by the barrier games here.
> > > 
> > > Why not:
> > > 
> > >   void **it_func;
> > > 
> > >   preempt_disable();
> > >   it_func = rcu_dereference((tp)->funcs);
> > >   if (it_func) {
> > >     for (; *it_func; it_func++)
> > >       ((void(*)(proto))(*it_func))(args);
> > >   }
> > >   preempt_enable();
> > > 
> > > That is, why can we skip the barrier when !it_func? is that because at
> > > that time we don't actually dereference it_func and therefore cannot
> > > observe stale data?
> > > 
> > 
> > Exactly. I used the implementation of rcu_assign_pointer as a hint that
> > we did not need barriers when setting the pointer to NULL, and thus we
> > should not need the read barrier when reading the NULL pointer, because
> > it references no data.
> > 
> > #define rcu_assign_pointer(p, v) \
> >         ({ \
> >                 if (!__builtin_constant_p(v) || \
> >                     ((v) != NULL)) \
> >                         smp_wmb(); \
> >                 (p) = (v); \
> >         })
> 
> Yeah, I saw that,.. made me wonder. It basically assumes that when we
> write:
> 
>   rcu_assign_pointer(foo, NULL);
> 
> foo will not be used as an index or offset.
> 
> I guess Paul has thought it through and verified all in-kernel use
> cases, but it still makes me feel unconfortable.
> 
> > #define rcu_dereference(p)     ({ \
> >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> >                                 smp_read_barrier_depends(); \
> >                                 (_________p1); \
> >                                 })
> > 
> > But I think you are right, since we are already in unlikely code, using
> > rcu_dereference as you do is better than my use of read barrier depends.
> > It should not change anything in the assembly result except on alpha,
> > where the read_barrier_depends() is not a nop.
> > 
> > I wonder if there would be a way to add this kind of NULL pointer case
> > check without overhead in rcu_dereference() on alpha. I guess not, since
> > the pointer is almost never known at compile-time. And I guess Paul must
> > already have thought about it. The only case where we could add this
> > test is when we know that we have a if (ptr != NULL) test following the
> > rcu_dereference(); we could then assume the compiler will merge the two
> > branches since they depend on the same condition.
> 
> I remember seeing a thread about all this special casing NULL, but have
> never been able to find it again - my google skillz always fail me.
> 
> Basically it doesn't work if you use the variable as an index/offset,
> because in that case 0 is a valid offset and you still generate a data
> dependency.
> 
> IIRC the conclusion was that the gains were too small to spend more time
> on it, although I would like to hear about the special case in
> rcu_assign_pointer.
> 
> /me goes use git blame....
> 

Actually, we could probably do the following, which also adds an extra
coherency check about non-NULL pointer assumptions :

#ifdef CONFIG_RCU_DEBUG /* this would be new */
#define DEBUG_RCU_BUG_ON(x) BUG_ON(x)
#else
#define DEBUG_RCU_BUG_ON(x)
#endif

#define rcu_dereference(p)     ({ \
                                typeof(p) _________p1 = ACCESS_ONCE(p); \
                                if (p != NULL) \
                                  smp_read_barrier_depends(); \
                                (_________p1); \
                                })

#define rcu_dereference_non_null(p)     ({ \
                                typeof(p) _________p1 = ACCESS_ONCE(p); \
                                DEBUG_RCU_BUG_ON(p == NULL); \
                                smp_read_barrier_depends(); \
                                (_________p1); \
                                })

The use-case where rcu_dereference() would be used is when it is
followed by a null pointer check (grepping through the sources shows me
this is a very very common case). In rare cases, it is assumed that the
pointer is never NULL and it is used just after the rcu_dereference. It
those cases, the extra test could be saved on alpha by using
rcu_dereference_non_null(p), which would check the the pointer is indeed
never NULL under some debug kernel configuration.

Does it make sense ?

Mathieu

> > > If so, does this really matter since we're already in an unlikely
> > > section? Again, if so, this deserves a comment ;-)
> > > 
> > > [ still think those preempt_* calls should be called
> > >   rcu_read_sched_lock() or such. ]
> > > 
> > > Anyway, does this still generate better code?
> > > 
> > 
> > On x86_64 :
> > 
> >  820:   bf 01 00 00 00          mov    $0x1,%edi
> >  825:   e8 00 00 00 00          callq  82a <thread_return+0x136>
> >  82a:   48 8b 1d 00 00 00 00    mov    0x0(%rip),%rbx        # 831 <thread_return+0x13d>
> >  831:   48 85 db                test   %rbx,%rbx
> >  834:   75 21                   jne    857 <thread_return+0x163>
> >  836:   eb 27                   jmp    85f <thread_return+0x16b>
> >  838:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
> >  83f:   00 
> >  840:   48 8b 95 68 ff ff ff    mov    -0x98(%rbp),%rdx
> >  847:   48 8b b5 60 ff ff ff    mov    -0xa0(%rbp),%rsi
> >  84e:   4c 89 e7                mov    %r12,%rdi
> >  851:   48 83 c3 08             add    $0x8,%rbx
> >  855:   ff d0                   callq  *%rax
> >  857:   48 8b 03                mov    (%rbx),%rax
> >  85a:   48 85 c0                test   %rax,%rax
> >  85d:   75 e1                   jne    840 <thread_return+0x14c>
> >  85f:   bf 01 00 00 00          mov    $0x1,%edi
> >  864:
> > 
> > for 68 bytes.
> > 
> > My original implementation was 77 bytes, so yes, we have a win.
> 
> Ah, good good ! :-)
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 16:08                 ` Mathieu Desnoyers
@ 2008-07-15 16:25                   ` Peter Zijlstra
  2008-07-15 16:51                     ` Mathieu Desnoyers
  2008-08-01 21:10                     ` Paul E. McKenney
  2008-07-15 16:26                   ` Mathieu Desnoyers
  2008-08-01 21:10                   ` Paul E. McKenney
  2 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2008-07-15 16:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

On Tue, 2008-07-15 at 12:08 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2008-07-15 at 11:22 -0400, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > 
> > > > I'm confused by the barrier games here.
> > > > 
> > > > Why not:
> > > > 
> > > >   void **it_func;
> > > > 
> > > >   preempt_disable();
> > > >   it_func = rcu_dereference((tp)->funcs);
> > > >   if (it_func) {
> > > >     for (; *it_func; it_func++)
> > > >       ((void(*)(proto))(*it_func))(args);
> > > >   }
> > > >   preempt_enable();
> > > > 
> > > > That is, why can we skip the barrier when !it_func? is that because at
> > > > that time we don't actually dereference it_func and therefore cannot
> > > > observe stale data?
> > > > 
> > > 
> > > Exactly. I used the implementation of rcu_assign_pointer as a hint that
> > > we did not need barriers when setting the pointer to NULL, and thus we
> > > should not need the read barrier when reading the NULL pointer, because
> > > it references no data.
> > > 
> > > #define rcu_assign_pointer(p, v) \
> > >         ({ \
> > >                 if (!__builtin_constant_p(v) || \
> > >                     ((v) != NULL)) \
> > >                         smp_wmb(); \
> > >                 (p) = (v); \
> > >         })
> > 
> > Yeah, I saw that,.. made me wonder. It basically assumes that when we
> > write:
> > 
> >   rcu_assign_pointer(foo, NULL);
> > 
> > foo will not be used as an index or offset.
> > 
> > I guess Paul has thought it through and verified all in-kernel use
> > cases, but it still makes me feel unconfortable.
> > 
> > > #define rcu_dereference(p)     ({ \
> > >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> > >                                 smp_read_barrier_depends(); \
> > >                                 (_________p1); \
> > >                                 })
> > > 
> > > But I think you are right, since we are already in unlikely code, using
> > > rcu_dereference as you do is better than my use of read barrier depends.
> > > It should not change anything in the assembly result except on alpha,
> > > where the read_barrier_depends() is not a nop.
> > > 
> > > I wonder if there would be a way to add this kind of NULL pointer case
> > > check without overhead in rcu_dereference() on alpha. I guess not, since
> > > the pointer is almost never known at compile-time. And I guess Paul must
> > > already have thought about it. The only case where we could add this
> > > test is when we know that we have a if (ptr != NULL) test following the
> > > rcu_dereference(); we could then assume the compiler will merge the two
> > > branches since they depend on the same condition.
> > 
> > I remember seeing a thread about all this special casing NULL, but have
> > never been able to find it again - my google skillz always fail me.
> > 
> > Basically it doesn't work if you use the variable as an index/offset,
> > because in that case 0 is a valid offset and you still generate a data
> > dependency.
> > 
> > IIRC the conclusion was that the gains were too small to spend more time
> > on it, although I would like to hear about the special case in
> > rcu_assign_pointer.
> > 
> > /me goes use git blame....
> > 
> 
> Actually, we could probably do the following, which also adds an extra
> coherency check about non-NULL pointer assumptions :
> 
> #ifdef CONFIG_RCU_DEBUG /* this would be new */
> #define DEBUG_RCU_BUG_ON(x) BUG_ON(x)
> #else
> #define DEBUG_RCU_BUG_ON(x)
> #endif
> 
> #define rcu_dereference(p)     ({ \
>                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
>                                 if (p != NULL) \
>                                   smp_read_barrier_depends(); \
>                                 (_________p1); \
>                                 })
> 
> #define rcu_dereference_non_null(p)     ({ \
>                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
>                                 DEBUG_RCU_BUG_ON(p == NULL); \
>                                 smp_read_barrier_depends(); \
>                                 (_________p1); \
>                                 })
> 
> The use-case where rcu_dereference() would be used is when it is
> followed by a null pointer check (grepping through the sources shows me
> this is a very very common case). In rare cases, it is assumed that the
> pointer is never NULL and it is used just after the rcu_dereference. It
> those cases, the extra test could be saved on alpha by using
> rcu_dereference_non_null(p), which would check the the pointer is indeed
> never NULL under some debug kernel configuration.
> 
> Does it make sense ?

This would break the case where the dereferenced variable is used as an
index/offset where 0 is a valid value and still generates data
dependencies.

So if with your new version we do:

  i = rcu_dereference(foo);
  j = table[i];

which translates into:

  i = ACCESS_ONCE(foo);
  if (i)
    smp_read_barrier_depends();
  j = table[i];

which when i == 0, would fail to do the barrier and can thus cause j to
be a wrong value.

Sadly I'll have to defer to Paul to explain exactly how that can happen
- I always get my head in a horrible twist with this case.




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 16:08                 ` Mathieu Desnoyers
  2008-07-15 16:25                   ` Peter Zijlstra
@ 2008-07-15 16:26                   ` Mathieu Desnoyers
  2008-08-01 21:10                   ` Paul E. McKenney
  2 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 16:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2008-07-15 at 11:22 -0400, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > 
> > > > I'm confused by the barrier games here.
> > > > 
> > > > Why not:
> > > > 
> > > >   void **it_func;
> > > > 
> > > >   preempt_disable();
> > > >   it_func = rcu_dereference((tp)->funcs);
> > > >   if (it_func) {
> > > >     for (; *it_func; it_func++)
> > > >       ((void(*)(proto))(*it_func))(args);
> > > >   }
> > > >   preempt_enable();
> > > > 
> > > > That is, why can we skip the barrier when !it_func? is that because at
> > > > that time we don't actually dereference it_func and therefore cannot
> > > > observe stale data?
> > > > 
> > > 
> > > Exactly. I used the implementation of rcu_assign_pointer as a hint that
> > > we did not need barriers when setting the pointer to NULL, and thus we
> > > should not need the read barrier when reading the NULL pointer, because
> > > it references no data.
> > > 
> > > #define rcu_assign_pointer(p, v) \
> > >         ({ \
> > >                 if (!__builtin_constant_p(v) || \
> > >                     ((v) != NULL)) \
> > >                         smp_wmb(); \
> > >                 (p) = (v); \
> > >         })
> > 
> > Yeah, I saw that,.. made me wonder. It basically assumes that when we
> > write:
> > 
> >   rcu_assign_pointer(foo, NULL);
> > 
> > foo will not be used as an index or offset.
> > 
> > I guess Paul has thought it through and verified all in-kernel use
> > cases, but it still makes me feel unconfortable.
> > 
> > > #define rcu_dereference(p)     ({ \
> > >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> > >                                 smp_read_barrier_depends(); \
> > >                                 (_________p1); \
> > >                                 })
> > > 
> > > But I think you are right, since we are already in unlikely code, using
> > > rcu_dereference as you do is better than my use of read barrier depends.
> > > It should not change anything in the assembly result except on alpha,
> > > where the read_barrier_depends() is not a nop.
> > > 
> > > I wonder if there would be a way to add this kind of NULL pointer case
> > > check without overhead in rcu_dereference() on alpha. I guess not, since
> > > the pointer is almost never known at compile-time. And I guess Paul must
> > > already have thought about it. The only case where we could add this
> > > test is when we know that we have a if (ptr != NULL) test following the
> > > rcu_dereference(); we could then assume the compiler will merge the two
> > > branches since they depend on the same condition.
> > 
> > I remember seeing a thread about all this special casing NULL, but have
> > never been able to find it again - my google skillz always fail me.
> > 
> > Basically it doesn't work if you use the variable as an index/offset,
> > because in that case 0 is a valid offset and you still generate a data
> > dependency.
> > 
> > IIRC the conclusion was that the gains were too small to spend more time
> > on it, although I would like to hear about the special case in
> > rcu_assign_pointer.
> > 
> > /me goes use git blame....
> > 
> 
> Actually, we could probably do the following, which also adds an extra
> coherency check about non-NULL pointer assumptions :
> 
> #ifdef CONFIG_RCU_DEBUG /* this would be new */
> #define DEBUG_RCU_BUG_ON(x) BUG_ON(x)
> #else
> #define DEBUG_RCU_BUG_ON(x)
> #endif
> 
> #define rcu_dereference(p)     ({ \
>                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
>                                 if (p != NULL) \

Actually this line should be :
                                if (_________p1 != NULL) \

>                                   smp_read_barrier_depends(); \
>                                 (_________p1); \
>                                 })
> 
> #define rcu_dereference_non_null(p)     ({ \
>                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
>                                 DEBUG_RCU_BUG_ON(p == NULL); \

And this one :
                                DEBUG_RCU_BUG_ON(_________p1 == NULL); \

Mathieu


>                                 smp_read_barrier_depends(); \
>                                 (_________p1); \
>                                 })
> 
> The use-case where rcu_dereference() would be used is when it is
> followed by a null pointer check (grepping through the sources shows me
> this is a very very common case). In rare cases, it is assumed that the
> pointer is never NULL and it is used just after the rcu_dereference. It
> those cases, the extra test could be saved on alpha by using
> rcu_dereference_non_null(p), which would check the the pointer is indeed
> never NULL under some debug kernel configuration.
> 
> Does it make sense ?
> 
> Mathieu
> 
> > > > If so, does this really matter since we're already in an unlikely
> > > > section? Again, if so, this deserves a comment ;-)
> > > > 
> > > > [ still think those preempt_* calls should be called
> > > >   rcu_read_sched_lock() or such. ]
> > > > 
> > > > Anyway, does this still generate better code?
> > > > 
> > > 
> > > On x86_64 :
> > > 
> > >  820:   bf 01 00 00 00          mov    $0x1,%edi
> > >  825:   e8 00 00 00 00          callq  82a <thread_return+0x136>
> > >  82a:   48 8b 1d 00 00 00 00    mov    0x0(%rip),%rbx        # 831 <thread_return+0x13d>
> > >  831:   48 85 db                test   %rbx,%rbx
> > >  834:   75 21                   jne    857 <thread_return+0x163>
> > >  836:   eb 27                   jmp    85f <thread_return+0x16b>
> > >  838:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
> > >  83f:   00 
> > >  840:   48 8b 95 68 ff ff ff    mov    -0x98(%rbp),%rdx
> > >  847:   48 8b b5 60 ff ff ff    mov    -0xa0(%rbp),%rsi
> > >  84e:   4c 89 e7                mov    %r12,%rdi
> > >  851:   48 83 c3 08             add    $0x8,%rbx
> > >  855:   ff d0                   callq  *%rax
> > >  857:   48 8b 03                mov    (%rbx),%rax
> > >  85a:   48 85 c0                test   %rax,%rax
> > >  85d:   75 e1                   jne    840 <thread_return+0x14c>
> > >  85f:   bf 01 00 00 00          mov    $0x1,%edi
> > >  864:
> > > 
> > > for 68 bytes.
> > > 
> > > My original implementation was 77 bytes, so yes, we have a win.
> > 
> > Ah, good good ! :-)
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 16:25                   ` Peter Zijlstra
@ 2008-07-15 16:51                     ` Mathieu Desnoyers
  2008-08-01 21:10                       ` Paul E. McKenney
  2008-08-01 21:10                     ` Paul E. McKenney
  1 sibling, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2008-07-15 at 12:08 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Tue, 2008-07-15 at 11:22 -0400, Mathieu Desnoyers wrote:
> > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > 
> > > > > I'm confused by the barrier games here.
> > > > > 
> > > > > Why not:
> > > > > 
> > > > >   void **it_func;
> > > > > 
> > > > >   preempt_disable();
> > > > >   it_func = rcu_dereference((tp)->funcs);
> > > > >   if (it_func) {
> > > > >     for (; *it_func; it_func++)
> > > > >       ((void(*)(proto))(*it_func))(args);
> > > > >   }
> > > > >   preempt_enable();
> > > > > 
> > > > > That is, why can we skip the barrier when !it_func? is that because at
> > > > > that time we don't actually dereference it_func and therefore cannot
> > > > > observe stale data?
> > > > > 
> > > > 
> > > > Exactly. I used the implementation of rcu_assign_pointer as a hint that
> > > > we did not need barriers when setting the pointer to NULL, and thus we
> > > > should not need the read barrier when reading the NULL pointer, because
> > > > it references no data.
> > > > 
> > > > #define rcu_assign_pointer(p, v) \
> > > >         ({ \
> > > >                 if (!__builtin_constant_p(v) || \
> > > >                     ((v) != NULL)) \
> > > >                         smp_wmb(); \
> > > >                 (p) = (v); \
> > > >         })
> > > 
> > > Yeah, I saw that,.. made me wonder. It basically assumes that when we
> > > write:
> > > 
> > >   rcu_assign_pointer(foo, NULL);
> > > 
> > > foo will not be used as an index or offset.
> > > 
> > > I guess Paul has thought it through and verified all in-kernel use
> > > cases, but it still makes me feel unconfortable.
> > > 
> > > > #define rcu_dereference(p)     ({ \
> > > >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> > > >                                 smp_read_barrier_depends(); \
> > > >                                 (_________p1); \
> > > >                                 })
> > > > 
> > > > But I think you are right, since we are already in unlikely code, using
> > > > rcu_dereference as you do is better than my use of read barrier depends.
> > > > It should not change anything in the assembly result except on alpha,
> > > > where the read_barrier_depends() is not a nop.
> > > > 
> > > > I wonder if there would be a way to add this kind of NULL pointer case
> > > > check without overhead in rcu_dereference() on alpha. I guess not, since
> > > > the pointer is almost never known at compile-time. And I guess Paul must
> > > > already have thought about it. The only case where we could add this
> > > > test is when we know that we have a if (ptr != NULL) test following the
> > > > rcu_dereference(); we could then assume the compiler will merge the two
> > > > branches since they depend on the same condition.
> > > 
> > > I remember seeing a thread about all this special casing NULL, but have
> > > never been able to find it again - my google skillz always fail me.
> > > 
> > > Basically it doesn't work if you use the variable as an index/offset,
> > > because in that case 0 is a valid offset and you still generate a data
> > > dependency.
> > > 
> > > IIRC the conclusion was that the gains were too small to spend more time
> > > on it, although I would like to hear about the special case in
> > > rcu_assign_pointer.
> > > 
> > > /me goes use git blame....
> > > 
> > 
> > Actually, we could probably do the following, which also adds an extra
> > coherency check about non-NULL pointer assumptions :
> > 
> > #ifdef CONFIG_RCU_DEBUG /* this would be new */
> > #define DEBUG_RCU_BUG_ON(x) BUG_ON(x)
> > #else
> > #define DEBUG_RCU_BUG_ON(x)
> > #endif
> > 
> > #define rcu_dereference(p)     ({ \
> >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> >                                 if (p != NULL) \
> >                                   smp_read_barrier_depends(); \
> >                                 (_________p1); \
> >                                 })
> > 
> > #define rcu_dereference_non_null(p)     ({ \
> >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> >                                 DEBUG_RCU_BUG_ON(p == NULL); \
> >                                 smp_read_barrier_depends(); \
> >                                 (_________p1); \
> >                                 })
> > 
> > The use-case where rcu_dereference() would be used is when it is
> > followed by a null pointer check (grepping through the sources shows me
> > this is a very very common case). In rare cases, it is assumed that the
> > pointer is never NULL and it is used just after the rcu_dereference. It
> > those cases, the extra test could be saved on alpha by using
> > rcu_dereference_non_null(p), which would check the the pointer is indeed
> > never NULL under some debug kernel configuration.
> > 
> > Does it make sense ?
> 
> This would break the case where the dereferenced variable is used as an
> index/offset where 0 is a valid value and still generates data
> dependencies.
> 
> So if with your new version we do:
> 
>   i = rcu_dereference(foo);
>   j = table[i];
> 
> which translates into:
> 
>   i = ACCESS_ONCE(foo);
>   if (i)
>     smp_read_barrier_depends();
>   j = table[i];
> 
> which when i == 0, would fail to do the barrier and can thus cause j to
> be a wrong value.
> 
> Sadly I'll have to defer to Paul to explain exactly how that can happen
> - I always get my head in a horrible twist with this case.
> 

I completely agree with you. However, given the current
rcu_assign_pointer() implementation, we already have this problem. My
proposal assumes the current rcu_assign_pointer() behavior is correct
and that those are never ever used for index/offsets.

We could enforce this as a compile-time check with something along the
lines of :

#define BUILD_BUG_ON_NOT_OFFSETABLE(x) (void)(x)[0]

And use it both in rcu_assign_pointer() and rcu_dereference().  It would
check for any type passed to rcu_assign_pointer and rcu_dereference
which is not either a pointer or an array.

Then if someone really want to shoot himself in the foot by casting a
pointer to a long after the rcu_deref, that's his problem.

Hrm, looking at rcu_assign_pointer tells me that the ((v) != NULL) test
should probably already complain if v is not a pointer. So my build test
is probably unneeded.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 15:31               ` Peter Zijlstra
  2008-07-15 15:50                 ` Mathieu Desnoyers
  2008-07-15 16:08                 ` Mathieu Desnoyers
@ 2008-07-15 17:50                 ` Mathieu Desnoyers
  2 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

* Peter Zijlstra (peterz@infradead.org) wrote:
> > > Anyway, does this still generate better code?
> > > 
> > 
> > On x86_64 :
> > 
> >  820:   bf 01 00 00 00          mov    $0x1,%edi
> >  825:   e8 00 00 00 00          callq  82a <thread_return+0x136>
> >  82a:   48 8b 1d 00 00 00 00    mov    0x0(%rip),%rbx        # 831 <thread_return+0x13d>
> >  831:   48 85 db                test   %rbx,%rbx
> >  834:   75 21                   jne    857 <thread_return+0x163>
> >  836:   eb 27                   jmp    85f <thread_return+0x16b>
> >  838:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
> >  83f:   00 
> >  840:   48 8b 95 68 ff ff ff    mov    -0x98(%rbp),%rdx
> >  847:   48 8b b5 60 ff ff ff    mov    -0xa0(%rbp),%rsi
> >  84e:   4c 89 e7                mov    %r12,%rdi
> >  851:   48 83 c3 08             add    $0x8,%rbx
> >  855:   ff d0                   callq  *%rax
> >  857:   48 8b 03                mov    (%rbx),%rax
> >  85a:   48 85 c0                test   %rax,%rax
> >  85d:   75 e1                   jne    840 <thread_return+0x14c>
> >  85f:   bf 01 00 00 00          mov    $0x1,%edi
> >  864:
> > 
> > for 68 bytes.
> > 
> > My original implementation was 77 bytes, so yes, we have a win.
> 
> Ah, good good ! :-)
> 

For the same number of instruction bytes, here is yet another improvement. I
removed the it_func[0] NULL test case, which is impossible. We never
have an empty array. If the array is empty, the array pointer is set to
NULL and the array is eventually freed when a quiescent state is reached.

/*
 * it_func[0] is never NULL because there is at least one element in the array
 * when the array itself is non NULL.
 */
#define __DO_TRACE(tp, proto, args)                                     \
        do {                                                            \
                void **it_func;                                         \
                                                                        \
                preempt_disable();                                      \
                it_func = rcu_dereference((tp)->funcs);                 \
                if (it_func) {                                          \
                        do {                                            \
                                ((void(*)(proto))(*it_func))(args);     \
                        } while (*(++it_func));                         \
                }                                                       \
                preempt_enable();                                       \
        } while (0)

P.S.: I'll change the preempt_enable/disable for rcu locks when I port
this patchset to linux.next. I temporarily keep the preempt
disable/enable statements.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 15:13           ` Peter Zijlstra
@ 2008-07-15 18:22             ` Mathieu Desnoyers
  2008-07-15 18:33               ` Steven Rostedt
  2008-07-15 18:52             ` Masami Hiramatsu
  1 sibling, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 18:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2008-07-15 at 10:46 -0400, Mathieu Desnoyers wrote:
> 
> > Talking about headers, I have noticed that placing headers with the code
> > may not be as clean as I would hope. For instance, the kernel/irq-trace.h
> > header, when included from kernel/irq/handle.c, has to be included with:
> > 
> > #include "../irq-trace.h"
> > 
> > Which is not _that_ bad, but we we want to instrument the irq handler
> > found in arch/x86/kernel/cpu/mcheck/mce_intel_64.c, including
> > #include "../../../../../kernel/irq-trace.h" makes me go "yeeeek!"
> > 
> > How about creating include/trace/irq.h and friends ?
> 
> Might as well.. anybody else got opinions?
> 

I'm also wondering if it's better to have :

filemap.h
fs.h
hugetlb.h
ipc.h
ipv4.h
ipv6.h
irq.h
kernel.h
memory.h
net.h
page.h
sched.h
swap.h
timer.h

all in include/trace/ or to create subdirectories first, like :

include/trace/net/
include/trace/mm/
...

or to go the other way around and re-use the existing subdirectories :

include/net/trace/
include/mm/trace/
...

?

Mathieu



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 18:22             ` Mathieu Desnoyers
@ 2008-07-15 18:33               ` Steven Rostedt
  0 siblings, 0 replies; 58+ messages in thread
From: Steven Rostedt @ 2008-07-15 18:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, Ingo Molnar, linux-kernel,
	Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney


On Tue, 15 Jul 2008, Mathieu Desnoyers wrote:
>
> I'm also wondering if it's better to have :
>
> filemap.h
> fs.h
> hugetlb.h
> ipc.h
> ipv4.h
> ipv6.h
> irq.h
> kernel.h
> memory.h
> net.h
> page.h
> sched.h
> swap.h
> timer.h

This might be a better idea.

>
> all in include/trace/ or to create subdirectories first, like :
>
> include/trace/net/
> include/trace/mm/

I think that is too much. A single trace directory should be sufficent.

> ....
>
> or to go the other way around and re-use the existing subdirectories :
>
> include/net/trace/
> include/mm/trace/
> ....

I'm definitely against that.

-- Steve


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 15:13           ` Peter Zijlstra
  2008-07-15 18:22             ` Mathieu Desnoyers
@ 2008-07-15 18:52             ` Masami Hiramatsu
  2008-07-15 19:08               ` Mathieu Desnoyers
  1 sibling, 1 reply; 58+ messages in thread
From: Masami Hiramatsu @ 2008-07-15 18:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

Hi,

Peter Zijlstra wrote:
> On Tue, 2008-07-15 at 10:46 -0400, Mathieu Desnoyers wrote:
> 
>> Talking about headers, I have noticed that placing headers with the code
>> may not be as clean as I would hope. For instance, the kernel/irq-trace.h
>> header, when included from kernel/irq/handle.c, has to be included with:
>>
>> #include "../irq-trace.h"
>>
>> Which is not _that_ bad, but we we want to instrument the irq handler
>> found in arch/x86/kernel/cpu/mcheck/mce_intel_64.c, including
>> #include "../../../../../kernel/irq-trace.h" makes me go "yeeeek!"
>>
>> How about creating include/trace/irq.h and friends ?
> 
> Might as well.. anybody else got opinions?

I just wonder why DEFINE_TRACE are used in separated headers
instead of include/linux/irq.h directly.

anyway, #include <trace/XXX.h> is good to me.

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 14:03       ` Peter Zijlstra
  2008-07-15 14:46         ` Mathieu Desnoyers
@ 2008-07-15 19:02         ` Mathieu Desnoyers
  2008-07-15 19:52           ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 19:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2008-07-15 at 09:25 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Wed, 2008-07-09 at 10:59 -0400, Mathieu Desnoyers wrote:
> 
> > > > +#define __DO_TRACE(tp, proto, args)					\
> > > > +	do {								\
> > > > +		int i;							\
> > > > +		void **funcs;						\
> > > > +		preempt_disable();					\
> > > > +		funcs = (tp)->funcs;					\
> > > > +		smp_read_barrier_depends();				\
> > > > +		if (funcs) {						\
> > > > +			for (i = 0; funcs[i]; i++) {			\
> > > 
> > > Also, why is the preempt_disable needed?
> > > 
> > 
> > Addition and removal of tracepoints is synchronized by RCU using the
> > scheduler (and preempt_disable) as guarantees to find a quiescent state
> > (this is really RCU "classic"). The update side uses rcu_barrier_sched()
> > with call_rcu_sched() and the read/execute side uses
> > "preempt_disable()/preempt_enable()".
> 
> > > > +static void tracepoint_entry_free_old(struct tracepoint_entry *entry, void *old)
> > > > +{
> > > > +	if (!old)
> > > > +		return;
> > > > +	entry->oldptr = old;
> > > > +	entry->rcu_pending = 1;
> > > > +	/* write rcu_pending before calling the RCU callback */
> > > > +	smp_wmb();
> > > > +#ifdef CONFIG_PREEMPT_RCU
> > > > +	synchronize_sched();	/* Until we have the call_rcu_sched() */
> > > > +#endif
> > > 
> > > Does this have something to do with the preempt_disable above?
> > > 
> > 
> > Yes, it does. We make sure the previous array containing probes, which
> > has been scheduled for deletion by the rcu callback, is indeed freed
> > before we proceed to the next update. It therefore limits the rate of
> > modification of a single tracepoint to one update per RCU period. The
> > objective here is to permit fast batch add/removal of probes on
> > _different_ tracepoints.
> > 
> > This use of "synchronize_sched()" can be changed for call_rcu_sched() in
> > linux-next, I'll fix this.
> 
> Right, I thought as much, its just that the raw preempt_disable()
> without comments leaves one wondering if there is anything else going
> on.
> 
> Would it make sense to add:
> 
> rcu_read_sched_lock()
> rcu_read_sched_unlock()
> 
> to match:
> 
> call_rcu_sched()
> rcu_barrier_sched()
> synchronize_sched()
> 
> ?
> 

Actually I think it's better to call them
rcu_read_lock_sched() and rcu_read_unlock_sched() to match the _bh()
equivalent already in rcupdate.h.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 18:52             ` Masami Hiramatsu
@ 2008-07-15 19:08               ` Mathieu Desnoyers
  0 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2008-07-15 19:08 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, akpm, Ingo Molnar, linux-kernel,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Hi,
> 
> Peter Zijlstra wrote:
> > On Tue, 2008-07-15 at 10:46 -0400, Mathieu Desnoyers wrote:
> > 
> >> Talking about headers, I have noticed that placing headers with the code
> >> may not be as clean as I would hope. For instance, the kernel/irq-trace.h
> >> header, when included from kernel/irq/handle.c, has to be included with:
> >>
> >> #include "../irq-trace.h"
> >>
> >> Which is not _that_ bad, but we we want to instrument the irq handler
> >> found in arch/x86/kernel/cpu/mcheck/mce_intel_64.c, including
> >> #include "../../../../../kernel/irq-trace.h" makes me go "yeeeek!"
> >>
> >> How about creating include/trace/irq.h and friends ?
> > 
> > Might as well.. anybody else got opinions?
> 
> I just wonder why DEFINE_TRACE are used in separated headers
> instead of include/linux/irq.h directly.
> 
> anyway, #include <trace/XXX.h> is good to me.
> 

Having these headers all placed nicely together will make it easier for
people who are looking for already existing tracepoints to locate them.

It's also worth noting that I am considering deploying a standard set of
tracepoints for userspace in a relatively short time frame. e.g. having
the ability to add tracepoints to pthread mutexes seems like an
interesting thing to have. And that will definitely require those
headers to sit somewhere around /usr/include/trace/ or something
similar, otherwise trying to locate those tracepoints will be hellish.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 19:02         ` Mathieu Desnoyers
@ 2008-07-15 19:52           ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2008-07-15 19:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Alexander Viro, Eduard - Gabriel Munteanu, Paul E McKenney

On Tue, 2008-07-15 at 15:02 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:

> > Would it make sense to add:
> > 
> > rcu_read_sched_lock()
> > rcu_read_sched_unlock()
> > 
> > to match:
> > 
> > call_rcu_sched()
> > rcu_barrier_sched()
> > synchronize_sched()
> > 
> > ?
> > 
> 
> Actually I think it's better to call them
> rcu_read_lock_sched() and rcu_read_unlock_sched() to match the _bh()
> equivalent already in rcupdate.h.

Sure,.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 15:50                 ` Mathieu Desnoyers
@ 2008-08-01 21:10                   ` Paul E. McKenney
  0 siblings, 0 replies; 58+ messages in thread
From: Paul E. McKenney @ 2008-08-01 21:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, Ingo Molnar, linux-kernel,
	Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Alexander Viro, Eduard - Gabriel Munteanu

On Tue, Jul 15, 2008 at 11:50:18AM -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2008-07-15 at 11:22 -0400, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > 
> > > > I'm confused by the barrier games here.
> > > > 
> > > > Why not:
> > > > 
> > > >   void **it_func;
> > > > 
> > > >   preempt_disable();
> > > >   it_func = rcu_dereference((tp)->funcs);
> > > >   if (it_func) {
> > > >     for (; *it_func; it_func++)
> > > >       ((void(*)(proto))(*it_func))(args);
> > > >   }
> > > >   preempt_enable();
> > > > 
> > > > That is, why can we skip the barrier when !it_func? is that because at
> > > > that time we don't actually dereference it_func and therefore cannot
> > > > observe stale data?
> > > > 
> > > 
> > > Exactly. I used the implementation of rcu_assign_pointer as a hint that
> > > we did not need barriers when setting the pointer to NULL, and thus we
> > > should not need the read barrier when reading the NULL pointer, because
> > > it references no data.
> > > 
> > > #define rcu_assign_pointer(p, v) \
> > >         ({ \
> > >                 if (!__builtin_constant_p(v) || \
> > >                     ((v) != NULL)) \
> > >                         smp_wmb(); \
> > >                 (p) = (v); \
> > >         })
> > 
> > Yeah, I saw that,.. made me wonder. It basically assumes that when we
> > write:
> > 
> >   rcu_assign_pointer(foo, NULL);
> > 
> > foo will not be used as an index or offset.
> > 
> > I guess Paul has thought it through and verified all in-kernel use
> > cases, but it still makes me feel unconfortable.

The idea was to create an rcu_assign_index() for that case, if and when
it arose, something like the following:

	#define rcu_assign_index(p, v, a) \
		({ \
			smp_wmb(); \
			(p) = (v); \
		})

							Thanx, Paul

> > > #define rcu_dereference(p)     ({ \
> > >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> > >                                 smp_read_barrier_depends(); \
> > >                                 (_________p1); \
> > >                                 })
> > > 
> > > But I think you are right, since we are already in unlikely code, using
> > > rcu_dereference as you do is better than my use of read barrier depends.
> > > It should not change anything in the assembly result except on alpha,
> > > where the read_barrier_depends() is not a nop.
> > > 
> > > I wonder if there would be a way to add this kind of NULL pointer case
> > > check without overhead in rcu_dereference() on alpha. I guess not, since
> > > the pointer is almost never known at compile-time. And I guess Paul must
> > > already have thought about it. The only case where we could add this
> > > test is when we know that we have a if (ptr != NULL) test following the
> > > rcu_dereference(); we could then assume the compiler will merge the two
> > > branches since they depend on the same condition.
> > 
> > I remember seeing a thread about all this special casing NULL, but have
> > never been able to find it again - my google skillz always fail me.
> > 
> > Basically it doesn't work if you use the variable as an index/offset,
> > because in that case 0 is a valid offset and you still generate a data
> > dependency.
> > 
> > IIRC the conclusion was that the gains were too small to spend more time
> > on it, although I would like to hear about the special case in
> > rcu_assign_pointer.
> > 
> > /me goes use git blame....
> > 
> 
> Seems to come from :
> 
> commit d99c4f6b13b3149bc83703ab1493beaeaaaf8a2d
> 
> which refers to this discussion :
> 
> http://www.mail-archive.com/netdev@vger.kernel.org/msg54852.html
> 
> Mathieu
> 
> 
> > > > If so, does this really matter since we're already in an unlikely
> > > > section? Again, if so, this deserves a comment ;-)
> > > > 
> > > > [ still think those preempt_* calls should be called
> > > >   rcu_read_sched_lock() or such. ]
> > > > 
> > > > Anyway, does this still generate better code?
> > > > 
> > > 
> > > On x86_64 :
> > > 
> > >  820:   bf 01 00 00 00          mov    $0x1,%edi
> > >  825:   e8 00 00 00 00          callq  82a <thread_return+0x136>
> > >  82a:   48 8b 1d 00 00 00 00    mov    0x0(%rip),%rbx        # 831 <thread_return+0x13d>
> > >  831:   48 85 db                test   %rbx,%rbx
> > >  834:   75 21                   jne    857 <thread_return+0x163>
> > >  836:   eb 27                   jmp    85f <thread_return+0x16b>
> > >  838:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
> > >  83f:   00 
> > >  840:   48 8b 95 68 ff ff ff    mov    -0x98(%rbp),%rdx
> > >  847:   48 8b b5 60 ff ff ff    mov    -0xa0(%rbp),%rsi
> > >  84e:   4c 89 e7                mov    %r12,%rdi
> > >  851:   48 83 c3 08             add    $0x8,%rbx
> > >  855:   ff d0                   callq  *%rax
> > >  857:   48 8b 03                mov    (%rbx),%rax
> > >  85a:   48 85 c0                test   %rax,%rax
> > >  85d:   75 e1                   jne    840 <thread_return+0x14c>
> > >  85f:   bf 01 00 00 00          mov    $0x1,%edi
> > >  864:
> > > 
> > > for 68 bytes.
> > > 
> > > My original implementation was 77 bytes, so yes, we have a win.
> > 
> > Ah, good good ! :-)
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 16:08                 ` Mathieu Desnoyers
  2008-07-15 16:25                   ` Peter Zijlstra
  2008-07-15 16:26                   ` Mathieu Desnoyers
@ 2008-08-01 21:10                   ` Paul E. McKenney
  2 siblings, 0 replies; 58+ messages in thread
From: Paul E. McKenney @ 2008-08-01 21:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, Ingo Molnar, linux-kernel,
	Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Alexander Viro, Eduard - Gabriel Munteanu

On Tue, Jul 15, 2008 at 12:08:13PM -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2008-07-15 at 11:22 -0400, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > 
> > > > I'm confused by the barrier games here.
> > > > 
> > > > Why not:
> > > > 
> > > >   void **it_func;
> > > > 
> > > >   preempt_disable();
> > > >   it_func = rcu_dereference((tp)->funcs);
> > > >   if (it_func) {
> > > >     for (; *it_func; it_func++)
> > > >       ((void(*)(proto))(*it_func))(args);
> > > >   }
> > > >   preempt_enable();
> > > > 
> > > > That is, why can we skip the barrier when !it_func? is that because at
> > > > that time we don't actually dereference it_func and therefore cannot
> > > > observe stale data?
> > > > 
> > > 
> > > Exactly. I used the implementation of rcu_assign_pointer as a hint that
> > > we did not need barriers when setting the pointer to NULL, and thus we
> > > should not need the read barrier when reading the NULL pointer, because
> > > it references no data.
> > > 
> > > #define rcu_assign_pointer(p, v) \
> > >         ({ \
> > >                 if (!__builtin_constant_p(v) || \
> > >                     ((v) != NULL)) \
> > >                         smp_wmb(); \
> > >                 (p) = (v); \
> > >         })
> > 
> > Yeah, I saw that,.. made me wonder. It basically assumes that when we
> > write:
> > 
> >   rcu_assign_pointer(foo, NULL);
> > 
> > foo will not be used as an index or offset.
> > 
> > I guess Paul has thought it through and verified all in-kernel use
> > cases, but it still makes me feel unconfortable.
> > 
> > > #define rcu_dereference(p)     ({ \
> > >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> > >                                 smp_read_barrier_depends(); \
> > >                                 (_________p1); \
> > >                                 })
> > > 
> > > But I think you are right, since we are already in unlikely code, using
> > > rcu_dereference as you do is better than my use of read barrier depends.
> > > It should not change anything in the assembly result except on alpha,
> > > where the read_barrier_depends() is not a nop.
> > > 
> > > I wonder if there would be a way to add this kind of NULL pointer case
> > > check without overhead in rcu_dereference() on alpha. I guess not, since
> > > the pointer is almost never known at compile-time. And I guess Paul must
> > > already have thought about it. The only case where we could add this
> > > test is when we know that we have a if (ptr != NULL) test following the
> > > rcu_dereference(); we could then assume the compiler will merge the two
> > > branches since they depend on the same condition.
> > 
> > I remember seeing a thread about all this special casing NULL, but have
> > never been able to find it again - my google skillz always fail me.
> > 
> > Basically it doesn't work if you use the variable as an index/offset,
> > because in that case 0 is a valid offset and you still generate a data
> > dependency.
> > 
> > IIRC the conclusion was that the gains were too small to spend more time
> > on it, although I would like to hear about the special case in
> > rcu_assign_pointer.
> > 
> > /me goes use git blame....
> > 
> 
> Actually, we could probably do the following, which also adds an extra
> coherency check about non-NULL pointer assumptions :
> 
> #ifdef CONFIG_RCU_DEBUG /* this would be new */
> #define DEBUG_RCU_BUG_ON(x) BUG_ON(x)
> #else
> #define DEBUG_RCU_BUG_ON(x)
> #endif
> 
> #define rcu_dereference(p)     ({ \
>                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
>                                 if (p != NULL) \
>                                   smp_read_barrier_depends(); \
>                                 (_________p1); \
>                                 })
> 
> #define rcu_dereference_non_null(p)     ({ \
>                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
>                                 DEBUG_RCU_BUG_ON(p == NULL); \
>                                 smp_read_barrier_depends(); \
>                                 (_________p1); \
>                                 })

The big question is "why"?  smp_read_barrier_depends() is pretty
lightweight, after all.

						Thanx, Paul

> The use-case where rcu_dereference() would be used is when it is
> followed by a null pointer check (grepping through the sources shows me
> this is a very very common case). In rare cases, it is assumed that the
> pointer is never NULL and it is used just after the rcu_dereference. It
> those cases, the extra test could be saved on alpha by using
> rcu_dereference_non_null(p), which would check the the pointer is indeed
> never NULL under some debug kernel configuration.
> 
> Does it make sense ?
> 
> Mathieu
> 
> > > > If so, does this really matter since we're already in an unlikely
> > > > section? Again, if so, this deserves a comment ;-)
> > > > 
> > > > [ still think those preempt_* calls should be called
> > > >   rcu_read_sched_lock() or such. ]
> > > > 
> > > > Anyway, does this still generate better code?
> > > > 
> > > 
> > > On x86_64 :
> > > 
> > >  820:   bf 01 00 00 00          mov    $0x1,%edi
> > >  825:   e8 00 00 00 00          callq  82a <thread_return+0x136>
> > >  82a:   48 8b 1d 00 00 00 00    mov    0x0(%rip),%rbx        # 831 <thread_return+0x13d>
> > >  831:   48 85 db                test   %rbx,%rbx
> > >  834:   75 21                   jne    857 <thread_return+0x163>
> > >  836:   eb 27                   jmp    85f <thread_return+0x16b>
> > >  838:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
> > >  83f:   00 
> > >  840:   48 8b 95 68 ff ff ff    mov    -0x98(%rbp),%rdx
> > >  847:   48 8b b5 60 ff ff ff    mov    -0xa0(%rbp),%rsi
> > >  84e:   4c 89 e7                mov    %r12,%rdi
> > >  851:   48 83 c3 08             add    $0x8,%rbx
> > >  855:   ff d0                   callq  *%rax
> > >  857:   48 8b 03                mov    (%rbx),%rax
> > >  85a:   48 85 c0                test   %rax,%rax
> > >  85d:   75 e1                   jne    840 <thread_return+0x14c>
> > >  85f:   bf 01 00 00 00          mov    $0x1,%edi
> > >  864:
> > > 
> > > for 68 bytes.
> > > 
> > > My original implementation was 77 bytes, so yes, we have a win.
> > 
> > Ah, good good ! :-)
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 16:25                   ` Peter Zijlstra
  2008-07-15 16:51                     ` Mathieu Desnoyers
@ 2008-08-01 21:10                     ` Paul E. McKenney
  1 sibling, 0 replies; 58+ messages in thread
From: Paul E. McKenney @ 2008-08-01 21:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel,
	Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Alexander Viro, Eduard - Gabriel Munteanu

On Tue, Jul 15, 2008 at 06:25:49PM +0200, Peter Zijlstra wrote:
> On Tue, 2008-07-15 at 12:08 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Tue, 2008-07-15 at 11:22 -0400, Mathieu Desnoyers wrote:
> > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > 
> > > > > I'm confused by the barrier games here.
> > > > > 
> > > > > Why not:
> > > > > 
> > > > >   void **it_func;
> > > > > 
> > > > >   preempt_disable();
> > > > >   it_func = rcu_dereference((tp)->funcs);
> > > > >   if (it_func) {
> > > > >     for (; *it_func; it_func++)
> > > > >       ((void(*)(proto))(*it_func))(args);
> > > > >   }
> > > > >   preempt_enable();
> > > > > 
> > > > > That is, why can we skip the barrier when !it_func? is that because at
> > > > > that time we don't actually dereference it_func and therefore cannot
> > > > > observe stale data?
> > > > > 
> > > > 
> > > > Exactly. I used the implementation of rcu_assign_pointer as a hint that
> > > > we did not need barriers when setting the pointer to NULL, and thus we
> > > > should not need the read barrier when reading the NULL pointer, because
> > > > it references no data.
> > > > 
> > > > #define rcu_assign_pointer(p, v) \
> > > >         ({ \
> > > >                 if (!__builtin_constant_p(v) || \
> > > >                     ((v) != NULL)) \
> > > >                         smp_wmb(); \
> > > >                 (p) = (v); \
> > > >         })
> > > 
> > > Yeah, I saw that,.. made me wonder. It basically assumes that when we
> > > write:
> > > 
> > >   rcu_assign_pointer(foo, NULL);
> > > 
> > > foo will not be used as an index or offset.
> > > 
> > > I guess Paul has thought it through and verified all in-kernel use
> > > cases, but it still makes me feel unconfortable.
> > > 
> > > > #define rcu_dereference(p)     ({ \
> > > >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> > > >                                 smp_read_barrier_depends(); \
> > > >                                 (_________p1); \
> > > >                                 })
> > > > 
> > > > But I think you are right, since we are already in unlikely code, using
> > > > rcu_dereference as you do is better than my use of read barrier depends.
> > > > It should not change anything in the assembly result except on alpha,
> > > > where the read_barrier_depends() is not a nop.
> > > > 
> > > > I wonder if there would be a way to add this kind of NULL pointer case
> > > > check without overhead in rcu_dereference() on alpha. I guess not, since
> > > > the pointer is almost never known at compile-time. And I guess Paul must
> > > > already have thought about it. The only case where we could add this
> > > > test is when we know that we have a if (ptr != NULL) test following the
> > > > rcu_dereference(); we could then assume the compiler will merge the two
> > > > branches since they depend on the same condition.
> > > 
> > > I remember seeing a thread about all this special casing NULL, but have
> > > never been able to find it again - my google skillz always fail me.
> > > 
> > > Basically it doesn't work if you use the variable as an index/offset,
> > > because in that case 0 is a valid offset and you still generate a data
> > > dependency.
> > > 
> > > IIRC the conclusion was that the gains were too small to spend more time
> > > on it, although I would like to hear about the special case in
> > > rcu_assign_pointer.
> > > 
> > > /me goes use git blame....
> > > 
> > 
> > Actually, we could probably do the following, which also adds an extra
> > coherency check about non-NULL pointer assumptions :
> > 
> > #ifdef CONFIG_RCU_DEBUG /* this would be new */
> > #define DEBUG_RCU_BUG_ON(x) BUG_ON(x)
> > #else
> > #define DEBUG_RCU_BUG_ON(x)
> > #endif
> > 
> > #define rcu_dereference(p)     ({ \
> >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> >                                 if (p != NULL) \
> >                                   smp_read_barrier_depends(); \
> >                                 (_________p1); \
> >                                 })
> > 
> > #define rcu_dereference_non_null(p)     ({ \
> >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> >                                 DEBUG_RCU_BUG_ON(p == NULL); \
> >                                 smp_read_barrier_depends(); \
> >                                 (_________p1); \
> >                                 })
> > 
> > The use-case where rcu_dereference() would be used is when it is
> > followed by a null pointer check (grepping through the sources shows me
> > this is a very very common case). In rare cases, it is assumed that the
> > pointer is never NULL and it is used just after the rcu_dereference. It
> > those cases, the extra test could be saved on alpha by using
> > rcu_dereference_non_null(p), which would check the the pointer is indeed
> > never NULL under some debug kernel configuration.
> > 
> > Does it make sense ?
> 
> This would break the case where the dereferenced variable is used as an
> index/offset where 0 is a valid value and still generates data
> dependencies.
> 
> So if with your new version we do:
> 
>   i = rcu_dereference(foo);
>   j = table[i];
> 
> which translates into:
> 
>   i = ACCESS_ONCE(foo);
>   if (i)
>     smp_read_barrier_depends();
>   j = table[i];
> 
> which when i == 0, would fail to do the barrier and can thus cause j to
> be a wrong value.
> 
> Sadly I'll have to defer to Paul to explain exactly how that can happen
> - I always get my head in a horrible twist with this case.

Does http://lkml.org/lkml/2008/2/2/255 help?  (Hmmm...  I was intending
to do a more formal write up of this, but clearly haven't gotten to it...)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-07-15 16:51                     ` Mathieu Desnoyers
@ 2008-08-01 21:10                       ` Paul E. McKenney
  2008-08-02  0:03                         ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Paul E. McKenney @ 2008-08-01 21:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, Ingo Molnar, linux-kernel,
	Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Alexander Viro, Eduard - Gabriel Munteanu

On Tue, Jul 15, 2008 at 12:51:23PM -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2008-07-15 at 12:08 -0400, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > On Tue, 2008-07-15 at 11:22 -0400, Mathieu Desnoyers wrote:
> > > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > > 
> > > > > > I'm confused by the barrier games here.
> > > > > > 
> > > > > > Why not:
> > > > > > 
> > > > > >   void **it_func;
> > > > > > 
> > > > > >   preempt_disable();
> > > > > >   it_func = rcu_dereference((tp)->funcs);
> > > > > >   if (it_func) {
> > > > > >     for (; *it_func; it_func++)
> > > > > >       ((void(*)(proto))(*it_func))(args);
> > > > > >   }
> > > > > >   preempt_enable();
> > > > > > 
> > > > > > That is, why can we skip the barrier when !it_func? is that because at
> > > > > > that time we don't actually dereference it_func and therefore cannot
> > > > > > observe stale data?
> > > > > > 
> > > > > 
> > > > > Exactly. I used the implementation of rcu_assign_pointer as a hint that
> > > > > we did not need barriers when setting the pointer to NULL, and thus we
> > > > > should not need the read barrier when reading the NULL pointer, because
> > > > > it references no data.
> > > > > 
> > > > > #define rcu_assign_pointer(p, v) \
> > > > >         ({ \
> > > > >                 if (!__builtin_constant_p(v) || \
> > > > >                     ((v) != NULL)) \
> > > > >                         smp_wmb(); \
> > > > >                 (p) = (v); \
> > > > >         })
> > > > 
> > > > Yeah, I saw that,.. made me wonder. It basically assumes that when we
> > > > write:
> > > > 
> > > >   rcu_assign_pointer(foo, NULL);
> > > > 
> > > > foo will not be used as an index or offset.
> > > > 
> > > > I guess Paul has thought it through and verified all in-kernel use
> > > > cases, but it still makes me feel unconfortable.
> > > > 
> > > > > #define rcu_dereference(p)     ({ \
> > > > >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> > > > >                                 smp_read_barrier_depends(); \
> > > > >                                 (_________p1); \
> > > > >                                 })
> > > > > 
> > > > > But I think you are right, since we are already in unlikely code, using
> > > > > rcu_dereference as you do is better than my use of read barrier depends.
> > > > > It should not change anything in the assembly result except on alpha,
> > > > > where the read_barrier_depends() is not a nop.
> > > > > 
> > > > > I wonder if there would be a way to add this kind of NULL pointer case
> > > > > check without overhead in rcu_dereference() on alpha. I guess not, since
> > > > > the pointer is almost never known at compile-time. And I guess Paul must
> > > > > already have thought about it. The only case where we could add this
> > > > > test is when we know that we have a if (ptr != NULL) test following the
> > > > > rcu_dereference(); we could then assume the compiler will merge the two
> > > > > branches since they depend on the same condition.
> > > > 
> > > > I remember seeing a thread about all this special casing NULL, but have
> > > > never been able to find it again - my google skillz always fail me.
> > > > 
> > > > Basically it doesn't work if you use the variable as an index/offset,
> > > > because in that case 0 is a valid offset and you still generate a data
> > > > dependency.
> > > > 
> > > > IIRC the conclusion was that the gains were too small to spend more time
> > > > on it, although I would like to hear about the special case in
> > > > rcu_assign_pointer.
> > > > 
> > > > /me goes use git blame....
> > > > 
> > > 
> > > Actually, we could probably do the following, which also adds an extra
> > > coherency check about non-NULL pointer assumptions :
> > > 
> > > #ifdef CONFIG_RCU_DEBUG /* this would be new */
> > > #define DEBUG_RCU_BUG_ON(x) BUG_ON(x)
> > > #else
> > > #define DEBUG_RCU_BUG_ON(x)
> > > #endif
> > > 
> > > #define rcu_dereference(p)     ({ \
> > >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> > >                                 if (p != NULL) \
> > >                                   smp_read_barrier_depends(); \
> > >                                 (_________p1); \
> > >                                 })
> > > 
> > > #define rcu_dereference_non_null(p)     ({ \
> > >                                 typeof(p) _________p1 = ACCESS_ONCE(p); \
> > >                                 DEBUG_RCU_BUG_ON(p == NULL); \
> > >                                 smp_read_barrier_depends(); \
> > >                                 (_________p1); \
> > >                                 })
> > > 
> > > The use-case where rcu_dereference() would be used is when it is
> > > followed by a null pointer check (grepping through the sources shows me
> > > this is a very very common case). In rare cases, it is assumed that the
> > > pointer is never NULL and it is used just after the rcu_dereference. It
> > > those cases, the extra test could be saved on alpha by using
> > > rcu_dereference_non_null(p), which would check the the pointer is indeed
> > > never NULL under some debug kernel configuration.
> > > 
> > > Does it make sense ?
> > 
> > This would break the case where the dereferenced variable is used as an
> > index/offset where 0 is a valid value and still generates data
> > dependencies.
> > 
> > So if with your new version we do:
> > 
> >   i = rcu_dereference(foo);
> >   j = table[i];
> > 
> > which translates into:
> > 
> >   i = ACCESS_ONCE(foo);
> >   if (i)
> >     smp_read_barrier_depends();
> >   j = table[i];
> > 
> > which when i == 0, would fail to do the barrier and can thus cause j to
> > be a wrong value.
> > 
> > Sadly I'll have to defer to Paul to explain exactly how that can happen
> > - I always get my head in a horrible twist with this case.
> > 
> 
> I completely agree with you. However, given the current
> rcu_assign_pointer() implementation, we already have this problem. My
> proposal assumes the current rcu_assign_pointer() behavior is correct
> and that those are never ever used for index/offsets.
> 
> We could enforce this as a compile-time check with something along the
> lines of :
> 
> #define BUILD_BUG_ON_NOT_OFFSETABLE(x) (void)(x)[0]
> 
> And use it both in rcu_assign_pointer() and rcu_dereference().  It would
> check for any type passed to rcu_assign_pointer and rcu_dereference
> which is not either a pointer or an array.
> 
> Then if someone really want to shoot himself in the foot by casting a
> pointer to a long after the rcu_deref, that's his problem.
> 
> Hrm, looking at rcu_assign_pointer tells me that the ((v) != NULL) test
> should probably already complain if v is not a pointer. So my build test
> is probably unneeded.

Yeah, I was thinking in terms of rcu_dereference() working with both
rcu_assign_pointer() and an as-yet-mythical rcu_assign_index().  Perhaps
this would be a good time to get better names:

Current:	rcu_assign_pointer()	rcu_dereference()
New Pointers:	rcu_publish_pointer()	rcu_subscribe_pointer()
New Indexes:	rcu_publish_index()	rcu_subscribe_index()

And, while I am at it, work in a way of checking for either being in
the appropriate RCU read-side critical section and/or having the
needed lock/mutex/whatever held -- something I believe PeterZ was
prototyping some months back.

Though I still am having a hard time with the conditional in
rcu_dereference() vs. the smp_read_barrier_depends()...

						Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-08-01 21:10                       ` Paul E. McKenney
@ 2008-08-02  0:03                         ` Peter Zijlstra
  2008-08-02  0:17                           ` Paul E. McKenney
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2008-08-02  0:03 UTC (permalink / raw)
  To: paulmck
  Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel,
	Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Alexander Viro, Eduard - Gabriel Munteanu

On Fri, 2008-08-01 at 14:10 -0700, Paul E. McKenney wrote:

> Yeah, I was thinking in terms of rcu_dereference() working with both
> rcu_assign_pointer() and an as-yet-mythical rcu_assign_index().  Perhaps
> this would be a good time to get better names:
> 
> Current:	rcu_assign_pointer()	rcu_dereference()
> New Pointers:	rcu_publish_pointer()	rcu_subscribe_pointer()
> New Indexes:	rcu_publish_index()	rcu_subscribe_index()

Is it really worth the effort, splitting it out into these two cases?

> And, while I am at it, work in a way of checking for either being in
> the appropriate RCU read-side critical section and/or having the
> needed lock/mutex/whatever held -- something I believe PeterZ was
> prototyping some months back.

Yeah - I have (bitrotted a bit, but should be salvageable) lockdep
annotations for rcu_dereference().

The problem with them is the huge amount of false positives.. Take for
example the Radix tree code, its perfectly fine to use the radix tree
code without RCU - say you do the old rwlock style, still it uses
rcu_dereference().

I never figured out a suitable way to annotate that.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 01/15] Kernel Tracepoints
  2008-08-02  0:03                         ` Peter Zijlstra
@ 2008-08-02  0:17                           ` Paul E. McKenney
  0 siblings, 0 replies; 58+ messages in thread
From: Paul E. McKenney @ 2008-08-02  0:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel,
	Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Alexander Viro, Eduard - Gabriel Munteanu

On Sat, Aug 02, 2008 at 02:03:57AM +0200, Peter Zijlstra wrote:
> On Fri, 2008-08-01 at 14:10 -0700, Paul E. McKenney wrote:
> 
> > Yeah, I was thinking in terms of rcu_dereference() working with both
> > rcu_assign_pointer() and an as-yet-mythical rcu_assign_index().  Perhaps
> > this would be a good time to get better names:
> > 
> > Current:	rcu_assign_pointer()	rcu_dereference()
> > New Pointers:	rcu_publish_pointer()	rcu_subscribe_pointer()
> > New Indexes:	rcu_publish_index()	rcu_subscribe_index()
> 
> Is it really worth the effort, splitting it out into these two cases?

Either we should split out into the pointer/index cases, or the
definition of rcu_assign_pointer() should probably lose its current
check for NULL...

> > And, while I am at it, work in a way of checking for either being in
> > the appropriate RCU read-side critical section and/or having the
> > needed lock/mutex/whatever held -- something I believe PeterZ was
> > prototyping some months back.
> 
> Yeah - I have (bitrotted a bit, but should be salvageable) lockdep
> annotations for rcu_dereference().
> 
> The problem with them is the huge amount of false positives.. Take for
> example the Radix tree code, its perfectly fine to use the radix tree
> code without RCU - say you do the old rwlock style, still it uses
> rcu_dereference().
> 
> I never figured out a suitable way to annotate that.

My thought was to add a second argument that contained a boolean.  If
the rcu_dereference() was either within an RCU read-side critical
section on the one hand or if the boolean evaluated to "true" on the
other, then no assertion.  This would require SPIN_LOCK_HELD() or
similar primitives.  (And one of the reasons for the renaming

Of course, in the case of radix tree, it would be necessary to pass the
boolean in through the radix-tree read-side APIs, which would perhaps be
a bit annoying.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2008-08-02  0:17 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-09 14:59 [patch 00/15] Tracepoints v3 for linux-next Mathieu Desnoyers
2008-07-09 14:59 ` [patch 01/15] Kernel Tracepoints Mathieu Desnoyers
2008-07-15  7:50   ` Peter Zijlstra
2008-07-15 13:25     ` Mathieu Desnoyers
2008-07-15 13:59       ` Peter Zijlstra
2008-07-15 14:27         ` Mathieu Desnoyers
2008-07-15 14:42           ` Peter Zijlstra
2008-07-15 15:22             ` Mathieu Desnoyers
2008-07-15 15:31               ` Peter Zijlstra
2008-07-15 15:50                 ` Mathieu Desnoyers
2008-08-01 21:10                   ` Paul E. McKenney
2008-07-15 16:08                 ` Mathieu Desnoyers
2008-07-15 16:25                   ` Peter Zijlstra
2008-07-15 16:51                     ` Mathieu Desnoyers
2008-08-01 21:10                       ` Paul E. McKenney
2008-08-02  0:03                         ` Peter Zijlstra
2008-08-02  0:17                           ` Paul E. McKenney
2008-08-01 21:10                     ` Paul E. McKenney
2008-07-15 16:26                   ` Mathieu Desnoyers
2008-08-01 21:10                   ` Paul E. McKenney
2008-07-15 17:50                 ` Mathieu Desnoyers
2008-07-15 14:03       ` Peter Zijlstra
2008-07-15 14:46         ` Mathieu Desnoyers
2008-07-15 15:13           ` Peter Zijlstra
2008-07-15 18:22             ` Mathieu Desnoyers
2008-07-15 18:33               ` Steven Rostedt
2008-07-15 18:52             ` Masami Hiramatsu
2008-07-15 19:08               ` Mathieu Desnoyers
2008-07-15 19:02         ` Mathieu Desnoyers
2008-07-15 19:52           ` Peter Zijlstra
2008-07-09 14:59 ` [patch 02/15] Tracepoints Documentation Mathieu Desnoyers
2008-07-09 14:59 ` [patch 03/15] Tracepoints Samples Mathieu Desnoyers
2008-07-09 14:59 ` [patch 04/15] LTTng instrumentation - irq Mathieu Desnoyers
2008-07-09 16:39   ` Masami Hiramatsu
2008-07-09 17:05     ` [patch 04/15] LTTng instrumentation - irq (update) Mathieu Desnoyers
2008-07-09 14:59 ` [patch 05/15] LTTng instrumentation - scheduler Mathieu Desnoyers
2008-07-09 15:34   ` [patch 05/15] LTTng instrumentation - scheduler (repost) Mathieu Desnoyers
2008-07-09 15:39     ` Ingo Molnar
2008-07-09 16:00       ` Mathieu Desnoyers
2008-07-09 16:21     ` [patch 05/15] LTTng instrumentation - scheduler (merge ftrace markers) Mathieu Desnoyers
2008-07-09 19:09       ` [PATCH] ftrace port to tracepoints (linux-next) Mathieu Desnoyers
2008-07-10  3:14         ` Takashi Nishiie
2008-07-10  3:57           ` [PATCH] ftrace port to tracepoints (linux-next) (nitpick update) Mathieu Desnoyers
     [not found]         ` <20080711143709.GB11500@Krystal>
     [not found]           ` <Pine.LNX.4.58.0807141112540.30484@gandalf.stny.rr.com>
     [not found]             ` <20080714153334.GA651@Krystal>
     [not found]               ` <Pine.LNX.4.58.0807141153250.29493@gandalf.stny.rr.com>
2008-07-14 16:25                 ` [PATCH] ftrace memory barriers Mathieu Desnoyers
2008-07-14 16:35                   ` Steven Rostedt
2008-07-09 14:59 ` [patch 06/15] LTTng instrumentation - timer Mathieu Desnoyers
2008-07-09 14:59 ` [patch 07/15] LTTng instrumentation - kernel Mathieu Desnoyers
2008-07-09 14:59 ` [patch 08/15] LTTng instrumentation - filemap Mathieu Desnoyers
2008-07-09 14:59 ` [patch 09/15] LTTng instrumentation - swap Mathieu Desnoyers
2008-07-09 14:59 ` [patch 10/15] LTTng instrumentation - memory page faults Mathieu Desnoyers
2008-07-09 14:59 ` [patch 11/15] LTTng instrumentation - page Mathieu Desnoyers
2008-07-09 14:59 ` [patch 12/15] LTTng instrumentation - hugetlb Mathieu Desnoyers
2008-07-11 14:30   ` [patch 12/15] LTTng instrumentation - hugetlb (update) Mathieu Desnoyers
2008-07-09 14:59 ` [patch 13/15] LTTng instrumentation - net Mathieu Desnoyers
2008-07-09 14:59 ` [patch 14/15] LTTng instrumentation - ipv4 Mathieu Desnoyers
2008-07-09 14:59 ` Mathieu Desnoyers
2008-07-09 17:01 ` [patch 00/15] Tracepoints v3 for linux-next Masami Hiramatsu
2008-07-09 17:11   ` [patch 15/15] LTTng instrumentation - ipv6 Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).