All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/2] Syscalls tracing
@ 2009-03-07  4:52 Frederic Weisbecker
  2009-03-07  4:52 ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Frederic Weisbecker
                   ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-07  4:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh,
	Frederic Weisbecker

Here is a first attempt, quick one-shot, to provide a syscall tracing
infrastructure on ftrace.

The RFC prefix is here to reflect its ugliness on various parts.
The compromise between tracing reliabilty and speed is hard to balance.
For example I guess the basic and horrid string mask should be dropped in favour
of something else, which takes care of the volatile strings from the userspace.

But I hope a lot of ideas to make it better will come along this discussion.

Frederic Weisbecker (2):
  tracing/ftrace: syscall tracing infrastructure
  tracing/x86: basic implementation of syscall tracing for x86-64

 arch/x86/Kconfig                   |    1 +
 arch/x86/include/asm/ftrace.h      |    7 +
 arch/x86/include/asm/thread_info.h |    9 +-
 arch/x86/kernel/ftrace.c           |   15 ++
 arch/x86/kernel/ptrace.c           |    7 +
 include/linux/ftrace.h             |   44 ++++++
 kernel/trace/Kconfig               |   10 ++
 kernel/trace/Makefile              |    1 +
 kernel/trace/trace.c               |    6 +
 kernel/trace/trace.h               |   32 ++++
 kernel/trace/trace_syscalls.c      |  284 ++++++++++++++++++++++++++++++++++++
 11 files changed, 413 insertions(+), 3 deletions(-)
 create mode 100644 kernel/trace/trace_syscalls.c


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-07  4:52 [RFC][PATCH 0/2] Syscalls tracing Frederic Weisbecker
@ 2009-03-07  4:52 ` Frederic Weisbecker
  2009-03-07  7:49   ` Frederic Weisbecker
                     ` (4 more replies)
  2009-03-07  4:53 ` [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64 Frederic Weisbecker
                   ` (2 subsequent siblings)
  3 siblings, 5 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-07  4:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh,
	Frederic Weisbecker

The most important drawback that implies the syscall tracing is the variable
number of parameters that takes a syscall. Assuming there is a large number of
syscall, the goal is to provide as much as possible a generic layer on it.

What we want on tracing time (the hot path):

- Save the parameters and the number of the syscall. That's all. The rest of the
  job can be done on the "output path".

On output time:

- Get the raw-binary saved parameters and the syscall number as well as some
  usual infos such as the pid, the time...
- Depending on the tracing requirements, the user should have the choice to
  dump the raw datas or a formatted version.

The core infrastructure has a static array which contains the necessary
informations for each syscall (ok, I've only implemented 4 for now).

When a syscall triggers, the tracer look at this array (O(1) access) and get the
number of parameters for this syscall. Then these parameters are saved in a
binary form on the ring buffer with the syscall number (a generic version, arch
independent).

On output time the parameters, the name of the syscall and very basic
informations to format the parameters are picket from the ring-buffer and the
generic syscall array.

Depending of the level of implementation we want for the syscall inside this
array, we can provide just a single mask where the n bit matches the n
parameter. If the bit is set, then the parameter will be considered as a string
pointer. Otherwise its raw hexadecimal value will be printed.

If needed, we can otherwise provide a specific callback to print the syscall
event.

The probem with these approaches comes from the fact that the string might have
disappeared from the user memory on output time. Even if we protect against
faults, it can give unreliable traces.
So perhaps we could think about copying the string on tracing time.

Concerning the return value, it doesn't make any problem, a single raw
value is printed for each one.

Note that the current implementation can be easily scaled further to give only
binary informations to the user and the matching metadata to decode it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/ftrace.h        |   44 +++++++
 kernel/trace/Kconfig          |   10 ++
 kernel/trace/Makefile         |    1 +
 kernel/trace/trace.c          |    6 +
 kernel/trace/trace.h          |   32 +++++
 kernel/trace/trace_syscalls.c |  284 +++++++++++++++++++++++++++++++++++++++++
 6 files changed, 377 insertions(+), 0 deletions(-)
 create mode 100644 kernel/trace/trace_syscalls.c

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index e1583f2..1798692 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}
 
 #endif /* CONFIG_HW_BRANCH_TRACER */
 
+/* Index for the ftrace syscalls array */
+enum syscall_trace_nr {
+	SYSCALL_TRACE_OPEN = 1,
+	SYSCALL_TRACE_CLOSE,
+	SYSCALL_TRACE_READ,
+	SYSCALL_TRACE_WRITE,
+
+	__SYSCALL_TRACE_END
+};
+
+/*
+ * A syscall entry in the ftrace syscalls array.
+ *
+ * @name: name of the syscall
+ * @nb_args: number of parameters it takes
+ * @simple_format: if true, we use string_mask, otherwise the print callback
+ * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
+ *		 will be displayed as a string, otherwise it will be considered
+ *		 as a raw number (hex displayed).
+ * @print: implement it if you want a custom output for a given syscall
+ */
+struct syscall_trace_entry {
+	const char	*name;
+	int		nb_args;
+	bool		simple_format;
+	union {
+		unsigned long string_mask;
+		int (*print)(unsigned long *);
+	} output;
+};
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+extern short arch_syscall_trace_nr[];
+extern void start_ftrace_syscalls(void);
+extern void stop_ftrace_syscalls(void);
+extern void ftrace_syscall_enter(struct pt_regs *regs);
+extern void ftrace_syscall_exit(struct pt_regs *regs);
+#else
+static inline void start_ftrace_syscalls(void) { }
+static inline void stop_ftrace_syscalls(void) { }
+static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
+static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
+#endif
+
 #endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 8e4a2a6..95a0ad1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
 config HAVE_HW_BRANCH_TRACER
 	bool
 
+config HAVE_FTRACE_SYSCALLS
+	bool
+
 config TRACER_MAX_TRACE
 	bool
 
@@ -175,6 +178,13 @@ config EVENT_TRACER
 	  allowing the user to pick and choose which trace point they
 	  want to trace.
 
+config FTRACE_SYSCALLS
+	bool "Trace syscalls"
+	depends on HAVE_FTRACE_SYSCALLS
+	select TRACING
+	help
+	  Basic tracer to catch the syscall entry and exit events.
+
 config BOOT_TRACER
 	bool "Trace boot initcalls"
 	select TRACING
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c7a2943..c3feea0 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_EVENT_TRACER) += trace_events.o
 obj-$(CONFIG_EVENT_TRACER) += events.o
 obj-$(CONFIG_EVENT_TRACER) += trace_export.o
+obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index cc94f86..5152be3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
 unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
 	TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
 
+
+struct trace_array *__get_global_trace(void)
+{
+	return &global_trace;
+}
+
 /**
  * trace_wake_up - wake up tasks waiting for trace input
  *
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2bfb7d1..9583830 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -30,6 +30,8 @@ enum trace_type {
 	TRACE_GRAPH_ENT,
 	TRACE_USER_STACK,
 	TRACE_HW_BRANCHES,
+	TRACE_SYSCALL_ENTER,
+	TRACE_SYSCALL_EXIT,
 	TRACE_KMEM_ALLOC,
 	TRACE_KMEM_FREE,
 	TRACE_POWER,
@@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
 	const void *ptr;
 };
 
+struct syscall_trace_enter {
+	struct trace_entry	ent;
+	enum syscall_trace_nr	nr;
+	unsigned long		args[];
+};
+
+struct syscall_trace_exit {
+	struct trace_entry	ent;
+	enum syscall_trace_nr	nr;
+	unsigned long		ret;
+};
+
+
 /*
  * trace_flag_type is an enumeration that holds different
  * states when a trace occurs. These are:
@@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
 			  TRACE_KMEM_ALLOC);	\
 		IF_ASSIGN(var, ent, struct kmemtrace_free_entry,	\
 			  TRACE_KMEM_FREE);	\
+		IF_ASSIGN(var, ent, struct syscall_trace_enter,		\
+			  TRACE_SYSCALL_ENTER);				\
+		IF_ASSIGN(var, ent, struct syscall_trace_exit,		\
+			  TRACE_SYSCALL_EXIT);				\
 		__ftrace_bad_type();					\
 	} while (0)
 
@@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
 #endif /* CONFIG_FTRACE_STARTUP_TEST */
 
 extern void *head_page(struct trace_array_cpu *data);
+extern struct trace_array *__get_global_trace(void);
 extern long ns2usecs(cycle_t nsec);
 extern int
 trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
@@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
 	return test_tsk_trace_trace(task);
 }
 
+#ifdef CONFIG_FTRACE_SYSCALLS
+extern enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags);
+extern enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags);
+#else
+static inline enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags) { }
+static inline enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags) { }
+#endif
+
 /*
  * trace_iterator_flags is an enumeration that defines bit
  * positions into trace_flags that controls the output.
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
new file mode 100644
index 0000000..470bb9d
--- /dev/null
+++ b/kernel/trace/trace_syscalls.c
@@ -0,0 +1,284 @@
+#include <linux/kernel.h>
+#include <linux/ftrace.h>
+#include <asm/syscall.h>
+
+#include "trace_output.h"
+#include "trace.h"
+
+/* Create basic entry on syscall array (with the rudimentary string mask) */
+#define SYS_TRACE_ENTRY(sname, args, mask)			  \
+	{.name = #sname, .nb_args = args, .simple_format = true,  \
+	.output.string_mask = mask}
+
+/*
+ * Create a custom defined entry on the syscall array, you will have to
+ * implement a callback to output the syscall.
+ */
+#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer)		  \
+	{.name = #sname, .nb_args = args, .simple_format = false, \
+	.print = printer}
+
+static const struct syscall_trace_entry syscall_trace_entries[] = {
+	/* For open, the first argument is a string, hence the given mask */
+	[SYSCALL_TRACE_OPEN]	= SYS_TRACE_ENTRY(open, 3, 0x1),
+	[SYSCALL_TRACE_CLOSE]	= SYS_TRACE_ENTRY(close, 1, 0),
+	[SYSCALL_TRACE_READ]	= SYS_TRACE_ENTRY(read, 3, 0),
+	[SYSCALL_TRACE_WRITE]	= SYS_TRACE_ENTRY(read, 3, 0),
+};
+
+
+static atomic_t refcount;
+
+enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags)
+{
+	struct trace_seq *s = &iter->seq;
+	struct trace_entry *ent = iter->ent;
+	struct syscall_trace_enter *trace;
+	enum syscall_trace_nr syscall;
+	const struct syscall_trace_entry *entry;
+	char *str = NULL;
+	int i, ret;
+
+	trace_assign_type(trace, ent);
+
+	syscall = trace->nr;
+
+	if (!syscall || syscall >= __SYSCALL_TRACE_END)
+		return TRACE_TYPE_HANDLED;
+
+	entry = &syscall_trace_entries[syscall];
+	if (!entry->simple_format) {
+		ret = entry->output.print(trace->args);
+		if (!ret)
+			return TRACE_TYPE_PARTIAL_LINE;
+	}
+
+	ret = trace_seq_printf(s, "syscall %s :", entry->name);
+	if (!ret)
+		return TRACE_TYPE_PARTIAL_LINE;
+
+	for (i = 0; i < entry->nb_args; i++) {
+		/*
+		 * FIXME: the string comes from the user, but on tracing
+		 * output time, it may have disappeared from the memory.
+		 * May be better to strdup it on tracing time.
+		 */
+		if (entry->output.string_mask & (1 << i)) {
+			/* 128 is enough for most path */
+			str = strndup_user((char *)trace->args[i], 128);
+
+			/* If it faulted badly, the error shoud have been
+			 * handled while servicing the syscall, just ignore.
+			 */
+			if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
+			    || str == ERR_PTR(-ENOMEM))
+				return TRACE_TYPE_HANDLED;
+
+			ret = trace_seq_printf(s, " %s", str);
+		} else {
+			ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
+		}
+		if (!ret)
+			goto end_partial;
+	}
+	if (!trace_seq_printf(s, "\n"))
+		goto end_partial;
+
+	kfree(str);
+	return TRACE_TYPE_HANDLED;
+
+end_partial:
+	kfree(str);
+	return TRACE_TYPE_PARTIAL_LINE;
+}
+
+enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags)
+{
+	struct trace_seq *s = &iter->seq;
+	struct trace_entry *ent = iter->ent;
+	struct syscall_trace_exit *trace;
+	enum syscall_trace_nr syscall;
+	const struct syscall_trace_entry *entry;
+	int ret;
+
+	trace_assign_type(trace, ent);
+
+	syscall = trace->nr;
+
+	if (!syscall || syscall >= __SYSCALL_TRACE_END)
+		return TRACE_TYPE_HANDLED;
+
+	entry = &syscall_trace_entries[syscall];
+
+	ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
+				trace->ret);
+	if (!ret)
+		return TRACE_TYPE_PARTIAL_LINE;
+
+	return TRACE_TYPE_HANDLED;
+}
+
+void start_ftrace_syscalls(void)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+
+	if (atomic_inc_return(&refcount) != 1)
+		goto out;
+
+	read_lock_irqsave(&tasklist_lock, flags);
+
+	do_each_thread(g, t) {
+		set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+	} while_each_thread(g, t);
+
+	read_unlock_irqrestore(&tasklist_lock, flags);
+out:
+	atomic_dec(&refcount);
+}
+
+void stop_ftrace_syscalls(void)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+
+	if (atomic_dec_return(&refcount))
+		goto out;
+
+	read_lock_irqsave(&tasklist_lock, flags);
+
+	do_each_thread(g, t) {
+		clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+	} while_each_thread(g, t);
+
+	read_unlock_irqrestore(&tasklist_lock, flags);
+out:
+	atomic_inc(&refcount);
+}
+
+void ftrace_syscall_enter(struct pt_regs *regs)
+{
+	struct trace_array *tr = __get_global_trace();
+	struct syscall_trace_enter *entry;
+	const struct syscall_trace_entry *sys_data;
+	struct ring_buffer_event *event;
+	struct trace_array_cpu *data;
+	int size;
+	int syscall_nr;
+	int nr_offset;
+	int cpu;
+
+	syscall_nr = syscall_get_nr(current, regs);
+	nr_offset = arch_syscall_trace_nr[syscall_nr];
+	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
+		return;
+
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+
+	if (unlikely(atomic_read(&data->disabled)))
+		return;
+
+	sys_data = &syscall_trace_entries[nr_offset];
+	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
+
+	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
+	if (!event)
+		return;
+
+	entry = ring_buffer_event_data(event);
+	entry->nr = nr_offset;
+	syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
+
+	ring_buffer_unlock_commit(tr->buffer, event);
+	trace_wake_up();
+}
+
+void ftrace_syscall_exit(struct pt_regs *regs)
+{
+	struct trace_array *tr = __get_global_trace();
+	struct syscall_trace_exit *entry;
+	const struct syscall_trace_entry *sys_data;
+	struct ring_buffer_event *event;
+	struct trace_array_cpu *data;
+	int syscall_nr;
+	int nr_offset;
+	int cpu;
+
+	syscall_nr = syscall_get_nr(current, regs);
+	nr_offset = arch_syscall_trace_nr[syscall_nr];
+	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
+		return;
+
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+
+	if (unlikely(atomic_read(&data->disabled)))
+		return;
+
+	sys_data = &syscall_trace_entries[nr_offset];
+
+	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
+				sizeof(*entry), 0, 0);
+	if (!event)
+		return;
+
+	entry = ring_buffer_event_data(event);
+	entry->nr = nr_offset;
+	entry->ret = syscall_get_return_value(current, regs);
+
+	ring_buffer_unlock_commit(tr->buffer, event);
+	trace_wake_up();
+}
+
+static int init_syscall_tracer(struct trace_array *tr)
+{
+	start_ftrace_syscalls();
+
+	return 0;
+}
+
+static void reset_syscall_tracer(struct trace_array *tr)
+{
+	stop_ftrace_syscalls();
+}
+
+static struct trace_event syscall_enter_event = {
+	.type	 	= TRACE_SYSCALL_ENTER,
+	.trace		= print_syscall_enter,
+};
+
+static struct trace_event syscall_exit_event = {
+	.type	 	= TRACE_SYSCALL_EXIT,
+	.trace		= print_syscall_exit,
+};
+
+static struct tracer syscall_tracer __read_mostly = {
+	.name	     	= "syscall",
+	.init		= init_syscall_tracer,
+	.reset		= reset_syscall_tracer
+};
+
+__init int register_ftrace_syscalls(void)
+{
+	int ret;
+
+	ret = register_ftrace_event(&syscall_enter_event);
+	if (!ret) {
+		printk(KERN_WARNING "event %d failed to register\n",
+		       syscall_enter_event.type);
+		WARN_ON_ONCE(1);
+	}
+
+	ret = register_ftrace_event(&syscall_exit_event);
+	if (!ret) {
+		printk(KERN_WARNING "event %d failed to register\n",
+		       syscall_exit_event.type);
+		WARN_ON_ONCE(1);
+	}
+
+	return register_tracer(&syscall_tracer);
+}
+device_initcall(register_ftrace_syscalls);
-- 
1.6.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64
  2009-03-07  4:52 [RFC][PATCH 0/2] Syscalls tracing Frederic Weisbecker
  2009-03-07  4:52 ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Frederic Weisbecker
@ 2009-03-07  4:53 ` Frederic Weisbecker
  2009-03-13  4:27   ` [tip:tracing/syscalls] tracing/x86: basic implementation of syscall tracing for x86 Frederic Weisbecker
                     ` (2 more replies)
  2009-03-07 12:15 ` [RFC][PATCH 0/2] Syscalls tracing Peter Zijlstra
  2009-03-07 20:26 ` Frank Ch. Eigler
  3 siblings, 3 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-07  4:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh,
	Frederic Weisbecker

Provide the ptrace hooks and arch specific syscall numbers to ftrace arch indepedant
syscall numbers.
For now it only supports 4 syscalls to provide an example.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 arch/x86/Kconfig                   |    1 +
 arch/x86/include/asm/ftrace.h      |    7 +++++++
 arch/x86/include/asm/thread_info.h |    9 ++++++---
 arch/x86/kernel/ftrace.c           |   15 +++++++++++++++
 arch/x86/kernel/ptrace.c           |    7 +++++++
 5 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2a0ddfe..b5aec1b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -35,6 +35,7 @@ config X86
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_FUNCTION_TRACE_MCOUNT_TEST
 	select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE
+	select HAVE_FTRACE_SYSCALLS if X86_64
 	select HAVE_KVM
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index db24c22..b2e6855 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -28,6 +28,13 @@
 
 #endif
 
+/* FIXME: I don't want to stay hardcoded */
+#ifdef CONFIG_X86_64
+#define FTRACE_SYSCALL_MAX	296
+#else
+#define FTRACE_SYSCALL_MAX	333
+#endif
+
 #ifdef CONFIG_FUNCTION_TRACER
 #define MCOUNT_ADDR		((long)(mcount))
 #define MCOUNT_INSN_SIZE	5 /* sizeof mcount call */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f6ba6f8..83d2b73 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -95,6 +95,7 @@ struct thread_info {
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_DEBUGCTLMSR		25	/* uses thread_struct.debugctlmsr */
 #define TIF_DS_AREA_MSR		26      /* uses thread_struct.ds_area_msr */
+#define TIF_SYSCALL_FTRACE	27	/* for ftrace syscall instrumentation */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
@@ -117,15 +118,17 @@ struct thread_info {
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
 #define _TIF_DEBUGCTLMSR	(1 << TIF_DEBUGCTLMSR)
 #define _TIF_DS_AREA_MSR	(1 << TIF_DS_AREA_MSR)
+#define _TIF_SYSCALL_FTRACE	(1 << TIF_SYSCALL_FTRACE)
 
 /* work to do in syscall_trace_enter() */
 #define _TIF_WORK_SYSCALL_ENTRY	\
-	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | \
+	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_FTRACE |	\
 	 _TIF_SYSCALL_AUDIT | _TIF_SECCOMP | _TIF_SINGLESTEP)
 
 /* work to do in syscall_trace_leave() */
 #define _TIF_WORK_SYSCALL_EXIT	\
-	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP)
+	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP |	\
+	 _TIF_SYSCALL_FTRACE)
 
 /* work to do on interrupt/exception return */
 #define _TIF_WORK_MASK							\
@@ -134,7 +137,7 @@ struct thread_info {
 	   _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU))
 
 /* work to do on any return to user space */
-#define _TIF_ALLWORK_MASK (0x0000FFFF & ~_TIF_SECCOMP)
+#define _TIF_ALLWORK_MASK ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_FTRACE)
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index a85da17..5bba98b 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -453,3 +453,18 @@ void prepare_ftrace_return(unsigned long *parent, unsigned long self_addr)
 	}
 }
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+#ifdef CONFIG_X86_64
+/*
+ * Array that make the arch specific syscall numbers
+ * matching the ftrace syscall numbers.
+ */
+short arch_syscall_trace_nr[FTRACE_SYSCALL_MAX] = {
+	[__NR_open]	= SYSCALL_TRACE_OPEN,
+	[__NR_close]	= SYSCALL_TRACE_CLOSE,
+	[__NR_read]	= SYSCALL_TRACE_READ,
+	[__NR_write]	= SYSCALL_TRACE_WRITE
+};
+#endif
+#endif
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 3d9672e..99749d6 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -21,6 +21,7 @@
 #include <linux/audit.h>
 #include <linux/seccomp.h>
 #include <linux/signal.h>
+#include <linux/ftrace.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -1416,6 +1417,9 @@ asmregparm long syscall_trace_enter(struct pt_regs *regs)
 	    tracehook_report_syscall_entry(regs))
 		ret = -1L;
 
+	if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
+		ftrace_syscall_enter(regs);
+
 	if (unlikely(current->audit_context)) {
 		if (IS_IA32)
 			audit_syscall_entry(AUDIT_ARCH_I386,
@@ -1439,6 +1443,9 @@ asmregparm void syscall_trace_leave(struct pt_regs *regs)
 	if (unlikely(current->audit_context))
 		audit_syscall_exit(AUDITSC_RESULT(regs->ax), regs->ax);
 
+	if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
+		ftrace_syscall_exit(regs);
+
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, 0);
 
-- 
1.6.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-07  4:52 ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Frederic Weisbecker
@ 2009-03-07  7:49   ` Frederic Weisbecker
  2009-03-09 13:52     ` Steven Rostedt
  2009-03-08 16:24   ` Ingo Molnar
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-07  7:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh

On Sat, Mar 07, 2009 at 05:52:59AM +0100, Frederic Weisbecker wrote:
> The most important drawback that implies the syscall tracing is the variable
> number of parameters that takes a syscall. Assuming there is a large number of
> syscall, the goal is to provide as much as possible a generic layer on it.
> 
> What we want on tracing time (the hot path):
> 
> - Save the parameters and the number of the syscall. That's all. The rest of the
>   job can be done on the "output path".
> 
> On output time:
> 
> - Get the raw-binary saved parameters and the syscall number as well as some
>   usual infos such as the pid, the time...
> - Depending on the tracing requirements, the user should have the choice to
>   dump the raw datas or a formatted version.
> 
> The core infrastructure has a static array which contains the necessary
> informations for each syscall (ok, I've only implemented 4 for now).
> 
> When a syscall triggers, the tracer look at this array (O(1) access) and get the
> number of parameters for this syscall. Then these parameters are saved in a
> binary form on the ring buffer with the syscall number (a generic version, arch
> independent).
> 
> On output time the parameters, the name of the syscall and very basic
> informations to format the parameters are picket from the ring-buffer and the
> generic syscall array.
> 
> Depending of the level of implementation we want for the syscall inside this
> array, we can provide just a single mask where the n bit matches the n
> parameter. If the bit is set, then the parameter will be considered as a string
> pointer. Otherwise its raw hexadecimal value will be printed.
> 
> If needed, we can otherwise provide a specific callback to print the syscall
> event.
> 
> The probem with these approaches comes from the fact that the string might have
> disappeared from the user memory on output time. Even if we protect against
> faults, it can give unreliable traces.
> So perhaps we could think about copying the string on tracing time.
> 
> Concerning the return value, it doesn't make any problem, a single raw
> value is printed for each one.
> 
> Note that the current implementation can be easily scaled further to give only
> binary informations to the user and the matching metadata to decode it.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> ---


I made some silly mistakes on this one.
Here is a v2:

---

>From f1fd4e7885f1f96f27fd58b20a1aa5cb054cb15f Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker <fweisbec@gmail.com>
Date: Sat, 7 Mar 2009 08:46:46 +0100
Subject: [PATCH] tracing/ftrace: syscall tracing infrastructure

The most important drawback that implies the syscall tracing is the variable
number of parameters that takes a syscall. Assuming there is a large number of
syscall, the goal is to provide as much as possible a generic layer on it.

What we want on tracing time (the hot path):

- Save the parameters and the number of the syscall. That's all. The rest of the
  job can be done on the "output path".

On output time:

- Get the raw-binary saved parameters and the syscall number as well as some
  usual infos such as the pid, the time...
- Depending on the tracing requirements, the user should have the choice to
  dump the raw datas or a formatted version.

The core infrastructure has a static array which contains the necessary
informations for each syscall (ok, I've only implemented 4 for now).

When a syscall triggers, the tracer look at this array (O(1) access) and get the
number of parameters for this syscall. Then these parameters are saved in a
binary form on the ring buffer with the syscall number (a generic version, arch
independent).

On output time the parameters, the name of the syscall and very basic
informations to format the parameters are picket from the ring-buffer and the
generic syscall array.

Depending of the level of implementation we want for the syscall inside this
array, we can provide just a single mask where the n bit matches the n
parameter. If the bit is set, then the parameter will be considered as a string
pointer. Otherwise its raw hexadecimal value will be printed.

If needed, we can otherwise provide a specific callback to print the syscall
event.

The probem with these approaches comes from the fact that the string might have
disappeared from the user memory on output time. Even if we protect against
faults, it can give unreliable traces.
So perhaps we could think about copying the string on tracing time.

Concerning the return value, it doesn't make any problem, a single raw
value is printed for each one.

Note that the current implementation can be easily scaled further to give only
binary informations to the user and the matching metadata to decode it.

Changes V2:

- Fixes mistakes on refcounting
- Fixes SYS_TRACE_ENTRY_SPECIAL macro which forgot the union
- Always return after the print callback is called

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/ftrace.h        |   44 +++++++
 kernel/trace/Kconfig          |   10 ++
 kernel/trace/Makefile         |    1 +
 kernel/trace/trace.c          |    6 +
 kernel/trace/trace.h          |   32 +++++
 kernel/trace/trace_syscalls.c |  285 +++++++++++++++++++++++++++++++++++++++++
 6 files changed, 378 insertions(+), 0 deletions(-)
 create mode 100644 kernel/trace/trace_syscalls.c

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index e1583f2..1798692 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}
 
 #endif /* CONFIG_HW_BRANCH_TRACER */
 
+/* Index for the ftrace syscalls array */
+enum syscall_trace_nr {
+	SYSCALL_TRACE_OPEN = 1,
+	SYSCALL_TRACE_CLOSE,
+	SYSCALL_TRACE_READ,
+	SYSCALL_TRACE_WRITE,
+
+	__SYSCALL_TRACE_END
+};
+
+/*
+ * A syscall entry in the ftrace syscalls array.
+ *
+ * @name: name of the syscall
+ * @nb_args: number of parameters it takes
+ * @simple_format: if true, we use string_mask, otherwise the print callback
+ * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
+ *		 will be displayed as a string, otherwise it will be considered
+ *		 as a raw number (hex displayed).
+ * @print: implement it if you want a custom output for a given syscall
+ */
+struct syscall_trace_entry {
+	const char	*name;
+	int		nb_args;
+	bool		simple_format;
+	union {
+		unsigned long string_mask;
+		int (*print)(unsigned long *);
+	} output;
+};
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+extern short arch_syscall_trace_nr[];
+extern void start_ftrace_syscalls(void);
+extern void stop_ftrace_syscalls(void);
+extern void ftrace_syscall_enter(struct pt_regs *regs);
+extern void ftrace_syscall_exit(struct pt_regs *regs);
+#else
+static inline void start_ftrace_syscalls(void) { }
+static inline void stop_ftrace_syscalls(void) { }
+static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
+static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
+#endif
+
 #endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 8e4a2a6..95a0ad1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
 config HAVE_HW_BRANCH_TRACER
 	bool
 
+config HAVE_FTRACE_SYSCALLS
+	bool
+
 config TRACER_MAX_TRACE
 	bool
 
@@ -175,6 +178,13 @@ config EVENT_TRACER
 	  allowing the user to pick and choose which trace point they
 	  want to trace.
 
+config FTRACE_SYSCALLS
+	bool "Trace syscalls"
+	depends on HAVE_FTRACE_SYSCALLS
+	select TRACING
+	help
+	  Basic tracer to catch the syscall entry and exit events.
+
 config BOOT_TRACER
 	bool "Trace boot initcalls"
 	select TRACING
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c7a2943..c3feea0 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_EVENT_TRACER) += trace_events.o
 obj-$(CONFIG_EVENT_TRACER) += events.o
 obj-$(CONFIG_EVENT_TRACER) += trace_export.o
+obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index cc94f86..5152be3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
 unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
 	TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
 
+
+struct trace_array *__get_global_trace(void)
+{
+	return &global_trace;
+}
+
 /**
  * trace_wake_up - wake up tasks waiting for trace input
  *
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2bfb7d1..9583830 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -30,6 +30,8 @@ enum trace_type {
 	TRACE_GRAPH_ENT,
 	TRACE_USER_STACK,
 	TRACE_HW_BRANCHES,
+	TRACE_SYSCALL_ENTER,
+	TRACE_SYSCALL_EXIT,
 	TRACE_KMEM_ALLOC,
 	TRACE_KMEM_FREE,
 	TRACE_POWER,
@@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
 	const void *ptr;
 };
 
+struct syscall_trace_enter {
+	struct trace_entry	ent;
+	enum syscall_trace_nr	nr;
+	unsigned long		args[];
+};
+
+struct syscall_trace_exit {
+	struct trace_entry	ent;
+	enum syscall_trace_nr	nr;
+	unsigned long		ret;
+};
+
+
 /*
  * trace_flag_type is an enumeration that holds different
  * states when a trace occurs. These are:
@@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
 			  TRACE_KMEM_ALLOC);	\
 		IF_ASSIGN(var, ent, struct kmemtrace_free_entry,	\
 			  TRACE_KMEM_FREE);	\
+		IF_ASSIGN(var, ent, struct syscall_trace_enter,		\
+			  TRACE_SYSCALL_ENTER);				\
+		IF_ASSIGN(var, ent, struct syscall_trace_exit,		\
+			  TRACE_SYSCALL_EXIT);				\
 		__ftrace_bad_type();					\
 	} while (0)
 
@@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
 #endif /* CONFIG_FTRACE_STARTUP_TEST */
 
 extern void *head_page(struct trace_array_cpu *data);
+extern struct trace_array *__get_global_trace(void);
 extern long ns2usecs(cycle_t nsec);
 extern int
 trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
@@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
 	return test_tsk_trace_trace(task);
 }
 
+#ifdef CONFIG_FTRACE_SYSCALLS
+extern enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags);
+extern enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags);
+#else
+static inline enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags) { }
+static inline enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags) { }
+#endif
+
 /*
  * trace_iterator_flags is an enumeration that defines bit
  * positions into trace_flags that controls the output.
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
new file mode 100644
index 0000000..3fc5e54
--- /dev/null
+++ b/kernel/trace/trace_syscalls.c
@@ -0,0 +1,285 @@
+#include <linux/kernel.h>
+#include <linux/ftrace.h>
+#include <asm/syscall.h>
+
+#include "trace_output.h"
+#include "trace.h"
+
+/* Create basic entry on syscall array (with the rudimentary string mask) */
+#define SYS_TRACE_ENTRY(sname, args, mask)			  \
+	{.name = #sname, .nb_args = args, .simple_format = true,  \
+	.output.string_mask = mask}
+
+/*
+ * Create a custom defined entry on the syscall array, you will have to
+ * implement a callback to output the syscall.
+ */
+#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer)		  \
+	{.name = #sname, .nb_args = args, .simple_format = false, \
+	.output.print = printer}
+
+static const struct syscall_trace_entry syscall_trace_entries[] = {
+	/* For open, the first argument is a string, hence the given mask */
+	[SYSCALL_TRACE_OPEN]	= SYS_TRACE_ENTRY(open, 3, 0x1),
+	[SYSCALL_TRACE_CLOSE]	= SYS_TRACE_ENTRY(close, 1, 0),
+	[SYSCALL_TRACE_READ]	= SYS_TRACE_ENTRY(read, 3, 0),
+	[SYSCALL_TRACE_WRITE]	= SYS_TRACE_ENTRY(read, 3, 0),
+};
+
+
+static atomic_t refcount;
+
+enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags)
+{
+	struct trace_seq *s = &iter->seq;
+	struct trace_entry *ent = iter->ent;
+	struct syscall_trace_enter *trace;
+	enum syscall_trace_nr syscall;
+	const struct syscall_trace_entry *entry;
+	char *str = NULL;
+	int i, ret;
+
+	trace_assign_type(trace, ent);
+
+	syscall = trace->nr;
+
+	if (!syscall || syscall >= __SYSCALL_TRACE_END)
+		return TRACE_TYPE_HANDLED;
+
+	entry = &syscall_trace_entries[syscall];
+	if (!entry->simple_format) {
+		ret = entry->output.print(trace->args);
+		if (!ret)
+			return TRACE_TYPE_PARTIAL_LINE;
+		return TRACE_TYPE_HANDLED;
+	}
+
+	ret = trace_seq_printf(s, "syscall %s :", entry->name);
+	if (!ret)
+		return TRACE_TYPE_PARTIAL_LINE;
+
+	for (i = 0; i < entry->nb_args; i++) {
+		/*
+		 * FIXME: the string comes from the user, but on tracing
+		 * output time, it may have disappeared from the memory.
+		 * May be better to strdup it on tracing time.
+		 */
+		if (entry->output.string_mask & (1 << i)) {
+			/* 128 is enough for most path */
+			str = strndup_user((char *)trace->args[i], 128);
+
+			/* If it faulted badly, the error shoud have been
+			 * handled while servicing the syscall, just ignore.
+			 */
+			if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
+			    || str == ERR_PTR(-ENOMEM))
+				return TRACE_TYPE_HANDLED;
+
+			ret = trace_seq_printf(s, " %s", str);
+		} else {
+			ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
+		}
+		if (!ret)
+			goto end_partial;
+	}
+	if (!trace_seq_printf(s, "\n"))
+		goto end_partial;
+
+	kfree(str);
+	return TRACE_TYPE_HANDLED;
+
+end_partial:
+	kfree(str);
+	return TRACE_TYPE_PARTIAL_LINE;
+}
+
+enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags)
+{
+	struct trace_seq *s = &iter->seq;
+	struct trace_entry *ent = iter->ent;
+	struct syscall_trace_exit *trace;
+	enum syscall_trace_nr syscall;
+	const struct syscall_trace_entry *entry;
+	int ret;
+
+	trace_assign_type(trace, ent);
+
+	syscall = trace->nr;
+
+	if (!syscall || syscall >= __SYSCALL_TRACE_END)
+		return TRACE_TYPE_HANDLED;
+
+	entry = &syscall_trace_entries[syscall];
+
+	ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
+				trace->ret);
+	if (!ret)
+		return TRACE_TYPE_PARTIAL_LINE;
+
+	return TRACE_TYPE_HANDLED;
+}
+
+void start_ftrace_syscalls(void)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+
+	if (atomic_inc_return(&refcount) != 1)
+		goto out;
+
+	read_lock_irqsave(&tasklist_lock, flags);
+
+	do_each_thread(g, t) {
+		set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+	} while_each_thread(g, t);
+
+	read_unlock_irqrestore(&tasklist_lock, flags);
+
+	return;
+out:
+	atomic_dec(&refcount);
+}
+
+void stop_ftrace_syscalls(void)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+
+	if (atomic_dec_return(&refcount))
+		return;
+
+	read_lock_irqsave(&tasklist_lock, flags);
+
+	do_each_thread(g, t) {
+		clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+	} while_each_thread(g, t);
+
+	read_unlock_irqrestore(&tasklist_lock, flags);
+}
+
+void ftrace_syscall_enter(struct pt_regs *regs)
+{
+	struct trace_array *tr = __get_global_trace();
+	struct syscall_trace_enter *entry;
+	const struct syscall_trace_entry *sys_data;
+	struct ring_buffer_event *event;
+	struct trace_array_cpu *data;
+	int size;
+	int syscall_nr;
+	int nr_offset;
+	int cpu;
+
+	syscall_nr = syscall_get_nr(current, regs);
+	nr_offset = arch_syscall_trace_nr[syscall_nr];
+	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
+		return;
+
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+
+	if (unlikely(atomic_read(&data->disabled)))
+		return;
+
+	sys_data = &syscall_trace_entries[nr_offset];
+	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
+
+	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
+	if (!event)
+		return;
+
+	entry = ring_buffer_event_data(event);
+	entry->nr = nr_offset;
+	syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
+
+	ring_buffer_unlock_commit(tr->buffer, event);
+	trace_wake_up();
+}
+
+void ftrace_syscall_exit(struct pt_regs *regs)
+{
+	struct trace_array *tr = __get_global_trace();
+	struct syscall_trace_exit *entry;
+	const struct syscall_trace_entry *sys_data;
+	struct ring_buffer_event *event;
+	struct trace_array_cpu *data;
+	int syscall_nr;
+	int nr_offset;
+	int cpu;
+
+	syscall_nr = syscall_get_nr(current, regs);
+	nr_offset = arch_syscall_trace_nr[syscall_nr];
+	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
+		return;
+
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+
+	if (unlikely(atomic_read(&data->disabled)))
+		return;
+
+	sys_data = &syscall_trace_entries[nr_offset];
+
+	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
+				sizeof(*entry), 0, 0);
+	if (!event)
+		return;
+
+	entry = ring_buffer_event_data(event);
+	entry->nr = nr_offset;
+	entry->ret = syscall_get_return_value(current, regs);
+
+	ring_buffer_unlock_commit(tr->buffer, event);
+	trace_wake_up();
+}
+
+static int init_syscall_tracer(struct trace_array *tr)
+{
+	start_ftrace_syscalls();
+
+	return 0;
+}
+
+static void reset_syscall_tracer(struct trace_array *tr)
+{
+	stop_ftrace_syscalls();
+}
+
+static struct trace_event syscall_enter_event = {
+	.type	 	= TRACE_SYSCALL_ENTER,
+	.trace		= print_syscall_enter,
+};
+
+static struct trace_event syscall_exit_event = {
+	.type	 	= TRACE_SYSCALL_EXIT,
+	.trace		= print_syscall_exit,
+};
+
+static struct tracer syscall_tracer __read_mostly = {
+	.name	     	= "syscall",
+	.init		= init_syscall_tracer,
+	.reset		= reset_syscall_tracer
+};
+
+__init int register_ftrace_syscalls(void)
+{
+	int ret;
+
+	ret = register_ftrace_event(&syscall_enter_event);
+	if (!ret) {
+		printk(KERN_WARNING "event %d failed to register\n",
+		       syscall_enter_event.type);
+		WARN_ON_ONCE(1);
+	}
+
+	ret = register_ftrace_event(&syscall_exit_event);
+	if (!ret) {
+		printk(KERN_WARNING "event %d failed to register\n",
+		       syscall_exit_event.type);
+		WARN_ON_ONCE(1);
+	}
+
+	return register_tracer(&syscall_tracer);
+}
+device_initcall(register_ftrace_syscalls);
-- 
1.6.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 0/2] Syscalls tracing
  2009-03-07  4:52 [RFC][PATCH 0/2] Syscalls tracing Frederic Weisbecker
  2009-03-07  4:52 ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Frederic Weisbecker
  2009-03-07  4:53 ` [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64 Frederic Weisbecker
@ 2009-03-07 12:15 ` Peter Zijlstra
  2009-03-07 16:02   ` Frederic Weisbecker
  2009-03-07 20:26 ` Frank Ch. Eigler
  3 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2009-03-07 12:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh, Arjan van de Ven

On Sat, 2009-03-07 at 05:52 +0100, Frederic Weisbecker wrote:
> Here is a first attempt, quick one-shot, to provide a syscall tracing
> infrastructure on ftrace.
> 
> The RFC prefix is here to reflect its ugliness on various parts.
> The compromise between tracing reliabilty and speed is hard to balance.
> For example I guess the basic and horrid string mask should be dropped in favour
> of something else, which takes care of the volatile strings from the userspace.
> 
> But I hope a lot of ideas to make it better will come along this discussion.

Can't you abuse the SYSCALL_DEFINE macros? This current approach looks
like it will replicate the syscall table.




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 0/2] Syscalls tracing
  2009-03-07 12:15 ` [RFC][PATCH 0/2] Syscalls tracing Peter Zijlstra
@ 2009-03-07 16:02   ` Frederic Weisbecker
  2009-03-08 11:24     ` Frederic Weisbecker
  0 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-07 16:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh, Arjan van de Ven

On Sat, Mar 07, 2009 at 01:15:18PM +0100, Peter Zijlstra wrote:
> On Sat, 2009-03-07 at 05:52 +0100, Frederic Weisbecker wrote:
> > Here is a first attempt, quick one-shot, to provide a syscall tracing
> > infrastructure on ftrace.
> > 
> > The RFC prefix is here to reflect its ugliness on various parts.
> > The compromise between tracing reliabilty and speed is hard to balance.
> > For example I guess the basic and horrid string mask should be dropped in favour
> > of something else, which takes care of the volatile strings from the userspace.
> > 
> > But I hope a lot of ideas to make it better will come along this discussion.
> 
> Can't you abuse the SYSCALL_DEFINE macros? This current approach looks
> like it will replicate the syscall table.
> 


Ah, I did not even think about it.
I will be able to get the number of parameters. Sounds good. But I will
still need a way to store their format somewhere.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 0/2] Syscalls tracing
  2009-03-07  4:52 [RFC][PATCH 0/2] Syscalls tracing Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2009-03-07 12:15 ` [RFC][PATCH 0/2] Syscalls tracing Peter Zijlstra
@ 2009-03-07 20:26 ` Frank Ch. Eigler
  2009-03-07 21:53   ` Frederic Weisbecker
  3 siblings, 1 reply; 40+ messages in thread
From: Frank Ch. Eigler @ 2009-03-07 20:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh

Frederic Weisbecker <fweisbec@gmail.com> writes:

> Here is a first attempt, quick one-shot, to provide a syscall tracing
> infrastructure on ftrace.

Please see also the utrace-based thread syscall/signal/lifecycle.
tracer I posted a few times, and is just about to be reposted as a
part of the larger utrace submission.

System call metadata (name, argument count, and getting fancier from
there) would be nice to have for other clients too, such as the audit
subsystem.

The main drawback of this general approach however is the notion that
ftrace is the solitary user of system call tracing, thus dedicating
that new task flag to this purpose.  Therefore, your code has nothing
like reference counting or sharing; nothing yet to avoid overhead on
threads that need no tracing, nor to allow more than one tracing
widget to consume events.  These are the sorts of services that utrace
provides.

- FChE

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 0/2] Syscalls tracing
  2009-03-07 20:26 ` Frank Ch. Eigler
@ 2009-03-07 21:53   ` Frederic Weisbecker
  0 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-07 21:53 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh

On Sat, Mar 07, 2009 at 03:26:49PM -0500, Frank Ch. Eigler wrote:
> Frederic Weisbecker <fweisbec@gmail.com> writes:
> 
> > Here is a first attempt, quick one-shot, to provide a syscall tracing
> > infrastructure on ftrace.
> 
> Please see also the utrace-based thread syscall/signal/lifecycle.
> tracer I posted a few times, and is just about to be reposted as a
> part of the larger utrace submission.


Yes I've looked at it and as I will explain below, I don't have
any bad feelings against utrace.

 
> System call metadata (name, argument count, and getting fancier from
> there) would be nice to have for other clients too, such as the audit
> subsystem.
> 
> The main drawback of this general approach however is the notion that
> ftrace is the solitary user of system call tracing, thus dedicating
> that new task flag to this purpose.  Therefore, your code has nothing
> like reference counting or sharing; nothing yet to avoid overhead on
> threads that need no tracing, nor to allow more than one tracing
> widget to consume events.  These are the sorts of services that utrace
> provides.
> 
> - FChE


Yes. But it was a first attempt scribbled in one day. I already knew
that a fair amount of bits coming along these patches started on the wrong
way. But that doesn't matter, I just wanted to test the feelings of anyone
interested about the main axis inside:

- having a O(1) access table which stores the number of parameters for each
  syscalls, so that the traced path can be very fast and efficient:
  we know what to save quickly and we only save the relevant bits from
  the path, that said, 1 syscall number (which behaves here as a unique
  identifier key) and its parameters values.

- provide a good way to extract the results. It needs a better and larger work
  on the syscalls metadata. Ie: the string mask is a bit ridiculous, as well as
  the individual callback to display a syscall (because it implies too much work
  and code that can be a lot factorized) but I expect that a discussion about a
  good to produce efficient metadata could born after it.

For example we could have an array of char where the types are encoding using
an enum which describes the types of the parameters.
Depending of these parameters, we can have one callback for each type to display
them. We can also export these metadatas to userspace so that the users are able
to only extract the binary traces and format and give some sense by formatting it
later using the metadas.


Anyway, since this is a first scribbling, there is neither reference count nor
per task tracing granularity :-)

But, that's said, I already expressed my feelings about utrace by the past.
It's a good library, seems easy to use and powerful and I would like to see
it merged.

I don't know what other people think about it. But IMHO, I think it would be good
if you could, for now, make an attempt to sort the only relevant bits to provide
syscalls tracing and submit it.

It would exclude a large part that concerns the signals/fork/exec tracing
which will probably be needed later.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 0/2] Syscalls tracing
  2009-03-07 16:02   ` Frederic Weisbecker
@ 2009-03-08 11:24     ` Frederic Weisbecker
  2009-03-08 11:28       ` Frederic Weisbecker
  0 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-08 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh, Arjan van de Ven

On Sat, Mar 07, 2009 at 05:02:47PM +0100, Frederic Weisbecker wrote:
> On Sat, Mar 07, 2009 at 01:15:18PM +0100, Peter Zijlstra wrote:
> > On Sat, 2009-03-07 at 05:52 +0100, Frederic Weisbecker wrote:
> > > Here is a first attempt, quick one-shot, to provide a syscall tracing
> > > infrastructure on ftrace.
> > > 
> > > The RFC prefix is here to reflect its ugliness on various parts.
> > > The compromise between tracing reliabilty and speed is hard to balance.
> > > For example I guess the basic and horrid string mask should be dropped in favour
> > > of something else, which takes care of the volatile strings from the userspace.
> > > 
> > > But I hope a lot of ideas to make it better will come along this discussion.
> > 
> > Can't you abuse the SYSCALL_DEFINE macros? This current approach looks
> > like it will replicate the syscall table.
> > 
> 
> 
> Ah, I did not even think about it.
> I will be able to get the number of parameters. Sounds good. But I will
> still need a way to store their format somewhere.
>

Ok, we can iterate through sections datas for each one and then generate the format string
depending of the types of the parameters. We can even to it once at boot time.
The last thing is the need to match the exact syscall entry from this section when we enter
a syscall. Don't know yet how I will do that but I will think about it.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 0/2] Syscalls tracing
  2009-03-08 11:24     ` Frederic Weisbecker
@ 2009-03-08 11:28       ` Frederic Weisbecker
  0 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-08 11:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh, Arjan van de Ven

On Sun, Mar 08, 2009 at 12:24:41PM +0100, Frederic Weisbecker wrote:
> On Sat, Mar 07, 2009 at 05:02:47PM +0100, Frederic Weisbecker wrote:
> > On Sat, Mar 07, 2009 at 01:15:18PM +0100, Peter Zijlstra wrote:
> > > On Sat, 2009-03-07 at 05:52 +0100, Frederic Weisbecker wrote:
> > > > Here is a first attempt, quick one-shot, to provide a syscall tracing
> > > > infrastructure on ftrace.
> > > > 
> > > > The RFC prefix is here to reflect its ugliness on various parts.
> > > > The compromise between tracing reliabilty and speed is hard to balance.
> > > > For example I guess the basic and horrid string mask should be dropped in favour
> > > > of something else, which takes care of the volatile strings from the userspace.
> > > > 
> > > > But I hope a lot of ideas to make it better will come along this discussion.
> > > 
> > > Can't you abuse the SYSCALL_DEFINE macros? This current approach looks
> > > like it will replicate the syscall table.
> > > 
> > 
> > 
> > Ah, I did not even think about it.
> > I will be able to get the number of parameters. Sounds good. But I will
> > still need a way to store their format somewhere.
> >
> 
> Ok, we can iterate through sections datas for each one and then generate the format string
> depending of the types of the parameters. We can even to it once at boot time.
> The last thing is the need to match the exact syscall entry from this section when we enter
> a syscall. Don't know yet how I will do that but I will think about it.
> 

Ah and the string mask is actually useful to determine if we one parameter is a string.
If so we can decide to save it on tracing time or not.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-07  4:52 ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Frederic Weisbecker
  2009-03-07  7:49   ` Frederic Weisbecker
@ 2009-03-08 16:24   ` Ingo Molnar
  2009-03-09 20:40     ` Frederic Weisbecker
  2009-03-09 13:44   ` Steven Rostedt
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2009-03-08 16:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> +static const struct syscall_trace_entry syscall_trace_entries[] = {
> +	/* For open, the first argument is a string, hence the given mask */
> +	[SYSCALL_TRACE_OPEN]	= SYS_TRACE_ENTRY(open, 3, 0x1),
> +	[SYSCALL_TRACE_CLOSE]	= SYS_TRACE_ENTRY(close, 1, 0),
> +	[SYSCALL_TRACE_READ]	= SYS_TRACE_ENTRY(read, 3, 0),
> +	[SYSCALL_TRACE_WRITE]	= SYS_TRACE_ENTRY(read, 3, 0),
> +};

s/read/write in the last entry i guess.

But i think the whole concept of duplicating the syscall table 
is the wrong way around.

Note that we dont have to build this information at all - in 
2.6.29-rc1 all syscalls got wrapper macros:

 SYSCALL_DEFINE1(nice, int, increment)
 SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
 SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
                 unsigned long __user *, user_mask_ptr)

We also have the syscall table itself.

So what we can do: by changing the SYSCALL_DEFINEX() macros we 
can emit the following information into a table:

  (syscall_fn_address, syscall_name_string, #of arguments, array 
   of argument names and type sizeof()s)

then during bootup we can match up the sys_call_table[] to the 
secondary table we built, so that we can order the secondary 
table based on syscall NR. 99% of the syscalls will match up 
just fine.

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-07  4:52 ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Frederic Weisbecker
  2009-03-07  7:49   ` Frederic Weisbecker
  2009-03-08 16:24   ` Ingo Molnar
@ 2009-03-09 13:44   ` Steven Rostedt
  2009-03-09 20:42     ` Frederic Weisbecker
  2009-03-13  4:27   ` [tip:tracing/syscalls] tracing/ftrace: syscall tracing infrastructure, basics Frederic Weisbecker
  2009-03-16 19:36   ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Masami Hiramatsu
  4 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2009-03-09 13:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh


On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
>  
>  libftrace-y := ftrace.o
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index cc94f86..5152be3 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
>  unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
>  	TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
>  
> +
> +struct trace_array *__get_global_trace(void)
> +{
> +	return &global_trace;
> +}
> +


Don't do that. It's opening up a can of worms ;-)

>  
> @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
>  #endif /* CONFIG_FTRACE_STARTUP_TEST */
>  
>  extern void *head_page(struct trace_array_cpu *data);
> +extern struct trace_array *__get_global_trace(void);

Remove that too.

> +void ftrace_syscall_enter(struct pt_regs *regs)
> +{
> +	struct trace_array *tr = __get_global_trace();

Don't do that.

> +	struct syscall_trace_enter *entry;
> +	const struct syscall_trace_entry *sys_data;
> +	struct ring_buffer_event *event;
> +	struct trace_array_cpu *data;
> +	int size;
> +	int syscall_nr;
> +	int nr_offset;
> +	int cpu;
> +
> +	syscall_nr = syscall_get_nr(current, regs);
> +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> +		return;
> +
> +	cpu = raw_smp_processor_id();
> +	data = tr->data[cpu];
> +
> +	if (unlikely(atomic_read(&data->disabled)))
> +		return;
> +
> +	sys_data = &syscall_trace_entries[nr_offset];
> +	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> +
> +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);

Here you want:

	trace_current_buffer_lock_reserve(TRACE_SYSCALL_ENTER, size, 0, 0);


-- Steve


> +	if (!event)
> +		return;
> +
> +	entry = ring_buffer_event_data(event);
> +	entry->nr = nr_offset;
> +	syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> +
> +	ring_buffer_unlock_commit(tr->buffer, event);
> +	trace_wake_up();
> +}
> +
> +void ftrace_syscall_exit(struct pt_regs *regs)
> +{
> +	struct trace_array *tr = __get_global_trace();
> +	struct syscall_trace_exit *entry;
> +	const struct syscall_trace_entry *sys_data;
> +	struct ring_buffer_event *event;
> +	struct trace_array_cpu *data;
> +	int syscall_nr;
> +	int nr_offset;
> +	int cpu;
> +
> +	syscall_nr = syscall_get_nr(current, regs);
> +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> +		return;
> +
> +	cpu = raw_smp_processor_id();
> +	data = tr->data[cpu];
> +
> +	if (unlikely(atomic_read(&data->disabled)))
> +		return;
> +
> +	sys_data = &syscall_trace_entries[nr_offset];
> +
> +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> +				sizeof(*entry), 0, 0);

Same here.

> +	if (!event)
> +		return;
> +
> +	entry = ring_buffer_event_data(event);
> +	entry->nr = nr_offset;
> +	entry->ret = syscall_get_return_value(current, regs);
> +
> +	ring_buffer_unlock_commit(tr->buffer, event);
> +	trace_wake_up();
> +}
> +

-- Steve


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-07  7:49   ` Frederic Weisbecker
@ 2009-03-09 13:52     ` Steven Rostedt
  2009-03-09 20:47       ` Frederic Weisbecker
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2009-03-09 13:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh


On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
> +
> +
> +static atomic_t refcount;
> +
> +enum print_line_t
> +print_syscall_enter(struct trace_iterator *iter, int flags)
> +{
> +	struct trace_seq *s = &iter->seq;
> +	struct trace_entry *ent = iter->ent;
> +	struct syscall_trace_enter *trace;
> +	enum syscall_trace_nr syscall;
> +	const struct syscall_trace_entry *entry;
> +	char *str = NULL;
> +	int i, ret;
> +
> +	trace_assign_type(trace, ent);
> +
> +	syscall = trace->nr;
> +
> +	if (!syscall || syscall >= __SYSCALL_TRACE_END)

You still need to print a "\n" otherwise the output of the next line
will start on the current line.

> +		return TRACE_TYPE_HANDLED;
> +
> +	entry = &syscall_trace_entries[syscall];
> +	if (!entry->simple_format) {
> +		ret = entry->output.print(trace->args);
> +		if (!ret)
> +			return TRACE_TYPE_PARTIAL_LINE;
> +		return TRACE_TYPE_HANDLED;
> +	}
> +
> +	ret = trace_seq_printf(s, "syscall %s :", entry->name);
> +	if (!ret)
> +		return TRACE_TYPE_PARTIAL_LINE;
> +
> +	for (i = 0; i < entry->nb_args; i++) {
> +		/*
> +		 * FIXME: the string comes from the user, but on tracing
> +		 * output time, it may have disappeared from the memory.
> +		 * May be better to strdup it on tracing time.
> +		 */
> +		if (entry->output.string_mask & (1 << i)) {
> +			/* 128 is enough for most path */
> +			str = strndup_user((char *)trace->args[i], 128);
> +
> +			/* If it faulted badly, the error shoud have been
> +			 * handled while servicing the syscall, just ignore.
> +			 */
> +			if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
> +			    || str == ERR_PTR(-ENOMEM))
> +				return TRACE_TYPE_HANDLED;
> +
> +			ret = trace_seq_printf(s, " %s", str);
> +		} else {
> +			ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
> +		}
> +		if (!ret)
> +			goto end_partial;
> +	}
> +	if (!trace_seq_printf(s, "\n"))
> +		goto end_partial;
> +
> +	kfree(str);
> +	return TRACE_TYPE_HANDLED;
> +
> +end_partial:
> +	kfree(str);
> +	return TRACE_TYPE_PARTIAL_LINE;
> +}
> +
> +enum print_line_t
> +print_syscall_exit(struct trace_iterator *iter, int flags)
> +{
> +	struct trace_seq *s = &iter->seq;
> +	struct trace_entry *ent = iter->ent;
> +	struct syscall_trace_exit *trace;
> +	enum syscall_trace_nr syscall;
> +	const struct syscall_trace_entry *entry;
> +	int ret;
> +
> +	trace_assign_type(trace, ent);
> +
> +	syscall = trace->nr;
> +
> +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
> +		return TRACE_TYPE_HANDLED;
> +
> +	entry = &syscall_trace_entries[syscall];
> +
> +	ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
> +				trace->ret);
> +	if (!ret)
> +		return TRACE_TYPE_PARTIAL_LINE;
> +
> +	return TRACE_TYPE_HANDLED;
> +}
> +
> +void start_ftrace_syscalls(void)
> +{
> +	unsigned long flags;
> +	struct task_struct *g, *t;
> +
> +	if (atomic_inc_return(&refcount) != 1)
> +		goto out;
> +
> +	read_lock_irqsave(&tasklist_lock, flags);
> +
> +	do_each_thread(g, t) {
> +		set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> +	} while_each_thread(g, t);
> +
> +	read_unlock_irqrestore(&tasklist_lock, flags);
> +
> +	return;
> +out:
> +	atomic_dec(&refcount);
> +}
> +
> +void stop_ftrace_syscalls(void)
> +{
> +	unsigned long flags;
> +	struct task_struct *g, *t;
> +
> +	if (atomic_dec_return(&refcount))
> +		return;
> +
> +	read_lock_irqsave(&tasklist_lock, flags);
> +
> +	do_each_thread(g, t) {
> +		clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> +	} while_each_thread(g, t);
> +
> +	read_unlock_irqrestore(&tasklist_lock, flags);
> +}
> +
> +void ftrace_syscall_enter(struct pt_regs *regs)
> +{
> +	struct trace_array *tr = __get_global_trace();
> +	struct syscall_trace_enter *entry;
> +	const struct syscall_trace_entry *sys_data;
> +	struct ring_buffer_event *event;
> +	struct trace_array_cpu *data;
> +	int size;
> +	int syscall_nr;
> +	int nr_offset;
> +	int cpu;
> +
> +	syscall_nr = syscall_get_nr(current, regs);
> +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> +		return;
> +
> +	cpu = raw_smp_processor_id();
> +	data = tr->data[cpu];
> +
> +	if (unlikely(atomic_read(&data->disabled)))
> +		return;
> +
> +	sys_data = &syscall_trace_entries[nr_offset];
> +	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> +
> +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
> +	if (!event)
> +		return;
> +
> +	entry = ring_buffer_event_data(event);
> +	entry->nr = nr_offset;
> +	syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> +
> +	ring_buffer_unlock_commit(tr->buffer, event);

BTW, you should always use the matching lock. Above you used 
trace_buffer_lock_reserve, and here you use ring_buffer_unlock_commit. 
This may not work in the future. There is a trace_buffer_unlock_commit 
too.

But you are going to change these to trace_current_buffer_lock_reserve and 
trace_current_buffer_unlock_commit anyway ;-)


> +	trace_wake_up();
> +}
> +
> +void ftrace_syscall_exit(struct pt_regs *regs)
> +{
> +	struct trace_array *tr = __get_global_trace();
> +	struct syscall_trace_exit *entry;
> +	const struct syscall_trace_entry *sys_data;
> +	struct ring_buffer_event *event;
> +	struct trace_array_cpu *data;
> +	int syscall_nr;
> +	int nr_offset;
> +	int cpu;
> +
> +	syscall_nr = syscall_get_nr(current, regs);
> +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> +		return;
> +
> +	cpu = raw_smp_processor_id();
> +	data = tr->data[cpu];
> +
> +	if (unlikely(atomic_read(&data->disabled)))
> +		return;
> +
> +	sys_data = &syscall_trace_entries[nr_offset];
> +
> +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> +				sizeof(*entry), 0, 0);
> +	if (!event)
> +		return;
> +
> +	entry = ring_buffer_event_data(event);
> +	entry->nr = nr_offset;
> +	entry->ret = syscall_get_return_value(current, regs);
> +
> +	ring_buffer_unlock_commit(tr->buffer, event);

Same here.

-- Steve

> +	trace_wake_up();
> +}
> +
> +static int init_syscall_tracer(struct trace_array *tr)
> +{
> +	start_ftrace_syscalls();
> +
> +	return 0;
> +}
> +
> +static void reset_syscall_tracer(struct trace_array *tr)
> +{
> +	stop_ftrace_syscalls();
> +}
> +
> +static struct trace_event syscall_enter_event = {
> +	.type	 	= TRACE_SYSCALL_ENTER,
> +	.trace		= print_syscall_enter,
> +};
> +
> +static struct trace_event syscall_exit_event = {
> +	.type	 	= TRACE_SYSCALL_EXIT,
> +	.trace		= print_syscall_exit,
> +};
> +
> +static struct tracer syscall_tracer __read_mostly = {
> +	.name	     	= "syscall",
> +	.init		= init_syscall_tracer,
> +	.reset		= reset_syscall_tracer
> +};
> +
> +__init int register_ftrace_syscalls(void)
> +{
> +	int ret;
> +
> +	ret = register_ftrace_event(&syscall_enter_event);
> +	if (!ret) {
> +		printk(KERN_WARNING "event %d failed to register\n",
> +		       syscall_enter_event.type);
> +		WARN_ON_ONCE(1);
> +	}
> +
> +	ret = register_ftrace_event(&syscall_exit_event);
> +	if (!ret) {
> +		printk(KERN_WARNING "event %d failed to register\n",
> +		       syscall_exit_event.type);
> +		WARN_ON_ONCE(1);
> +	}
> +
> +	return register_tracer(&syscall_tracer);
> +}
> +device_initcall(register_ftrace_syscalls);
> -- 
> 1.6.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-08 16:24   ` Ingo Molnar
@ 2009-03-09 20:40     ` Frederic Weisbecker
  0 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-09 20:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh

On Sun, Mar 08, 2009 at 05:24:44PM +0100, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > +static const struct syscall_trace_entry syscall_trace_entries[] = {
> > +	/* For open, the first argument is a string, hence the given mask */
> > +	[SYSCALL_TRACE_OPEN]	= SYS_TRACE_ENTRY(open, 3, 0x1),
> > +	[SYSCALL_TRACE_CLOSE]	= SYS_TRACE_ENTRY(close, 1, 0),
> > +	[SYSCALL_TRACE_READ]	= SYS_TRACE_ENTRY(read, 3, 0),
> > +	[SYSCALL_TRACE_WRITE]	= SYS_TRACE_ENTRY(read, 3, 0),
> > +};
> 
> s/read/write in the last entry i guess.
> 
> But i think the whole concept of duplicating the syscall table 
> is the wrong way around.
> 
> Note that we dont have to build this information at all - in 
> 2.6.29-rc1 all syscalls got wrapper macros:
> 
>  SYSCALL_DEFINE1(nice, int, increment)
>  SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
>  SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
>                  unsigned long __user *, user_mask_ptr)
> 
> We also have the syscall table itself.
> 
> So what we can do: by changing the SYSCALL_DEFINEX() macros we 
> can emit the following information into a table:
> 
>   (syscall_fn_address, syscall_name_string, #of arguments, array 
>    of argument names and type sizeof()s)
> 
> then during bootup we can match up the sys_call_table[] to the 
> secondary table we built, so that we can order the secondary 
> table based on syscall NR. 99% of the syscalls will match up 
> just fine.
> 
> 	Ingo

Yes, I didn't know the SYSCALL_DEFINEx() macro until Peter told me.
That's indeed a much better way to proceed, I will look at all that.


Thanks!


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-09 13:44   ` Steven Rostedt
@ 2009-03-09 20:42     ` Frederic Weisbecker
  0 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-09 20:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh

On Mon, Mar 09, 2009 at 09:44:35AM -0400, Steven Rostedt wrote:
> 
> On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
> >  
> >  libftrace-y := ftrace.o
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index cc94f86..5152be3 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
> >  unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
> >  	TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
> >  
> > +
> > +struct trace_array *__get_global_trace(void)
> > +{
> > +	return &global_trace;
> > +}
> > +
> 
> 
> Don't do that. It's opening up a can of worms ;-)



That's right. I just wanted to have the insertion done in the dedicated file.

 
> >  
> > @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
> >  #endif /* CONFIG_FTRACE_STARTUP_TEST */
> >  
> >  extern void *head_page(struct trace_array_cpu *data);
> > +extern struct trace_array *__get_global_trace(void);
> 
> Remove that too.
> 
> > +void ftrace_syscall_enter(struct pt_regs *regs)
> > +{
> > +	struct trace_array *tr = __get_global_trace();
> 
> Don't do that.
> 
> > +	struct syscall_trace_enter *entry;
> > +	const struct syscall_trace_entry *sys_data;
> > +	struct ring_buffer_event *event;
> > +	struct trace_array_cpu *data;
> > +	int size;
> > +	int syscall_nr;
> > +	int nr_offset;
> > +	int cpu;
> > +
> > +	syscall_nr = syscall_get_nr(current, regs);
> > +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> > +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > +		return;
> > +
> > +	cpu = raw_smp_processor_id();
> > +	data = tr->data[cpu];
> > +
> > +	if (unlikely(atomic_read(&data->disabled)))
> > +		return;
> > +
> > +	sys_data = &syscall_trace_entries[nr_offset];
> > +	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> > +
> > +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
> 
> Here you want:
> 
> 	trace_current_buffer_lock_reserve(TRACE_SYSCALL_ENTER, size, 0, 0);
> 
> 
> -- Steve


Heh! I didn't know this one! :-)
Thanks!


> 
> > +	if (!event)
> > +		return;
> > +
> > +	entry = ring_buffer_event_data(event);
> > +	entry->nr = nr_offset;
> > +	syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> > +
> > +	ring_buffer_unlock_commit(tr->buffer, event);
> > +	trace_wake_up();
> > +}
> > +
> > +void ftrace_syscall_exit(struct pt_regs *regs)
> > +{
> > +	struct trace_array *tr = __get_global_trace();
> > +	struct syscall_trace_exit *entry;
> > +	const struct syscall_trace_entry *sys_data;
> > +	struct ring_buffer_event *event;
> > +	struct trace_array_cpu *data;
> > +	int syscall_nr;
> > +	int nr_offset;
> > +	int cpu;
> > +
> > +	syscall_nr = syscall_get_nr(current, regs);
> > +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> > +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > +		return;
> > +
> > +	cpu = raw_smp_processor_id();
> > +	data = tr->data[cpu];
> > +
> > +	if (unlikely(atomic_read(&data->disabled)))
> > +		return;
> > +
> > +	sys_data = &syscall_trace_entries[nr_offset];
> > +
> > +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> > +				sizeof(*entry), 0, 0);
> 
> Same here.
> 
> > +	if (!event)
> > +		return;
> > +
> > +	entry = ring_buffer_event_data(event);
> > +	entry->nr = nr_offset;
> > +	entry->ret = syscall_get_return_value(current, regs);
> > +
> > +	ring_buffer_unlock_commit(tr->buffer, event);
> > +	trace_wake_up();
> > +}
> > +
> 
> -- Steve
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-09 13:52     ` Steven Rostedt
@ 2009-03-09 20:47       ` Frederic Weisbecker
  2009-03-10  0:52         ` Steven Rostedt
  0 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-09 20:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh

On Mon, Mar 09, 2009 at 09:52:20AM -0400, Steven Rostedt wrote:
> 
> On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
> > +
> > +
> > +static atomic_t refcount;
> > +
> > +enum print_line_t
> > +print_syscall_enter(struct trace_iterator *iter, int flags)
> > +{
> > +	struct trace_seq *s = &iter->seq;
> > +	struct trace_entry *ent = iter->ent;
> > +	struct syscall_trace_enter *trace;
> > +	enum syscall_trace_nr syscall;
> > +	const struct syscall_trace_entry *entry;
> > +	char *str = NULL;
> > +	int i, ret;
> > +
> > +	trace_assign_type(trace, ent);
> > +
> > +	syscall = trace->nr;
> > +
> > +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
> 
> You still need to print a "\n" otherwise the output of the next line
> will start on the current line.

Are you sure?
This entry will be ignored. And the previous one did print a newline.
So the next one should not have any problem.

 
> > +		return TRACE_TYPE_HANDLED;
> > +
> > +	entry = &syscall_trace_entries[syscall];
> > +	if (!entry->simple_format) {
> > +		ret = entry->output.print(trace->args);
> > +		if (!ret)
> > +			return TRACE_TYPE_PARTIAL_LINE;
> > +		return TRACE_TYPE_HANDLED;
> > +	}
> > +
> > +	ret = trace_seq_printf(s, "syscall %s :", entry->name);
> > +	if (!ret)
> > +		return TRACE_TYPE_PARTIAL_LINE;
> > +
> > +	for (i = 0; i < entry->nb_args; i++) {
> > +		/*
> > +		 * FIXME: the string comes from the user, but on tracing
> > +		 * output time, it may have disappeared from the memory.
> > +		 * May be better to strdup it on tracing time.
> > +		 */
> > +		if (entry->output.string_mask & (1 << i)) {
> > +			/* 128 is enough for most path */
> > +			str = strndup_user((char *)trace->args[i], 128);
> > +
> > +			/* If it faulted badly, the error shoud have been
> > +			 * handled while servicing the syscall, just ignore.
> > +			 */
> > +			if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
> > +			    || str == ERR_PTR(-ENOMEM))
> > +				return TRACE_TYPE_HANDLED;
> > +
> > +			ret = trace_seq_printf(s, " %s", str);
> > +		} else {
> > +			ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
> > +		}
> > +		if (!ret)
> > +			goto end_partial;
> > +	}
> > +	if (!trace_seq_printf(s, "\n"))
> > +		goto end_partial;
> > +
> > +	kfree(str);
> > +	return TRACE_TYPE_HANDLED;
> > +
> > +end_partial:
> > +	kfree(str);
> > +	return TRACE_TYPE_PARTIAL_LINE;
> > +}
> > +
> > +enum print_line_t
> > +print_syscall_exit(struct trace_iterator *iter, int flags)
> > +{
> > +	struct trace_seq *s = &iter->seq;
> > +	struct trace_entry *ent = iter->ent;
> > +	struct syscall_trace_exit *trace;
> > +	enum syscall_trace_nr syscall;
> > +	const struct syscall_trace_entry *entry;
> > +	int ret;
> > +
> > +	trace_assign_type(trace, ent);
> > +
> > +	syscall = trace->nr;
> > +
> > +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
> > +		return TRACE_TYPE_HANDLED;
> > +
> > +	entry = &syscall_trace_entries[syscall];
> > +
> > +	ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
> > +				trace->ret);
> > +	if (!ret)
> > +		return TRACE_TYPE_PARTIAL_LINE;
> > +
> > +	return TRACE_TYPE_HANDLED;
> > +}
> > +
> > +void start_ftrace_syscalls(void)
> > +{
> > +	unsigned long flags;
> > +	struct task_struct *g, *t;
> > +
> > +	if (atomic_inc_return(&refcount) != 1)
> > +		goto out;
> > +
> > +	read_lock_irqsave(&tasklist_lock, flags);
> > +
> > +	do_each_thread(g, t) {
> > +		set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > +	} while_each_thread(g, t);
> > +
> > +	read_unlock_irqrestore(&tasklist_lock, flags);
> > +
> > +	return;
> > +out:
> > +	atomic_dec(&refcount);
> > +}
> > +
> > +void stop_ftrace_syscalls(void)
> > +{
> > +	unsigned long flags;
> > +	struct task_struct *g, *t;
> > +
> > +	if (atomic_dec_return(&refcount))
> > +		return;
> > +
> > +	read_lock_irqsave(&tasklist_lock, flags);
> > +
> > +	do_each_thread(g, t) {
> > +		clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > +	} while_each_thread(g, t);
> > +
> > +	read_unlock_irqrestore(&tasklist_lock, flags);
> > +}
> > +
> > +void ftrace_syscall_enter(struct pt_regs *regs)
> > +{
> > +	struct trace_array *tr = __get_global_trace();
> > +	struct syscall_trace_enter *entry;
> > +	const struct syscall_trace_entry *sys_data;
> > +	struct ring_buffer_event *event;
> > +	struct trace_array_cpu *data;
> > +	int size;
> > +	int syscall_nr;
> > +	int nr_offset;
> > +	int cpu;
> > +
> > +	syscall_nr = syscall_get_nr(current, regs);
> > +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> > +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > +		return;
> > +
> > +	cpu = raw_smp_processor_id();
> > +	data = tr->data[cpu];
> > +
> > +	if (unlikely(atomic_read(&data->disabled)))
> > +		return;
> > +
> > +	sys_data = &syscall_trace_entries[nr_offset];
> > +	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> > +
> > +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
> > +	if (!event)
> > +		return;
> > +
> > +	entry = ring_buffer_event_data(event);
> > +	entry->nr = nr_offset;
> > +	syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> > +
> > +	ring_buffer_unlock_commit(tr->buffer, event);
> 
> BTW, you should always use the matching lock. Above you used 
> trace_buffer_lock_reserve, and here you use ring_buffer_unlock_commit. 
> This may not work in the future. There is a trace_buffer_unlock_commit 
> too.

Ok. I just wondered a bit about the tests against stacktraces flags
which are unecessary here.

> 
> But you are going to change these to trace_current_buffer_lock_reserve and 
> trace_current_buffer_unlock_commit anyway ;-)
> 


Exactly :-)

Thanks!

Frederic.


> > +	trace_wake_up();
> > +}
> > +
> > +void ftrace_syscall_exit(struct pt_regs *regs)
> > +{
> > +	struct trace_array *tr = __get_global_trace();
> > +	struct syscall_trace_exit *entry;
> > +	const struct syscall_trace_entry *sys_data;
> > +	struct ring_buffer_event *event;
> > +	struct trace_array_cpu *data;
> > +	int syscall_nr;
> > +	int nr_offset;
> > +	int cpu;
> > +
> > +	syscall_nr = syscall_get_nr(current, regs);
> > +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> > +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > +		return;
> > +
> > +	cpu = raw_smp_processor_id();
> > +	data = tr->data[cpu];
> > +
> > +	if (unlikely(atomic_read(&data->disabled)))
> > +		return;
> > +
> > +	sys_data = &syscall_trace_entries[nr_offset];
> > +
> > +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> > +				sizeof(*entry), 0, 0);
> > +	if (!event)
> > +		return;
> > +
> > +	entry = ring_buffer_event_data(event);
> > +	entry->nr = nr_offset;
> > +	entry->ret = syscall_get_return_value(current, regs);
> > +
> > +	ring_buffer_unlock_commit(tr->buffer, event);
> 
> Same here.
> 
> -- Steve
> 
> > +	trace_wake_up();
> > +}
> > +
> > +static int init_syscall_tracer(struct trace_array *tr)
> > +{
> > +	start_ftrace_syscalls();
> > +
> > +	return 0;
> > +}
> > +
> > +static void reset_syscall_tracer(struct trace_array *tr)
> > +{
> > +	stop_ftrace_syscalls();
> > +}
> > +
> > +static struct trace_event syscall_enter_event = {
> > +	.type	 	= TRACE_SYSCALL_ENTER,
> > +	.trace		= print_syscall_enter,
> > +};
> > +
> > +static struct trace_event syscall_exit_event = {
> > +	.type	 	= TRACE_SYSCALL_EXIT,
> > +	.trace		= print_syscall_exit,
> > +};
> > +
> > +static struct tracer syscall_tracer __read_mostly = {
> > +	.name	     	= "syscall",
> > +	.init		= init_syscall_tracer,
> > +	.reset		= reset_syscall_tracer
> > +};
> > +
> > +__init int register_ftrace_syscalls(void)
> > +{
> > +	int ret;
> > +
> > +	ret = register_ftrace_event(&syscall_enter_event);
> > +	if (!ret) {
> > +		printk(KERN_WARNING "event %d failed to register\n",
> > +		       syscall_enter_event.type);
> > +		WARN_ON_ONCE(1);
> > +	}
> > +
> > +	ret = register_ftrace_event(&syscall_exit_event);
> > +	if (!ret) {
> > +		printk(KERN_WARNING "event %d failed to register\n",
> > +		       syscall_exit_event.type);
> > +		WARN_ON_ONCE(1);
> > +	}
> > +
> > +	return register_tracer(&syscall_tracer);
> > +}
> > +device_initcall(register_ftrace_syscalls);
> > -- 
> > 1.6.1
> > 
> > 
> > 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-09 20:47       ` Frederic Weisbecker
@ 2009-03-10  0:52         ` Steven Rostedt
  2009-03-10 10:08           ` Frederic Weisbecker
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2009-03-10  0:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh



On Mon, 9 Mar 2009, Frederic Weisbecker wrote:

> On Mon, Mar 09, 2009 at 09:52:20AM -0400, Steven Rostedt wrote:
> > 
> > On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
> > > +
> > > +
> > > +static atomic_t refcount;
> > > +
> > > +enum print_line_t
> > > +print_syscall_enter(struct trace_iterator *iter, int flags)
> > > +{
> > > +	struct trace_seq *s = &iter->seq;
> > > +	struct trace_entry *ent = iter->ent;
> > > +	struct syscall_trace_enter *trace;
> > > +	enum syscall_trace_nr syscall;
> > > +	const struct syscall_trace_entry *entry;
> > > +	char *str = NULL;
> > > +	int i, ret;
> > > +
> > > +	trace_assign_type(trace, ent);
> > > +
> > > +	syscall = trace->nr;
> > > +
> > > +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
> > 
> > You still need to print a "\n" otherwise the output of the next line
> > will start on the current line.
> 
> Are you sure?
> This entry will be ignored. And the previous one did print a newline.
> So the next one should not have any problem.

Does this take over the printline? Otherwise we could be printing a 
header, (comm, timestamp, cpu, etc).

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-10  0:52         ` Steven Rostedt
@ 2009-03-10 10:08           ` Frederic Weisbecker
  0 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-10 10:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh

On Mon, Mar 09, 2009 at 08:52:00PM -0400, Steven Rostedt wrote:
> 
> 
> On Mon, 9 Mar 2009, Frederic Weisbecker wrote:
> 
> > On Mon, Mar 09, 2009 at 09:52:20AM -0400, Steven Rostedt wrote:
> > > 
> > > On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
> > > > +
> > > > +
> > > > +static atomic_t refcount;
> > > > +
> > > > +enum print_line_t
> > > > +print_syscall_enter(struct trace_iterator *iter, int flags)
> > > > +{
> > > > +	struct trace_seq *s = &iter->seq;
> > > > +	struct trace_entry *ent = iter->ent;
> > > > +	struct syscall_trace_enter *trace;
> > > > +	enum syscall_trace_nr syscall;
> > > > +	const struct syscall_trace_entry *entry;
> > > > +	char *str = NULL;
> > > > +	int i, ret;
> > > > +
> > > > +	trace_assign_type(trace, ent);
> > > > +
> > > > +	syscall = trace->nr;
> > > > +
> > > > +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
> > > 
> > > You still need to print a "\n" otherwise the output of the next line
> > > will start on the current line.
> > 
> > Are you sure?
> > This entry will be ignored. And the previous one did print a newline.
> > So the next one should not have any problem.
> 
> Does this take over the printline? Otherwise we could be printing a 
> header, (comm, timestamp, cpu, etc).
> 
> -- Steve

Ah ok, I did not think of the headers.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tip:tracing/syscalls] tracing/ftrace: syscall tracing infrastructure, basics
  2009-03-07  4:52 ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Frederic Weisbecker
                     ` (2 preceding siblings ...)
  2009-03-09 13:44   ` Steven Rostedt
@ 2009-03-13  4:27   ` Frederic Weisbecker
  2009-03-16 19:36   ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Masami Hiramatsu
  4 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-13  4:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, fweisbec, rostedt, tglx, laijs, mingo

Commit-ID:  ee08c6eccb7d1295516f7cf420fddf7b14e9146f
Gitweb:     http://git.kernel.org/tip/ee08c6eccb7d1295516f7cf420fddf7b14e9146f
Author:     Frederic Weisbecker <fweisbec@gmail.com>
AuthorDate: Sat, 7 Mar 2009 05:52:59 +0100
Commit:     Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 13 Mar 2009 06:25:43 +0100

tracing/ftrace: syscall tracing infrastructure, basics

Provide basic callbacks to do syscall tracing.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1236401580-5758-2-git-send-email-fweisbec@gmail.com>
[ simplified it to a trace_printk() for now. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 include/linux/ftrace.h        |   21 ++++++++
 kernel/trace/Kconfig          |   10 ++++
 kernel/trace/Makefile         |    1 +
 kernel/trace/trace.h          |    2 +
 kernel/trace/trace_syscalls.c |  113 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 147 insertions(+), 0 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index e1583f2..c146c10 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -503,4 +503,25 @@ static inline void trace_hw_branch_oops(void) {}
 
 #endif /* CONFIG_HW_BRANCH_TRACER */
 
+/*
+ * A syscall entry in the ftrace syscalls array.
+ *
+ * @syscall_nr: syscall number
+ */
+struct syscall_trace_entry {
+	int		syscall_nr;
+};
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+extern void start_ftrace_syscalls(void);
+extern void stop_ftrace_syscalls(void);
+extern void ftrace_syscall_enter(struct pt_regs *regs);
+extern void ftrace_syscall_exit(struct pt_regs *regs);
+#else
+static inline void start_ftrace_syscalls(void) { }
+static inline void stop_ftrace_syscalls(void) { }
+static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
+static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
+#endif
+
 #endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 8e4a2a6..95a0ad1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
 config HAVE_HW_BRANCH_TRACER
 	bool
 
+config HAVE_FTRACE_SYSCALLS
+	bool
+
 config TRACER_MAX_TRACE
 	bool
 
@@ -175,6 +178,13 @@ config EVENT_TRACER
 	  allowing the user to pick and choose which trace point they
 	  want to trace.
 
+config FTRACE_SYSCALLS
+	bool "Trace syscalls"
+	depends on HAVE_FTRACE_SYSCALLS
+	select TRACING
+	help
+	  Basic tracer to catch the syscall entry and exit events.
+
 config BOOT_TRACER
 	bool "Trace boot initcalls"
 	select TRACING
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c7a2943..c3feea0 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_EVENT_TRACER) += trace_events.o
 obj-$(CONFIG_EVENT_TRACER) += events.o
 obj-$(CONFIG_EVENT_TRACER) += trace_export.o
+obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index c5e1d88..3d49daa 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -30,6 +30,8 @@ enum trace_type {
 	TRACE_GRAPH_ENT,
 	TRACE_USER_STACK,
 	TRACE_HW_BRANCHES,
+	TRACE_SYSCALL_ENTER,
+	TRACE_SYSCALL_EXIT,
 	TRACE_KMEM_ALLOC,
 	TRACE_KMEM_FREE,
 	TRACE_POWER,
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
new file mode 100644
index 0000000..66cf974
--- /dev/null
+++ b/kernel/trace/trace_syscalls.c
@@ -0,0 +1,113 @@
+#include <linux/ftrace.h>
+#include <linux/kernel.h>
+
+#include <asm/syscall.h>
+
+#include "trace_output.h"
+#include "trace.h"
+
+static atomic_t refcount;
+
+void start_ftrace_syscalls(void)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+
+	if (atomic_inc_return(&refcount) != 1)
+		goto out;
+
+	read_lock_irqsave(&tasklist_lock, flags);
+
+	do_each_thread(g, t) {
+		set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+	} while_each_thread(g, t);
+
+	read_unlock_irqrestore(&tasklist_lock, flags);
+out:
+	atomic_dec(&refcount);
+}
+
+void stop_ftrace_syscalls(void)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+
+	if (atomic_dec_return(&refcount))
+		goto out;
+
+	read_lock_irqsave(&tasklist_lock, flags);
+
+	do_each_thread(g, t) {
+		clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+	} while_each_thread(g, t);
+
+	read_unlock_irqrestore(&tasklist_lock, flags);
+out:
+	atomic_inc(&refcount);
+}
+
+void ftrace_syscall_enter(struct pt_regs *regs)
+{
+	int syscall_nr;
+
+	syscall_nr = syscall_get_nr(current, regs);
+
+	trace_printk("syscall %d enter\n", syscall_nr);
+}
+
+void ftrace_syscall_exit(struct pt_regs *regs)
+{
+	int syscall_nr;
+
+	syscall_nr = syscall_get_nr(current, regs);
+
+	trace_printk("syscall %d exit\n", syscall_nr);
+}
+
+static int init_syscall_tracer(struct trace_array *tr)
+{
+	start_ftrace_syscalls();
+
+	return 0;
+}
+
+static void reset_syscall_tracer(struct trace_array *tr)
+{
+	stop_ftrace_syscalls();
+}
+
+static struct trace_event syscall_enter_event = {
+	.type		= TRACE_SYSCALL_ENTER,
+};
+
+static struct trace_event syscall_exit_event = {
+	.type		= TRACE_SYSCALL_EXIT,
+};
+
+static struct tracer syscall_tracer __read_mostly = {
+	.name		= "syscall",
+	.init		= init_syscall_tracer,
+	.reset		= reset_syscall_tracer
+};
+
+__init int register_ftrace_syscalls(void)
+{
+	int ret;
+
+	ret = register_ftrace_event(&syscall_enter_event);
+	if (!ret) {
+		printk(KERN_WARNING "event %d failed to register\n",
+		       syscall_enter_event.type);
+		WARN_ON_ONCE(1);
+	}
+
+	ret = register_ftrace_event(&syscall_exit_event);
+	if (!ret) {
+		printk(KERN_WARNING "event %d failed to register\n",
+		       syscall_exit_event.type);
+		WARN_ON_ONCE(1);
+	}
+
+	return register_tracer(&syscall_tracer);
+}
+device_initcall(register_ftrace_syscalls);

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [tip:tracing/syscalls] tracing/x86: basic implementation of syscall tracing for x86
  2009-03-07  4:53 ` [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64 Frederic Weisbecker
@ 2009-03-13  4:27   ` Frederic Weisbecker
  2009-03-13 16:09   ` [tip:tracing/syscalls] tracing/syscalls: support for syscalls tracing on x86, fix Ingo Molnar
  2009-03-13 16:32   ` [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64 Mathieu Desnoyers
  2 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-13  4:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, fweisbec, rostedt, tglx, laijs, mingo

Commit-ID:  1b3fa2ce64363c289b3b14723cca7290bf91cfce
Gitweb:     http://git.kernel.org/tip/1b3fa2ce64363c289b3b14723cca7290bf91cfce
Author:     Frederic Weisbecker <fweisbec@gmail.com>
AuthorDate: Sat, 7 Mar 2009 05:53:00 +0100
Commit:     Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 13 Mar 2009 06:25:44 +0100

tracing/x86: basic implementation of syscall tracing for x86

Provide the x86 trace callbacks to trace syscalls.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1236401580-5758-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/x86/Kconfig                   |    1 +
 arch/x86/include/asm/thread_info.h |    9 ++++++---
 arch/x86/kernel/ptrace.c           |    7 +++++++
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bdcee12..b0a638b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -35,6 +35,7 @@ config X86
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_FUNCTION_TRACE_MCOUNT_TEST
 	select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE
+	select HAVE_FTRACE_SYSCALLS
 	select HAVE_KVM
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index df9d5f7..8820a73 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -94,6 +94,7 @@ struct thread_info {
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_DEBUGCTLMSR		25	/* uses thread_struct.debugctlmsr */
 #define TIF_DS_AREA_MSR		26      /* uses thread_struct.ds_area_msr */
+#define TIF_SYSCALL_FTRACE	27	/* for ftrace syscall instrumentation */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
@@ -115,15 +116,17 @@ struct thread_info {
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
 #define _TIF_DEBUGCTLMSR	(1 << TIF_DEBUGCTLMSR)
 #define _TIF_DS_AREA_MSR	(1 << TIF_DS_AREA_MSR)
+#define _TIF_SYSCALL_FTRACE	(1 << TIF_SYSCALL_FTRACE)
 
 /* work to do in syscall_trace_enter() */
 #define _TIF_WORK_SYSCALL_ENTRY	\
-	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | \
+	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_FTRACE |	\
 	 _TIF_SYSCALL_AUDIT | _TIF_SECCOMP | _TIF_SINGLESTEP)
 
 /* work to do in syscall_trace_leave() */
 #define _TIF_WORK_SYSCALL_EXIT	\
-	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP)
+	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP |	\
+	 _TIF_SYSCALL_FTRACE)
 
 /* work to do on interrupt/exception return */
 #define _TIF_WORK_MASK							\
@@ -132,7 +135,7 @@ struct thread_info {
 	   _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU))
 
 /* work to do on any return to user space */
-#define _TIF_ALLWORK_MASK (0x0000FFFF & ~_TIF_SECCOMP)
+#define _TIF_ALLWORK_MASK ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_FTRACE)
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 3d9672e..99749d6 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -21,6 +21,7 @@
 #include <linux/audit.h>
 #include <linux/seccomp.h>
 #include <linux/signal.h>
+#include <linux/ftrace.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -1416,6 +1417,9 @@ asmregparm long syscall_trace_enter(struct pt_regs *regs)
 	    tracehook_report_syscall_entry(regs))
 		ret = -1L;
 
+	if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
+		ftrace_syscall_enter(regs);
+
 	if (unlikely(current->audit_context)) {
 		if (IS_IA32)
 			audit_syscall_entry(AUDIT_ARCH_I386,
@@ -1439,6 +1443,9 @@ asmregparm void syscall_trace_leave(struct pt_regs *regs)
 	if (unlikely(current->audit_context))
 		audit_syscall_exit(AUDITSC_RESULT(regs->ax), regs->ax);
 
+	if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
+		ftrace_syscall_exit(regs);
+
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, 0);
 

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [tip:tracing/syscalls] tracing/syscalls: support for syscalls tracing on x86, fix
  2009-03-07  4:53 ` [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64 Frederic Weisbecker
  2009-03-13  4:27   ` [tip:tracing/syscalls] tracing/x86: basic implementation of syscall tracing for x86 Frederic Weisbecker
@ 2009-03-13 16:09   ` Ingo Molnar
  2009-03-13 16:32   ` [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64 Mathieu Desnoyers
  2 siblings, 0 replies; 40+ messages in thread
From: Ingo Molnar @ 2009-03-13 16:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, fweisbec, rostedt, tglx, laijs, mingo

Commit-ID:  ccd50dfd92ea2c4ba9e39531ac55db53393e783e
Gitweb:     http://git.kernel.org/tip/ccd50dfd92ea2c4ba9e39531ac55db53393e783e
Author:     Ingo Molnar <mingo@elte.hu>
AuthorDate: Fri, 13 Mar 2009 17:02:17 +0100
Commit:     Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 13 Mar 2009 17:02:17 +0100

tracing/syscalls: support for syscalls tracing on x86, fix

Impact: build fix

 kernel/built-in.o: In function `ftrace_syscall_exit':
 (.text+0x76667): undefined reference to `syscall_nr_to_meta'

ftrace.o is built:

obj-$(CONFIG_DYNAMIC_FTRACE)    += ftrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o

But now a CONFIG_FTRACE_SYSCALLS dependency is needed too.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1236401580-5758-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/x86/kernel/Makefile |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 339ce35..84000eb 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -66,7 +66,8 @@ obj-$(CONFIG_X86_MPPARSE)	+= mpparse.o
 obj-y				+= apic/
 obj-$(CONFIG_X86_REBOOTFIXUPS)	+= reboot_fixups_32.o
 obj-$(CONFIG_DYNAMIC_FTRACE)	+= ftrace.o
-obj-$(CONFIG_FUNCTION_GRAPH_TRACER)	+= ftrace.o
+obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
+obj-$(CONFIG_FTRACE_SYSCALLS)	+= ftrace.o
 obj-$(CONFIG_KEXEC)		+= machine_kexec_$(BITS).o
 obj-$(CONFIG_KEXEC)		+= relocate_kernel_$(BITS).o crash.o
 obj-$(CONFIG_CRASH_DUMP)	+= crash_dump_$(BITS).o

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64
  2009-03-07  4:53 ` [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64 Frederic Weisbecker
  2009-03-13  4:27   ` [tip:tracing/syscalls] tracing/x86: basic implementation of syscall tracing for x86 Frederic Weisbecker
  2009-03-13 16:09   ` [tip:tracing/syscalls] tracing/syscalls: support for syscalls tracing on x86, fix Ingo Molnar
@ 2009-03-13 16:32   ` Mathieu Desnoyers
  2009-03-13 16:37     ` Ingo Molnar
  2 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2009-03-13 16:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Jiaying Zhang, Martin Bligh

* Frederic Weisbecker (fweisbec@gmail.com) wrote:
> Provide the ptrace hooks and arch specific syscall numbers to ftrace arch indepedant
> syscall numbers.
> For now it only supports 4 syscalls to provide an example.
> 

Hi Frederic,

I already have the equivalent TIF_KERNEL_TRACE flag in my LTTng tree
added to every Linux architecture. You might want to re-use this work
rather than re-doing this. I don't mind changing the flag name.

Please see the LTTng tree and patchset at http://www.lttng.org. I'll
answer more precisely if you need more information.

Mathieu

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> ---
>  arch/x86/Kconfig                   |    1 +
>  arch/x86/include/asm/ftrace.h      |    7 +++++++
>  arch/x86/include/asm/thread_info.h |    9 ++++++---
>  arch/x86/kernel/ftrace.c           |   15 +++++++++++++++
>  arch/x86/kernel/ptrace.c           |    7 +++++++
>  5 files changed, 36 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2a0ddfe..b5aec1b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -35,6 +35,7 @@ config X86
>  	select HAVE_FUNCTION_GRAPH_TRACER
>  	select HAVE_FUNCTION_TRACE_MCOUNT_TEST
>  	select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE
> +	select HAVE_FTRACE_SYSCALLS if X86_64
>  	select HAVE_KVM
>  	select HAVE_ARCH_KGDB
>  	select HAVE_ARCH_TRACEHOOK
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index db24c22..b2e6855 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -28,6 +28,13 @@
>  
>  #endif
>  
> +/* FIXME: I don't want to stay hardcoded */
> +#ifdef CONFIG_X86_64
> +#define FTRACE_SYSCALL_MAX	296
> +#else
> +#define FTRACE_SYSCALL_MAX	333
> +#endif
> +
>  #ifdef CONFIG_FUNCTION_TRACER
>  #define MCOUNT_ADDR		((long)(mcount))
>  #define MCOUNT_INSN_SIZE	5 /* sizeof mcount call */
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index f6ba6f8..83d2b73 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -95,6 +95,7 @@ struct thread_info {
>  #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
>  #define TIF_DEBUGCTLMSR		25	/* uses thread_struct.debugctlmsr */
>  #define TIF_DS_AREA_MSR		26      /* uses thread_struct.ds_area_msr */
> +#define TIF_SYSCALL_FTRACE	27	/* for ftrace syscall instrumentation */
>  
>  #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
> @@ -117,15 +118,17 @@ struct thread_info {
>  #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
>  #define _TIF_DEBUGCTLMSR	(1 << TIF_DEBUGCTLMSR)
>  #define _TIF_DS_AREA_MSR	(1 << TIF_DS_AREA_MSR)
> +#define _TIF_SYSCALL_FTRACE	(1 << TIF_SYSCALL_FTRACE)
>  
>  /* work to do in syscall_trace_enter() */
>  #define _TIF_WORK_SYSCALL_ENTRY	\
> -	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | \
> +	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_FTRACE |	\
>  	 _TIF_SYSCALL_AUDIT | _TIF_SECCOMP | _TIF_SINGLESTEP)
>  
>  /* work to do in syscall_trace_leave() */
>  #define _TIF_WORK_SYSCALL_EXIT	\
> -	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP)
> +	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP |	\
> +	 _TIF_SYSCALL_FTRACE)
>  
>  /* work to do on interrupt/exception return */
>  #define _TIF_WORK_MASK							\
> @@ -134,7 +137,7 @@ struct thread_info {
>  	   _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU))
>  
>  /* work to do on any return to user space */
> -#define _TIF_ALLWORK_MASK (0x0000FFFF & ~_TIF_SECCOMP)
> +#define _TIF_ALLWORK_MASK ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_FTRACE)
>  
>  /* Only used for 64 bit */
>  #define _TIF_DO_NOTIFY_MASK						\
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index a85da17..5bba98b 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -453,3 +453,18 @@ void prepare_ftrace_return(unsigned long *parent, unsigned long self_addr)
>  	}
>  }
>  #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
> +
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +#ifdef CONFIG_X86_64
> +/*
> + * Array that make the arch specific syscall numbers
> + * matching the ftrace syscall numbers.
> + */
> +short arch_syscall_trace_nr[FTRACE_SYSCALL_MAX] = {
> +	[__NR_open]	= SYSCALL_TRACE_OPEN,
> +	[__NR_close]	= SYSCALL_TRACE_CLOSE,
> +	[__NR_read]	= SYSCALL_TRACE_READ,
> +	[__NR_write]	= SYSCALL_TRACE_WRITE
> +};
> +#endif
> +#endif
> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> index 3d9672e..99749d6 100644
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -21,6 +21,7 @@
>  #include <linux/audit.h>
>  #include <linux/seccomp.h>
>  #include <linux/signal.h>
> +#include <linux/ftrace.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
> @@ -1416,6 +1417,9 @@ asmregparm long syscall_trace_enter(struct pt_regs *regs)
>  	    tracehook_report_syscall_entry(regs))
>  		ret = -1L;
>  
> +	if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
> +		ftrace_syscall_enter(regs);
> +
>  	if (unlikely(current->audit_context)) {
>  		if (IS_IA32)
>  			audit_syscall_entry(AUDIT_ARCH_I386,
> @@ -1439,6 +1443,9 @@ asmregparm void syscall_trace_leave(struct pt_regs *regs)
>  	if (unlikely(current->audit_context))
>  		audit_syscall_exit(AUDITSC_RESULT(regs->ax), regs->ax);
>  
> +	if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
> +		ftrace_syscall_exit(regs);
> +
>  	if (test_thread_flag(TIF_SYSCALL_TRACE))
>  		tracehook_report_syscall_exit(regs, 0);
>  
> -- 
> 1.6.1
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64
  2009-03-13 16:32   ` [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64 Mathieu Desnoyers
@ 2009-03-13 16:37     ` Ingo Molnar
  2009-03-13 16:50       ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2009-03-13 16:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, LKML, Lai Jiangshan, Steven Rostedt,
	Peter Zijlstra, Jiaying Zhang, Martin Bligh


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> * Frederic Weisbecker (fweisbec@gmail.com) wrote:
>
> > Provide the ptrace hooks and arch specific syscall numbers 
> > to ftrace arch indepedant syscall numbers. For now it only 
> > supports 4 syscalls to provide an example.
> > 
> 
> Hi Frederic,
> 
> I already have the equivalent TIF_KERNEL_TRACE flag in my 
> LTTng tree added to every Linux architecture. You might want 
> to re-use this work rather than re-doing this. I don't mind 
> changing the flag name.

Yeah. Note that the TIF bits are just one part - there are other 
bits needed for HAVE_FTRACE_SYSCALLS arch support.

Also, i'd eventually expect the TIF bits to be converted to a 
tracehook callback, not spread it to other architectures.

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64
  2009-03-13 16:37     ` Ingo Molnar
@ 2009-03-13 16:50       ` Mathieu Desnoyers
  2009-03-15  4:44         ` Ingo Molnar
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2009-03-13 16:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, LKML, Lai Jiangshan, Steven Rostedt,
	Peter Zijlstra, Jiaying Zhang, Martin Bligh

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> >
> > > Provide the ptrace hooks and arch specific syscall numbers 
> > > to ftrace arch indepedant syscall numbers. For now it only 
> > > supports 4 syscalls to provide an example.
> > > 
> > 
> > Hi Frederic,
> > 
> > I already have the equivalent TIF_KERNEL_TRACE flag in my 
> > LTTng tree added to every Linux architecture. You might want 
> > to re-use this work rather than re-doing this. I don't mind 
> > changing the flag name.
> 
> Yeah. Note that the TIF bits are just one part - there are other 
> bits needed for HAVE_FTRACE_SYSCALLS arch support.
> 
> Also, i'd eventually expect the TIF bits to be converted to a 
> tracehook callback, not spread it to other architectures.
> 

The nice part about the TIF bit is that it permits adding this syscall
tracing feature without increasing the size of the thread_info struct
nor adding any extra tests in entry.S. Does the tracehook callback have
these features ?

Mathieu

> 	Ingo

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64
  2009-03-13 16:50       ` Mathieu Desnoyers
@ 2009-03-15  4:44         ` Ingo Molnar
  2009-03-15 15:16           ` Frederic Weisbecker
  2009-03-15 19:22           ` Mathieu Desnoyers
  0 siblings, 2 replies; 40+ messages in thread
From: Ingo Molnar @ 2009-03-15  4:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, LKML, Lai Jiangshan, Steven Rostedt,
	Peter Zijlstra, Jiaying Zhang, Martin Bligh


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > 
> > > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > >
> > > > Provide the ptrace hooks and arch specific syscall numbers 
> > > > to ftrace arch indepedant syscall numbers. For now it only 
> > > > supports 4 syscalls to provide an example.
> > > > 
> > > 
> > > Hi Frederic,
> > > 
> > > I already have the equivalent TIF_KERNEL_TRACE flag in my 
> > > LTTng tree added to every Linux architecture. You might want 
> > > to re-use this work rather than re-doing this. I don't mind 
> > > changing the flag name.
> > 
> > Yeah. Note that the TIF bits are just one part - there are other 
> > bits needed for HAVE_FTRACE_SYSCALLS arch support.
> > 
> > Also, i'd eventually expect the TIF bits to be converted to a 
> > tracehook callback, not spread it to other architectures.
> > 
> 
> The nice part about the TIF bit is that it permits adding this 
> syscall tracing feature without increasing the size of the 
> thread_info struct nor adding any extra tests in entry.S. Does 
> the tracehook callback have these features ?

yes. Tracehook just factors out common interfacing points - with 
one specific implementation for now: ptrace. For syscall tracing 
that means it wraps TIF_SYSCALL_TRACE in essence.

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64
  2009-03-15  4:44         ` Ingo Molnar
@ 2009-03-15 15:16           ` Frederic Weisbecker
  2009-03-15 19:22           ` Mathieu Desnoyers
  1 sibling, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-15 15:16 UTC (permalink / raw)
  To: Ingo Molnar, Mathieu Desnoyers
  Cc: LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Jiaying Zhang, Martin Bligh

On Sun, Mar 15, 2009 at 05:44:05AM +0100, Ingo Molnar wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > * Ingo Molnar (mingo@elte.hu) wrote:
> > > 
> > > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > > 
> > > > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > > >
> > > > > Provide the ptrace hooks and arch specific syscall numbers 
> > > > > to ftrace arch indepedant syscall numbers. For now it only 
> > > > > supports 4 syscalls to provide an example.
> > > > > 
> > > > 
> > > > Hi Frederic,
> > > > 
> > > > I already have the equivalent TIF_KERNEL_TRACE flag in my 
> > > > LTTng tree added to every Linux architecture. You might want 
> > > > to re-use this work rather than re-doing this. I don't mind 
> > > > changing the flag name.
> > > 
> > > Yeah. Note that the TIF bits are just one part - there are other 
> > > bits needed for HAVE_FTRACE_SYSCALLS arch support.
> > > 
> > > Also, i'd eventually expect the TIF bits to be converted to a 
> > > tracehook callback, not spread it to other architectures.
> > > 
> > 
> > The nice part about the TIF bit is that it permits adding this 
> > syscall tracing feature without increasing the size of the 
> > thread_info struct nor adding any extra tests in entry.S. Does 
> > the tracehook callback have these features ?
> 
> yes. Tracehook just factors out common interfacing points - with 
> one specific implementation for now: ptrace. For syscall tracing 
> that means it wraps TIF_SYSCALL_TRACE in essence.
> 
> 	Ingo


Since I didn't want to experience conflicts against ptrace, I haven't
played with the tracehooks.

But anyway, I wrote this low level part of syscall tracing having in mind
the fact that utrace does this work very much better. That's why I wrote it
to be very basic and simple enough to do the job, but waiting for a better
code that is already written somewhere else.

So now there is some code which does that on both utrace and Lttng.

I would be pleased to see these patches which handle these flags properly
on LKML :-)

Frederic.




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64
  2009-03-15  4:44         ` Ingo Molnar
  2009-03-15 15:16           ` Frederic Weisbecker
@ 2009-03-15 19:22           ` Mathieu Desnoyers
  1 sibling, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2009-03-15 19:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, LKML, Lai Jiangshan, Steven Rostedt,
	Peter Zijlstra, Jiaying Zhang, Martin Bligh

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > * Ingo Molnar (mingo@elte.hu) wrote:
> > > 
> > > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > > 
> > > > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > > >
> > > > > Provide the ptrace hooks and arch specific syscall numbers 
> > > > > to ftrace arch indepedant syscall numbers. For now it only 
> > > > > supports 4 syscalls to provide an example.
> > > > > 
> > > > 
> > > > Hi Frederic,
> > > > 
> > > > I already have the equivalent TIF_KERNEL_TRACE flag in my 
> > > > LTTng tree added to every Linux architecture. You might want 
> > > > to re-use this work rather than re-doing this. I don't mind 
> > > > changing the flag name.
> > > 
> > > Yeah. Note that the TIF bits are just one part - there are other 
> > > bits needed for HAVE_FTRACE_SYSCALLS arch support.
> > > 
> > > Also, i'd eventually expect the TIF bits to be converted to a 
> > > tracehook callback, not spread it to other architectures.
> > > 
> > 
> > The nice part about the TIF bit is that it permits adding this 
> > syscall tracing feature without increasing the size of the 
> > thread_info struct nor adding any extra tests in entry.S. Does 
> > the tracehook callback have these features ?
> 
> yes. Tracehook just factors out common interfacing points - with 
> one specific implementation for now: ptrace. For syscall tracing 
> that means it wraps TIF_SYSCALL_TRACE in essence.
> 
> 	Ingo

What overhead and latency does tracehooks add when

- they are being enabled/disabled
- once they are enabled ?

Do they require to stop each and every thread upon activation ?

I'll post my lttng trace flags for RFC, as it could save Frederic some
duplicated work. Feel free to comment.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-07  4:52 ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Frederic Weisbecker
                     ` (3 preceding siblings ...)
  2009-03-13  4:27   ` [tip:tracing/syscalls] tracing/ftrace: syscall tracing infrastructure, basics Frederic Weisbecker
@ 2009-03-16 19:36   ` Masami Hiramatsu
  2009-03-16 20:15     ` Frederic Weisbecker
  4 siblings, 1 reply; 40+ messages in thread
From: Masami Hiramatsu @ 2009-03-16 19:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh, Roland McGrath

Hi Frederic,

Out of curiously, could you tell me why did you introduce new TIF instead of
using utrace's tracehook API?
For example, systemtap already use current utrace/tracehook api for tracing
all syscalls. If there is any new advantage, it would be good to enhance
existing utrace/tracehook.

Thank you,

Frederic Weisbecker wrote:
> The most important drawback that implies the syscall tracing is the variable
> number of parameters that takes a syscall. Assuming there is a large number of
> syscall, the goal is to provide as much as possible a generic layer on it.
> 
> What we want on tracing time (the hot path):
> 
> - Save the parameters and the number of the syscall. That's all. The rest of the
>   job can be done on the "output path".
> 
> On output time:
> 
> - Get the raw-binary saved parameters and the syscall number as well as some
>   usual infos such as the pid, the time...
> - Depending on the tracing requirements, the user should have the choice to
>   dump the raw datas or a formatted version.
> 
> The core infrastructure has a static array which contains the necessary
> informations for each syscall (ok, I've only implemented 4 for now).
> 
> When a syscall triggers, the tracer look at this array (O(1) access) and get the
> number of parameters for this syscall. Then these parameters are saved in a
> binary form on the ring buffer with the syscall number (a generic version, arch
> independent).
> 
> On output time the parameters, the name of the syscall and very basic
> informations to format the parameters are picket from the ring-buffer and the
> generic syscall array.
> 
> Depending of the level of implementation we want for the syscall inside this
> array, we can provide just a single mask where the n bit matches the n
> parameter. If the bit is set, then the parameter will be considered as a string
> pointer. Otherwise its raw hexadecimal value will be printed.
> 
> If needed, we can otherwise provide a specific callback to print the syscall
> event.
> 
> The probem with these approaches comes from the fact that the string might have
> disappeared from the user memory on output time. Even if we protect against
> faults, it can give unreliable traces.
> So perhaps we could think about copying the string on tracing time.
> 
> Concerning the return value, it doesn't make any problem, a single raw
> value is printed for each one.
> 
> Note that the current implementation can be easily scaled further to give only
> binary informations to the user and the matching metadata to decode it.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> ---
>  include/linux/ftrace.h        |   44 +++++++
>  kernel/trace/Kconfig          |   10 ++
>  kernel/trace/Makefile         |    1 +
>  kernel/trace/trace.c          |    6 +
>  kernel/trace/trace.h          |   32 +++++
>  kernel/trace/trace_syscalls.c |  284 +++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 377 insertions(+), 0 deletions(-)
>  create mode 100644 kernel/trace/trace_syscalls.c
> 
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index e1583f2..1798692 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}
>  
>  #endif /* CONFIG_HW_BRANCH_TRACER */
>  
> +/* Index for the ftrace syscalls array */
> +enum syscall_trace_nr {
> +	SYSCALL_TRACE_OPEN = 1,
> +	SYSCALL_TRACE_CLOSE,
> +	SYSCALL_TRACE_READ,
> +	SYSCALL_TRACE_WRITE,
> +
> +	__SYSCALL_TRACE_END
> +};
> +
> +/*
> + * A syscall entry in the ftrace syscalls array.
> + *
> + * @name: name of the syscall
> + * @nb_args: number of parameters it takes
> + * @simple_format: if true, we use string_mask, otherwise the print callback
> + * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
> + *		 will be displayed as a string, otherwise it will be considered
> + *		 as a raw number (hex displayed).
> + * @print: implement it if you want a custom output for a given syscall
> + */
> +struct syscall_trace_entry {
> +	const char	*name;
> +	int		nb_args;
> +	bool		simple_format;
> +	union {
> +		unsigned long string_mask;
> +		int (*print)(unsigned long *);
> +	} output;
> +};
> +
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +extern short arch_syscall_trace_nr[];
> +extern void start_ftrace_syscalls(void);
> +extern void stop_ftrace_syscalls(void);
> +extern void ftrace_syscall_enter(struct pt_regs *regs);
> +extern void ftrace_syscall_exit(struct pt_regs *regs);
> +#else
> +static inline void start_ftrace_syscalls(void) { }
> +static inline void stop_ftrace_syscalls(void) { }
> +static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
> +static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
> +#endif
> +
>  #endif /* _LINUX_FTRACE_H */
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 8e4a2a6..95a0ad1 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
>  config HAVE_HW_BRANCH_TRACER
>  	bool
>  
> +config HAVE_FTRACE_SYSCALLS
> +	bool
> +
>  config TRACER_MAX_TRACE
>  	bool
>  
> @@ -175,6 +178,13 @@ config EVENT_TRACER
>  	  allowing the user to pick and choose which trace point they
>  	  want to trace.
>  
> +config FTRACE_SYSCALLS
> +	bool "Trace syscalls"
> +	depends on HAVE_FTRACE_SYSCALLS
> +	select TRACING
> +	help
> +	  Basic tracer to catch the syscall entry and exit events.
> +
>  config BOOT_TRACER
>  	bool "Trace boot initcalls"
>  	select TRACING
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index c7a2943..c3feea0 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
>  obj-$(CONFIG_EVENT_TRACER) += trace_events.o
>  obj-$(CONFIG_EVENT_TRACER) += events.o
>  obj-$(CONFIG_EVENT_TRACER) += trace_export.o
> +obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
>  
>  libftrace-y := ftrace.o
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index cc94f86..5152be3 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
>  unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
>  	TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
>  
> +
> +struct trace_array *__get_global_trace(void)
> +{
> +	return &global_trace;
> +}
> +
>  /**
>   * trace_wake_up - wake up tasks waiting for trace input
>   *
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 2bfb7d1..9583830 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -30,6 +30,8 @@ enum trace_type {
>  	TRACE_GRAPH_ENT,
>  	TRACE_USER_STACK,
>  	TRACE_HW_BRANCHES,
> +	TRACE_SYSCALL_ENTER,
> +	TRACE_SYSCALL_EXIT,
>  	TRACE_KMEM_ALLOC,
>  	TRACE_KMEM_FREE,
>  	TRACE_POWER,
> @@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
>  	const void *ptr;
>  };
>  
> +struct syscall_trace_enter {
> +	struct trace_entry	ent;
> +	enum syscall_trace_nr	nr;
> +	unsigned long		args[];
> +};
> +
> +struct syscall_trace_exit {
> +	struct trace_entry	ent;
> +	enum syscall_trace_nr	nr;
> +	unsigned long		ret;
> +};
> +
> +
>  /*
>   * trace_flag_type is an enumeration that holds different
>   * states when a trace occurs. These are:
> @@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
>  			  TRACE_KMEM_ALLOC);	\
>  		IF_ASSIGN(var, ent, struct kmemtrace_free_entry,	\
>  			  TRACE_KMEM_FREE);	\
> +		IF_ASSIGN(var, ent, struct syscall_trace_enter,		\
> +			  TRACE_SYSCALL_ENTER);				\
> +		IF_ASSIGN(var, ent, struct syscall_trace_exit,		\
> +			  TRACE_SYSCALL_EXIT);				\
>  		__ftrace_bad_type();					\
>  	} while (0)
>  
> @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
>  #endif /* CONFIG_FTRACE_STARTUP_TEST */
>  
>  extern void *head_page(struct trace_array_cpu *data);
> +extern struct trace_array *__get_global_trace(void);
>  extern long ns2usecs(cycle_t nsec);
>  extern int
>  trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
> @@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
>  	return test_tsk_trace_trace(task);
>  }
>  
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +extern enum print_line_t
> +print_syscall_enter(struct trace_iterator *iter, int flags);
> +extern enum print_line_t
> +print_syscall_exit(struct trace_iterator *iter, int flags);
> +#else
> +static inline enum print_line_t
> +print_syscall_enter(struct trace_iterator *iter, int flags) { }
> +static inline enum print_line_t
> +print_syscall_exit(struct trace_iterator *iter, int flags) { }
> +#endif
> +
>  /*
>   * trace_iterator_flags is an enumeration that defines bit
>   * positions into trace_flags that controls the output.
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> new file mode 100644
> index 0000000..470bb9d
> --- /dev/null
> +++ b/kernel/trace/trace_syscalls.c
> @@ -0,0 +1,284 @@
> +#include <linux/kernel.h>
> +#include <linux/ftrace.h>
> +#include <asm/syscall.h>
> +
> +#include "trace_output.h"
> +#include "trace.h"
> +
> +/* Create basic entry on syscall array (with the rudimentary string mask) */
> +#define SYS_TRACE_ENTRY(sname, args, mask)			  \
> +	{.name = #sname, .nb_args = args, .simple_format = true,  \
> +	.output.string_mask = mask}
> +
> +/*
> + * Create a custom defined entry on the syscall array, you will have to
> + * implement a callback to output the syscall.
> + */
> +#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer)		  \
> +	{.name = #sname, .nb_args = args, .simple_format = false, \
> +	.print = printer}
> +
> +static const struct syscall_trace_entry syscall_trace_entries[] = {
> +	/* For open, the first argument is a string, hence the given mask */
> +	[SYSCALL_TRACE_OPEN]	= SYS_TRACE_ENTRY(open, 3, 0x1),
> +	[SYSCALL_TRACE_CLOSE]	= SYS_TRACE_ENTRY(close, 1, 0),
> +	[SYSCALL_TRACE_READ]	= SYS_TRACE_ENTRY(read, 3, 0),
> +	[SYSCALL_TRACE_WRITE]	= SYS_TRACE_ENTRY(read, 3, 0),
> +};
> +
> +
> +static atomic_t refcount;
> +
> +enum print_line_t
> +print_syscall_enter(struct trace_iterator *iter, int flags)
> +{
> +	struct trace_seq *s = &iter->seq;
> +	struct trace_entry *ent = iter->ent;
> +	struct syscall_trace_enter *trace;
> +	enum syscall_trace_nr syscall;
> +	const struct syscall_trace_entry *entry;
> +	char *str = NULL;
> +	int i, ret;
> +
> +	trace_assign_type(trace, ent);
> +
> +	syscall = trace->nr;
> +
> +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
> +		return TRACE_TYPE_HANDLED;
> +
> +	entry = &syscall_trace_entries[syscall];
> +	if (!entry->simple_format) {
> +		ret = entry->output.print(trace->args);
> +		if (!ret)
> +			return TRACE_TYPE_PARTIAL_LINE;
> +	}
> +
> +	ret = trace_seq_printf(s, "syscall %s :", entry->name);
> +	if (!ret)
> +		return TRACE_TYPE_PARTIAL_LINE;
> +
> +	for (i = 0; i < entry->nb_args; i++) {
> +		/*
> +		 * FIXME: the string comes from the user, but on tracing
> +		 * output time, it may have disappeared from the memory.
> +		 * May be better to strdup it on tracing time.
> +		 */
> +		if (entry->output.string_mask & (1 << i)) {
> +			/* 128 is enough for most path */
> +			str = strndup_user((char *)trace->args[i], 128);
> +
> +			/* If it faulted badly, the error shoud have been
> +			 * handled while servicing the syscall, just ignore.
> +			 */
> +			if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
> +			    || str == ERR_PTR(-ENOMEM))
> +				return TRACE_TYPE_HANDLED;
> +
> +			ret = trace_seq_printf(s, " %s", str);
> +		} else {
> +			ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
> +		}
> +		if (!ret)
> +			goto end_partial;
> +	}
> +	if (!trace_seq_printf(s, "\n"))
> +		goto end_partial;
> +
> +	kfree(str);
> +	return TRACE_TYPE_HANDLED;
> +
> +end_partial:
> +	kfree(str);
> +	return TRACE_TYPE_PARTIAL_LINE;
> +}
> +
> +enum print_line_t
> +print_syscall_exit(struct trace_iterator *iter, int flags)
> +{
> +	struct trace_seq *s = &iter->seq;
> +	struct trace_entry *ent = iter->ent;
> +	struct syscall_trace_exit *trace;
> +	enum syscall_trace_nr syscall;
> +	const struct syscall_trace_entry *entry;
> +	int ret;
> +
> +	trace_assign_type(trace, ent);
> +
> +	syscall = trace->nr;
> +
> +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
> +		return TRACE_TYPE_HANDLED;
> +
> +	entry = &syscall_trace_entries[syscall];
> +
> +	ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
> +				trace->ret);
> +	if (!ret)
> +		return TRACE_TYPE_PARTIAL_LINE;
> +
> +	return TRACE_TYPE_HANDLED;
> +}
> +
> +void start_ftrace_syscalls(void)
> +{
> +	unsigned long flags;
> +	struct task_struct *g, *t;
> +
> +	if (atomic_inc_return(&refcount) != 1)
> +		goto out;
> +
> +	read_lock_irqsave(&tasklist_lock, flags);
> +
> +	do_each_thread(g, t) {
> +		set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> +	} while_each_thread(g, t);
> +
> +	read_unlock_irqrestore(&tasklist_lock, flags);
> +out:
> +	atomic_dec(&refcount);
> +}
> +
> +void stop_ftrace_syscalls(void)
> +{
> +	unsigned long flags;
> +	struct task_struct *g, *t;
> +
> +	if (atomic_dec_return(&refcount))
> +		goto out;
> +
> +	read_lock_irqsave(&tasklist_lock, flags);
> +
> +	do_each_thread(g, t) {
> +		clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> +	} while_each_thread(g, t);
> +
> +	read_unlock_irqrestore(&tasklist_lock, flags);
> +out:
> +	atomic_inc(&refcount);
> +}
> +
> +void ftrace_syscall_enter(struct pt_regs *regs)
> +{
> +	struct trace_array *tr = __get_global_trace();
> +	struct syscall_trace_enter *entry;
> +	const struct syscall_trace_entry *sys_data;
> +	struct ring_buffer_event *event;
> +	struct trace_array_cpu *data;
> +	int size;
> +	int syscall_nr;
> +	int nr_offset;
> +	int cpu;
> +
> +	syscall_nr = syscall_get_nr(current, regs);
> +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> +		return;
> +
> +	cpu = raw_smp_processor_id();
> +	data = tr->data[cpu];
> +
> +	if (unlikely(atomic_read(&data->disabled)))
> +		return;
> +
> +	sys_data = &syscall_trace_entries[nr_offset];
> +	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> +
> +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
> +	if (!event)
> +		return;
> +
> +	entry = ring_buffer_event_data(event);
> +	entry->nr = nr_offset;
> +	syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> +
> +	ring_buffer_unlock_commit(tr->buffer, event);
> +	trace_wake_up();
> +}
> +
> +void ftrace_syscall_exit(struct pt_regs *regs)
> +{
> +	struct trace_array *tr = __get_global_trace();
> +	struct syscall_trace_exit *entry;
> +	const struct syscall_trace_entry *sys_data;
> +	struct ring_buffer_event *event;
> +	struct trace_array_cpu *data;
> +	int syscall_nr;
> +	int nr_offset;
> +	int cpu;
> +
> +	syscall_nr = syscall_get_nr(current, regs);
> +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> +		return;
> +
> +	cpu = raw_smp_processor_id();
> +	data = tr->data[cpu];
> +
> +	if (unlikely(atomic_read(&data->disabled)))
> +		return;
> +
> +	sys_data = &syscall_trace_entries[nr_offset];
> +
> +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> +				sizeof(*entry), 0, 0);
> +	if (!event)
> +		return;
> +
> +	entry = ring_buffer_event_data(event);
> +	entry->nr = nr_offset;
> +	entry->ret = syscall_get_return_value(current, regs);
> +
> +	ring_buffer_unlock_commit(tr->buffer, event);
> +	trace_wake_up();
> +}
> +
> +static int init_syscall_tracer(struct trace_array *tr)
> +{
> +	start_ftrace_syscalls();
> +
> +	return 0;
> +}
> +
> +static void reset_syscall_tracer(struct trace_array *tr)
> +{
> +	stop_ftrace_syscalls();
> +}
> +
> +static struct trace_event syscall_enter_event = {
> +	.type	 	= TRACE_SYSCALL_ENTER,
> +	.trace		= print_syscall_enter,
> +};
> +
> +static struct trace_event syscall_exit_event = {
> +	.type	 	= TRACE_SYSCALL_EXIT,
> +	.trace		= print_syscall_exit,
> +};
> +
> +static struct tracer syscall_tracer __read_mostly = {
> +	.name	     	= "syscall",
> +	.init		= init_syscall_tracer,
> +	.reset		= reset_syscall_tracer
> +};
> +
> +__init int register_ftrace_syscalls(void)
> +{
> +	int ret;
> +
> +	ret = register_ftrace_event(&syscall_enter_event);
> +	if (!ret) {
> +		printk(KERN_WARNING "event %d failed to register\n",
> +		       syscall_enter_event.type);
> +		WARN_ON_ONCE(1);
> +	}
> +
> +	ret = register_ftrace_event(&syscall_exit_event);
> +	if (!ret) {
> +		printk(KERN_WARNING "event %d failed to register\n",
> +		       syscall_exit_event.type);
> +		WARN_ON_ONCE(1);
> +	}
> +
> +	return register_tracer(&syscall_tracer);
> +}
> +device_initcall(register_ftrace_syscalls);

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-16 19:36   ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Masami Hiramatsu
@ 2009-03-16 20:15     ` Frederic Weisbecker
  2009-03-16 20:38       ` Masami Hiramatsu
  0 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-16 20:15 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh, Roland McGrath

On Mon, Mar 16, 2009 at 03:36:58PM -0400, Masami Hiramatsu wrote:
> Hi Frederic,
> 
> Out of curiously, could you tell me why did you introduce new TIF instead of
> using utrace's tracehook API?
> For example, systemtap already use current utrace/tracehook api for tracing
> all syscalls. If there is any new advantage, it would be good to enhance
> existing utrace/tracehook.
> 
> Thank you,


Hi,

Because the tracehooks rely on ptrace work and both ftrace and ptrace would overlap
in a same flag, creating racy issues. At least ss far as I understood it.
How does systemtap manage it?

Thanks.

> 
> Frederic Weisbecker wrote:
> > The most important drawback that implies the syscall tracing is the variable
> > number of parameters that takes a syscall. Assuming there is a large number of
> > syscall, the goal is to provide as much as possible a generic layer on it.
> > 
> > What we want on tracing time (the hot path):
> > 
> > - Save the parameters and the number of the syscall. That's all. The rest of the
> >   job can be done on the "output path".
> > 
> > On output time:
> > 
> > - Get the raw-binary saved parameters and the syscall number as well as some
> >   usual infos such as the pid, the time...
> > - Depending on the tracing requirements, the user should have the choice to
> >   dump the raw datas or a formatted version.
> > 
> > The core infrastructure has a static array which contains the necessary
> > informations for each syscall (ok, I've only implemented 4 for now).
> > 
> > When a syscall triggers, the tracer look at this array (O(1) access) and get the
> > number of parameters for this syscall. Then these parameters are saved in a
> > binary form on the ring buffer with the syscall number (a generic version, arch
> > independent).
> > 
> > On output time the parameters, the name of the syscall and very basic
> > informations to format the parameters are picket from the ring-buffer and the
> > generic syscall array.
> > 
> > Depending of the level of implementation we want for the syscall inside this
> > array, we can provide just a single mask where the n bit matches the n
> > parameter. If the bit is set, then the parameter will be considered as a string
> > pointer. Otherwise its raw hexadecimal value will be printed.
> > 
> > If needed, we can otherwise provide a specific callback to print the syscall
> > event.
> > 
> > The probem with these approaches comes from the fact that the string might have
> > disappeared from the user memory on output time. Even if we protect against
> > faults, it can give unreliable traces.
> > So perhaps we could think about copying the string on tracing time.
> > 
> > Concerning the return value, it doesn't make any problem, a single raw
> > value is printed for each one.
> > 
> > Note that the current implementation can be easily scaled further to give only
> > binary informations to the user and the matching metadata to decode it.
> > 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > ---
> >  include/linux/ftrace.h        |   44 +++++++
> >  kernel/trace/Kconfig          |   10 ++
> >  kernel/trace/Makefile         |    1 +
> >  kernel/trace/trace.c          |    6 +
> >  kernel/trace/trace.h          |   32 +++++
> >  kernel/trace/trace_syscalls.c |  284 +++++++++++++++++++++++++++++++++++++++++
> >  6 files changed, 377 insertions(+), 0 deletions(-)
> >  create mode 100644 kernel/trace/trace_syscalls.c
> > 
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > index e1583f2..1798692 100644
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}
> >  
> >  #endif /* CONFIG_HW_BRANCH_TRACER */
> >  
> > +/* Index for the ftrace syscalls array */
> > +enum syscall_trace_nr {
> > +	SYSCALL_TRACE_OPEN = 1,
> > +	SYSCALL_TRACE_CLOSE,
> > +	SYSCALL_TRACE_READ,
> > +	SYSCALL_TRACE_WRITE,
> > +
> > +	__SYSCALL_TRACE_END
> > +};
> > +
> > +/*
> > + * A syscall entry in the ftrace syscalls array.
> > + *
> > + * @name: name of the syscall
> > + * @nb_args: number of parameters it takes
> > + * @simple_format: if true, we use string_mask, otherwise the print callback
> > + * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
> > + *		 will be displayed as a string, otherwise it will be considered
> > + *		 as a raw number (hex displayed).
> > + * @print: implement it if you want a custom output for a given syscall
> > + */
> > +struct syscall_trace_entry {
> > +	const char	*name;
> > +	int		nb_args;
> > +	bool		simple_format;
> > +	union {
> > +		unsigned long string_mask;
> > +		int (*print)(unsigned long *);
> > +	} output;
> > +};
> > +
> > +#ifdef CONFIG_FTRACE_SYSCALLS
> > +extern short arch_syscall_trace_nr[];
> > +extern void start_ftrace_syscalls(void);
> > +extern void stop_ftrace_syscalls(void);
> > +extern void ftrace_syscall_enter(struct pt_regs *regs);
> > +extern void ftrace_syscall_exit(struct pt_regs *regs);
> > +#else
> > +static inline void start_ftrace_syscalls(void) { }
> > +static inline void stop_ftrace_syscalls(void) { }
> > +static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
> > +static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
> > +#endif
> > +
> >  #endif /* _LINUX_FTRACE_H */
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 8e4a2a6..95a0ad1 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
> >  config HAVE_HW_BRANCH_TRACER
> >  	bool
> >  
> > +config HAVE_FTRACE_SYSCALLS
> > +	bool
> > +
> >  config TRACER_MAX_TRACE
> >  	bool
> >  
> > @@ -175,6 +178,13 @@ config EVENT_TRACER
> >  	  allowing the user to pick and choose which trace point they
> >  	  want to trace.
> >  
> > +config FTRACE_SYSCALLS
> > +	bool "Trace syscalls"
> > +	depends on HAVE_FTRACE_SYSCALLS
> > +	select TRACING
> > +	help
> > +	  Basic tracer to catch the syscall entry and exit events.
> > +
> >  config BOOT_TRACER
> >  	bool "Trace boot initcalls"
> >  	select TRACING
> > diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> > index c7a2943..c3feea0 100644
> > --- a/kernel/trace/Makefile
> > +++ b/kernel/trace/Makefile
> > @@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
> >  obj-$(CONFIG_EVENT_TRACER) += trace_events.o
> >  obj-$(CONFIG_EVENT_TRACER) += events.o
> >  obj-$(CONFIG_EVENT_TRACER) += trace_export.o
> > +obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
> >  
> >  libftrace-y := ftrace.o
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index cc94f86..5152be3 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
> >  unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
> >  	TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
> >  
> > +
> > +struct trace_array *__get_global_trace(void)
> > +{
> > +	return &global_trace;
> > +}
> > +
> >  /**
> >   * trace_wake_up - wake up tasks waiting for trace input
> >   *
> > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> > index 2bfb7d1..9583830 100644
> > --- a/kernel/trace/trace.h
> > +++ b/kernel/trace/trace.h
> > @@ -30,6 +30,8 @@ enum trace_type {
> >  	TRACE_GRAPH_ENT,
> >  	TRACE_USER_STACK,
> >  	TRACE_HW_BRANCHES,
> > +	TRACE_SYSCALL_ENTER,
> > +	TRACE_SYSCALL_EXIT,
> >  	TRACE_KMEM_ALLOC,
> >  	TRACE_KMEM_FREE,
> >  	TRACE_POWER,
> > @@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
> >  	const void *ptr;
> >  };
> >  
> > +struct syscall_trace_enter {
> > +	struct trace_entry	ent;
> > +	enum syscall_trace_nr	nr;
> > +	unsigned long		args[];
> > +};
> > +
> > +struct syscall_trace_exit {
> > +	struct trace_entry	ent;
> > +	enum syscall_trace_nr	nr;
> > +	unsigned long		ret;
> > +};
> > +
> > +
> >  /*
> >   * trace_flag_type is an enumeration that holds different
> >   * states when a trace occurs. These are:
> > @@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
> >  			  TRACE_KMEM_ALLOC);	\
> >  		IF_ASSIGN(var, ent, struct kmemtrace_free_entry,	\
> >  			  TRACE_KMEM_FREE);	\
> > +		IF_ASSIGN(var, ent, struct syscall_trace_enter,		\
> > +			  TRACE_SYSCALL_ENTER);				\
> > +		IF_ASSIGN(var, ent, struct syscall_trace_exit,		\
> > +			  TRACE_SYSCALL_EXIT);				\
> >  		__ftrace_bad_type();					\
> >  	} while (0)
> >  
> > @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
> >  #endif /* CONFIG_FTRACE_STARTUP_TEST */
> >  
> >  extern void *head_page(struct trace_array_cpu *data);
> > +extern struct trace_array *__get_global_trace(void);
> >  extern long ns2usecs(cycle_t nsec);
> >  extern int
> >  trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
> > @@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
> >  	return test_tsk_trace_trace(task);
> >  }
> >  
> > +#ifdef CONFIG_FTRACE_SYSCALLS
> > +extern enum print_line_t
> > +print_syscall_enter(struct trace_iterator *iter, int flags);
> > +extern enum print_line_t
> > +print_syscall_exit(struct trace_iterator *iter, int flags);
> > +#else
> > +static inline enum print_line_t
> > +print_syscall_enter(struct trace_iterator *iter, int flags) { }
> > +static inline enum print_line_t
> > +print_syscall_exit(struct trace_iterator *iter, int flags) { }
> > +#endif
> > +
> >  /*
> >   * trace_iterator_flags is an enumeration that defines bit
> >   * positions into trace_flags that controls the output.
> > diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> > new file mode 100644
> > index 0000000..470bb9d
> > --- /dev/null
> > +++ b/kernel/trace/trace_syscalls.c
> > @@ -0,0 +1,284 @@
> > +#include <linux/kernel.h>
> > +#include <linux/ftrace.h>
> > +#include <asm/syscall.h>
> > +
> > +#include "trace_output.h"
> > +#include "trace.h"
> > +
> > +/* Create basic entry on syscall array (with the rudimentary string mask) */
> > +#define SYS_TRACE_ENTRY(sname, args, mask)			  \
> > +	{.name = #sname, .nb_args = args, .simple_format = true,  \
> > +	.output.string_mask = mask}
> > +
> > +/*
> > + * Create a custom defined entry on the syscall array, you will have to
> > + * implement a callback to output the syscall.
> > + */
> > +#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer)		  \
> > +	{.name = #sname, .nb_args = args, .simple_format = false, \
> > +	.print = printer}
> > +
> > +static const struct syscall_trace_entry syscall_trace_entries[] = {
> > +	/* For open, the first argument is a string, hence the given mask */
> > +	[SYSCALL_TRACE_OPEN]	= SYS_TRACE_ENTRY(open, 3, 0x1),
> > +	[SYSCALL_TRACE_CLOSE]	= SYS_TRACE_ENTRY(close, 1, 0),
> > +	[SYSCALL_TRACE_READ]	= SYS_TRACE_ENTRY(read, 3, 0),
> > +	[SYSCALL_TRACE_WRITE]	= SYS_TRACE_ENTRY(read, 3, 0),
> > +};
> > +
> > +
> > +static atomic_t refcount;
> > +
> > +enum print_line_t
> > +print_syscall_enter(struct trace_iterator *iter, int flags)
> > +{
> > +	struct trace_seq *s = &iter->seq;
> > +	struct trace_entry *ent = iter->ent;
> > +	struct syscall_trace_enter *trace;
> > +	enum syscall_trace_nr syscall;
> > +	const struct syscall_trace_entry *entry;
> > +	char *str = NULL;
> > +	int i, ret;
> > +
> > +	trace_assign_type(trace, ent);
> > +
> > +	syscall = trace->nr;
> > +
> > +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
> > +		return TRACE_TYPE_HANDLED;
> > +
> > +	entry = &syscall_trace_entries[syscall];
> > +	if (!entry->simple_format) {
> > +		ret = entry->output.print(trace->args);
> > +		if (!ret)
> > +			return TRACE_TYPE_PARTIAL_LINE;
> > +	}
> > +
> > +	ret = trace_seq_printf(s, "syscall %s :", entry->name);
> > +	if (!ret)
> > +		return TRACE_TYPE_PARTIAL_LINE;
> > +
> > +	for (i = 0; i < entry->nb_args; i++) {
> > +		/*
> > +		 * FIXME: the string comes from the user, but on tracing
> > +		 * output time, it may have disappeared from the memory.
> > +		 * May be better to strdup it on tracing time.
> > +		 */
> > +		if (entry->output.string_mask & (1 << i)) {
> > +			/* 128 is enough for most path */
> > +			str = strndup_user((char *)trace->args[i], 128);
> > +
> > +			/* If it faulted badly, the error shoud have been
> > +			 * handled while servicing the syscall, just ignore.
> > +			 */
> > +			if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
> > +			    || str == ERR_PTR(-ENOMEM))
> > +				return TRACE_TYPE_HANDLED;
> > +
> > +			ret = trace_seq_printf(s, " %s", str);
> > +		} else {
> > +			ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
> > +		}
> > +		if (!ret)
> > +			goto end_partial;
> > +	}
> > +	if (!trace_seq_printf(s, "\n"))
> > +		goto end_partial;
> > +
> > +	kfree(str);
> > +	return TRACE_TYPE_HANDLED;
> > +
> > +end_partial:
> > +	kfree(str);
> > +	return TRACE_TYPE_PARTIAL_LINE;
> > +}
> > +
> > +enum print_line_t
> > +print_syscall_exit(struct trace_iterator *iter, int flags)
> > +{
> > +	struct trace_seq *s = &iter->seq;
> > +	struct trace_entry *ent = iter->ent;
> > +	struct syscall_trace_exit *trace;
> > +	enum syscall_trace_nr syscall;
> > +	const struct syscall_trace_entry *entry;
> > +	int ret;
> > +
> > +	trace_assign_type(trace, ent);
> > +
> > +	syscall = trace->nr;
> > +
> > +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
> > +		return TRACE_TYPE_HANDLED;
> > +
> > +	entry = &syscall_trace_entries[syscall];
> > +
> > +	ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
> > +				trace->ret);
> > +	if (!ret)
> > +		return TRACE_TYPE_PARTIAL_LINE;
> > +
> > +	return TRACE_TYPE_HANDLED;
> > +}
> > +
> > +void start_ftrace_syscalls(void)
> > +{
> > +	unsigned long flags;
> > +	struct task_struct *g, *t;
> > +
> > +	if (atomic_inc_return(&refcount) != 1)
> > +		goto out;
> > +
> > +	read_lock_irqsave(&tasklist_lock, flags);
> > +
> > +	do_each_thread(g, t) {
> > +		set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > +	} while_each_thread(g, t);
> > +
> > +	read_unlock_irqrestore(&tasklist_lock, flags);
> > +out:
> > +	atomic_dec(&refcount);
> > +}
> > +
> > +void stop_ftrace_syscalls(void)
> > +{
> > +	unsigned long flags;
> > +	struct task_struct *g, *t;
> > +
> > +	if (atomic_dec_return(&refcount))
> > +		goto out;
> > +
> > +	read_lock_irqsave(&tasklist_lock, flags);
> > +
> > +	do_each_thread(g, t) {
> > +		clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > +	} while_each_thread(g, t);
> > +
> > +	read_unlock_irqrestore(&tasklist_lock, flags);
> > +out:
> > +	atomic_inc(&refcount);
> > +}
> > +
> > +void ftrace_syscall_enter(struct pt_regs *regs)
> > +{
> > +	struct trace_array *tr = __get_global_trace();
> > +	struct syscall_trace_enter *entry;
> > +	const struct syscall_trace_entry *sys_data;
> > +	struct ring_buffer_event *event;
> > +	struct trace_array_cpu *data;
> > +	int size;
> > +	int syscall_nr;
> > +	int nr_offset;
> > +	int cpu;
> > +
> > +	syscall_nr = syscall_get_nr(current, regs);
> > +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> > +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > +		return;
> > +
> > +	cpu = raw_smp_processor_id();
> > +	data = tr->data[cpu];
> > +
> > +	if (unlikely(atomic_read(&data->disabled)))
> > +		return;
> > +
> > +	sys_data = &syscall_trace_entries[nr_offset];
> > +	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> > +
> > +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
> > +	if (!event)
> > +		return;
> > +
> > +	entry = ring_buffer_event_data(event);
> > +	entry->nr = nr_offset;
> > +	syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> > +
> > +	ring_buffer_unlock_commit(tr->buffer, event);
> > +	trace_wake_up();
> > +}
> > +
> > +void ftrace_syscall_exit(struct pt_regs *regs)
> > +{
> > +	struct trace_array *tr = __get_global_trace();
> > +	struct syscall_trace_exit *entry;
> > +	const struct syscall_trace_entry *sys_data;
> > +	struct ring_buffer_event *event;
> > +	struct trace_array_cpu *data;
> > +	int syscall_nr;
> > +	int nr_offset;
> > +	int cpu;
> > +
> > +	syscall_nr = syscall_get_nr(current, regs);
> > +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> > +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > +		return;
> > +
> > +	cpu = raw_smp_processor_id();
> > +	data = tr->data[cpu];
> > +
> > +	if (unlikely(atomic_read(&data->disabled)))
> > +		return;
> > +
> > +	sys_data = &syscall_trace_entries[nr_offset];
> > +
> > +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> > +				sizeof(*entry), 0, 0);
> > +	if (!event)
> > +		return;
> > +
> > +	entry = ring_buffer_event_data(event);
> > +	entry->nr = nr_offset;
> > +	entry->ret = syscall_get_return_value(current, regs);
> > +
> > +	ring_buffer_unlock_commit(tr->buffer, event);
> > +	trace_wake_up();
> > +}
> > +
> > +static int init_syscall_tracer(struct trace_array *tr)
> > +{
> > +	start_ftrace_syscalls();
> > +
> > +	return 0;
> > +}
> > +
> > +static void reset_syscall_tracer(struct trace_array *tr)
> > +{
> > +	stop_ftrace_syscalls();
> > +}
> > +
> > +static struct trace_event syscall_enter_event = {
> > +	.type	 	= TRACE_SYSCALL_ENTER,
> > +	.trace		= print_syscall_enter,
> > +};
> > +
> > +static struct trace_event syscall_exit_event = {
> > +	.type	 	= TRACE_SYSCALL_EXIT,
> > +	.trace		= print_syscall_exit,
> > +};
> > +
> > +static struct tracer syscall_tracer __read_mostly = {
> > +	.name	     	= "syscall",
> > +	.init		= init_syscall_tracer,
> > +	.reset		= reset_syscall_tracer
> > +};
> > +
> > +__init int register_ftrace_syscalls(void)
> > +{
> > +	int ret;
> > +
> > +	ret = register_ftrace_event(&syscall_enter_event);
> > +	if (!ret) {
> > +		printk(KERN_WARNING "event %d failed to register\n",
> > +		       syscall_enter_event.type);
> > +		WARN_ON_ONCE(1);
> > +	}
> > +
> > +	ret = register_ftrace_event(&syscall_exit_event);
> > +	if (!ret) {
> > +		printk(KERN_WARNING "event %d failed to register\n",
> > +		       syscall_exit_event.type);
> > +		WARN_ON_ONCE(1);
> > +	}
> > +
> > +	return register_tracer(&syscall_tracer);
> > +}
> > +device_initcall(register_ftrace_syscalls);
> 
> -- 
> Masami Hiramatsu
> 
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
> 
> e-mail: mhiramat@redhat.com
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-16 20:15     ` Frederic Weisbecker
@ 2009-03-16 20:38       ` Masami Hiramatsu
  2009-03-16 21:45         ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Masami Hiramatsu @ 2009-03-16 20:38 UTC (permalink / raw)
  To: Frederic Weisbecker, Roland McGrath, Frank Ch. Eigler
  Cc: Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Mathieu Desnoyers, Jiaying Zhang, Martin Bligh

Frederic Weisbecker wrote:
> On Mon, Mar 16, 2009 at 03:36:58PM -0400, Masami Hiramatsu wrote:
>> Hi Frederic,
>>
>> Out of curiously, could you tell me why did you introduce new TIF instead of
>> using utrace's tracehook API?
>> For example, systemtap already use current utrace/tracehook api for tracing
>> all syscalls. If there is any new advantage, it would be good to enhance
>> existing utrace/tracehook.
>>
>> Thank you,
> 
> 
> Hi,
> 
> Because the tracehooks rely on ptrace work and both ftrace and ptrace would overlap
> in a same flag, creating racy issues. At least ss far as I understood it.
> How does systemtap manage it?

As far as I know, utrace supports multiple trace-engines on a process.
Since ptrace is just an engine of utrace, you can add another engine on utrace.

utrace-+-ptrace_engine---owner_process
       |
       +-systemtap_module
       |
       +-ftrace_plugin

Here, Frank had posted an example of utrace->ftrace engine.
http://lkml.org/lkml/2009/1/27/294

And here is the latest his patch(which seems to support syscall tracing...)
http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124

Frank, Roland, could you explain that?
Perhaps, we'd better discuss utrace on LKML for avoiding any redundant
code problems.

Thank you,

> 
> Thanks.
> 
>> Frederic Weisbecker wrote:
>>> The most important drawback that implies the syscall tracing is the variable
>>> number of parameters that takes a syscall. Assuming there is a large number of
>>> syscall, the goal is to provide as much as possible a generic layer on it.
>>>
>>> What we want on tracing time (the hot path):
>>>
>>> - Save the parameters and the number of the syscall. That's all. The rest of the
>>>   job can be done on the "output path".
>>>
>>> On output time:
>>>
>>> - Get the raw-binary saved parameters and the syscall number as well as some
>>>   usual infos such as the pid, the time...
>>> - Depending on the tracing requirements, the user should have the choice to
>>>   dump the raw datas or a formatted version.
>>>
>>> The core infrastructure has a static array which contains the necessary
>>> informations for each syscall (ok, I've only implemented 4 for now).
>>>
>>> When a syscall triggers, the tracer look at this array (O(1) access) and get the
>>> number of parameters for this syscall. Then these parameters are saved in a
>>> binary form on the ring buffer with the syscall number (a generic version, arch
>>> independent).
>>>
>>> On output time the parameters, the name of the syscall and very basic
>>> informations to format the parameters are picket from the ring-buffer and the
>>> generic syscall array.
>>>
>>> Depending of the level of implementation we want for the syscall inside this
>>> array, we can provide just a single mask where the n bit matches the n
>>> parameter. If the bit is set, then the parameter will be considered as a string
>>> pointer. Otherwise its raw hexadecimal value will be printed.
>>>
>>> If needed, we can otherwise provide a specific callback to print the syscall
>>> event.
>>>
>>> The probem with these approaches comes from the fact that the string might have
>>> disappeared from the user memory on output time. Even if we protect against
>>> faults, it can give unreliable traces.
>>> So perhaps we could think about copying the string on tracing time.
>>>
>>> Concerning the return value, it doesn't make any problem, a single raw
>>> value is printed for each one.
>>>
>>> Note that the current implementation can be easily scaled further to give only
>>> binary informations to the user and the matching metadata to decode it.
>>>
>>> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
>>> ---
>>>  include/linux/ftrace.h        |   44 +++++++
>>>  kernel/trace/Kconfig          |   10 ++
>>>  kernel/trace/Makefile         |    1 +
>>>  kernel/trace/trace.c          |    6 +
>>>  kernel/trace/trace.h          |   32 +++++
>>>  kernel/trace/trace_syscalls.c |  284 +++++++++++++++++++++++++++++++++++++++++
>>>  6 files changed, 377 insertions(+), 0 deletions(-)
>>>  create mode 100644 kernel/trace/trace_syscalls.c
>>>
>>> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
>>> index e1583f2..1798692 100644
>>> --- a/include/linux/ftrace.h
>>> +++ b/include/linux/ftrace.h
>>> @@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}
>>>  
>>>  #endif /* CONFIG_HW_BRANCH_TRACER */
>>>  
>>> +/* Index for the ftrace syscalls array */
>>> +enum syscall_trace_nr {
>>> +	SYSCALL_TRACE_OPEN = 1,
>>> +	SYSCALL_TRACE_CLOSE,
>>> +	SYSCALL_TRACE_READ,
>>> +	SYSCALL_TRACE_WRITE,
>>> +
>>> +	__SYSCALL_TRACE_END
>>> +};
>>> +
>>> +/*
>>> + * A syscall entry in the ftrace syscalls array.
>>> + *
>>> + * @name: name of the syscall
>>> + * @nb_args: number of parameters it takes
>>> + * @simple_format: if true, we use string_mask, otherwise the print callback
>>> + * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
>>> + *		 will be displayed as a string, otherwise it will be considered
>>> + *		 as a raw number (hex displayed).
>>> + * @print: implement it if you want a custom output for a given syscall
>>> + */
>>> +struct syscall_trace_entry {
>>> +	const char	*name;
>>> +	int		nb_args;
>>> +	bool		simple_format;
>>> +	union {
>>> +		unsigned long string_mask;
>>> +		int (*print)(unsigned long *);
>>> +	} output;
>>> +};
>>> +
>>> +#ifdef CONFIG_FTRACE_SYSCALLS
>>> +extern short arch_syscall_trace_nr[];
>>> +extern void start_ftrace_syscalls(void);
>>> +extern void stop_ftrace_syscalls(void);
>>> +extern void ftrace_syscall_enter(struct pt_regs *regs);
>>> +extern void ftrace_syscall_exit(struct pt_regs *regs);
>>> +#else
>>> +static inline void start_ftrace_syscalls(void) { }
>>> +static inline void stop_ftrace_syscalls(void) { }
>>> +static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
>>> +static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
>>> +#endif
>>> +
>>>  #endif /* _LINUX_FTRACE_H */
>>> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
>>> index 8e4a2a6..95a0ad1 100644
>>> --- a/kernel/trace/Kconfig
>>> +++ b/kernel/trace/Kconfig
>>> @@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
>>>  config HAVE_HW_BRANCH_TRACER
>>>  	bool
>>>  
>>> +config HAVE_FTRACE_SYSCALLS
>>> +	bool
>>> +
>>>  config TRACER_MAX_TRACE
>>>  	bool
>>>  
>>> @@ -175,6 +178,13 @@ config EVENT_TRACER
>>>  	  allowing the user to pick and choose which trace point they
>>>  	  want to trace.
>>>  
>>> +config FTRACE_SYSCALLS
>>> +	bool "Trace syscalls"
>>> +	depends on HAVE_FTRACE_SYSCALLS
>>> +	select TRACING
>>> +	help
>>> +	  Basic tracer to catch the syscall entry and exit events.
>>> +
>>>  config BOOT_TRACER
>>>  	bool "Trace boot initcalls"
>>>  	select TRACING
>>> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
>>> index c7a2943..c3feea0 100644
>>> --- a/kernel/trace/Makefile
>>> +++ b/kernel/trace/Makefile
>>> @@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
>>>  obj-$(CONFIG_EVENT_TRACER) += trace_events.o
>>>  obj-$(CONFIG_EVENT_TRACER) += events.o
>>>  obj-$(CONFIG_EVENT_TRACER) += trace_export.o
>>> +obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
>>>  
>>>  libftrace-y := ftrace.o
>>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>>> index cc94f86..5152be3 100644
>>> --- a/kernel/trace/trace.c
>>> +++ b/kernel/trace/trace.c
>>> @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
>>>  unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
>>>  	TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
>>>  
>>> +
>>> +struct trace_array *__get_global_trace(void)
>>> +{
>>> +	return &global_trace;
>>> +}
>>> +
>>>  /**
>>>   * trace_wake_up - wake up tasks waiting for trace input
>>>   *
>>> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
>>> index 2bfb7d1..9583830 100644
>>> --- a/kernel/trace/trace.h
>>> +++ b/kernel/trace/trace.h
>>> @@ -30,6 +30,8 @@ enum trace_type {
>>>  	TRACE_GRAPH_ENT,
>>>  	TRACE_USER_STACK,
>>>  	TRACE_HW_BRANCHES,
>>> +	TRACE_SYSCALL_ENTER,
>>> +	TRACE_SYSCALL_EXIT,
>>>  	TRACE_KMEM_ALLOC,
>>>  	TRACE_KMEM_FREE,
>>>  	TRACE_POWER,
>>> @@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
>>>  	const void *ptr;
>>>  };
>>>  
>>> +struct syscall_trace_enter {
>>> +	struct trace_entry	ent;
>>> +	enum syscall_trace_nr	nr;
>>> +	unsigned long		args[];
>>> +};
>>> +
>>> +struct syscall_trace_exit {
>>> +	struct trace_entry	ent;
>>> +	enum syscall_trace_nr	nr;
>>> +	unsigned long		ret;
>>> +};
>>> +
>>> +
>>>  /*
>>>   * trace_flag_type is an enumeration that holds different
>>>   * states when a trace occurs. These are:
>>> @@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
>>>  			  TRACE_KMEM_ALLOC);	\
>>>  		IF_ASSIGN(var, ent, struct kmemtrace_free_entry,	\
>>>  			  TRACE_KMEM_FREE);	\
>>> +		IF_ASSIGN(var, ent, struct syscall_trace_enter,		\
>>> +			  TRACE_SYSCALL_ENTER);				\
>>> +		IF_ASSIGN(var, ent, struct syscall_trace_exit,		\
>>> +			  TRACE_SYSCALL_EXIT);				\
>>>  		__ftrace_bad_type();					\
>>>  	} while (0)
>>>  
>>> @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
>>>  #endif /* CONFIG_FTRACE_STARTUP_TEST */
>>>  
>>>  extern void *head_page(struct trace_array_cpu *data);
>>> +extern struct trace_array *__get_global_trace(void);
>>>  extern long ns2usecs(cycle_t nsec);
>>>  extern int
>>>  trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
>>> @@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
>>>  	return test_tsk_trace_trace(task);
>>>  }
>>>  
>>> +#ifdef CONFIG_FTRACE_SYSCALLS
>>> +extern enum print_line_t
>>> +print_syscall_enter(struct trace_iterator *iter, int flags);
>>> +extern enum print_line_t
>>> +print_syscall_exit(struct trace_iterator *iter, int flags);
>>> +#else
>>> +static inline enum print_line_t
>>> +print_syscall_enter(struct trace_iterator *iter, int flags) { }
>>> +static inline enum print_line_t
>>> +print_syscall_exit(struct trace_iterator *iter, int flags) { }
>>> +#endif
>>> +
>>>  /*
>>>   * trace_iterator_flags is an enumeration that defines bit
>>>   * positions into trace_flags that controls the output.
>>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
>>> new file mode 100644
>>> index 0000000..470bb9d
>>> --- /dev/null
>>> +++ b/kernel/trace/trace_syscalls.c
>>> @@ -0,0 +1,284 @@
>>> +#include <linux/kernel.h>
>>> +#include <linux/ftrace.h>
>>> +#include <asm/syscall.h>
>>> +
>>> +#include "trace_output.h"
>>> +#include "trace.h"
>>> +
>>> +/* Create basic entry on syscall array (with the rudimentary string mask) */
>>> +#define SYS_TRACE_ENTRY(sname, args, mask)			  \
>>> +	{.name = #sname, .nb_args = args, .simple_format = true,  \
>>> +	.output.string_mask = mask}
>>> +
>>> +/*
>>> + * Create a custom defined entry on the syscall array, you will have to
>>> + * implement a callback to output the syscall.
>>> + */
>>> +#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer)		  \
>>> +	{.name = #sname, .nb_args = args, .simple_format = false, \
>>> +	.print = printer}
>>> +
>>> +static const struct syscall_trace_entry syscall_trace_entries[] = {
>>> +	/* For open, the first argument is a string, hence the given mask */
>>> +	[SYSCALL_TRACE_OPEN]	= SYS_TRACE_ENTRY(open, 3, 0x1),
>>> +	[SYSCALL_TRACE_CLOSE]	= SYS_TRACE_ENTRY(close, 1, 0),
>>> +	[SYSCALL_TRACE_READ]	= SYS_TRACE_ENTRY(read, 3, 0),
>>> +	[SYSCALL_TRACE_WRITE]	= SYS_TRACE_ENTRY(read, 3, 0),
>>> +};
>>> +
>>> +
>>> +static atomic_t refcount;
>>> +
>>> +enum print_line_t
>>> +print_syscall_enter(struct trace_iterator *iter, int flags)
>>> +{
>>> +	struct trace_seq *s = &iter->seq;
>>> +	struct trace_entry *ent = iter->ent;
>>> +	struct syscall_trace_enter *trace;
>>> +	enum syscall_trace_nr syscall;
>>> +	const struct syscall_trace_entry *entry;
>>> +	char *str = NULL;
>>> +	int i, ret;
>>> +
>>> +	trace_assign_type(trace, ent);
>>> +
>>> +	syscall = trace->nr;
>>> +
>>> +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
>>> +		return TRACE_TYPE_HANDLED;
>>> +
>>> +	entry = &syscall_trace_entries[syscall];
>>> +	if (!entry->simple_format) {
>>> +		ret = entry->output.print(trace->args);
>>> +		if (!ret)
>>> +			return TRACE_TYPE_PARTIAL_LINE;
>>> +	}
>>> +
>>> +	ret = trace_seq_printf(s, "syscall %s :", entry->name);
>>> +	if (!ret)
>>> +		return TRACE_TYPE_PARTIAL_LINE;
>>> +
>>> +	for (i = 0; i < entry->nb_args; i++) {
>>> +		/*
>>> +		 * FIXME: the string comes from the user, but on tracing
>>> +		 * output time, it may have disappeared from the memory.
>>> +		 * May be better to strdup it on tracing time.
>>> +		 */
>>> +		if (entry->output.string_mask & (1 << i)) {
>>> +			/* 128 is enough for most path */
>>> +			str = strndup_user((char *)trace->args[i], 128);
>>> +
>>> +			/* If it faulted badly, the error shoud have been
>>> +			 * handled while servicing the syscall, just ignore.
>>> +			 */
>>> +			if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
>>> +			    || str == ERR_PTR(-ENOMEM))
>>> +				return TRACE_TYPE_HANDLED;
>>> +
>>> +			ret = trace_seq_printf(s, " %s", str);
>>> +		} else {
>>> +			ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
>>> +		}
>>> +		if (!ret)
>>> +			goto end_partial;
>>> +	}
>>> +	if (!trace_seq_printf(s, "\n"))
>>> +		goto end_partial;
>>> +
>>> +	kfree(str);
>>> +	return TRACE_TYPE_HANDLED;
>>> +
>>> +end_partial:
>>> +	kfree(str);
>>> +	return TRACE_TYPE_PARTIAL_LINE;
>>> +}
>>> +
>>> +enum print_line_t
>>> +print_syscall_exit(struct trace_iterator *iter, int flags)
>>> +{
>>> +	struct trace_seq *s = &iter->seq;
>>> +	struct trace_entry *ent = iter->ent;
>>> +	struct syscall_trace_exit *trace;
>>> +	enum syscall_trace_nr syscall;
>>> +	const struct syscall_trace_entry *entry;
>>> +	int ret;
>>> +
>>> +	trace_assign_type(trace, ent);
>>> +
>>> +	syscall = trace->nr;
>>> +
>>> +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
>>> +		return TRACE_TYPE_HANDLED;
>>> +
>>> +	entry = &syscall_trace_entries[syscall];
>>> +
>>> +	ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
>>> +				trace->ret);
>>> +	if (!ret)
>>> +		return TRACE_TYPE_PARTIAL_LINE;
>>> +
>>> +	return TRACE_TYPE_HANDLED;
>>> +}
>>> +
>>> +void start_ftrace_syscalls(void)
>>> +{
>>> +	unsigned long flags;
>>> +	struct task_struct *g, *t;
>>> +
>>> +	if (atomic_inc_return(&refcount) != 1)
>>> +		goto out;
>>> +
>>> +	read_lock_irqsave(&tasklist_lock, flags);
>>> +
>>> +	do_each_thread(g, t) {
>>> +		set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
>>> +	} while_each_thread(g, t);
>>> +
>>> +	read_unlock_irqrestore(&tasklist_lock, flags);
>>> +out:
>>> +	atomic_dec(&refcount);
>>> +}
>>> +
>>> +void stop_ftrace_syscalls(void)
>>> +{
>>> +	unsigned long flags;
>>> +	struct task_struct *g, *t;
>>> +
>>> +	if (atomic_dec_return(&refcount))
>>> +		goto out;
>>> +
>>> +	read_lock_irqsave(&tasklist_lock, flags);
>>> +
>>> +	do_each_thread(g, t) {
>>> +		clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
>>> +	} while_each_thread(g, t);
>>> +
>>> +	read_unlock_irqrestore(&tasklist_lock, flags);
>>> +out:
>>> +	atomic_inc(&refcount);
>>> +}
>>> +
>>> +void ftrace_syscall_enter(struct pt_regs *regs)
>>> +{
>>> +	struct trace_array *tr = __get_global_trace();
>>> +	struct syscall_trace_enter *entry;
>>> +	const struct syscall_trace_entry *sys_data;
>>> +	struct ring_buffer_event *event;
>>> +	struct trace_array_cpu *data;
>>> +	int size;
>>> +	int syscall_nr;
>>> +	int nr_offset;
>>> +	int cpu;
>>> +
>>> +	syscall_nr = syscall_get_nr(current, regs);
>>> +	nr_offset = arch_syscall_trace_nr[syscall_nr];
>>> +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
>>> +		return;
>>> +
>>> +	cpu = raw_smp_processor_id();
>>> +	data = tr->data[cpu];
>>> +
>>> +	if (unlikely(atomic_read(&data->disabled)))
>>> +		return;
>>> +
>>> +	sys_data = &syscall_trace_entries[nr_offset];
>>> +	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
>>> +
>>> +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
>>> +	if (!event)
>>> +		return;
>>> +
>>> +	entry = ring_buffer_event_data(event);
>>> +	entry->nr = nr_offset;
>>> +	syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
>>> +
>>> +	ring_buffer_unlock_commit(tr->buffer, event);
>>> +	trace_wake_up();
>>> +}
>>> +
>>> +void ftrace_syscall_exit(struct pt_regs *regs)
>>> +{
>>> +	struct trace_array *tr = __get_global_trace();
>>> +	struct syscall_trace_exit *entry;
>>> +	const struct syscall_trace_entry *sys_data;
>>> +	struct ring_buffer_event *event;
>>> +	struct trace_array_cpu *data;
>>> +	int syscall_nr;
>>> +	int nr_offset;
>>> +	int cpu;
>>> +
>>> +	syscall_nr = syscall_get_nr(current, regs);
>>> +	nr_offset = arch_syscall_trace_nr[syscall_nr];
>>> +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
>>> +		return;
>>> +
>>> +	cpu = raw_smp_processor_id();
>>> +	data = tr->data[cpu];
>>> +
>>> +	if (unlikely(atomic_read(&data->disabled)))
>>> +		return;
>>> +
>>> +	sys_data = &syscall_trace_entries[nr_offset];
>>> +
>>> +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
>>> +				sizeof(*entry), 0, 0);
>>> +	if (!event)
>>> +		return;
>>> +
>>> +	entry = ring_buffer_event_data(event);
>>> +	entry->nr = nr_offset;
>>> +	entry->ret = syscall_get_return_value(current, regs);
>>> +
>>> +	ring_buffer_unlock_commit(tr->buffer, event);
>>> +	trace_wake_up();
>>> +}
>>> +
>>> +static int init_syscall_tracer(struct trace_array *tr)
>>> +{
>>> +	start_ftrace_syscalls();
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static void reset_syscall_tracer(struct trace_array *tr)
>>> +{
>>> +	stop_ftrace_syscalls();
>>> +}
>>> +
>>> +static struct trace_event syscall_enter_event = {
>>> +	.type	 	= TRACE_SYSCALL_ENTER,
>>> +	.trace		= print_syscall_enter,
>>> +};
>>> +
>>> +static struct trace_event syscall_exit_event = {
>>> +	.type	 	= TRACE_SYSCALL_EXIT,
>>> +	.trace		= print_syscall_exit,
>>> +};
>>> +
>>> +static struct tracer syscall_tracer __read_mostly = {
>>> +	.name	     	= "syscall",
>>> +	.init		= init_syscall_tracer,
>>> +	.reset		= reset_syscall_tracer
>>> +};
>>> +
>>> +__init int register_ftrace_syscalls(void)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = register_ftrace_event(&syscall_enter_event);
>>> +	if (!ret) {
>>> +		printk(KERN_WARNING "event %d failed to register\n",
>>> +		       syscall_enter_event.type);
>>> +		WARN_ON_ONCE(1);
>>> +	}
>>> +
>>> +	ret = register_ftrace_event(&syscall_exit_event);
>>> +	if (!ret) {
>>> +		printk(KERN_WARNING "event %d failed to register\n",
>>> +		       syscall_exit_event.type);
>>> +		WARN_ON_ONCE(1);
>>> +	}
>>> +
>>> +	return register_tracer(&syscall_tracer);
>>> +}
>>> +device_initcall(register_ftrace_syscalls);
>> -- 
>> Masami Hiramatsu
>>
>> Software Engineer
>> Hitachi Computer Products (America) Inc.
>> Software Solutions Division
>>
>> e-mail: mhiramat@redhat.com
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-16 20:38       ` Masami Hiramatsu
@ 2009-03-16 21:45         ` Mathieu Desnoyers
  2009-03-16 22:18           ` Frank Ch. Eigler
  2009-03-17  5:24           ` Oleg Nesterov
  0 siblings, 2 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2009-03-16 21:45 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Frederic Weisbecker, Roland McGrath, Frank Ch. Eigler,
	Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Jiaying Zhang, Martin Bligh

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Frederic Weisbecker wrote:
> > On Mon, Mar 16, 2009 at 03:36:58PM -0400, Masami Hiramatsu wrote:
> >> Hi Frederic,
> >>
> >> Out of curiously, could you tell me why did you introduce new TIF instead of
> >> using utrace's tracehook API?
> >> For example, systemtap already use current utrace/tracehook api for tracing
> >> all syscalls. If there is any new advantage, it would be good to enhance
> >> existing utrace/tracehook.
> >>
> >> Thank you,
> > 
> > 
> > Hi,
> > 
> > Because the tracehooks rely on ptrace work and both ftrace and ptrace would overlap
> > in a same flag, creating racy issues. At least ss far as I understood it.
> > How does systemtap manage it?
> 
> As far as I know, utrace supports multiple trace-engines on a process.
> Since ptrace is just an engine of utrace, you can add another engine on utrace.
> 
> utrace-+-ptrace_engine---owner_process
>        |
>        +-systemtap_module
>        |
>        +-ftrace_plugin
> 
> Here, Frank had posted an example of utrace->ftrace engine.
> http://lkml.org/lkml/2009/1/27/294
> 
> And here is the latest his patch(which seems to support syscall tracing...)
> http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
> 

Reminder : we are looking at system-wide tracing here. Here are some
comments about the current utrace implementation.

Looking at include/linux/utrace.h from the tree

17  * A tracing engine starts by calling utrace_attach_task() or
18  * utrace_attach_pid() on the chosen thread, passing in a set of hooks
19  * (&struct utrace_engine_ops), and some associated data.  This produces a
20  * &struct utrace_engine, which is the handle used for all other
21  * operations.  An attached engine has its ops vector, its data, and an
22  * event mask controlled by utrace_set_events().

So if the system has, say 3000 threads, then we have 3000 struct
utrace_engine created ? I wonder what effet this could have one
cachelines if this is used to trace hot paths like system call
entry/exit. Have you benchmarked this kind of scenario under tbench ?


24  * For each event bit that is set, that engine will get the
25  * appropriate ops->report_*() callback when the event occurs.  The
26  * &struct utrace_engine_ops need not provide callbacks for an event
27  * unless the engine sets one of the associated event bits.

Looking at utrace_set_events(), we seem to be limited to 32 events on a
32-bits architectures because it uses a bitmask ? Isn't it a bit small?


682 /**
683  * utrace_set_events_pid - choose which event reports a tracing engine gets
684  * @pid:                thread to affect
685  * @engine:             attached engine to affect
686  * @eventmask:          new event mask
687  *
688  * This is the same as utrace_set_events(), but takes a &struct pid
689  * pointer rather than a &struct task_struct pointer.  The caller must
690  * hold a ref on @pid, but does not need to worry about the task
691  * staying valid.  If it's been reaped so that @pid points nowhere,
692  * then this call returns -%ESRCH.


Comments like "but does not need to worry about the task staying valid"
does not make me feel safe and comfortable at all, could you explain
how you can assume that derefencing an "invalid" pointer will return
NULL ?

About the utrace_attach_task() :

244         if (unlikely(target->flags & PF_KTHREAD))
245                 /*
246                  * Silly kernel, utrace is for users!
247                  */
248                 return ERR_PTR(-EPERM);

So we cannot trace kernel threads ?


118 /*
119  * Called without locks, when we might be the first utrace engine to attach.
120  * If this is a newborn thread and we are not the creator, we have to wait
121  * for it.  The creator gets the first chance to attach.  The PF_STARTING
122  * flag is cleared after its report_clone hook has had a chance to run.
123  */
124 static inline int utrace_attach_delay(struct task_struct *target)
125 {
126         if ((target->flags & PF_STARTING) && target->real_parent != current)
127                 do {
128                         schedule_timeout_interruptible(1);
129                         if (signal_pending(current))
130                                 return -ERESTARTNOINTR;
131                 } while (target->flags & PF_STARTING);
132
133         return 0;
134 }

Why do we absolutely have to poll until the thread has started ?


utrace_add_engine()
  set_notify_resume(target);

ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
that it is set asynchronously with the execution of the target thread
(as I do with my TIF_KERNEL_TRACE thread flag).

However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in 
entry_64.S

int_signal:
and
retint_signal:

code paths. However, if there is no syscall tracing to do upon syscall
entry, the thread flags are not re-read at syscall exit and you will
miss the syscall exit returning from your target thread if this thread
was blocked while you set its TIF_NOTIFY_RESUME. Or is it handled in
some subtle way I did not figure out ? BTW re-reading the TIF flags from
the thread_info at syscall exit on the fast path is out of question
because it considerably degrades the kernel performances. entry_*.S is
a very, very critical path.

Mathieu


> Frank, Roland, could you explain that?
> Perhaps, we'd better discuss utrace on LKML for avoiding any redundant
> code problems.
> 
> Thank you,
> 
> > 
> > Thanks.
> > 
> >> Frederic Weisbecker wrote:
> >>> The most important drawback that implies the syscall tracing is the variable
> >>> number of parameters that takes a syscall. Assuming there is a large number of
> >>> syscall, the goal is to provide as much as possible a generic layer on it.
> >>>
> >>> What we want on tracing time (the hot path):
> >>>
> >>> - Save the parameters and the number of the syscall. That's all. The rest of the
> >>>   job can be done on the "output path".
> >>>
> >>> On output time:
> >>>
> >>> - Get the raw-binary saved parameters and the syscall number as well as some
> >>>   usual infos such as the pid, the time...
> >>> - Depending on the tracing requirements, the user should have the choice to
> >>>   dump the raw datas or a formatted version.
> >>>
> >>> The core infrastructure has a static array which contains the necessary
> >>> informations for each syscall (ok, I've only implemented 4 for now).
> >>>
> >>> When a syscall triggers, the tracer look at this array (O(1) access) and get the
> >>> number of parameters for this syscall. Then these parameters are saved in a
> >>> binary form on the ring buffer with the syscall number (a generic version, arch
> >>> independent).
> >>>
> >>> On output time the parameters, the name of the syscall and very basic
> >>> informations to format the parameters are picket from the ring-buffer and the
> >>> generic syscall array.
> >>>
> >>> Depending of the level of implementation we want for the syscall inside this
> >>> array, we can provide just a single mask where the n bit matches the n
> >>> parameter. If the bit is set, then the parameter will be considered as a string
> >>> pointer. Otherwise its raw hexadecimal value will be printed.
> >>>
> >>> If needed, we can otherwise provide a specific callback to print the syscall
> >>> event.
> >>>
> >>> The probem with these approaches comes from the fact that the string might have
> >>> disappeared from the user memory on output time. Even if we protect against
> >>> faults, it can give unreliable traces.
> >>> So perhaps we could think about copying the string on tracing time.
> >>>
> >>> Concerning the return value, it doesn't make any problem, a single raw
> >>> value is printed for each one.
> >>>
> >>> Note that the current implementation can be easily scaled further to give only
> >>> binary informations to the user and the matching metadata to decode it.
> >>>
> >>> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> >>> ---
> >>>  include/linux/ftrace.h        |   44 +++++++
> >>>  kernel/trace/Kconfig          |   10 ++
> >>>  kernel/trace/Makefile         |    1 +
> >>>  kernel/trace/trace.c          |    6 +
> >>>  kernel/trace/trace.h          |   32 +++++
> >>>  kernel/trace/trace_syscalls.c |  284 +++++++++++++++++++++++++++++++++++++++++
> >>>  6 files changed, 377 insertions(+), 0 deletions(-)
> >>>  create mode 100644 kernel/trace/trace_syscalls.c
> >>>
> >>> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> >>> index e1583f2..1798692 100644
> >>> --- a/include/linux/ftrace.h
> >>> +++ b/include/linux/ftrace.h
> >>> @@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}
> >>>  
> >>>  #endif /* CONFIG_HW_BRANCH_TRACER */
> >>>  
> >>> +/* Index for the ftrace syscalls array */
> >>> +enum syscall_trace_nr {
> >>> +	SYSCALL_TRACE_OPEN = 1,
> >>> +	SYSCALL_TRACE_CLOSE,
> >>> +	SYSCALL_TRACE_READ,
> >>> +	SYSCALL_TRACE_WRITE,
> >>> +
> >>> +	__SYSCALL_TRACE_END
> >>> +};
> >>> +
> >>> +/*
> >>> + * A syscall entry in the ftrace syscalls array.
> >>> + *
> >>> + * @name: name of the syscall
> >>> + * @nb_args: number of parameters it takes
> >>> + * @simple_format: if true, we use string_mask, otherwise the print callback
> >>> + * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
> >>> + *		 will be displayed as a string, otherwise it will be considered
> >>> + *		 as a raw number (hex displayed).
> >>> + * @print: implement it if you want a custom output for a given syscall
> >>> + */
> >>> +struct syscall_trace_entry {
> >>> +	const char	*name;
> >>> +	int		nb_args;
> >>> +	bool		simple_format;
> >>> +	union {
> >>> +		unsigned long string_mask;
> >>> +		int (*print)(unsigned long *);
> >>> +	} output;
> >>> +};
> >>> +
> >>> +#ifdef CONFIG_FTRACE_SYSCALLS
> >>> +extern short arch_syscall_trace_nr[];
> >>> +extern void start_ftrace_syscalls(void);
> >>> +extern void stop_ftrace_syscalls(void);
> >>> +extern void ftrace_syscall_enter(struct pt_regs *regs);
> >>> +extern void ftrace_syscall_exit(struct pt_regs *regs);
> >>> +#else
> >>> +static inline void start_ftrace_syscalls(void) { }
> >>> +static inline void stop_ftrace_syscalls(void) { }
> >>> +static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
> >>> +static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
> >>> +#endif
> >>> +
> >>>  #endif /* _LINUX_FTRACE_H */
> >>> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> >>> index 8e4a2a6..95a0ad1 100644
> >>> --- a/kernel/trace/Kconfig
> >>> +++ b/kernel/trace/Kconfig
> >>> @@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
> >>>  config HAVE_HW_BRANCH_TRACER
> >>>  	bool
> >>>  
> >>> +config HAVE_FTRACE_SYSCALLS
> >>> +	bool
> >>> +
> >>>  config TRACER_MAX_TRACE
> >>>  	bool
> >>>  
> >>> @@ -175,6 +178,13 @@ config EVENT_TRACER
> >>>  	  allowing the user to pick and choose which trace point they
> >>>  	  want to trace.
> >>>  
> >>> +config FTRACE_SYSCALLS
> >>> +	bool "Trace syscalls"
> >>> +	depends on HAVE_FTRACE_SYSCALLS
> >>> +	select TRACING
> >>> +	help
> >>> +	  Basic tracer to catch the syscall entry and exit events.
> >>> +
> >>>  config BOOT_TRACER
> >>>  	bool "Trace boot initcalls"
> >>>  	select TRACING
> >>> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> >>> index c7a2943..c3feea0 100644
> >>> --- a/kernel/trace/Makefile
> >>> +++ b/kernel/trace/Makefile
> >>> @@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
> >>>  obj-$(CONFIG_EVENT_TRACER) += trace_events.o
> >>>  obj-$(CONFIG_EVENT_TRACER) += events.o
> >>>  obj-$(CONFIG_EVENT_TRACER) += trace_export.o
> >>> +obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
> >>>  
> >>>  libftrace-y := ftrace.o
> >>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> >>> index cc94f86..5152be3 100644
> >>> --- a/kernel/trace/trace.c
> >>> +++ b/kernel/trace/trace.c
> >>> @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
> >>>  unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
> >>>  	TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
> >>>  
> >>> +
> >>> +struct trace_array *__get_global_trace(void)
> >>> +{
> >>> +	return &global_trace;
> >>> +}
> >>> +
> >>>  /**
> >>>   * trace_wake_up - wake up tasks waiting for trace input
> >>>   *
> >>> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> >>> index 2bfb7d1..9583830 100644
> >>> --- a/kernel/trace/trace.h
> >>> +++ b/kernel/trace/trace.h
> >>> @@ -30,6 +30,8 @@ enum trace_type {
> >>>  	TRACE_GRAPH_ENT,
> >>>  	TRACE_USER_STACK,
> >>>  	TRACE_HW_BRANCHES,
> >>> +	TRACE_SYSCALL_ENTER,
> >>> +	TRACE_SYSCALL_EXIT,
> >>>  	TRACE_KMEM_ALLOC,
> >>>  	TRACE_KMEM_FREE,
> >>>  	TRACE_POWER,
> >>> @@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
> >>>  	const void *ptr;
> >>>  };
> >>>  
> >>> +struct syscall_trace_enter {
> >>> +	struct trace_entry	ent;
> >>> +	enum syscall_trace_nr	nr;
> >>> +	unsigned long		args[];
> >>> +};
> >>> +
> >>> +struct syscall_trace_exit {
> >>> +	struct trace_entry	ent;
> >>> +	enum syscall_trace_nr	nr;
> >>> +	unsigned long		ret;
> >>> +};
> >>> +
> >>> +
> >>>  /*
> >>>   * trace_flag_type is an enumeration that holds different
> >>>   * states when a trace occurs. These are:
> >>> @@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
> >>>  			  TRACE_KMEM_ALLOC);	\
> >>>  		IF_ASSIGN(var, ent, struct kmemtrace_free_entry,	\
> >>>  			  TRACE_KMEM_FREE);	\
> >>> +		IF_ASSIGN(var, ent, struct syscall_trace_enter,		\
> >>> +			  TRACE_SYSCALL_ENTER);				\
> >>> +		IF_ASSIGN(var, ent, struct syscall_trace_exit,		\
> >>> +			  TRACE_SYSCALL_EXIT);				\
> >>>  		__ftrace_bad_type();					\
> >>>  	} while (0)
> >>>  
> >>> @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
> >>>  #endif /* CONFIG_FTRACE_STARTUP_TEST */
> >>>  
> >>>  extern void *head_page(struct trace_array_cpu *data);
> >>> +extern struct trace_array *__get_global_trace(void);
> >>>  extern long ns2usecs(cycle_t nsec);
> >>>  extern int
> >>>  trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
> >>> @@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
> >>>  	return test_tsk_trace_trace(task);
> >>>  }
> >>>  
> >>> +#ifdef CONFIG_FTRACE_SYSCALLS
> >>> +extern enum print_line_t
> >>> +print_syscall_enter(struct trace_iterator *iter, int flags);
> >>> +extern enum print_line_t
> >>> +print_syscall_exit(struct trace_iterator *iter, int flags);
> >>> +#else
> >>> +static inline enum print_line_t
> >>> +print_syscall_enter(struct trace_iterator *iter, int flags) { }
> >>> +static inline enum print_line_t
> >>> +print_syscall_exit(struct trace_iterator *iter, int flags) { }
> >>> +#endif
> >>> +
> >>>  /*
> >>>   * trace_iterator_flags is an enumeration that defines bit
> >>>   * positions into trace_flags that controls the output.
> >>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> >>> new file mode 100644
> >>> index 0000000..470bb9d
> >>> --- /dev/null
> >>> +++ b/kernel/trace/trace_syscalls.c
> >>> @@ -0,0 +1,284 @@
> >>> +#include <linux/kernel.h>
> >>> +#include <linux/ftrace.h>
> >>> +#include <asm/syscall.h>
> >>> +
> >>> +#include "trace_output.h"
> >>> +#include "trace.h"
> >>> +
> >>> +/* Create basic entry on syscall array (with the rudimentary string mask) */
> >>> +#define SYS_TRACE_ENTRY(sname, args, mask)			  \
> >>> +	{.name = #sname, .nb_args = args, .simple_format = true,  \
> >>> +	.output.string_mask = mask}
> >>> +
> >>> +/*
> >>> + * Create a custom defined entry on the syscall array, you will have to
> >>> + * implement a callback to output the syscall.
> >>> + */
> >>> +#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer)		  \
> >>> +	{.name = #sname, .nb_args = args, .simple_format = false, \
> >>> +	.print = printer}
> >>> +
> >>> +static const struct syscall_trace_entry syscall_trace_entries[] = {
> >>> +	/* For open, the first argument is a string, hence the given mask */
> >>> +	[SYSCALL_TRACE_OPEN]	= SYS_TRACE_ENTRY(open, 3, 0x1),
> >>> +	[SYSCALL_TRACE_CLOSE]	= SYS_TRACE_ENTRY(close, 1, 0),
> >>> +	[SYSCALL_TRACE_READ]	= SYS_TRACE_ENTRY(read, 3, 0),
> >>> +	[SYSCALL_TRACE_WRITE]	= SYS_TRACE_ENTRY(read, 3, 0),
> >>> +};
> >>> +
> >>> +
> >>> +static atomic_t refcount;
> >>> +
> >>> +enum print_line_t
> >>> +print_syscall_enter(struct trace_iterator *iter, int flags)
> >>> +{
> >>> +	struct trace_seq *s = &iter->seq;
> >>> +	struct trace_entry *ent = iter->ent;
> >>> +	struct syscall_trace_enter *trace;
> >>> +	enum syscall_trace_nr syscall;
> >>> +	const struct syscall_trace_entry *entry;
> >>> +	char *str = NULL;
> >>> +	int i, ret;
> >>> +
> >>> +	trace_assign_type(trace, ent);
> >>> +
> >>> +	syscall = trace->nr;
> >>> +
> >>> +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
> >>> +		return TRACE_TYPE_HANDLED;
> >>> +
> >>> +	entry = &syscall_trace_entries[syscall];
> >>> +	if (!entry->simple_format) {
> >>> +		ret = entry->output.print(trace->args);
> >>> +		if (!ret)
> >>> +			return TRACE_TYPE_PARTIAL_LINE;
> >>> +	}
> >>> +
> >>> +	ret = trace_seq_printf(s, "syscall %s :", entry->name);
> >>> +	if (!ret)
> >>> +		return TRACE_TYPE_PARTIAL_LINE;
> >>> +
> >>> +	for (i = 0; i < entry->nb_args; i++) {
> >>> +		/*
> >>> +		 * FIXME: the string comes from the user, but on tracing
> >>> +		 * output time, it may have disappeared from the memory.
> >>> +		 * May be better to strdup it on tracing time.
> >>> +		 */
> >>> +		if (entry->output.string_mask & (1 << i)) {
> >>> +			/* 128 is enough for most path */
> >>> +			str = strndup_user((char *)trace->args[i], 128);
> >>> +
> >>> +			/* If it faulted badly, the error shoud have been
> >>> +			 * handled while servicing the syscall, just ignore.
> >>> +			 */
> >>> +			if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
> >>> +			    || str == ERR_PTR(-ENOMEM))
> >>> +				return TRACE_TYPE_HANDLED;
> >>> +
> >>> +			ret = trace_seq_printf(s, " %s", str);
> >>> +		} else {
> >>> +			ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
> >>> +		}
> >>> +		if (!ret)
> >>> +			goto end_partial;
> >>> +	}
> >>> +	if (!trace_seq_printf(s, "\n"))
> >>> +		goto end_partial;
> >>> +
> >>> +	kfree(str);
> >>> +	return TRACE_TYPE_HANDLED;
> >>> +
> >>> +end_partial:
> >>> +	kfree(str);
> >>> +	return TRACE_TYPE_PARTIAL_LINE;
> >>> +}
> >>> +
> >>> +enum print_line_t
> >>> +print_syscall_exit(struct trace_iterator *iter, int flags)
> >>> +{
> >>> +	struct trace_seq *s = &iter->seq;
> >>> +	struct trace_entry *ent = iter->ent;
> >>> +	struct syscall_trace_exit *trace;
> >>> +	enum syscall_trace_nr syscall;
> >>> +	const struct syscall_trace_entry *entry;
> >>> +	int ret;
> >>> +
> >>> +	trace_assign_type(trace, ent);
> >>> +
> >>> +	syscall = trace->nr;
> >>> +
> >>> +	if (!syscall || syscall >= __SYSCALL_TRACE_END)
> >>> +		return TRACE_TYPE_HANDLED;
> >>> +
> >>> +	entry = &syscall_trace_entries[syscall];
> >>> +
> >>> +	ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
> >>> +				trace->ret);
> >>> +	if (!ret)
> >>> +		return TRACE_TYPE_PARTIAL_LINE;
> >>> +
> >>> +	return TRACE_TYPE_HANDLED;
> >>> +}
> >>> +
> >>> +void start_ftrace_syscalls(void)
> >>> +{
> >>> +	unsigned long flags;
> >>> +	struct task_struct *g, *t;
> >>> +
> >>> +	if (atomic_inc_return(&refcount) != 1)
> >>> +		goto out;
> >>> +
> >>> +	read_lock_irqsave(&tasklist_lock, flags);
> >>> +
> >>> +	do_each_thread(g, t) {
> >>> +		set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> >>> +	} while_each_thread(g, t);
> >>> +
> >>> +	read_unlock_irqrestore(&tasklist_lock, flags);
> >>> +out:
> >>> +	atomic_dec(&refcount);
> >>> +}
> >>> +
> >>> +void stop_ftrace_syscalls(void)
> >>> +{
> >>> +	unsigned long flags;
> >>> +	struct task_struct *g, *t;
> >>> +
> >>> +	if (atomic_dec_return(&refcount))
> >>> +		goto out;
> >>> +
> >>> +	read_lock_irqsave(&tasklist_lock, flags);
> >>> +
> >>> +	do_each_thread(g, t) {
> >>> +		clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> >>> +	} while_each_thread(g, t);
> >>> +
> >>> +	read_unlock_irqrestore(&tasklist_lock, flags);
> >>> +out:
> >>> +	atomic_inc(&refcount);
> >>> +}
> >>> +
> >>> +void ftrace_syscall_enter(struct pt_regs *regs)
> >>> +{
> >>> +	struct trace_array *tr = __get_global_trace();
> >>> +	struct syscall_trace_enter *entry;
> >>> +	const struct syscall_trace_entry *sys_data;
> >>> +	struct ring_buffer_event *event;
> >>> +	struct trace_array_cpu *data;
> >>> +	int size;
> >>> +	int syscall_nr;
> >>> +	int nr_offset;
> >>> +	int cpu;
> >>> +
> >>> +	syscall_nr = syscall_get_nr(current, regs);
> >>> +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> >>> +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> >>> +		return;
> >>> +
> >>> +	cpu = raw_smp_processor_id();
> >>> +	data = tr->data[cpu];
> >>> +
> >>> +	if (unlikely(atomic_read(&data->disabled)))
> >>> +		return;
> >>> +
> >>> +	sys_data = &syscall_trace_entries[nr_offset];
> >>> +	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> >>> +
> >>> +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
> >>> +	if (!event)
> >>> +		return;
> >>> +
> >>> +	entry = ring_buffer_event_data(event);
> >>> +	entry->nr = nr_offset;
> >>> +	syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> >>> +
> >>> +	ring_buffer_unlock_commit(tr->buffer, event);
> >>> +	trace_wake_up();
> >>> +}
> >>> +
> >>> +void ftrace_syscall_exit(struct pt_regs *regs)
> >>> +{
> >>> +	struct trace_array *tr = __get_global_trace();
> >>> +	struct syscall_trace_exit *entry;
> >>> +	const struct syscall_trace_entry *sys_data;
> >>> +	struct ring_buffer_event *event;
> >>> +	struct trace_array_cpu *data;
> >>> +	int syscall_nr;
> >>> +	int nr_offset;
> >>> +	int cpu;
> >>> +
> >>> +	syscall_nr = syscall_get_nr(current, regs);
> >>> +	nr_offset = arch_syscall_trace_nr[syscall_nr];
> >>> +	if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> >>> +		return;
> >>> +
> >>> +	cpu = raw_smp_processor_id();
> >>> +	data = tr->data[cpu];
> >>> +
> >>> +	if (unlikely(atomic_read(&data->disabled)))
> >>> +		return;
> >>> +
> >>> +	sys_data = &syscall_trace_entries[nr_offset];
> >>> +
> >>> +	event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> >>> +				sizeof(*entry), 0, 0);
> >>> +	if (!event)
> >>> +		return;
> >>> +
> >>> +	entry = ring_buffer_event_data(event);
> >>> +	entry->nr = nr_offset;
> >>> +	entry->ret = syscall_get_return_value(current, regs);
> >>> +
> >>> +	ring_buffer_unlock_commit(tr->buffer, event);
> >>> +	trace_wake_up();
> >>> +}
> >>> +
> >>> +static int init_syscall_tracer(struct trace_array *tr)
> >>> +{
> >>> +	start_ftrace_syscalls();
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static void reset_syscall_tracer(struct trace_array *tr)
> >>> +{
> >>> +	stop_ftrace_syscalls();
> >>> +}
> >>> +
> >>> +static struct trace_event syscall_enter_event = {
> >>> +	.type	 	= TRACE_SYSCALL_ENTER,
> >>> +	.trace		= print_syscall_enter,
> >>> +};
> >>> +
> >>> +static struct trace_event syscall_exit_event = {
> >>> +	.type	 	= TRACE_SYSCALL_EXIT,
> >>> +	.trace		= print_syscall_exit,
> >>> +};
> >>> +
> >>> +static struct tracer syscall_tracer __read_mostly = {
> >>> +	.name	     	= "syscall",
> >>> +	.init		= init_syscall_tracer,
> >>> +	.reset		= reset_syscall_tracer
> >>> +};
> >>> +
> >>> +__init int register_ftrace_syscalls(void)
> >>> +{
> >>> +	int ret;
> >>> +
> >>> +	ret = register_ftrace_event(&syscall_enter_event);
> >>> +	if (!ret) {
> >>> +		printk(KERN_WARNING "event %d failed to register\n",
> >>> +		       syscall_enter_event.type);
> >>> +		WARN_ON_ONCE(1);
> >>> +	}
> >>> +
> >>> +	ret = register_ftrace_event(&syscall_exit_event);
> >>> +	if (!ret) {
> >>> +		printk(KERN_WARNING "event %d failed to register\n",
> >>> +		       syscall_exit_event.type);
> >>> +		WARN_ON_ONCE(1);
> >>> +	}
> >>> +
> >>> +	return register_tracer(&syscall_tracer);
> >>> +}
> >>> +device_initcall(register_ftrace_syscalls);
> >> -- 
> >> Masami Hiramatsu
> >>
> >> Software Engineer
> >> Hitachi Computer Products (America) Inc.
> >> Software Solutions Division
> >>
> >> e-mail: mhiramat@redhat.com
> >>
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> Masami Hiramatsu
> 
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
> 
> e-mail: mhiramat@redhat.com
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-16 21:45         ` Mathieu Desnoyers
@ 2009-03-16 22:18           ` Frank Ch. Eigler
  2009-03-16 23:46             ` Frederic Weisbecker
  2009-03-23 16:42             ` Mathieu Desnoyers
  2009-03-17  5:24           ` Oleg Nesterov
  1 sibling, 2 replies; 40+ messages in thread
From: Frank Ch. Eigler @ 2009-03-16 22:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, Frederic Weisbecker, Roland McGrath,
	Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Jiaying Zhang, Martin Bligh, utrace-devel

Hi -


On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:

> [...]
> > As far as I know, utrace supports multiple trace-engines on a process.
> > Since ptrace is just an engine of utrace, you can add another engine on utrace.
> > 
> > utrace-+-ptrace_engine---owner_process
> >        |
> >        +-systemtap_module
> >        |
> >        +-ftrace_plugin

Right.  In this way, utrace is simply a multiplexing intermediary.


> > Here, Frank had posted an example of utrace->ftrace engine.
> > http://lkml.org/lkml/2009/1/27/294
> > 
> > And here is the latest his patch(which seems to support syscall tracing...)
> > http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
> > 
> 
> Reminder : we are looking at system-wide tracing here. Here are some
> comments about the current utrace implementation.
> 
> Looking at include/linux/utrace.h from the tree
> 
> 17  * A tracing engine starts by calling utrace_attach_task() or
> 18  * utrace_attach_pid() on the chosen thread, passing in a set of hooks
> 19  * (&struct utrace_engine_ops), and some associated data.  This produces a
> 20  * &struct utrace_engine, which is the handle used for all other
> 21  * operations.  An attached engine has its ops vector, its data, and an
> 22  * event mask controlled by utrace_set_events().
> 
> So if the system has, say 3000 threads, then we have 3000 struct
> utrace_engine created ? I wonder what effet this could have one
> cachelines if this is used to trace hot paths like system call
> entry/exit. Have you benchmarked this kind of scenario under tbench ?

It has not been a problem, since utrace_engines are designed to be
lightweight.  Starting or stopping a systemtap script of the form

    probe process.syscall {}

appears to have no noticable impact on a tbench suite.


> 24  * For each event bit that is set, that engine will get the
> 25  * appropriate ops->report_*() callback when the event occurs.  The
> 26  * &struct utrace_engine_ops need not provide callbacks for an event
> 27  * unless the engine sets one of the associated event bits.
> 
> Looking at utrace_set_events(), we seem to be limited to 32 events on a
> 32-bits architectures because it uses a bitmask ? Isn't it a bit small?

There are only a few types of thread events that involve different
classes of treatment, or different degrees of freedom in terms of
interference with the uninstrumented fast path of the threads.

For example, it does not make sense to have different flag bits for
different system calls, since choosing to trace *any* system call
involves taking the thread off of the fast path with the TIF_ flag.
Once it's off the fast path, it doesn't matter whether the utrace core
or some client performs syscall discrimination, so it is left to the
client.


> 682 /**
> 683  * utrace_set_events_pid - choose which event reports a tracing engine gets
> 684  * @pid:                thread to affect
> 685  * @engine:             attached engine to affect
> 686  * @eventmask:          new event mask
> 687  *
> 688  * This is the same as utrace_set_events(), but takes a &struct pid
> 689  * pointer rather than a &struct task_struct pointer.  The caller must
> 690  * hold a ref on @pid, but does not need to worry about the task
> 691  * staying valid.  If it's been reaped so that @pid points nowhere,
> 692  * then this call returns -%ESRCH.
> 
> 
> Comments like "but does not need to worry about the task staying valid"
> does not make me feel safe and comfortable at all, could you explain
> how you can assume that derefencing an "invalid" pointer will return
> NULL ?

(We're doing a final round of "internal" (pre-LKML) reviews of the
utrace implementation right now on utrace-devel@redhat.com, where such
comments get fastest attention from the experts.)

For this particular issue, the utrace documentation file explains the
liveness rules for the various pointers that can be fed to or received
from utrace functions.  This is not about "feeling" safe, it's about
what the mechanism is deliberately designed to permit.


> About the utrace_attach_task() :
> 
> 244         if (unlikely(target->flags & PF_KTHREAD))
> 245                 /*
> 246                  * Silly kernel, utrace is for users!
> 247                  */
> 248                 return ERR_PTR(-EPERM);
> 
> So we cannot trace kernel threads ?

I'm not quite sure about all the reasons for this, but I believe that
kernel threads don't tend to engage in job control / signal /
system-call activities the same way as normal user threads do.


> 118 /*
> 119  * Called without locks, when we might be the first utrace engine to attach.
> 120  * If this is a newborn thread and we are not the creator, we have to wait
> 121  * for it.  The creator gets the first chance to attach.  The PF_STARTING
> 122  * flag is cleared after its report_clone hook has had a chance to run.
> 123  */
> 124 static inline int utrace_attach_delay(struct task_struct *target)
> 125 {
> 126         if ((target->flags & PF_STARTING) && target->real_parent != current)
> 127                 do {
> 128                         schedule_timeout_interruptible(1);
> 129                         if (signal_pending(current))
> 130                                 return -ERESTARTNOINTR;
> 131                 } while (target->flags & PF_STARTING);
> 132
> 133         return 0;
> 134 }
> 
> Why do we absolutely have to poll until the thread has started ?

(I don't know off the top of my head - Roland?)


> utrace_add_engine()
>   set_notify_resume(target);
> 
> ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> that it is set asynchronously with the execution of the target thread
> (as I do with my TIF_KERNEL_TRACE thread flag).
> 
> However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in 
> entry_64.S
> 
> int_signal:
> and
> retint_signal:
> 
> code paths. However, if there is no syscall tracing to do upon syscall
> entry, the thread flags are not re-read at syscall exit and you will
> miss the syscall exit returning from your target thread if this thread
> was blocked while you set its TIF_NOTIFY_RESUME. Or is it handled in
> some subtle way I did not figure out ? BTW re-reading the TIF flags from
> the thread_info at syscall exit on the fast path is out of question
> because it considerably degrades the kernel performances. entry_*.S is
> a very, very critical path.

(I don't know off the top of my head - Roland?)


- FChE

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-16 22:18           ` Frank Ch. Eigler
@ 2009-03-16 23:46             ` Frederic Weisbecker
  2009-03-23 16:42             ` Mathieu Desnoyers
  1 sibling, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2009-03-16 23:46 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Mathieu Desnoyers, Masami Hiramatsu, Roland McGrath, Ingo Molnar,
	LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Jiaying Zhang, Martin Bligh, utrace-devel

On Mon, Mar 16, 2009 at 06:18:00PM -0400, Frank Ch. Eigler wrote:
> Hi -
> 
> 
> On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:
> 
> > [...]
> > > As far as I know, utrace supports multiple trace-engines on a process.
> > > Since ptrace is just an engine of utrace, you can add another engine on utrace.
> > > 
> > > utrace-+-ptrace_engine---owner_process
> > >        |
> > >        +-systemtap_module
> > >        |
> > >        +-ftrace_plugin
> 
> Right.  In this way, utrace is simply a multiplexing intermediary.
> 
> 
> > > Here, Frank had posted an example of utrace->ftrace engine.
> > > http://lkml.org/lkml/2009/1/27/294
> > > 
> > > And here is the latest his patch(which seems to support syscall tracing...)
> > > http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
> > > 
> > 
> > Reminder : we are looking at system-wide tracing here. Here are some
> > comments about the current utrace implementation.
> > 
> > Looking at include/linux/utrace.h from the tree
> > 
> > 17  * A tracing engine starts by calling utrace_attach_task() or
> > 18  * utrace_attach_pid() on the chosen thread, passing in a set of hooks
> > 19  * (&struct utrace_engine_ops), and some associated data.  This produces a
> > 20  * &struct utrace_engine, which is the handle used for all other
> > 21  * operations.  An attached engine has its ops vector, its data, and an
> > 22  * event mask controlled by utrace_set_events().
> > 
> > So if the system has, say 3000 threads, then we have 3000 struct
> > utrace_engine created ? I wonder what effet this could have one
> > cachelines if this is used to trace hot paths like system call
> > entry/exit. Have you benchmarked this kind of scenario under tbench ?
> 
> It has not been a problem, since utrace_engines are designed to be
> lightweight.  Starting or stopping a systemtap script of the form
> 
>     probe process.syscall {}
> 
> appears to have no noticable impact on a tbench suite.
> 
> 
> > 24  * For each event bit that is set, that engine will get the
> > 25  * appropriate ops->report_*() callback when the event occurs.  The
> > 26  * &struct utrace_engine_ops need not provide callbacks for an event
> > 27  * unless the engine sets one of the associated event bits.
> > 
> > Looking at utrace_set_events(), we seem to be limited to 32 events on a
> > 32-bits architectures because it uses a bitmask ? Isn't it a bit small?
> 
> There are only a few types of thread events that involve different
> classes of treatment, or different degrees of freedom in terms of
> interference with the uninstrumented fast path of the threads.
> 
> For example, it does not make sense to have different flag bits for
> different system calls, since choosing to trace *any* system call
> involves taking the thread off of the fast path with the TIF_ flag.
> Once it's off the fast path, it doesn't matter whether the utrace core
> or some client performs syscall discrimination, so it is left to the
> client.
> 
> 
> > 682 /**
> > 683  * utrace_set_events_pid - choose which event reports a tracing engine gets
> > 684  * @pid:                thread to affect
> > 685  * @engine:             attached engine to affect
> > 686  * @eventmask:          new event mask
> > 687  *
> > 688  * This is the same as utrace_set_events(), but takes a &struct pid
> > 689  * pointer rather than a &struct task_struct pointer.  The caller must
> > 690  * hold a ref on @pid, but does not need to worry about the task
> > 691  * staying valid.  If it's been reaped so that @pid points nowhere,
> > 692  * then this call returns -%ESRCH.
> > 
> > 
> > Comments like "but does not need to worry about the task staying valid"
> > does not make me feel safe and comfortable at all, could you explain
> > how you can assume that derefencing an "invalid" pointer will return
> > NULL ?
> 
> (We're doing a final round of "internal" (pre-LKML) reviews of the
> utrace implementation right now on utrace-devel@redhat.com, where such
> comments get fastest attention from the experts.)
> 
> For this particular issue, the utrace documentation file explains the
> liveness rules for the various pointers that can be fed to or received
> from utrace functions.  This is not about "feeling" safe, it's about
> what the mechanism is deliberately designed to permit.
> 
> 
> > About the utrace_attach_task() :
> > 
> > 244         if (unlikely(target->flags & PF_KTHREAD))
> > 245                 /*
> > 246                  * Silly kernel, utrace is for users!
> > 247                  */
> > 248                 return ERR_PTR(-EPERM);
> > 
> > So we cannot trace kernel threads ?
> 
> I'm not quite sure about all the reasons for this, but I believe that
> kernel threads don't tend to engage in job control / signal /
> system-call activities the same way as normal user threads do.
> 


Some of them use some syscalls, but it doesn't involve a user/kernel switch.
So it's not tracable by hooking syscall_entry/exit or using tracehooks.
It would require specific hooks on sys_* functions for that.

So this check is right (writing on each thread info seems somewhat costly so
it's better if it is avoided like here).

Frederic.

 
> > 118 /*
> > 119  * Called without locks, when we might be the first utrace engine to attach.
> > 120  * If this is a newborn thread and we are not the creator, we have to wait
> > 121  * for it.  The creator gets the first chance to attach.  The PF_STARTING
> > 122  * flag is cleared after its report_clone hook has had a chance to run.
> > 123  */
> > 124 static inline int utrace_attach_delay(struct task_struct *target)
> > 125 {
> > 126         if ((target->flags & PF_STARTING) && target->real_parent != current)
> > 127                 do {
> > 128                         schedule_timeout_interruptible(1);
> > 129                         if (signal_pending(current))
> > 130                                 return -ERESTARTNOINTR;
> > 131                 } while (target->flags & PF_STARTING);
> > 132
> > 133         return 0;
> > 134 }
> > 
> > Why do we absolutely have to poll until the thread has started ?
> 
> (I don't know off the top of my head - Roland?)
> 
> 
> > utrace_add_engine()
> >   set_notify_resume(target);
> > 
> > ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> > that it is set asynchronously with the execution of the target thread
> > (as I do with my TIF_KERNEL_TRACE thread flag).
> > 
> > However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in 
> > entry_64.S
> > 
> > int_signal:
> > and
> > retint_signal:
> > 
> > code paths. However, if there is no syscall tracing to do upon syscall
> > entry, the thread flags are not re-read at syscall exit and you will
> > miss the syscall exit returning from your target thread if this thread
> > was blocked while you set its TIF_NOTIFY_RESUME. Or is it handled in
> > some subtle way I did not figure out ? BTW re-reading the TIF flags from
> > the thread_info at syscall exit on the fast path is out of question
> > because it considerably degrades the kernel performances. entry_*.S is
> > a very, very critical path.
> 
> (I don't know off the top of my head - Roland?)
> 
> 
> - FChE


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-16 21:45         ` Mathieu Desnoyers
  2009-03-16 22:18           ` Frank Ch. Eigler
@ 2009-03-17  5:24           ` Oleg Nesterov
  2009-03-17 16:00             ` Mathieu Desnoyers
  1 sibling, 1 reply; 40+ messages in thread
From: Oleg Nesterov @ 2009-03-17  5:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Peter Zijlstra, Frederic Weisbecker, linux-kernel,
	Steven Rostedt, Jiaying Zhang, Martin Bligh, utrace-devel

On 03/16, Mathieu Desnoyers wrote:
>
> utrace_add_engine()
>   set_notify_resume(target);
>
> ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> that it is set asynchronously with the execution of the target thread
> (as I do with my TIF_KERNEL_TRACE thread flag).
>
> However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
> entry_64.S
>
> int_signal:
> and
> retint_signal:
>
> code paths. However, if there is no syscall tracing to do upon syscall
> entry, the thread flags are not re-read at syscall exit and you will
> miss the syscall exit returning from your target thread if this thread
> was blocked while you set its TIF_NOTIFY_RESUME.

Afaics, TIF_NOTIFY_RESUME is not needed to trace syscall entry/exit.
If engine wants the syscall tracing, utrace_set_events(UTRACE_SYSCALL_xxx)
sets TIF_SYSCALL_TRACE. And syscall_trace_enter/syscall_trace_leave call
tracehook_report_syscall_xxx().

Oleg.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-17  5:24           ` Oleg Nesterov
@ 2009-03-17 16:00             ` Mathieu Desnoyers
  2009-03-19 10:34               ` Roland McGrath
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2009-03-17 16:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lai Jiangshan, Peter Zijlstra, Frederic Weisbecker, linux-kernel,
	Steven Rostedt, Jiaying Zhang, Martin Bligh, utrace-devel

* Oleg Nesterov (oleg@redhat.com) wrote:
> On 03/16, Mathieu Desnoyers wrote:
> >
> > utrace_add_engine()
> >   set_notify_resume(target);
> >
> > ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> > that it is set asynchronously with the execution of the target thread
> > (as I do with my TIF_KERNEL_TRACE thread flag).
> >
> > However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
> > entry_64.S
> >
> > int_signal:
> > and
> > retint_signal:
> >
> > code paths. However, if there is no syscall tracing to do upon syscall
> > entry, the thread flags are not re-read at syscall exit and you will
> > miss the syscall exit returning from your target thread if this thread
> > was blocked while you set its TIF_NOTIFY_RESUME.
> 
> Afaics, TIF_NOTIFY_RESUME is not needed to trace syscall entry/exit.
> If engine wants the syscall tracing, utrace_set_events(UTRACE_SYSCALL_xxx)
> sets TIF_SYSCALL_TRACE. And syscall_trace_enter/syscall_trace_leave call
> tracehook_report_syscall_xxx().
> 
> Oleg.

I recall that TIF_SYSCALL_TRACE also suffers from the same problem as
TIF_NOTIFY_RESUME if set asynchronously with the target thread's
execution at least on x86_64 and arm. Do you take care to stop the
target thread in utrace_set_events ?

Mathieu

> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-17 16:00             ` Mathieu Desnoyers
@ 2009-03-19 10:34               ` Roland McGrath
  2009-03-23 17:33                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Roland McGrath @ 2009-03-19 10:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Oleg Nesterov, Lai Jiangshan, Peter Zijlstra,
	Frederic Weisbecker, linux-kernel, Steven Rostedt, Martin Bligh,
	utrace-devel, Jiaying Zhang

The utrace API itself is not a good fit for global tracing, since its
purpose is tracing and control of individual user threads.  There is
no reason to allocate its per-task data structures when you are going
to treat all tasks the same anyway.  The points that I think are being
missed are about the possibilities of overloading TIF_SYSCALL_TRACE.

It's true that ptrace uses TIF_SYSCALL_TRACE as a flag for whether you are
in the middle of a PTRACE_SYSCALL, so it can be confused by setting it for 
other purposes on a task that is also ptrace'd (but not with PTRACE_SYSCALL).
Until we are able to do away with these parts of the old ptrace innards,
you can't overload TIF_SYSCALL_TRACE without perturbing ptrace behavior.

The utrace code does not have this problem.  It keeps its own state bits,
so for it, TIF_SYSCALL_TRACE means exactly "the task will call
tracehook_report_syscall_*" and no more.  To use TIF_SYSCALL_TRACE for
another purpose, just set it on all the tasks you like (and/or set it on
new tasks in fork.c) and add your code (tracepoints, whatever) to
tracehook_report_syscall_* alongside the calls there into utrace.  There is
exactly one place in utrace code that clears TIF_SYSCALL_TRACE, and you
just add "&& !global_syscall_tracing_enabled" to the condition there.  You
don't need to bother clearing TIF_SYSCALL_TRACE on any task when you're
done.  If your "global_syscall_tracing_enabled" (or whatever it is) is
clear, each task will lazily fall into utrace at its next syscall
entry/exit and then utrace will reset TIF_SYSCALL_TRACE when it finds no
reason left to have it on.

I'm not really going to delve into utrace internals in this thread.  Please
raise those questions in review of the utrace patches when current code is
actually posted, where they belong.  Here I'll just mention the relevant
things that relate to the underlying issue you raised about synchronization.
As thoroughly documented, utrace_set_events() is a quick, asynchronous call
that itself makes no guarantees about how quickly a running task will start
to report the newly-requested events.  For purposes relevant here, it just
sets TIF_SYSCALL_TRACE and nothing else.  In utrace, if you want synchronous
assurance that a task misses no events you ask for, then you must first use
utrace_control (et al) to stop it and synchronize.  That is not something
that makes much sense at all for a "flip on global tracing" operation, which
is not generally especially synchronous with anything else.  If you want
best effort that a task will pick up newly-requested events Real Soon Now,
you can use utrace_control with just UTRACE_REPORT.  For purposes relevant
here, this just does set_notify_resume().  That will send an IPI if the task
is running, and then it will start noticing before it returns to user mode.
So:
	set_tsk_thread_flag(task, TIF_SYSCALL_TRACE);
	set_notify_resume(task);
is what I would expect you to do on each task if you want to quickly get it
to start hitting tracehook_report_syscall_*.  (I'm a bit dubious that there
is really any need to speed it up with set_notify_resume, but that's just me.)

Finally, some broader points about TIF_SYSCALL_TRACE that I think
have been overlooked.  The key special feature of TIF_SYSCALL_TRACE
is that it gets you to a place where full user_regset access is
available.  Debuggers need this to read (and write) the full user
register state arbitrarily, which they also need to do user
backtraces and the like.  If you do not need user_regset to work,
then you don't need to be on this (slowest) code path.

If you are only interested in reading syscall arguments and results
(or even in changing syscall results in exit tracing) then you do
not need user_regset and you do not need to take the slowest syscall
path.  (If you are doing backtraces but already rely on full kernel
stack unwinding to do it, you also do not need user_regset.)  From
anywhere inside the kernel, you can use the asm/syscall.h calls to
read syscall args, whichever entry path the task took.

The other mechanism to hook into every syscall entry/exit is
TIF_SYSCALL_AUDIT.  On some machines (like x86), this takes a third,
"warm" code path that is faster than the TIF_SYSCALL_TRACE path
(though obviously still off the fastest direct code path).  It can
be faster precisely because it doesn't need to allow for user_regset
access, nor for modification of syscall arguments in entry tracing.
For normal read-only tracing of just the actual syscall details,
it has all you need.

Unfortunately the arch code all looks like:

	if (unlikely(current->audit_context))
		 audit_syscall_{entry,exit}(...);

So we need to change that to:

	if (unlikely(test_thread_flag(TIF_SYSCALL_AUDIT)))
		 audit_syscall_{entry,exit}(...);

But that is pretty easy to get right, even doing it blind on arch's
you can't test.  Far better than adding new asm hackery for each arch
that's almost identical to TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT (and
finding out that some are fresh out of TIF bits in the range that
their asm code can handle).

TIF_SYSCALL_AUDIT is only set when allocating audit_context, and its
paths already have !context tests so overloading is harmless today.
(Whereas with TIF_SYSCALL_TRACE, you have to wait for later ptrace
cleanups or write off using ptrace simultaneously.)

Then you can do the lazy disable in audit_syscall_{entry,exit} with:

	if (unlikely(!context)) {
		if (unlikely(!global_syscall_tracing_enabled))
			clear_thread_flag(TIF_SYSCALL_AUDIT);
		return;
	}

Plus add there your tracepoint or whatnot.

Unless you really plan to use user_regset in your tracepoints, then
I think this is a better plan for global syscall tracing than either
fiddling with TIF_SYSCALL_TRACE or adding new arch asm requirements.
(IMHO, the latter is the worst idea on the table.)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-16 22:18           ` Frank Ch. Eigler
  2009-03-16 23:46             ` Frederic Weisbecker
@ 2009-03-23 16:42             ` Mathieu Desnoyers
  2009-03-23 16:52               ` Frank Ch. Eigler
  1 sibling, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2009-03-23 16:42 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Masami Hiramatsu, Frederic Weisbecker, Roland McGrath,
	Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Jiaying Zhang, Martin Bligh, utrace-devel

* Frank Ch. Eigler (fche@redhat.com) wrote:
> Hi -
> 
> 
> On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:
> 
> > [...]
> > > As far as I know, utrace supports multiple trace-engines on a process.
> > > Since ptrace is just an engine of utrace, you can add another engine on utrace.
> > > 
> > > utrace-+-ptrace_engine---owner_process
> > >        |
> > >        +-systemtap_module
> > >        |
> > >        +-ftrace_plugin
> 
> Right.  In this way, utrace is simply a multiplexing intermediary.
> 
> 
> > > Here, Frank had posted an example of utrace->ftrace engine.
> > > http://lkml.org/lkml/2009/1/27/294
> > > 
> > > And here is the latest his patch(which seems to support syscall tracing...)
> > > http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
> > > 
> > 
> > Reminder : we are looking at system-wide tracing here. Here are some
> > comments about the current utrace implementation.
> > 
> > Looking at include/linux/utrace.h from the tree
> > 
> > 17  * A tracing engine starts by calling utrace_attach_task() or
> > 18  * utrace_attach_pid() on the chosen thread, passing in a set of hooks
> > 19  * (&struct utrace_engine_ops), and some associated data.  This produces a
> > 20  * &struct utrace_engine, which is the handle used for all other
> > 21  * operations.  An attached engine has its ops vector, its data, and an
> > 22  * event mask controlled by utrace_set_events().
> > 
> > So if the system has, say 3000 threads, then we have 3000 struct
> > utrace_engine created ? I wonder what effet this could have one
> > cachelines if this is used to trace hot paths like system call
> > entry/exit. Have you benchmarked this kind of scenario under tbench ?
> 
> It has not been a problem, since utrace_engines are designed to be
> lightweight.  Starting or stopping a systemtap script of the form
> 
>     probe process.syscall {}
> 
> appears to have no noticable impact on a tbench suite.
> 

Do you mean starting this script for a single process or for _all_ the
processes and threads running on the system ?

> 
> > 24  * For each event bit that is set, that engine will get the
> > 25  * appropriate ops->report_*() callback when the event occurs.  The
> > 26  * &struct utrace_engine_ops need not provide callbacks for an event
> > 27  * unless the engine sets one of the associated event bits.
> > 
> > Looking at utrace_set_events(), we seem to be limited to 32 events on a
> > 32-bits architectures because it uses a bitmask ? Isn't it a bit small?
> 
> There are only a few types of thread events that involve different
> classes of treatment, or different degrees of freedom in terms of
> interference with the uninstrumented fast path of the threads.
> 
> For example, it does not make sense to have different flag bits for
> different system calls, since choosing to trace *any* system call
> involves taking the thread off of the fast path with the TIF_ flag.
> Once it's off the fast path, it doesn't matter whether the utrace core
> or some client performs syscall discrimination, so it is left to the
> client.
> 

If we limit ourself to thread-interaction events, I agree that they are
limited. But in the system-wide tracing scenario, the criterions for
filtering can apply to many more event categories.

Referring to Roland's reply, I think using utrace to enable system-wide
collection of data would just be a waste of resources. Going through a
more lightweight system-wide activation seems more appropriate to me.
Utrace is still a very promising tool to trace process-specific activity
though.

Mathieu

> 
> > 682 /**
> > 683  * utrace_set_events_pid - choose which event reports a tracing engine gets
> > 684  * @pid:                thread to affect
> > 685  * @engine:             attached engine to affect
> > 686  * @eventmask:          new event mask
> > 687  *
> > 688  * This is the same as utrace_set_events(), but takes a &struct pid
> > 689  * pointer rather than a &struct task_struct pointer.  The caller must
> > 690  * hold a ref on @pid, but does not need to worry about the task
> > 691  * staying valid.  If it's been reaped so that @pid points nowhere,
> > 692  * then this call returns -%ESRCH.
> > 
> > 
> > Comments like "but does not need to worry about the task staying valid"
> > does not make me feel safe and comfortable at all, could you explain
> > how you can assume that derefencing an "invalid" pointer will return
> > NULL ?
> 
> (We're doing a final round of "internal" (pre-LKML) reviews of the
> utrace implementation right now on utrace-devel@redhat.com, where such
> comments get fastest attention from the experts.)
> 
> For this particular issue, the utrace documentation file explains the
> liveness rules for the various pointers that can be fed to or received
> from utrace functions.  This is not about "feeling" safe, it's about
> what the mechanism is deliberately designed to permit.
> 
> 
> > About the utrace_attach_task() :
> > 
> > 244         if (unlikely(target->flags & PF_KTHREAD))
> > 245                 /*
> > 246                  * Silly kernel, utrace is for users!
> > 247                  */
> > 248                 return ERR_PTR(-EPERM);
> > 
> > So we cannot trace kernel threads ?
> 
> I'm not quite sure about all the reasons for this, but I believe that
> kernel threads don't tend to engage in job control / signal /
> system-call activities the same way as normal user threads do.
> 
> 
> > 118 /*
> > 119  * Called without locks, when we might be the first utrace engine to attach.
> > 120  * If this is a newborn thread and we are not the creator, we have to wait
> > 121  * for it.  The creator gets the first chance to attach.  The PF_STARTING
> > 122  * flag is cleared after its report_clone hook has had a chance to run.
> > 123  */
> > 124 static inline int utrace_attach_delay(struct task_struct *target)
> > 125 {
> > 126         if ((target->flags & PF_STARTING) && target->real_parent != current)
> > 127                 do {
> > 128                         schedule_timeout_interruptible(1);
> > 129                         if (signal_pending(current))
> > 130                                 return -ERESTARTNOINTR;
> > 131                 } while (target->flags & PF_STARTING);
> > 132
> > 133         return 0;
> > 134 }
> > 
> > Why do we absolutely have to poll until the thread has started ?
> 
> (I don't know off the top of my head - Roland?)
> 
> 
> > utrace_add_engine()
> >   set_notify_resume(target);
> > 
> > ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> > that it is set asynchronously with the execution of the target thread
> > (as I do with my TIF_KERNEL_TRACE thread flag).
> > 
> > However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in 
> > entry_64.S
> > 
> > int_signal:
> > and
> > retint_signal:
> > 
> > code paths. However, if there is no syscall tracing to do upon syscall
> > entry, the thread flags are not re-read at syscall exit and you will
> > miss the syscall exit returning from your target thread if this thread
> > was blocked while you set its TIF_NOTIFY_RESUME. Or is it handled in
> > some subtle way I did not figure out ? BTW re-reading the TIF flags from
> > the thread_info at syscall exit on the fast path is out of question
> > because it considerably degrades the kernel performances. entry_*.S is
> > a very, very critical path.
> 
> (I don't know off the top of my head - Roland?)
> 
> 
> - FChE

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-23 16:42             ` Mathieu Desnoyers
@ 2009-03-23 16:52               ` Frank Ch. Eigler
  2009-03-23 17:03                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Frank Ch. Eigler @ 2009-03-23 16:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, Frederic Weisbecker, Roland McGrath,
	Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Jiaying Zhang, Martin Bligh, utrace-devel

Hi -

On Mon, Mar 23, 2009 at 12:42:08PM -0400, Mathieu Desnoyers wrote:
> [...]

(Please trim emails you're responding to.)

> [...]
> > > So if the system has, say 3000 threads, then we have 3000 struct
> > > utrace_engine created ? I wonder what effet this could have one
> > > cachelines if this is used to trace hot paths like system call
> > > entry/exit. Have you benchmarked this kind of scenario under tbench ?
> > 
> > It has not been a problem, since utrace_engines are designed to be
> > lightweight.  Starting or stopping a systemtap script of the form
> > 
> >     probe process.syscall {}
> > 
> > appears to have no noticable impact on a tbench suite.
> 
> Do you mean starting this script for a single process or for _all_ the
> processes and threads running on the system ?

The script above usually applies to all threads.


> > > Looking at utrace_set_events(), we seem to be limited to 32 events on a
> > > 32-bits architectures because it uses a bitmask ? Isn't it a bit small?
> > 
> > There are only a few types of thread events that involve different
> > classes of treatment, or different degrees of freedom in terms of
> > interference with the uninstrumented fast path of the threads. [...]
> 
> If we limit ourself to thread-interaction events, I agree that they are
> limited. But in the system-wide tracing scenario, the criterions for
> filtering can apply to many more event categories.

If those different criteria have equivalent impact on running threads,
there is no need to differentiate them at the low (utrace event flag)
level.  Could you offer an example to clarify?


> Referring to Roland's reply, I think using utrace to enable
> system-wide collection of data would just be a waste of
> resources. Going through a more lightweight system-wide activation
> seems more appropriate to me.  [...]

Perhaps.  OTOH it also makes sense to me to use (and improve) one
general facility, if it can do the right thing almost as fast as a
wholly separate facility that's specialized for one small purpose.
The decision would probably rest with a more data-based comparison of
performance & code size.


- FChE

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-23 16:52               ` Frank Ch. Eigler
@ 2009-03-23 17:03                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2009-03-23 17:03 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Masami Hiramatsu, Frederic Weisbecker, Roland McGrath,
	Ingo Molnar, LKML, Lai Jiangshan, Steven Rostedt, Peter Zijlstra,
	Jiaying Zhang, Martin Bligh, utrace-devel

* Frank Ch. Eigler (fche@redhat.com) wrote:
> Hi -
> 
> On Mon, Mar 23, 2009 at 12:42:08PM -0400, Mathieu Desnoyers wrote:
> > [...]
> 
> (Please trim emails you're responding to.)
> 
> > [...]
> > > > So if the system has, say 3000 threads, then we have 3000 struct
> > > > utrace_engine created ? I wonder what effet this could have one
> > > > cachelines if this is used to trace hot paths like system call
> > > > entry/exit. Have you benchmarked this kind of scenario under tbench ?
> > > 
> > > It has not been a problem, since utrace_engines are designed to be
> > > lightweight.  Starting or stopping a systemtap script of the form
> > > 
> > >     probe process.syscall {}
> > > 
> > > appears to have no noticable impact on a tbench suite.
> > 
> > Do you mean starting this script for a single process or for _all_ the
> > processes and threads running on the system ?
> 
> The script above usually applies to all threads.
> 

Hrm, I already spent more time installing and benchmarking systemtap
than I should, so I don't have time currently to run further systemtap
benchmarks, but I seriously doubt about this. Have you run the following
benchmark ?

Baseline :
vanilla kernel, without utrace

Comparison with :
utrace-enabled kernel, with the syscall probe activated

?

If you are comparing a utrace-enabled kernel with and without the
syscall probes activated, then you are probably missing some performance
impact. Also make sure AUDIT SYSCALL, secure computing and
frame pointers are disabled in your baseline kernel too.

If this is what you did, I would really like to see the numbers.

> 
> > > > Looking at utrace_set_events(), we seem to be limited to 32 events on a
> > > > 32-bits architectures because it uses a bitmask ? Isn't it a bit small?
> > > 
> > > There are only a few types of thread events that involve different
> > > classes of treatment, or different degrees of freedom in terms of
> > > interference with the uninstrumented fast path of the threads. [...]
> > 
> > If we limit ourself to thread-interaction events, I agree that they are
> > limited. But in the system-wide tracing scenario, the criterions for
> > filtering can apply to many more event categories.
> 
> If those different criteria have equivalent impact on running threads,
> there is no need to differentiate them at the low (utrace event flag)
> level.  Could you offer an example to clarify?
> 
> 
> > Referring to Roland's reply, I think using utrace to enable
> > system-wide collection of data would just be a waste of
> > resources. Going through a more lightweight system-wide activation
> > seems more appropriate to me.  [...]
> 
> Perhaps.  OTOH it also makes sense to me to use (and improve) one
> general facility, if it can do the right thing almost as fast as a
> wholly separate facility that's specialized for one small purpose.
> The decision would probably rest with a more data-based comparison of
> performance & code size.
> 

Sure.

Mathieu

> 
> - FChE

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
  2009-03-19 10:34               ` Roland McGrath
@ 2009-03-23 17:33                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2009-03-23 17:33 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Oleg Nesterov, Lai Jiangshan, Peter Zijlstra,
	Frederic Weisbecker, linux-kernel, Steven Rostedt, Martin Bligh,
	utrace-devel, Jiaying Zhang

Hi Roland,

* Roland McGrath (roland@redhat.com) wrote:
> The utrace API itself is not a good fit for global tracing, since its
> purpose is tracing and control of individual user threads.  There is
> no reason to allocate its per-task data structures when you are going
> to treat all tasks the same anyway.  The points that I think are being
> missed are about the possibilities of overloading TIF_SYSCALL_TRACE.
> 
> It's true that ptrace uses TIF_SYSCALL_TRACE as a flag for whether you are
> in the middle of a PTRACE_SYSCALL, so it can be confused by setting it for 
> other purposes on a task that is also ptrace'd (but not with PTRACE_SYSCALL).
> Until we are able to do away with these parts of the old ptrace innards,
> you can't overload TIF_SYSCALL_TRACE without perturbing ptrace behavior.
> 

Yes, this is why I went with a different thread flag in my
TIF_KERNEL_TRACE implementation.

> The utrace code does not have this problem.  It keeps its own state bits,
> so for it, TIF_SYSCALL_TRACE means exactly "the task will call
> tracehook_report_syscall_*" and no more.  To use TIF_SYSCALL_TRACE for
> another purpose, just set it on all the tasks you like (and/or set it on
> new tasks in fork.c) and add your code (tracepoints, whatever) to
> tracehook_report_syscall_* alongside the calls there into utrace.  There is
> exactly one place in utrace code that clears TIF_SYSCALL_TRACE, and you
> just add "&& !global_syscall_tracing_enabled" to the condition there.  You
> don't need to bother clearing TIF_SYSCALL_TRACE on any task when you're
> done.  If your "global_syscall_tracing_enabled" (or whatever it is) is
> clear, each task will lazily fall into utrace at its next syscall
> entry/exit and then utrace will reset TIF_SYSCALL_TRACE when it finds no
> reason left to have it on.

I wonder how racy enabling system-wide tracing and disabling utrace
tracing on a specific thread would be ? How do you ensure that the
global tracing flag and per-thread flags are updated consistently ?

I also wonder about added performance impact caused by the
tracehook_report_syscall_* call. Ideally, system-wide syscall tracing
should call directly into a tracing callback, write to the trace
buffers, and return. With utrace, we would have to call an intermediate
callback, which would then call our tracer, then test utrace flags to
check if utrace should be called, and then return. Function calls are
quite costly nowadays :(

> 
> I'm not really going to delve into utrace internals in this thread.  Please
> raise those questions in review of the utrace patches when current code is
> actually posted, where they belong.  Here I'll just mention the relevant
> things that relate to the underlying issue you raised about synchronization.
> As thoroughly documented, utrace_set_events() is a quick, asynchronous call
> that itself makes no guarantees about how quickly a running task will start
> to report the newly-requested events.  For purposes relevant here, it just
> sets TIF_SYSCALL_TRACE and nothing else.  In utrace, if you want synchronous
> assurance that a task misses no events you ask for, then you must first use
> utrace_control (et al) to stop it and synchronize.  That is not something
> that makes much sense at all for a "flip on global tracing" operation, which
> is not generally especially synchronous with anything else.  If you want
> best effort that a task will pick up newly-requested events Real Soon Now,
> you can use utrace_control with just UTRACE_REPORT.  For purposes relevant
> here, this just does set_notify_resume().  That will send an IPI if the task
> is running, and then it will start noticing before it returns to user mode.
> So:
> 	set_tsk_thread_flag(task, TIF_SYSCALL_TRACE);
> 	set_notify_resume(task);
> is what I would expect you to do on each task if you want to quickly get it
> to start hitting tracehook_report_syscall_*.  (I'm a bit dubious that there
> is really any need to speed it up with set_notify_resume, but that's just me.)


Ideally, when we start tracing, setting the flag can be asynchronous,
but we need to have a way to figure out when tracing is actually active
(e.g. rcu quiescent state). So this can be seen as synchronous
activation. Stopping all tasks does not really make much sense for
system-wide tracing, especially if there are alternatives.

> 
> Finally, some broader points about TIF_SYSCALL_TRACE that I think
> have been overlooked.  The key special feature of TIF_SYSCALL_TRACE
> is that it gets you to a place where full user_regset access is
> available.  Debuggers need this to read (and write) the full user
> register state arbitrarily, which they also need to do user
> backtraces and the like.  If you do not need user_regset to work,
> then you don't need to be on this (slowest) code path.

LTTng had userspace backtraces on syscall entry and irq entry a while
ago, and this way particularly useful. But I agree than if this is not
needed, we should go for the warm path.

> 
> If you are only interested in reading syscall arguments and results
> (or even in changing syscall results in exit tracing) then you do
> not need user_regset and you do not need to take the slowest syscall
> path.  (If you are doing backtraces but already rely on full kernel
> stack unwinding to do it, you also do not need user_regset.)  From
> anywhere inside the kernel, you can use the asm/syscall.h calls to
> read syscall args, whichever entry path the task took.
> 
> The other mechanism to hook into every syscall entry/exit is
> TIF_SYSCALL_AUDIT.  On some machines (like x86), this takes a third,
> "warm" code path that is faster than the TIF_SYSCALL_TRACE path
> (though obviously still off the fastest direct code path).  It can
> be faster precisely because it doesn't need to allow for user_regset
> access, nor for modification of syscall arguments in entry tracing.
> For normal read-only tracing of just the actual syscall details,
> it has all you need.
> 
> Unfortunately the arch code all looks like:
> 
> 	if (unlikely(current->audit_context))
> 		 audit_syscall_{entry,exit}(...);
> 
> So we need to change that to:
> 
> 	if (unlikely(test_thread_flag(TIF_SYSCALL_AUDIT)))
> 		 audit_syscall_{entry,exit}(...);
> 
> But that is pretty easy to get right, even doing it blind on arch's
> you can't test.  Far better than adding new asm hackery for each arch
> that's almost identical to TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT (and
> finding out that some are fresh out of TIF bits in the range that
> their asm code can handle).
> 
> TIF_SYSCALL_AUDIT is only set when allocating audit_context, and its
> paths already have !context tests so overloading is harmless today.
> (Whereas with TIF_SYSCALL_TRACE, you have to wait for later ptrace
> cleanups or write off using ptrace simultaneously.)
> 
> Then you can do the lazy disable in audit_syscall_{entry,exit} with:
> 
> 	if (unlikely(!context)) {
> 		if (unlikely(!global_syscall_tracing_enabled))
> 			clear_thread_flag(TIF_SYSCALL_AUDIT);
> 		return;
> 	}
> 
> Plus add there your tracepoint or whatnot.
> 
> Unless you really plan to use user_regset in your tracepoints, then
> I think this is a better plan for global syscall tracing than either
> fiddling with TIF_SYSCALL_TRACE or adding new arch asm requirements.
> (IMHO, the latter is the worst idea on the table.)
> 

Thanks for this thorough review of TIF flags. Hrm, racing with other
pieces of infrastructure is never fun, and given we might want to save
the userspace stack in some probes, I think it could be a good idea to
go with our own flag.

Mathieu

> 
> Thanks,
> Roland

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2009-03-23 17:33 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-07  4:52 [RFC][PATCH 0/2] Syscalls tracing Frederic Weisbecker
2009-03-07  4:52 ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Frederic Weisbecker
2009-03-07  7:49   ` Frederic Weisbecker
2009-03-09 13:52     ` Steven Rostedt
2009-03-09 20:47       ` Frederic Weisbecker
2009-03-10  0:52         ` Steven Rostedt
2009-03-10 10:08           ` Frederic Weisbecker
2009-03-08 16:24   ` Ingo Molnar
2009-03-09 20:40     ` Frederic Weisbecker
2009-03-09 13:44   ` Steven Rostedt
2009-03-09 20:42     ` Frederic Weisbecker
2009-03-13  4:27   ` [tip:tracing/syscalls] tracing/ftrace: syscall tracing infrastructure, basics Frederic Weisbecker
2009-03-16 19:36   ` [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure Masami Hiramatsu
2009-03-16 20:15     ` Frederic Weisbecker
2009-03-16 20:38       ` Masami Hiramatsu
2009-03-16 21:45         ` Mathieu Desnoyers
2009-03-16 22:18           ` Frank Ch. Eigler
2009-03-16 23:46             ` Frederic Weisbecker
2009-03-23 16:42             ` Mathieu Desnoyers
2009-03-23 16:52               ` Frank Ch. Eigler
2009-03-23 17:03                 ` Mathieu Desnoyers
2009-03-17  5:24           ` Oleg Nesterov
2009-03-17 16:00             ` Mathieu Desnoyers
2009-03-19 10:34               ` Roland McGrath
2009-03-23 17:33                 ` Mathieu Desnoyers
2009-03-07  4:53 ` [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64 Frederic Weisbecker
2009-03-13  4:27   ` [tip:tracing/syscalls] tracing/x86: basic implementation of syscall tracing for x86 Frederic Weisbecker
2009-03-13 16:09   ` [tip:tracing/syscalls] tracing/syscalls: support for syscalls tracing on x86, fix Ingo Molnar
2009-03-13 16:32   ` [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64 Mathieu Desnoyers
2009-03-13 16:37     ` Ingo Molnar
2009-03-13 16:50       ` Mathieu Desnoyers
2009-03-15  4:44         ` Ingo Molnar
2009-03-15 15:16           ` Frederic Weisbecker
2009-03-15 19:22           ` Mathieu Desnoyers
2009-03-07 12:15 ` [RFC][PATCH 0/2] Syscalls tracing Peter Zijlstra
2009-03-07 16:02   ` Frederic Weisbecker
2009-03-08 11:24     ` Frederic Weisbecker
2009-03-08 11:28       ` Frederic Weisbecker
2009-03-07 20:26 ` Frank Ch. Eigler
2009-03-07 21:53   ` Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.