All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 01/12] map syscall name to number
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
@ 2009-08-10 20:52 ` Jason Baron
  2009-08-10 20:52 ` [PATCH 02/12] call arch_init_ftrace_syscalls at boot Jason Baron
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

Add a new function to support translating a syscall name to number at runtime.
This allows the syscall event tracer to map syscall names to number.

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 arch/x86/kernel/ftrace.c |   16 ++++++++++++++++
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 8e96634..afb31d7 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -500,6 +500,22 @@ struct syscall_metadata *syscall_nr_to_meta(int nr)
 	return syscalls_metadata[nr];
 }
 
+int syscall_name_to_nr(char *name)
+{
+	int i;
+
+	if (!syscalls_metadata)
+		return -1;
+
+	for (i = 0; i < FTRACE_SYSCALL_MAX; i++) {
+		if (syscalls_metadata[i]) {
+			if (!strcmp(syscalls_metadata[i]->name, name))
+				return i;
+		}
+	}
+	return -1;
+}
+
 void arch_init_ftrace_syscalls(void)
 {
 	int i;
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 00/12] add syscall tracepoints V3
@ 2009-08-10 20:52 Jason Baron
  2009-08-10 20:52 ` [PATCH 01/12] map syscall name to number Jason Baron
                   ` (12 more replies)
  0 siblings, 13 replies; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

hi,

The following is an implementation of Frederic's syscall tracer on top of
tracepoints. It adds the ability to toggle the entry/exit of each syscall
via the standard events/syscalls/syscall_blah/enable interface. The
implementation is done by adding 2 tracepoints. One on entry and one for exit.

The patchset now also addes 'perf' tool support for counting the number of
syscall events. For example, I did a simple strace of 'cat'ing' a file, and
then verified that 'perf stat' gave a similar count.

For example:

#   perf stat -e syscalls:sys_enter_brk -e syscalls:sys_exit_brk -e syscalls:sys_enter_mmap -e syscalls:sys_enter_mmap -e syscalls:sys_enter_access -e syscalls:sys_exit_access  -e syscalls:sys_enter_close -e syscalls:sys_exit_close -e syscalls:sys_enter_read -e syscalls:sys_exit_read -e syscalls:sys_enter_write -e syscalls:sys_exit_write -e syscalls:sys_enter_mprotect -e syscalls:sys_exit_mprotect -e  syscalls:sys_enter_open -e syscalls:sys_exit_open -e syscalls:sys_enter_newfstat -e syscalls:sys_exit_newfstat -e syscalls:sys_enter_exit_group -e syscalls:sys_exit_exit_group cat /tmp/foo 


 Performance counter stats for 'cat /tmp/foo':

              3  syscalls:sys_enter_brk  
              3  syscalls:sys_exit_brk   
              9  syscalls:sys_enter_mmap 
              9  syscalls:sys_enter_mmap 
              1  syscalls:sys_enter_access
              1  syscalls:sys_exit_access
              6  syscalls:sys_enter_close
              6  syscalls:sys_exit_close 
              3  syscalls:sys_enter_read 
              3  syscalls:sys_exit_read  
              1  syscalls:sys_enter_write
              1  syscalls:sys_exit_write 
              3  syscalls:sys_enter_mprotect
              3  syscalls:sys_exit_mprotect
              4  syscalls:sys_enter_open 
              4  syscalls:sys_exit_open  
              5  syscalls:sys_enter_newfstat
              5  syscalls:sys_exit_newfstat
              1  syscalls:sys_enter_exit_group
              0  syscalls:sys_exit_exit_group

    0.000864861  seconds time elapsed


thanks,

-Jason


 arch/x86/include/asm/ftrace.h  |    4 +-
 arch/x86/kernel/ftrace.c       |   41 ++++--
 arch/x86/kernel/ptrace.c       |    6 +-
 arch/x86/kernel/sys_x86_64.c   |    8 +-
 include/linux/ftrace_event.h   |    5 +-
 include/linux/perf_counter.h   |    2 +
 include/linux/syscalls.h       |  125 ++++++++++++++++-
 include/linux/tracepoint.h     |   31 ++++-
 include/trace/ftrace.h         |    4 +-
 include/trace/syscall.h        |   54 ++++++--
 kernel/trace/trace.h           |    6 -
 kernel/trace/trace_events.c    |   33 +++--
 kernel/trace/trace_syscalls.c  |  311 ++++++++++++++++++++++++++++------------
 kernel/tracepoint.c            |   38 +++++
 tools/perf/util/parse-events.c |    8 +-
 15 files changed, 522 insertions(+), 154 deletions(-)


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH 02/12] call arch_init_ftrace_syscalls at boot
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
  2009-08-10 20:52 ` [PATCH 01/12] map syscall name to number Jason Baron
@ 2009-08-10 20:52 ` Jason Baron
  2009-08-10 20:52 ` [PATCH 03/12] add DECLARE_TRACE_WITH_CALLBACK() macro Jason Baron
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

Call arch_init_ftrace_syscalls at boot, so we can determine the set of syscalls
for the syscall trace events.

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 arch/x86/kernel/ftrace.c      |   15 ++++-----------
 include/trace/syscall.h       |    1 -
 kernel/trace/trace_syscalls.c |    1 -
 3 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index afb31d7..0d93d40 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -516,31 +516,24 @@ int syscall_name_to_nr(char *name)
 	return -1;
 }
 
-void arch_init_ftrace_syscalls(void)
+static int __init arch_init_ftrace_syscalls(void)
 {
 	int i;
 	struct syscall_metadata *meta;
 	unsigned long **psys_syscall_table = &sys_call_table;
-	static atomic_t refs;
-
-	if (atomic_inc_return(&refs) != 1)
-		goto end;
 
 	syscalls_metadata = kzalloc(sizeof(*syscalls_metadata) *
 					FTRACE_SYSCALL_MAX, GFP_KERNEL);
 	if (!syscalls_metadata) {
 		WARN_ON(1);
-		return;
+		return -ENOMEM;
 	}
 
 	for (i = 0; i < FTRACE_SYSCALL_MAX; i++) {
 		meta = find_syscall_meta(psys_syscall_table[i]);
 		syscalls_metadata[i] = meta;
 	}
-	return;
-
-	/* Paranoid: avoid overflow */
-end:
-	atomic_dec(&refs);
+	return 0;
 }
+arch_initcall(arch_init_ftrace_syscalls);
 #endif
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 8cfe515..c55fcce 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -19,7 +19,6 @@ struct syscall_metadata {
 };
 
 #ifdef CONFIG_FTRACE_SYSCALLS
-extern void arch_init_ftrace_syscalls(void);
 extern struct syscall_metadata *syscall_nr_to_meta(int nr);
 extern void start_ftrace_syscalls(void);
 extern void stop_ftrace_syscalls(void);
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 5e57964..08aed43 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -106,7 +106,6 @@ void start_ftrace_syscalls(void)
 	if (++refcount != 1)
 		goto unlock;
 
-	arch_init_ftrace_syscalls();
 	read_lock_irqsave(&tasklist_lock, flags);
 
 	do_each_thread(g, t) {
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 03/12] add DECLARE_TRACE_WITH_CALLBACK() macro
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
  2009-08-10 20:52 ` [PATCH 01/12] map syscall name to number Jason Baron
  2009-08-10 20:52 ` [PATCH 02/12] call arch_init_ftrace_syscalls at boot Jason Baron
@ 2009-08-10 20:52 ` Jason Baron
  2009-08-10 20:52 ` [PATCH 04/12] add syscall tracepoints Jason Baron
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

Introduce a new 'DECLARE_TRACE_WITH_CALLBACK()' macro, so that tracepoints can
associate an external register/unregister function.

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 include/linux/tracepoint.h |   31 +++++++++++++++++++++++++++----
 1 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index b9dc4ca..5984ed0 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -60,8 +60,10 @@ struct tracepoint {
  * Make sure the alignment of the structure in the __tracepoints section will
  * not add unwanted padding between the beginning of the section and the
  * structure. Force alignment to the same alignment as the section start.
+ * An optional set of (un)registration functions can be passed to perform any
+ * additional (un)registration work.
  */
-#define DECLARE_TRACE(name, proto, args)				\
+#define DECLARE_TRACE_WITH_CALLBACK(name, proto, args, reg, unreg)	\
 	extern struct tracepoint __tracepoint_##name;			\
 	static inline void trace_##name(proto)				\
 	{								\
@@ -71,13 +73,30 @@ struct tracepoint {
 	}								\
 	static inline int register_trace_##name(void (*probe)(proto))	\
 	{								\
-		return tracepoint_probe_register(#name, (void *)probe);	\
+		int ret;						\
+		void (*func)(void) = reg;				\
+									\
+		ret = tracepoint_probe_register(#name, (void *)probe);	\
+		if (func && !ret)					\
+			func();						\
+		return ret;						\
 	}								\
 	static inline int unregister_trace_##name(void (*probe)(proto))	\
 	{								\
-		return tracepoint_probe_unregister(#name, (void *)probe);\
+		int ret;						\
+		void (*func)(void) = unreg;				\
+									\
+		ret = tracepoint_probe_unregister(#name, (void *)probe);\
+		if (func && !ret)					\
+			func();						\
+		return ret;						\
 	}
 
+
+#define DECLARE_TRACE(name, proto, args)				 \
+	DECLARE_TRACE_WITH_CALLBACK(name, TP_PROTO(proto), TP_ARGS(args),\
+					NULL, NULL);
+
 #define DEFINE_TRACE(name)						\
 	static const char __tpstrtab_##name[]				\
 	__attribute__((section("__tracepoints_strings"))) = #name;	\
@@ -94,7 +113,7 @@ extern void tracepoint_update_probe_range(struct tracepoint *begin,
 	struct tracepoint *end);
 
 #else /* !CONFIG_TRACEPOINTS */
-#define DECLARE_TRACE(name, proto, args)				\
+#define DECLARE_TRACE_WITH_CALLBACK(name, proto, args, reg, unreg)	\
 	static inline void _do_trace_##name(struct tracepoint *tp, proto) \
 	{ }								\
 	static inline void trace_##name(proto)				\
@@ -108,6 +127,10 @@ extern void tracepoint_update_probe_range(struct tracepoint *begin,
 		return -ENOSYS;						\
 	}
 
+#define DECLARE_TRACE(name, proto, args)				 \
+	DECLARE_TRACE_WITH_CALLBACK(name, TP_PROTO(proto), TP_ARGS(args),\
+					NULL, NULL);
+
 #define DEFINE_TRACE(name)
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
 #define EXPORT_TRACEPOINT_SYMBOL(name)
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 04/12] add syscall tracepoints
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
                   ` (2 preceding siblings ...)
  2009-08-10 20:52 ` [PATCH 03/12] add DECLARE_TRACE_WITH_CALLBACK() macro Jason Baron
@ 2009-08-10 20:52 ` Jason Baron
  2009-08-10 20:52 ` [PATCH 05/12] update FTRACE_SYSCALL_MAX Jason Baron
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

add two tracepoints in syscall exit and entry path, conditioned on
TIF_SYSCALL_FTRACE. Supports the syscall trace event code.

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 arch/x86/kernel/ptrace.c |    6 ++++--
 include/trace/syscall.h  |   20 ++++++++++++++++++++
 kernel/tracepoint.c      |   38 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index cabdabc..1625ce9 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -37,6 +37,8 @@
 #include <asm/hw_breakpoint.h>
 
 #include <trace/syscall.h>
+DEFINE_TRACE(syscall_enter);
+DEFINE_TRACE(syscall_exit);
 
 #include "tls.h"
 
@@ -1549,7 +1551,7 @@ asmregparm long syscall_trace_enter(struct pt_regs *regs)
 		ret = -1L;
 
 	if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
-		ftrace_syscall_enter(regs);
+		trace_syscall_enter(regs, regs->orig_ax);
 
 	if (unlikely(current->audit_context)) {
 		if (IS_IA32)
@@ -1575,7 +1577,7 @@ asmregparm void syscall_trace_leave(struct pt_regs *regs)
 		audit_syscall_exit(AUDITSC_RESULT(regs->ax), regs->ax);
 
 	if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
-		ftrace_syscall_exit(regs);
+		trace_syscall_exit(regs, regs->ax);
 
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, 0);
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index c55fcce..3951d77 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -1,8 +1,28 @@
 #ifndef _TRACE_SYSCALL_H
 #define _TRACE_SYSCALL_H
 
+#include <linux/tracepoint.h>
+
 #include <asm/ptrace.h>
 
+
+extern void syscall_regfunc(void);
+extern void syscall_unregfunc(void);
+
+DECLARE_TRACE_WITH_CALLBACK(syscall_enter,
+	TP_PROTO(struct pt_regs *regs, long id),
+	TP_ARGS(regs, id),
+	syscall_regfunc,
+	syscall_unregfunc
+);
+
+DECLARE_TRACE_WITH_CALLBACK(syscall_exit,
+	TP_PROTO(struct pt_regs *regs, long ret),
+	TP_ARGS(regs, ret),
+	syscall_regfunc,
+	syscall_unregfunc
+);
+
 /*
  * A syscall entry in the ftrace syscalls array.
  *
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 1ef5d3a..070a42b 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -24,6 +24,7 @@
 #include <linux/tracepoint.h>
 #include <linux/err.h>
 #include <linux/slab.h>
+#include <linux/sched.h>
 
 extern struct tracepoint __start___tracepoints[];
 extern struct tracepoint __stop___tracepoints[];
@@ -577,3 +578,40 @@ static int init_tracepoints(void)
 __initcall(init_tracepoints);
 
 #endif /* CONFIG_MODULES */
+
+static DEFINE_MUTEX(regfunc_mutex);
+static int sys_tracepoint_refcount;
+
+void syscall_regfunc(void)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+
+	mutex_lock(&regfunc_mutex);
+	if (!sys_tracepoint_refcount) {
+		read_lock_irqsave(&tasklist_lock, flags);
+		do_each_thread(g, t) {
+			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+		} while_each_thread(g, t);
+		read_unlock_irqrestore(&tasklist_lock, flags);
+	}
+	sys_tracepoint_refcount++;
+	mutex_unlock(&regfunc_mutex);
+}
+
+void syscall_unregfunc(void)
+{
+	unsigned long flags;
+	struct task_struct *g, *t;
+
+	mutex_lock(&regfunc_mutex);
+	sys_tracepoint_refcount--;
+	if (!sys_tracepoint_refcount) {
+		read_lock_irqsave(&tasklist_lock, flags);
+		do_each_thread(g, t) {
+			clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+		} while_each_thread(g, t);
+		read_unlock_irqrestore(&tasklist_lock, flags);
+	}
+	mutex_unlock(&regfunc_mutex);
+}
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 05/12] update FTRACE_SYSCALL_MAX
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
                   ` (3 preceding siblings ...)
  2009-08-10 20:52 ` [PATCH 04/12] add syscall tracepoints Jason Baron
@ 2009-08-10 20:52 ` Jason Baron
  2009-08-11 11:00   ` Frederic Weisbecker
  2009-08-10 20:52 ` [PATCH 06/12] trace_event - raw_init bailout Jason Baron
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

update FTRACE_SYSCALL_MAX to the current number of syscalls

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 arch/x86/include/asm/ftrace.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index bd2c651..7113654 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -30,9 +30,9 @@
 
 /* FIXME: I don't want to stay hardcoded */
 #ifdef CONFIG_X86_64
-# define FTRACE_SYSCALL_MAX     296
+# define FTRACE_SYSCALL_MAX     299
 #else
-# define FTRACE_SYSCALL_MAX     333
+# define FTRACE_SYSCALL_MAX     337
 #endif
 
 #ifdef CONFIG_FUNCTION_TRACER
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 06/12] trace_event - raw_init bailout
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
                   ` (4 preceding siblings ...)
  2009-08-10 20:52 ` [PATCH 05/12] update FTRACE_SYSCALL_MAX Jason Baron
@ 2009-08-10 20:52 ` Jason Baron
  2009-08-10 20:52 ` [PATCH 07/12] add ftrace_event_call void * 'data' field Jason Baron
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

Allow the return value of raw_init() to bail us out of creating a trace event
file.

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 kernel/trace/trace_events.c |   29 +++++++++++++++++++----------
 1 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index e0cbede..f95f847 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -925,15 +925,6 @@ event_create_dir(struct ftrace_event_call *call, struct dentry *d_events,
 	if (strcmp(call->system, TRACE_SYSTEM) != 0)
 		d_events = event_subsystem_dir(call->system, d_events);
 
-	if (call->raw_init) {
-		ret = call->raw_init();
-		if (ret < 0) {
-			pr_warning("Could not initialize trace point"
-				   " events/%s\n", call->name);
-			return ret;
-		}
-	}
-
 	call->dir = debugfs_create_dir(call->name, d_events);
 	if (!call->dir) {
 		pr_warning("Could not create debugfs "
@@ -1058,6 +1049,7 @@ static void trace_module_add_events(struct module *mod)
 	struct ftrace_module_file_ops *file_ops = NULL;
 	struct ftrace_event_call *call, *start, *end;
 	struct dentry *d_events;
+	int ret;
 
 	start = mod->trace_events;
 	end = mod->trace_events + mod->num_trace_events;
@@ -1073,7 +1065,15 @@ static void trace_module_add_events(struct module *mod)
 		/* The linker may leave blanks */
 		if (!call->name)
 			continue;
-
+		if (call->raw_init) {
+			ret = call->raw_init();
+			if (ret < 0) {
+				if (ret != -ENOSYS)
+					pr_warning("Could not initialize trace "
+					"point events/%s\n", call->name);
+				continue;
+			}
+		}
 		/*
 		 * This module has events, create file ops for this module
 		 * if not already done.
@@ -1225,6 +1225,15 @@ static __init int event_trace_init(void)
 		/* The linker may leave blanks */
 		if (!call->name)
 			continue;
+		if (call->raw_init) {
+			ret = call->raw_init();
+			if (ret < 0) {
+				if (ret != -ENOSYS)
+					pr_warning("Could not initialize trace "
+					"point events/%s\n", call->name);
+				continue;
+			}
+		}
 		list_add(&call->list, &ftrace_events);
 		event_create_dir(call, d_events, &ftrace_event_id_fops,
 				 &ftrace_enable_fops, &ftrace_event_filter_fops,
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 07/12] add ftrace_event_call void * 'data' field
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
                   ` (5 preceding siblings ...)
  2009-08-10 20:52 ` [PATCH 06/12] trace_event - raw_init bailout Jason Baron
@ 2009-08-10 20:52 ` Jason Baron
  2009-08-11 10:09   ` Frederic Weisbecker
  2009-08-10 20:52 ` [PATCH 08/12] add trace events for each syscall entry/exit Jason Baron
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

add an optional * void pointer to 'ftrace_event_call' that is
passed in for regfunc and unregfunc.

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 include/linux/ftrace_event.h |    5 +++--
 include/trace/ftrace.h       |    4 ++--
 kernel/trace/trace_events.c  |    4 ++--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index ac8c6f8..8544f12 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -112,8 +112,8 @@ struct ftrace_event_call {
 	struct dentry		*dir;
 	struct trace_event	*event;
 	int			enabled;
-	int			(*regfunc)(void);
-	void			(*unregfunc)(void);
+	int			(*regfunc)(void *);
+	void			(*unregfunc)(void *);
 	int			id;
 	int			(*raw_init)(void);
 	int			(*show_format)(struct trace_seq *s);
@@ -122,6 +122,7 @@ struct ftrace_event_call {
 	int			filter_active;
 	struct event_filter	*filter;
 	void			*mod;
+	void			*data;
 
 	atomic_t		profile_count;
 	int			(*profile_enable)(struct ftrace_event_call *);
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 80e5f6c..a0de384 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -568,7 +568,7 @@ static void ftrace_raw_event_##call(proto)				\
 		trace_nowake_buffer_unlock_commit(event, irq_flags, pc); \
 }									\
 									\
-static int ftrace_raw_reg_event_##call(void)				\
+static int ftrace_raw_reg_event_##call(void *ptr)			\
 {									\
 	int ret;							\
 									\
@@ -579,7 +579,7 @@ static int ftrace_raw_reg_event_##call(void)				\
 	return ret;							\
 }									\
 									\
-static void ftrace_raw_unreg_event_##call(void)				\
+static void ftrace_raw_unreg_event_##call(void *ptr)			\
 {									\
 	unregister_trace_##call(ftrace_raw_event_##call);		\
 }									\
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index f95f847..1d289e2 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -86,14 +86,14 @@ static void ftrace_event_enable_disable(struct ftrace_event_call *call,
 		if (call->enabled) {
 			call->enabled = 0;
 			tracing_stop_cmdline_record();
-			call->unregfunc();
+			call->unregfunc(call->data);
 		}
 		break;
 	case 1:
 		if (!call->enabled) {
 			call->enabled = 1;
 			tracing_start_cmdline_record();
-			call->regfunc();
+			call->regfunc(call->data);
 		}
 		break;
 	}
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
                   ` (6 preceding siblings ...)
  2009-08-10 20:52 ` [PATCH 07/12] add ftrace_event_call void * 'data' field Jason Baron
@ 2009-08-10 20:52 ` Jason Baron
  2009-08-11 10:50   ` Frederic Weisbecker
  2009-08-25 12:50   ` Hendrik Brueckner
  2009-08-10 20:52 ` [PATCH 09/12] add support traceopint ids Jason Baron
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

Layer Frederic's syscall tracer on tracepoints. We create trace events via
hooking into the SYCALL_DEFINE macros. This allows us to individually toggle
syscall entry and exit points on/off.

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 include/linux/syscalls.h      |   61 +++++++++++++-
 include/trace/syscall.h       |   18 ++--
 kernel/trace/trace_syscalls.c |  183 ++++++++++++++++++++---------------------
 3 files changed, 159 insertions(+), 103 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 80de700..5e5b4d3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -64,6 +64,7 @@ struct perf_counter_attr;
 #include <linux/sem.h>
 #include <asm/siginfo.h>
 #include <asm/signal.h>
+#include <linux/unistd.h>
 #include <linux/quota.h>
 #include <linux/key.h>
 #include <trace/syscall.h>
@@ -112,6 +113,59 @@ struct perf_counter_attr;
 #define __SC_STR_TDECL5(t, a, ...)	#t, __SC_STR_TDECL4(__VA_ARGS__)
 #define __SC_STR_TDECL6(t, a, ...)	#t, __SC_STR_TDECL5(__VA_ARGS__)
 
+
+#define SYSCALL_TRACE_ENTER_EVENT(sname)				\
+	static struct ftrace_event_call event_enter_##sname;		\
+	static int init_enter_##sname(void)				\
+	{								\
+		int num;						\
+		num = syscall_name_to_nr("sys"#sname);			\
+		if (num < 0)						\
+			return -ENOSYS;					\
+		register_ftrace_event(&event_syscall_enter);		\
+		INIT_LIST_HEAD(&event_enter_##sname.fields);		\
+		init_preds(&event_enter_##sname);			\
+		return 0;						\
+	}								\
+	static struct ftrace_event_call __used				\
+	  __attribute__((__aligned__(4)))				\
+	  __attribute__((section("_ftrace_events")))			\
+	  event_enter_##sname = {					\
+		.name                   = "sys_enter"#sname,		\
+		.system                 = "syscalls",			\
+		.event                  = &event_syscall_enter,		\
+		.raw_init		= init_enter_##sname,		\
+		.regfunc		= reg_event_syscall_enter,	\
+		.unregfunc		= unreg_event_syscall_enter,	\
+		.data			= "sys"#sname,			\
+	}
+
+#define SYSCALL_TRACE_EXIT_EVENT(sname)					\
+	static struct ftrace_event_call event_exit_##sname;		\
+	static int init_exit_##sname(void)				\
+	{								\
+		int num;						\
+		num = syscall_name_to_nr("sys"#sname);			\
+		if (num < 0)						\
+			return -ENOSYS;					\
+		register_ftrace_event(&event_syscall_exit);		\
+		INIT_LIST_HEAD(&event_exit_##sname.fields);		\
+		init_preds(&event_exit_##sname);			\
+		return 0;						\
+	}								\
+	static struct ftrace_event_call __used				\
+	  __attribute__((__aligned__(4)))				\
+	  __attribute__((section("_ftrace_events")))			\
+	  event_exit_##sname = {					\
+		.name                   = "sys_exit"#sname,		\
+		.system                 = "syscalls",			\
+		.event                  = &event_syscall_exit,		\
+		.raw_init		= init_exit_##sname,		\
+		.regfunc		= reg_event_syscall_exit,	\
+		.unregfunc		= unreg_event_syscall_exit,	\
+		.data			= "sys"#sname,			\
+	}
+
 #define SYSCALL_METADATA(sname, nb)				\
 	static const struct syscall_metadata __used		\
 	  __attribute__((__aligned__(4)))			\
@@ -121,7 +175,9 @@ struct perf_counter_attr;
 		.nb_args 	= nb,				\
 		.types		= types_##sname,		\
 		.args		= args_##sname,			\
-	}
+	};							\
+	SYSCALL_TRACE_ENTER_EVENT(sname);			\
+	SYSCALL_TRACE_EXIT_EVENT(sname);
 
 #define SYSCALL_DEFINE0(sname)					\
 	static const struct syscall_metadata __used		\
@@ -131,8 +187,9 @@ struct perf_counter_attr;
 		.name 		= "sys_"#sname,			\
 		.nb_args 	= 0,				\
 	};							\
+	SYSCALL_TRACE_ENTER_EVENT(_##sname);			\
+	SYSCALL_TRACE_EXIT_EVENT(_##sname);			\
 	asmlinkage long sys_##sname(void)
-
 #else
 #define SYSCALL_DEFINE0(name)	   asmlinkage long sys_##name(void)
 #endif
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 3951d77..73fb8b4 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -2,6 +2,8 @@
 #define _TRACE_SYSCALL_H
 
 #include <linux/tracepoint.h>
+#include <linux/unistd.h>
+#include <linux/ftrace_event.h>
 
 #include <asm/ptrace.h>
 
@@ -40,15 +42,13 @@ struct syscall_metadata {
 
 #ifdef CONFIG_FTRACE_SYSCALLS
 extern struct syscall_metadata *syscall_nr_to_meta(int nr);
-extern void start_ftrace_syscalls(void);
-extern void stop_ftrace_syscalls(void);
-extern void ftrace_syscall_enter(struct pt_regs *regs);
-extern void ftrace_syscall_exit(struct pt_regs *regs);
-#else
-static inline void start_ftrace_syscalls(void)			{ }
-static inline void stop_ftrace_syscalls(void)			{ }
-static inline void ftrace_syscall_enter(struct pt_regs *regs)	{ }
-static inline void ftrace_syscall_exit(struct pt_regs *regs)	{ }
+extern int syscall_name_to_nr(char *name);
+extern struct trace_event event_syscall_enter;
+extern struct trace_event event_syscall_exit;
+extern int reg_event_syscall_enter(void *ptr);
+extern void unreg_event_syscall_enter(void *ptr);
+extern int reg_event_syscall_exit(void *ptr);
+extern void unreg_event_syscall_exit(void *ptr);
 #endif
 
 #endif /* _TRACE_SYSCALL_H */
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 08aed43..c7ae25e 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -1,15 +1,16 @@
 #include <trace/syscall.h>
 #include <linux/kernel.h>
+#include <linux/ftrace.h>
 #include <asm/syscall.h>
 
 #include "trace_output.h"
 #include "trace.h"
 
-/* Keep a counter of the syscall tracing users */
-static int refcount;
-
-/* Prevent from races on thread flags toggling */
 static DEFINE_MUTEX(syscall_trace_lock);
+static int sys_refcount_enter;
+static int sys_refcount_exit;
+static DECLARE_BITMAP(enabled_enter_syscalls, FTRACE_SYSCALL_MAX);
+static DECLARE_BITMAP(enabled_exit_syscalls, FTRACE_SYSCALL_MAX);
 
 /* Option to display the parameters types */
 enum {
@@ -95,53 +96,7 @@ print_syscall_exit(struct trace_iterator *iter, int flags)
 	return TRACE_TYPE_HANDLED;
 }
 
-void start_ftrace_syscalls(void)
-{
-	unsigned long flags;
-	struct task_struct *g, *t;
-
-	mutex_lock(&syscall_trace_lock);
-
-	/* Don't enable the flag on the tasks twice */
-	if (++refcount != 1)
-		goto unlock;
-
-	read_lock_irqsave(&tasklist_lock, flags);
-
-	do_each_thread(g, t) {
-		set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
-	} while_each_thread(g, t);
-
-	read_unlock_irqrestore(&tasklist_lock, flags);
-
-unlock:
-	mutex_unlock(&syscall_trace_lock);
-}
-
-void stop_ftrace_syscalls(void)
-{
-	unsigned long flags;
-	struct task_struct *g, *t;
-
-	mutex_lock(&syscall_trace_lock);
-
-	/* There are perhaps still some users */
-	if (--refcount)
-		goto unlock;
-
-	read_lock_irqsave(&tasklist_lock, flags);
-
-	do_each_thread(g, t) {
-		clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
-	} while_each_thread(g, t);
-
-	read_unlock_irqrestore(&tasklist_lock, flags);
-
-unlock:
-	mutex_unlock(&syscall_trace_lock);
-}
-
-void ftrace_syscall_enter(struct pt_regs *regs)
+void ftrace_syscall_enter(struct pt_regs *regs, long id)
 {
 	struct syscall_trace_enter *entry;
 	struct syscall_metadata *sys_data;
@@ -150,6 +105,8 @@ void ftrace_syscall_enter(struct pt_regs *regs)
 	int syscall_nr;
 
 	syscall_nr = syscall_get_nr(current, regs);
+	if (!test_bit(syscall_nr, enabled_enter_syscalls))
+		return;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
@@ -170,7 +127,7 @@ void ftrace_syscall_enter(struct pt_regs *regs)
 	trace_wake_up();
 }
 
-void ftrace_syscall_exit(struct pt_regs *regs)
+void ftrace_syscall_exit(struct pt_regs *regs, long ret)
 {
 	struct syscall_trace_exit *entry;
 	struct syscall_metadata *sys_data;
@@ -178,6 +135,8 @@ void ftrace_syscall_exit(struct pt_regs *regs)
 	int syscall_nr;
 
 	syscall_nr = syscall_get_nr(current, regs);
+	if (!test_bit(syscall_nr, enabled_exit_syscalls))
+		return;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
@@ -196,54 +155,94 @@ void ftrace_syscall_exit(struct pt_regs *regs)
 	trace_wake_up();
 }
 
-static int init_syscall_tracer(struct trace_array *tr)
+int reg_event_syscall_enter(void *ptr)
 {
-	start_ftrace_syscalls();
-
-	return 0;
+	int ret = 0;
+	int num;
+	char *name;
+
+	name = (char *)ptr;
+	num = syscall_name_to_nr(name);
+	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
+		return -ENOSYS;
+	mutex_lock(&syscall_trace_lock);
+	if (!sys_refcount_enter)
+		ret = register_trace_syscall_enter(ftrace_syscall_enter);
+	if (ret) {
+		pr_info("event trace: Could not activate"
+				"syscall entry trace point");
+	} else {
+		set_bit(num, enabled_enter_syscalls);
+		sys_refcount_enter++;
+	}
+	mutex_unlock(&syscall_trace_lock);
+	return ret;
 }
 
-static void reset_syscall_tracer(struct trace_array *tr)
+void unreg_event_syscall_enter(void *ptr)
 {
-	stop_ftrace_syscalls();
-	tracing_reset_online_cpus(tr);
-}
-
-static struct trace_event syscall_enter_event = {
-	.type	 	= TRACE_SYSCALL_ENTER,
-	.trace		= print_syscall_enter,
-};
-
-static struct trace_event syscall_exit_event = {
-	.type	 	= TRACE_SYSCALL_EXIT,
-	.trace		= print_syscall_exit,
-};
+	int num;
+	char *name;
 
-static struct tracer syscall_tracer __read_mostly = {
-	.name	     	= "syscall",
-	.init		= init_syscall_tracer,
-	.reset		= reset_syscall_tracer,
-	.flags		= &syscalls_flags,
-};
+	name = (char *)ptr;
+	num = syscall_name_to_nr(name);
+	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
+		return;
+	mutex_lock(&syscall_trace_lock);
+	sys_refcount_enter--;
+	clear_bit(num, enabled_enter_syscalls);
+	if (!sys_refcount_enter)
+		unregister_trace_syscall_enter(ftrace_syscall_enter);
+	mutex_unlock(&syscall_trace_lock);
+}
 
-__init int register_ftrace_syscalls(void)
+int reg_event_syscall_exit(void *ptr)
 {
-	int ret;
-
-	ret = register_ftrace_event(&syscall_enter_event);
-	if (!ret) {
-		printk(KERN_WARNING "event %d failed to register\n",
-		       syscall_enter_event.type);
-		WARN_ON_ONCE(1);
+	int ret = 0;
+	int num;
+	char *name;
+
+	name = (char *)ptr;
+	num = syscall_name_to_nr(name);
+	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
+		return -ENOSYS;
+	mutex_lock(&syscall_trace_lock);
+	if (!sys_refcount_exit)
+		ret = register_trace_syscall_exit(ftrace_syscall_exit);
+	if (ret) {
+		pr_info("event trace: Could not activate"
+				"syscall exit trace point");
+	} else {
+		set_bit(num, enabled_exit_syscalls);
+		sys_refcount_exit++;
 	}
+	mutex_unlock(&syscall_trace_lock);
+	return ret;
+}
 
-	ret = register_ftrace_event(&syscall_exit_event);
-	if (!ret) {
-		printk(KERN_WARNING "event %d failed to register\n",
-		       syscall_exit_event.type);
-		WARN_ON_ONCE(1);
-	}
+void unreg_event_syscall_exit(void *ptr)
+{
+	int num;
+	char *name;
 
-	return register_tracer(&syscall_tracer);
+	name = (char *)ptr;
+	num = syscall_name_to_nr(name);
+	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
+		return;
+	mutex_lock(&syscall_trace_lock);
+	sys_refcount_exit--;
+	clear_bit(num, enabled_exit_syscalls);
+	if (!sys_refcount_exit)
+		unregister_trace_syscall_exit(ftrace_syscall_exit);
+	mutex_unlock(&syscall_trace_lock);
 }
-device_initcall(register_ftrace_syscalls);
+
+struct trace_event event_syscall_enter = {
+	.trace			= print_syscall_enter,
+	.type			= TRACE_SYSCALL_ENTER
+};
+
+struct trace_event event_syscall_exit = {
+	.trace			= print_syscall_exit,
+	.type			= TRACE_SYSCALL_EXIT
+};
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 09/12] add support traceopint ids
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
                   ` (7 preceding siblings ...)
  2009-08-10 20:52 ` [PATCH 08/12] add trace events for each syscall entry/exit Jason Baron
@ 2009-08-10 20:52 ` Jason Baron
  2009-08-11 11:28   ` Frederic Weisbecker
  2009-08-10 20:53 ` [PATCH 10/12] add perf counter support Jason Baron
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

This patch associates an id with each syscall trace event, so that we can
identify each syscall trace event using the 'perf' tool.

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 arch/x86/kernel/ftrace.c      |   10 ++++++++++
 include/linux/syscalls.h      |   22 ++++++++++++++++++----
 include/trace/syscall.h       |    8 ++++++++
 kernel/trace/trace.h          |    6 ------
 kernel/trace/trace_syscalls.c |   26 ++++++++++++++++----------
 5 files changed, 52 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 0d93d40..3cff121 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -516,6 +516,16 @@ int syscall_name_to_nr(char *name)
 	return -1;
 }
 
+void set_syscall_enter_id(int num, int id)
+{
+	syscalls_metadata[num]->enter_id = id;
+}
+
+void set_syscall_exit_id(int num, int id)
+{
+	syscalls_metadata[num]->exit_id = id;
+}
+
 static int __init arch_init_ftrace_syscalls(void)
 {
 	int i;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 5e5b4d3..ce4b01c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -116,13 +116,20 @@ struct perf_counter_attr;
 
 #define SYSCALL_TRACE_ENTER_EVENT(sname)				\
 	static struct ftrace_event_call event_enter_##sname;		\
+	struct trace_event enter_syscall_print_##sname = {		\
+		.trace                  = print_syscall_enter,		\
+	};								\
 	static int init_enter_##sname(void)				\
 	{								\
-		int num;						\
+		int num, id;						\
 		num = syscall_name_to_nr("sys"#sname);			\
 		if (num < 0)						\
 			return -ENOSYS;					\
-		register_ftrace_event(&event_syscall_enter);		\
+		id = register_ftrace_event(&enter_syscall_print_##sname);\
+		if (!id)						\
+			return -ENODEV;					\
+		event_enter_##sname.id = id;				\
+		set_syscall_enter_id(num, id);				\
 		INIT_LIST_HEAD(&event_enter_##sname.fields);		\
 		init_preds(&event_enter_##sname);			\
 		return 0;						\
@@ -142,13 +149,20 @@ struct perf_counter_attr;
 
 #define SYSCALL_TRACE_EXIT_EVENT(sname)					\
 	static struct ftrace_event_call event_exit_##sname;		\
+	struct trace_event exit_syscall_print_##sname = {		\
+		.trace                  = print_syscall_exit,		\
+	};								\
 	static int init_exit_##sname(void)				\
 	{								\
-		int num;						\
+		int num, id;						\
 		num = syscall_name_to_nr("sys"#sname);			\
 		if (num < 0)						\
 			return -ENOSYS;					\
-		register_ftrace_event(&event_syscall_exit);		\
+		id = register_ftrace_event(&exit_syscall_print_##sname);\
+		if (!id)						\
+			return -ENODEV;					\
+		event_exit_##sname.id = id;				\
+		set_syscall_exit_id(num, id);				\
 		INIT_LIST_HEAD(&event_exit_##sname.fields);		\
 		init_preds(&event_exit_##sname);			\
 		return 0;						\
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 73fb8b4..df62840 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -32,23 +32,31 @@ DECLARE_TRACE_WITH_CALLBACK(syscall_exit,
  * @nb_args: number of parameters it takes
  * @types: list of types as strings
  * @args: list of args as strings (args[i] matches types[i])
+ * @enter_id: associated ftrace enter event id
+ * @exit_id: associated ftrace exit event id
  */
 struct syscall_metadata {
 	const char	*name;
 	int		nb_args;
 	const char	**types;
 	const char	**args;
+	int		enter_id;
+	int		exit_id;
 };
 
 #ifdef CONFIG_FTRACE_SYSCALLS
 extern struct syscall_metadata *syscall_nr_to_meta(int nr);
 extern int syscall_name_to_nr(char *name);
+void set_syscall_enter_id(int num, int id);
+void set_syscall_exit_id(int num, int id);
 extern struct trace_event event_syscall_enter;
 extern struct trace_event event_syscall_exit;
 extern int reg_event_syscall_enter(void *ptr);
 extern void unreg_event_syscall_enter(void *ptr);
 extern int reg_event_syscall_exit(void *ptr);
 extern void unreg_event_syscall_exit(void *ptr);
+enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags);
+enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags);
 #endif
 
 #endif /* _TRACE_SYSCALL_H */
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 44308f3..606073c 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -38,8 +38,6 @@ enum trace_type {
 	TRACE_GRAPH_ENT,
 	TRACE_USER_STACK,
 	TRACE_HW_BRANCHES,
-	TRACE_SYSCALL_ENTER,
-	TRACE_SYSCALL_EXIT,
 	TRACE_KMEM_ALLOC,
 	TRACE_KMEM_FREE,
 	TRACE_POWER,
@@ -334,10 +332,6 @@ extern void __ftrace_bad_type(void);
 			  TRACE_KMEM_ALLOC);	\
 		IF_ASSIGN(var, ent, struct kmemtrace_free_entry,	\
 			  TRACE_KMEM_FREE);	\
-		IF_ASSIGN(var, ent, struct syscall_trace_enter,		\
-			  TRACE_SYSCALL_ENTER);				\
-		IF_ASSIGN(var, ent, struct syscall_trace_exit,		\
-			  TRACE_SYSCALL_EXIT);				\
 		IF_ASSIGN(var, ent, struct ksym_trace_entry, TRACE_KSYM);\
 		__ftrace_bad_type();					\
 	} while (0)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index c7ae25e..e58a9c1 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -36,14 +36,18 @@ print_syscall_enter(struct trace_iterator *iter, int flags)
 	struct syscall_metadata *entry;
 	int i, ret, syscall;
 
-	trace_assign_type(trace, ent);
-
+	trace = (typeof(trace))ent;
 	syscall = trace->nr;
-
 	entry = syscall_nr_to_meta(syscall);
+
 	if (!entry)
 		goto end;
 
+	if (entry->enter_id != ent->type) {
+		WARN_ON_ONCE(1);
+		goto end;
+	}
+
 	ret = trace_seq_printf(s, "%s(", entry->name);
 	if (!ret)
 		return TRACE_TYPE_PARTIAL_LINE;
@@ -78,16 +82,20 @@ print_syscall_exit(struct trace_iterator *iter, int flags)
 	struct syscall_metadata *entry;
 	int ret;
 
-	trace_assign_type(trace, ent);
-
+	trace = (typeof(trace))ent;
 	syscall = trace->nr;
-
 	entry = syscall_nr_to_meta(syscall);
+
 	if (!entry) {
 		trace_seq_printf(s, "\n");
 		return TRACE_TYPE_HANDLED;
 	}
 
+	if (entry->exit_id != ent->type) {
+		WARN_ON_ONCE(1);
+		return TRACE_TYPE_UNHANDLED;
+	}
+
 	ret = trace_seq_printf(s, "%s -> 0x%lx\n", entry->name,
 				trace->ret);
 	if (!ret)
@@ -114,7 +122,7 @@ void ftrace_syscall_enter(struct pt_regs *regs, long id)
 
 	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
 
-	event = trace_current_buffer_lock_reserve(TRACE_SYSCALL_ENTER, size,
+	event = trace_current_buffer_lock_reserve(sys_data->enter_id, size,
 							0, 0);
 	if (!event)
 		return;
@@ -142,7 +150,7 @@ void ftrace_syscall_exit(struct pt_regs *regs, long ret)
 	if (!sys_data)
 		return;
 
-	event = trace_current_buffer_lock_reserve(TRACE_SYSCALL_EXIT,
+	event = trace_current_buffer_lock_reserve(sys_data->exit_id,
 				sizeof(*entry), 0, 0);
 	if (!event)
 		return;
@@ -239,10 +247,8 @@ void unreg_event_syscall_exit(void *ptr)
 
 struct trace_event event_syscall_enter = {
 	.trace			= print_syscall_enter,
-	.type			= TRACE_SYSCALL_ENTER
 };
 
 struct trace_event event_syscall_exit = {
 	.trace			= print_syscall_exit,
-	.type			= TRACE_SYSCALL_EXIT
 };
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 10/12] add perf counter support
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
                   ` (8 preceding siblings ...)
  2009-08-10 20:52 ` [PATCH 09/12] add support traceopint ids Jason Baron
@ 2009-08-10 20:53 ` Jason Baron
  2009-08-11 12:12   ` Frederic Weisbecker
  2009-08-10 20:53 ` [PATCH 11/12] add more namespace area to 'perf list' output Jason Baron
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

Make 'perf stat -e syscalls:sys_enter_blah' work with syscall style tracepoints.

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 include/linux/perf_counter.h  |    2 +
 include/linux/syscalls.h      |   52 +++++++++++++++++-
 include/trace/syscall.h       |    7 +++
 kernel/trace/trace_syscalls.c |  121 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 181 insertions(+), 1 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index c484834..aaf0c74 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -734,6 +734,8 @@ extern int sysctl_perf_counter_mlock;
 extern int sysctl_perf_counter_sample_rate;
 
 extern void perf_counter_init(void);
+extern void perf_tpcounter_event(int event_id, u64 addr, u64 count,
+				 void *record, int entry_size);
 
 #ifndef perf_misc_flags
 #define perf_misc_flags(regs)	(user_mode(regs) ? PERF_EVENT_MISC_USER : \
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index ce4b01c..5541e75 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -98,6 +98,53 @@ struct perf_counter_attr;
 #define __SC_TEST5(t5, a5, ...)	__SC_TEST(t5); __SC_TEST4(__VA_ARGS__)
 #define __SC_TEST6(t6, a6, ...)	__SC_TEST(t6); __SC_TEST5(__VA_ARGS__)
 
+#ifdef CONFIG_EVENT_PROFILE
+#define TRACE_SYS_ENTER_PROFILE(sname)					       \
+static int prof_sysenter_enable_##sname(struct ftrace_event_call *event_call)  \
+{									       \
+	int ret = 0;							       \
+	if (!atomic_inc_return(&event_enter_##sname.profile_count))	       \
+		ret = reg_prof_syscall_enter("sys"#sname);		       \
+	return ret;							       \
+}									       \
+									       \
+static void prof_sysenter_disable_##sname(struct ftrace_event_call *event_call)\
+{									       \
+	if (atomic_add_negative(-1, &event_enter_##sname.profile_count))       \
+		unreg_prof_syscall_enter("sys"#sname);			       \
+}
+
+#define TRACE_SYS_EXIT_PROFILE(sname)					       \
+static int prof_sysexit_enable_##sname(struct ftrace_event_call *event_call)   \
+{									       \
+	int ret = 0;							       \
+	if (!atomic_inc_return(&event_exit_##sname.profile_count))	       \
+		ret = reg_prof_syscall_exit("sys"#sname);		       \
+	return ret;							       \
+}									       \
+									       \
+static void prof_sysexit_disable_##sname(struct ftrace_event_call *event_call) \
+{                                                                              \
+	if (atomic_add_negative(-1, &event_exit_##sname.profile_count))	       \
+		unreg_prof_syscall_exit("sys"#sname);			       \
+}
+
+#define TRACE_SYS_ENTER_PROFILE_INIT(sname)				       \
+	.profile_count = ATOMIC_INIT(-1),				       \
+	.profile_enable = prof_sysenter_enable_##sname,			       \
+	.profile_disable = prof_sysenter_disable_##sname,
+
+#define TRACE_SYS_EXIT_PROFILE_INIT(sname)				       \
+	.profile_count = ATOMIC_INIT(-1),				       \
+	.profile_enable = prof_sysexit_enable_##sname,			       \
+	.profile_disable = prof_sysexit_disable_##sname,
+#else
+#define TRACE_SYS_ENTER_PROFILE(sname)
+#define TRACE_SYS_ENTER_PROFILE_INIT(sname)
+#define TRACE_SYS_EXIT_PROFILE(sname)
+#define TRACE_SYS_EXIT_PROFILE_INIT(sname)
+#endif
+
 #ifdef CONFIG_FTRACE_SYSCALLS
 #define __SC_STR_ADECL1(t, a)		#a
 #define __SC_STR_ADECL2(t, a, ...)	#a, __SC_STR_ADECL1(__VA_ARGS__)
@@ -113,7 +160,6 @@ struct perf_counter_attr;
 #define __SC_STR_TDECL5(t, a, ...)	#t, __SC_STR_TDECL4(__VA_ARGS__)
 #define __SC_STR_TDECL6(t, a, ...)	#t, __SC_STR_TDECL5(__VA_ARGS__)
 
-
 #define SYSCALL_TRACE_ENTER_EVENT(sname)				\
 	static struct ftrace_event_call event_enter_##sname;		\
 	struct trace_event enter_syscall_print_##sname = {		\
@@ -134,6 +180,7 @@ struct perf_counter_attr;
 		init_preds(&event_enter_##sname);			\
 		return 0;						\
 	}								\
+	TRACE_SYS_ENTER_PROFILE(sname);					\
 	static struct ftrace_event_call __used				\
 	  __attribute__((__aligned__(4)))				\
 	  __attribute__((section("_ftrace_events")))			\
@@ -145,6 +192,7 @@ struct perf_counter_attr;
 		.regfunc		= reg_event_syscall_enter,	\
 		.unregfunc		= unreg_event_syscall_enter,	\
 		.data			= "sys"#sname,			\
+		TRACE_SYS_ENTER_PROFILE_INIT(sname)			\
 	}
 
 #define SYSCALL_TRACE_EXIT_EVENT(sname)					\
@@ -167,6 +215,7 @@ struct perf_counter_attr;
 		init_preds(&event_exit_##sname);			\
 		return 0;						\
 	}								\
+	TRACE_SYS_EXIT_PROFILE(sname);					\
 	static struct ftrace_event_call __used				\
 	  __attribute__((__aligned__(4)))				\
 	  __attribute__((section("_ftrace_events")))			\
@@ -178,6 +227,7 @@ struct perf_counter_attr;
 		.regfunc		= reg_event_syscall_exit,	\
 		.unregfunc		= unreg_event_syscall_exit,	\
 		.data			= "sys"#sname,			\
+		TRACE_SYS_EXIT_PROFILE_INIT(sname)			\
 	}
 
 #define SYSCALL_METADATA(sname, nb)				\
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index df62840..3ab6dd1 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -58,5 +58,12 @@ extern void unreg_event_syscall_exit(void *ptr);
 enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags);
 enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags);
 #endif
+#ifdef CONFIG_EVENT_PROFILE
+int reg_prof_syscall_enter(char *name);
+void unreg_prof_syscall_enter(char *name);
+int reg_prof_syscall_exit(char *name);
+void unreg_prof_syscall_exit(char *name);
+
+#endif
 
 #endif /* _TRACE_SYSCALL_H */
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index e58a9c1..f4eaec3 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -1,6 +1,7 @@
 #include <trace/syscall.h>
 #include <linux/kernel.h>
 #include <linux/ftrace.h>
+#include <linux/perf_counter.h>
 #include <asm/syscall.h>
 
 #include "trace_output.h"
@@ -252,3 +253,123 @@ struct trace_event event_syscall_enter = {
 struct trace_event event_syscall_exit = {
 	.trace			= print_syscall_exit,
 };
+
+#ifdef CONFIG_EVENT_PROFILE
+static DECLARE_BITMAP(enabled_prof_enter_syscalls, FTRACE_SYSCALL_MAX);
+static DECLARE_BITMAP(enabled_prof_exit_syscalls, FTRACE_SYSCALL_MAX);
+static int sys_prof_refcount_enter;
+static int sys_prof_refcount_exit;
+
+static void prof_syscall_enter(struct pt_regs *regs, long id)
+{
+	struct syscall_metadata *sys_data;
+	int syscall_nr;
+
+	syscall_nr = syscall_get_nr(current, regs);
+	if (!test_bit(syscall_nr, enabled_prof_enter_syscalls))
+		return;
+
+	sys_data = syscall_nr_to_meta(syscall_nr);
+	if (!sys_data)
+		return;
+
+	perf_tpcounter_event(sys_data->enter_id, 0, 1, NULL, 0);
+}
+
+int reg_prof_syscall_enter(char *name)
+{
+	int ret = 0;
+	int num;
+
+	num = syscall_name_to_nr(name);
+	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
+		return -ENOSYS;
+
+	mutex_lock(&syscall_trace_lock);
+	if (!sys_prof_refcount_enter)
+		ret = register_trace_syscall_enter(prof_syscall_enter);
+	if (ret) {
+		pr_info("event trace: Could not activate"
+				"syscall entry trace point");
+	} else {
+		set_bit(num, enabled_prof_enter_syscalls);
+		sys_prof_refcount_enter++;
+	}
+	mutex_unlock(&syscall_trace_lock);
+	return ret;
+}
+
+void unreg_prof_syscall_enter(char *name)
+{
+	int num;
+
+	num = syscall_name_to_nr(name);
+	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
+		return;
+
+	mutex_lock(&syscall_trace_lock);
+	sys_prof_refcount_enter--;
+	clear_bit(num, enabled_prof_enter_syscalls);
+	if (!sys_prof_refcount_enter)
+		unregister_trace_syscall_enter(prof_syscall_enter);
+	mutex_unlock(&syscall_trace_lock);
+}
+
+static void prof_syscall_exit(struct pt_regs *regs, long ret)
+{
+	struct syscall_metadata *sys_data;
+	int syscall_nr;
+
+	syscall_nr = syscall_get_nr(current, regs);
+	if (!test_bit(syscall_nr, enabled_prof_exit_syscalls))
+		return;
+
+	sys_data = syscall_nr_to_meta(syscall_nr);
+	if (!sys_data)
+		return;
+
+	perf_tpcounter_event(sys_data->exit_id, 0, 1, NULL, 0);
+}
+
+int reg_prof_syscall_exit(char *name)
+{
+	int ret = 0;
+	int num;
+
+	num = syscall_name_to_nr(name);
+	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
+		return -ENOSYS;
+
+	mutex_lock(&syscall_trace_lock);
+	if (!sys_prof_refcount_exit)
+		ret = register_trace_syscall_exit(prof_syscall_exit);
+	if (ret) {
+		pr_info("event trace: Could not activate"
+				"syscall entry trace point");
+	} else {
+		set_bit(num, enabled_prof_exit_syscalls);
+		sys_prof_refcount_exit++;
+	}
+	mutex_unlock(&syscall_trace_lock);
+	return ret;
+}
+
+void unreg_prof_syscall_exit(char *name)
+{
+	int num;
+
+	num = syscall_name_to_nr(name);
+	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
+		return;
+
+	mutex_lock(&syscall_trace_lock);
+	sys_prof_refcount_exit--;
+	clear_bit(num, enabled_prof_exit_syscalls);
+	if (!sys_prof_refcount_exit)
+		unregister_trace_syscall_exit(prof_syscall_exit);
+	mutex_unlock(&syscall_trace_lock);
+}
+
+#endif
+
+
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 11/12] add more namespace area to 'perf list' output
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
                   ` (9 preceding siblings ...)
  2009-08-10 20:53 ` [PATCH 10/12] add perf counter support Jason Baron
@ 2009-08-10 20:53 ` Jason Baron
  2009-08-10 20:53 ` [PATCH 12/12] convert x86_64 mmap and uname to use DEFINE_SYSCALL Jason Baron
  2009-08-25 12:31 ` [PATCH 00/12] add syscall tracepoints V3 - s390 arch update Hendrik Brueckner
  12 siblings, 0 replies; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

The new syscall tracepoints can be too long for the 'perf list' output. Add
a few more characters.

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 tools/perf/util/parse-events.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4858d83..a5d661b 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -606,7 +606,7 @@ static void print_tracepoint_events(void)
 								evt_path, st) {
 			snprintf(evt_path, MAXPATHLEN, "%s:%s",
 				 sys_dirent.d_name, evt_dirent.d_name);
-			fprintf(stderr, "  %-40s [%s]\n", evt_path,
+			fprintf(stderr, "  %-42s [%s]\n", evt_path,
 				event_type_descriptors[PERF_TYPE_TRACEPOINT+1]);
 		}
 		closedir(evt_dir);
@@ -640,7 +640,7 @@ void print_events(void)
 			sprintf(name, "%s OR %s", syms->symbol, syms->alias);
 		else
 			strcpy(name, syms->symbol);
-		fprintf(stderr, "  %-40s [%s]\n", name,
+		fprintf(stderr, "  %-42s [%s]\n", name,
 			event_type_descriptors[type]);
 
 		prev_type = type;
@@ -654,7 +654,7 @@ void print_events(void)
 				continue;
 
 			for (i = 0; i < PERF_COUNT_HW_CACHE_RESULT_MAX; i++) {
-				fprintf(stderr, "  %-40s [%s]\n",
+				fprintf(stderr, "  %-42s [%s]\n",
 					event_cache_name(type, op, i),
 					event_type_descriptors[4]);
 			}
@@ -662,7 +662,7 @@ void print_events(void)
 	}
 
 	fprintf(stderr, "\n");
-	fprintf(stderr, "  %-40s [raw hardware event descriptor]\n",
+	fprintf(stderr, "  %-42s [raw hardware event descriptor]\n",
 		"rNNN");
 	fprintf(stderr, "\n");
 
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 12/12] convert x86_64 mmap and uname to use DEFINE_SYSCALL
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
                   ` (10 preceding siblings ...)
  2009-08-10 20:53 ` [PATCH 11/12] add more namespace area to 'perf list' output Jason Baron
@ 2009-08-10 20:53 ` Jason Baron
  2009-08-25 12:31 ` [PATCH 00/12] add syscall tracepoints V3 - s390 arch update Hendrik Brueckner
  12 siblings, 0 replies; 88+ messages in thread
From: Jason Baron @ 2009-08-10 20:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: fweisbec, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

A number of syscalls are not using 'DEFINE_SYSCALL'. I'm not sure why. Convert
x86_64 uname and mmap to use DEFINE_SYSCALL.

Signed-off-by: Jason Baron <jbaron@redhat.com>

---
 arch/x86/kernel/sys_x86_64.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 6bc211a..45e00eb 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -18,9 +18,9 @@
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
 
-asmlinkage long sys_mmap(unsigned long addr, unsigned long len,
-		unsigned long prot, unsigned long flags,
-		unsigned long fd, unsigned long off)
+SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
+		unsigned long, prot, unsigned long, flags,
+		unsigned long, fd, unsigned long, off)
 {
 	long error;
 	struct file *file;
@@ -226,7 +226,7 @@ bottomup:
 }
 
 
-asmlinkage long sys_uname(struct new_utsname __user *name)
+SYSCALL_DEFINE1(uname, struct new_utsname __user *, name)
 {
 	int err;
 	down_read(&uts_sem);
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/12] add ftrace_event_call void * 'data' field
  2009-08-10 20:52 ` [PATCH 07/12] add ftrace_event_call void * 'data' field Jason Baron
@ 2009-08-11 10:09   ` Frederic Weisbecker
  2009-08-17 22:19     ` Steven Rostedt
  0 siblings, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-11 10:09 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

On Mon, Aug 10, 2009 at 04:52:44PM -0400, Jason Baron wrote:
> add an optional * void pointer to 'ftrace_event_call' that is
> passed in for regfunc and unregfunc.
> 
> Signed-off-by: Jason Baron <jbaron@redhat.com>
> 
> ---
>  include/linux/ftrace_event.h |    5 +++--
>  include/trace/ftrace.h       |    4 ++--
>  kernel/trace/trace_events.c  |    4 ++--
>  3 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index ac8c6f8..8544f12 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -112,8 +112,8 @@ struct ftrace_event_call {
>  	struct dentry		*dir;
>  	struct trace_event	*event;
>  	int			enabled;
> -	int			(*regfunc)(void);
> -	void			(*unregfunc)(void);
> +	int			(*regfunc)(void *);
> +	void			(*unregfunc)(void *);
>  	int			id;
>  	int			(*raw_init)(void);
>  	int			(*show_format)(struct trace_seq *s);
> @@ -122,6 +122,7 @@ struct ftrace_event_call {
>  	int			filter_active;
>  	struct event_filter	*filter;
>  	void			*mod;
> +	void			*data;
>  
>  	atomic_t		profile_count;
>  	int			(*profile_enable)(struct ftrace_event_call *);
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 80e5f6c..a0de384 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -568,7 +568,7 @@ static void ftrace_raw_event_##call(proto)				\
>  		trace_nowake_buffer_unlock_commit(event, irq_flags, pc); \
>  }									\
>  									\
> -static int ftrace_raw_reg_event_##call(void)				\
> +static int ftrace_raw_reg_event_##call(void *ptr)			\


Shouldn't it have a __used attribute here, or something?


>  {									\
>  	int ret;							\
>  									\
> @@ -579,7 +579,7 @@ static int ftrace_raw_reg_event_##call(void)				\
>  	return ret;							\
>  }									\
>  									\
> -static void ftrace_raw_unreg_event_##call(void)				\
> +static void ftrace_raw_unreg_event_##call(void *ptr)			\



Same here.



>  {									\
>  	unregister_trace_##call(ftrace_raw_event_##call);		\
>  }									\
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index f95f847..1d289e2 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -86,14 +86,14 @@ static void ftrace_event_enable_disable(struct ftrace_event_call *call,
>  		if (call->enabled) {
>  			call->enabled = 0;
>  			tracing_stop_cmdline_record();
> -			call->unregfunc();
> +			call->unregfunc(call->data);
>  		}
>  		break;
>  	case 1:
>  		if (!call->enabled) {
>  			call->enabled = 1;
>  			tracing_start_cmdline_record();
> -			call->regfunc();
> +			call->regfunc(call->data);
>  		}
>  		break;
>  	}
> -- 
> 1.6.2.5
> 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-10 20:52 ` [PATCH 08/12] add trace events for each syscall entry/exit Jason Baron
@ 2009-08-11 10:50   ` Frederic Weisbecker
  2009-08-11 11:45     ` Ingo Molnar
  2009-08-25 12:50   ` Hendrik Brueckner
  1 sibling, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-11 10:50 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

On Mon, Aug 10, 2009 at 04:52:47PM -0400, Jason Baron wrote:
> Layer Frederic's syscall tracer on tracepoints. We create trace events via
> hooking into the SYCALL_DEFINE macros. This allows us to individually toggle
> syscall entry and exit points on/off.
> 
> Signed-off-by: Jason Baron <jbaron@redhat.com>
> 
> ---
>  include/linux/syscalls.h      |   61 +++++++++++++-
>  include/trace/syscall.h       |   18 ++--
>  kernel/trace/trace_syscalls.c |  183 ++++++++++++++++++++---------------------
>  3 files changed, 159 insertions(+), 103 deletions(-)
> 
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 80de700..5e5b4d3 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -64,6 +64,7 @@ struct perf_counter_attr;
>  #include <linux/sem.h>
>  #include <asm/siginfo.h>
>  #include <asm/signal.h>
> +#include <linux/unistd.h>
>  #include <linux/quota.h>
>  #include <linux/key.h>
>  #include <trace/syscall.h>
> @@ -112,6 +113,59 @@ struct perf_counter_attr;
>  #define __SC_STR_TDECL5(t, a, ...)	#t, __SC_STR_TDECL4(__VA_ARGS__)
>  #define __SC_STR_TDECL6(t, a, ...)	#t, __SC_STR_TDECL5(__VA_ARGS__)
>  
> +
> +#define SYSCALL_TRACE_ENTER_EVENT(sname)				\
> +	static struct ftrace_event_call event_enter_##sname;		\
> +	static int init_enter_##sname(void)				\
> +	{								\
> +		int num;						\
> +		num = syscall_name_to_nr("sys"#sname);			\
> +		if (num < 0)						\
> +			return -ENOSYS;					\
> +		register_ftrace_event(&event_syscall_enter);		\
> +		INIT_LIST_HEAD(&event_enter_##sname.fields);		\
> +		init_preds(&event_enter_##sname);			\
> +		return 0;						\
> +	}								\



You could rather add the struct ftrace_event_call *event as
a parameter to a generic function static int init_enter_syscall()
That would require adding this parameter in the raw_init() callback.
If I remember well, Masami does that in his kprobes events patchset.

May be we can let it as is and wait for his patchset to be integrated
to update that.


> +	static struct ftrace_event_call __used				\
> +	  __attribute__((__aligned__(4)))				\
> +	  __attribute__((section("_ftrace_events")))			\
> +	  event_enter_##sname = {					\
> +		.name                   = "sys_enter"#sname,		\
> +		.system                 = "syscalls",			\
> +		.event                  = &event_syscall_enter,		\
> +		.raw_init		= init_enter_##sname,		\
> +		.regfunc		= reg_event_syscall_enter,	\
> +		.unregfunc		= unreg_event_syscall_enter,	\
> +		.data			= "sys"#sname,			\
> +	}
> +
> +#define SYSCALL_TRACE_EXIT_EVENT(sname)					\
> +	static struct ftrace_event_call event_exit_##sname;		\
> +	static int init_exit_##sname(void)				\
> +	{								\
> +		int num;						\
> +		num = syscall_name_to_nr("sys"#sname);			\
> +		if (num < 0)						\
> +			return -ENOSYS;					\
> +		register_ftrace_event(&event_syscall_exit);		\
> +		INIT_LIST_HEAD(&event_exit_##sname.fields);		\
> +		init_preds(&event_exit_##sname);			\
> +		return 0;						\
> +	}								\
> +	static struct ftrace_event_call __used				\
> +	  __attribute__((__aligned__(4)))				\
> +	  __attribute__((section("_ftrace_events")))			\
> +	  event_exit_##sname = {					\
> +		.name                   = "sys_exit"#sname,		\
> +		.system                 = "syscalls",			\
> +		.event                  = &event_syscall_exit,		\
> +		.raw_init		= init_exit_##sname,		\
> +		.regfunc		= reg_event_syscall_exit,	\
> +		.unregfunc		= unreg_event_syscall_exit,	\
> +		.data			= "sys"#sname,			\
> +	}
> +
>  #define SYSCALL_METADATA(sname, nb)				\
>  	static const struct syscall_metadata __used		\
>  	  __attribute__((__aligned__(4)))			\
> @@ -121,7 +175,9 @@ struct perf_counter_attr;
>  		.nb_args 	= nb,				\
>  		.types		= types_##sname,		\
>  		.args		= args_##sname,			\
> -	}
> +	};							\
> +	SYSCALL_TRACE_ENTER_EVENT(sname);			\
> +	SYSCALL_TRACE_EXIT_EVENT(sname);
>  
>  #define SYSCALL_DEFINE0(sname)					\
>  	static const struct syscall_metadata __used		\
> @@ -131,8 +187,9 @@ struct perf_counter_attr;
>  		.name 		= "sys_"#sname,			\
>  		.nb_args 	= 0,				\
>  	};							\
> +	SYSCALL_TRACE_ENTER_EVENT(_##sname);			\
> +	SYSCALL_TRACE_EXIT_EVENT(_##sname);			\
>  	asmlinkage long sys_##sname(void)
> -
>  #else
>  #define SYSCALL_DEFINE0(name)	   asmlinkage long sys_##name(void)
>  #endif
> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
> index 3951d77..73fb8b4 100644
> --- a/include/trace/syscall.h
> +++ b/include/trace/syscall.h
> @@ -2,6 +2,8 @@
>  #define _TRACE_SYSCALL_H
>  
>  #include <linux/tracepoint.h>
> +#include <linux/unistd.h>
> +#include <linux/ftrace_event.h>
>  
>  #include <asm/ptrace.h>
>  
> @@ -40,15 +42,13 @@ struct syscall_metadata {
>  
>  #ifdef CONFIG_FTRACE_SYSCALLS
>  extern struct syscall_metadata *syscall_nr_to_meta(int nr);
> -extern void start_ftrace_syscalls(void);
> -extern void stop_ftrace_syscalls(void);
> -extern void ftrace_syscall_enter(struct pt_regs *regs);
> -extern void ftrace_syscall_exit(struct pt_regs *regs);
> -#else
> -static inline void start_ftrace_syscalls(void)			{ }
> -static inline void stop_ftrace_syscalls(void)			{ }
> -static inline void ftrace_syscall_enter(struct pt_regs *regs)	{ }
> -static inline void ftrace_syscall_exit(struct pt_regs *regs)	{ }
> +extern int syscall_name_to_nr(char *name);
> +extern struct trace_event event_syscall_enter;
> +extern struct trace_event event_syscall_exit;
> +extern int reg_event_syscall_enter(void *ptr);
> +extern void unreg_event_syscall_enter(void *ptr);
> +extern int reg_event_syscall_exit(void *ptr);
> +extern void unreg_event_syscall_exit(void *ptr);
>  #endif
>  
>  #endif /* _TRACE_SYSCALL_H */
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 08aed43..c7ae25e 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -1,15 +1,16 @@
>  #include <trace/syscall.h>
>  #include <linux/kernel.h>
> +#include <linux/ftrace.h>
>  #include <asm/syscall.h>
>  
>  #include "trace_output.h"
>  #include "trace.h"
>  
> -/* Keep a counter of the syscall tracing users */
> -static int refcount;
> -
> -/* Prevent from races on thread flags toggling */
>  static DEFINE_MUTEX(syscall_trace_lock);
> +static int sys_refcount_enter;
> +static int sys_refcount_exit;
> +static DECLARE_BITMAP(enabled_enter_syscalls, FTRACE_SYSCALL_MAX);
> +static DECLARE_BITMAP(enabled_exit_syscalls, FTRACE_SYSCALL_MAX);
>  
>  /* Option to display the parameters types */
>  enum {
> @@ -95,53 +96,7 @@ print_syscall_exit(struct trace_iterator *iter, int flags)
>  	return TRACE_TYPE_HANDLED;
>  }
>  
> -void start_ftrace_syscalls(void)
> -{
> -	unsigned long flags;
> -	struct task_struct *g, *t;
> -
> -	mutex_lock(&syscall_trace_lock);
> -
> -	/* Don't enable the flag on the tasks twice */
> -	if (++refcount != 1)
> -		goto unlock;
> -
> -	read_lock_irqsave(&tasklist_lock, flags);
> -
> -	do_each_thread(g, t) {
> -		set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> -	} while_each_thread(g, t);
> -
> -	read_unlock_irqrestore(&tasklist_lock, flags);
> -
> -unlock:
> -	mutex_unlock(&syscall_trace_lock);
> -}
> -
> -void stop_ftrace_syscalls(void)
> -{
> -	unsigned long flags;
> -	struct task_struct *g, *t;
> -
> -	mutex_lock(&syscall_trace_lock);
> -
> -	/* There are perhaps still some users */
> -	if (--refcount)
> -		goto unlock;
> -
> -	read_lock_irqsave(&tasklist_lock, flags);
> -
> -	do_each_thread(g, t) {
> -		clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> -	} while_each_thread(g, t);
> -
> -	read_unlock_irqrestore(&tasklist_lock, flags);
> -
> -unlock:
> -	mutex_unlock(&syscall_trace_lock);
> -}
> -
> -void ftrace_syscall_enter(struct pt_regs *regs)
> +void ftrace_syscall_enter(struct pt_regs *regs, long id)
>  {
>  	struct syscall_trace_enter *entry;
>  	struct syscall_metadata *sys_data;
> @@ -150,6 +105,8 @@ void ftrace_syscall_enter(struct pt_regs *regs)
>  	int syscall_nr;
>  
>  	syscall_nr = syscall_get_nr(current, regs);
> +	if (!test_bit(syscall_nr, enabled_enter_syscalls))
> +		return;
>  
>  	sys_data = syscall_nr_to_meta(syscall_nr);
>  	if (!sys_data)
> @@ -170,7 +127,7 @@ void ftrace_syscall_enter(struct pt_regs *regs)
>  	trace_wake_up();
>  }
>  
> -void ftrace_syscall_exit(struct pt_regs *regs)
> +void ftrace_syscall_exit(struct pt_regs *regs, long ret)
>  {
>  	struct syscall_trace_exit *entry;
>  	struct syscall_metadata *sys_data;
> @@ -178,6 +135,8 @@ void ftrace_syscall_exit(struct pt_regs *regs)
>  	int syscall_nr;
>  
>  	syscall_nr = syscall_get_nr(current, regs);
> +	if (!test_bit(syscall_nr, enabled_exit_syscalls))
> +		return;
>  
>  	sys_data = syscall_nr_to_meta(syscall_nr);
>  	if (!sys_data)
> @@ -196,54 +155,94 @@ void ftrace_syscall_exit(struct pt_regs *regs)
>  	trace_wake_up();
>  }
>  
> -static int init_syscall_tracer(struct trace_array *tr)
> +int reg_event_syscall_enter(void *ptr)
>  {
> -	start_ftrace_syscalls();
> -
> -	return 0;
> +	int ret = 0;
> +	int num;
> +	char *name;
> +
> +	name = (char *)ptr;
> +	num = syscall_name_to_nr(name);
> +	if (num < 0 || num >= FTRACE_SYSCALL_MAX)



I wonder if we should WARN once in this case. At least we would
be aware of new yet unsupported syscalls.



> +		return -ENOSYS;
> +	mutex_lock(&syscall_trace_lock);
> +	if (!sys_refcount_enter)
> +		ret = register_trace_syscall_enter(ftrace_syscall_enter);
> +	if (ret) {
> +		pr_info("event trace: Could not activate"
> +				"syscall entry trace point");
> +	} else {
> +		set_bit(num, enabled_enter_syscalls);
> +		sys_refcount_enter++;
> +	}
> +	mutex_unlock(&syscall_trace_lock);
> +	return ret;
>  }
>  
> -static void reset_syscall_tracer(struct trace_array *tr)
> +void unreg_event_syscall_enter(void *ptr)
>  {
> -	stop_ftrace_syscalls();
> -	tracing_reset_online_cpus(tr);
> -}
> -
> -static struct trace_event syscall_enter_event = {
> -	.type	 	= TRACE_SYSCALL_ENTER,
> -	.trace		= print_syscall_enter,
> -};
> -
> -static struct trace_event syscall_exit_event = {
> -	.type	 	= TRACE_SYSCALL_EXIT,
> -	.trace		= print_syscall_exit,
> -};
> +	int num;
> +	char *name;
>  
> -static struct tracer syscall_tracer __read_mostly = {
> -	.name	     	= "syscall",
> -	.init		= init_syscall_tracer,
> -	.reset		= reset_syscall_tracer,
> -	.flags		= &syscalls_flags,
> -};
> +	name = (char *)ptr;
> +	num = syscall_name_to_nr(name);
> +	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
> +		return;
> +	mutex_lock(&syscall_trace_lock);
> +	sys_refcount_enter--;
> +	clear_bit(num, enabled_enter_syscalls);
> +	if (!sys_refcount_enter)
> +		unregister_trace_syscall_enter(ftrace_syscall_enter);
> +	mutex_unlock(&syscall_trace_lock);
> +}
>  
> -__init int register_ftrace_syscalls(void)
> +int reg_event_syscall_exit(void *ptr)
>  {
> -	int ret;
> -
> -	ret = register_ftrace_event(&syscall_enter_event);
> -	if (!ret) {
> -		printk(KERN_WARNING "event %d failed to register\n",
> -		       syscall_enter_event.type);
> -		WARN_ON_ONCE(1);
> +	int ret = 0;
> +	int num;
> +	char *name;
> +
> +	name = (char *)ptr;
> +	num = syscall_name_to_nr(name);
> +	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
> +		return -ENOSYS;
> +	mutex_lock(&syscall_trace_lock);
> +	if (!sys_refcount_exit)
> +		ret = register_trace_syscall_exit(ftrace_syscall_exit);
> +	if (ret) {
> +		pr_info("event trace: Could not activate"
> +				"syscall exit trace point");
> +	} else {
> +		set_bit(num, enabled_exit_syscalls);
> +		sys_refcount_exit++;
>  	}
> +	mutex_unlock(&syscall_trace_lock);
> +	return ret;
> +}
>  
> -	ret = register_ftrace_event(&syscall_exit_event);
> -	if (!ret) {
> -		printk(KERN_WARNING "event %d failed to register\n",
> -		       syscall_exit_event.type);
> -		WARN_ON_ONCE(1);
> -	}
> +void unreg_event_syscall_exit(void *ptr)
> +{
> +	int num;
> +	char *name;
>  
> -	return register_tracer(&syscall_tracer);
> +	name = (char *)ptr;
> +	num = syscall_name_to_nr(name);
> +	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
> +		return;
> +	mutex_lock(&syscall_trace_lock);
> +	sys_refcount_exit--;
> +	clear_bit(num, enabled_exit_syscalls);
> +	if (!sys_refcount_exit)
> +		unregister_trace_syscall_exit(ftrace_syscall_exit);
> +	mutex_unlock(&syscall_trace_lock);
>  }
> -device_initcall(register_ftrace_syscalls);
> +
> +struct trace_event event_syscall_enter = {
> +	.trace			= print_syscall_enter,
> +	.type			= TRACE_SYSCALL_ENTER
> +};
> +
> +struct trace_event event_syscall_exit = {
> +	.trace			= print_syscall_exit,
> +	.type			= TRACE_SYSCALL_EXIT
> +};
> -- 
> 1.6.2.5
> 

Nice.

It's a bit too bad that enter and exit must be that separated
whereas their callbacks are pretty the same.

But I guess if we want to nicely decouple both, we don't have the choice.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/12] update FTRACE_SYSCALL_MAX
  2009-08-10 20:52 ` [PATCH 05/12] update FTRACE_SYSCALL_MAX Jason Baron
@ 2009-08-11 11:00   ` Frederic Weisbecker
  2009-08-11 19:39     ` Matt Fleming
  2009-08-24 13:41     ` Paul Mundt
  0 siblings, 2 replies; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-11 11:00 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

On Mon, Aug 10, 2009 at 04:52:35PM -0400, Jason Baron wrote:
> update FTRACE_SYSCALL_MAX to the current number of syscalls
> 
> Signed-off-by: Jason Baron <jbaron@redhat.com>
> 
> ---
>  arch/x86/include/asm/ftrace.h |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index bd2c651..7113654 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -30,9 +30,9 @@
>  
>  /* FIXME: I don't want to stay hardcoded */
>  #ifdef CONFIG_X86_64
> -# define FTRACE_SYSCALL_MAX     296
> +# define FTRACE_SYSCALL_MAX     299
>  #else
> -# define FTRACE_SYSCALL_MAX     333
> +# define FTRACE_SYSCALL_MAX     337
>  #endif


I don't remember why we had to use a hardcoded number.
Is there no way to keep being sync with the current number of
syscalls? We dwant to avoid patching the kernel each time we
have a new syscall :-)

  
>  #ifdef CONFIG_FUNCTION_TRACER
> -- 
> 1.6.2.5
> 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 09/12] add support traceopint ids
  2009-08-10 20:52 ` [PATCH 09/12] add support traceopint ids Jason Baron
@ 2009-08-11 11:28   ` Frederic Weisbecker
  0 siblings, 0 replies; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-11 11:28 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

On Mon, Aug 10, 2009 at 04:52:53PM -0400, Jason Baron wrote:
> This patch associates an id with each syscall trace event, so that we can
> identify each syscall trace event using the 'perf' tool.
> 
> Signed-off-by: Jason Baron <jbaron@redhat.com>
> 
> ---
>  arch/x86/kernel/ftrace.c      |   10 ++++++++++
>  include/linux/syscalls.h      |   22 ++++++++++++++++++----
>  include/trace/syscall.h       |    8 ++++++++
>  kernel/trace/trace.h          |    6 ------
>  kernel/trace/trace_syscalls.c |   26 ++++++++++++++++----------
>  5 files changed, 52 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index 0d93d40..3cff121 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -516,6 +516,16 @@ int syscall_name_to_nr(char *name)
>  	return -1;
>  }
>  
> +void set_syscall_enter_id(int num, int id)
> +{
> +	syscalls_metadata[num]->enter_id = id;
> +}
> +
> +void set_syscall_exit_id(int num, int id)
> +{
> +	syscalls_metadata[num]->exit_id = id;
> +}
> +
>  static int __init arch_init_ftrace_syscalls(void)
>  {
>  	int i;
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 5e5b4d3..ce4b01c 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -116,13 +116,20 @@ struct perf_counter_attr;
>  
>  #define SYSCALL_TRACE_ENTER_EVENT(sname)				\
>  	static struct ftrace_event_call event_enter_##sname;		\
> +	struct trace_event enter_syscall_print_##sname = {		\
> +		.trace                  = print_syscall_enter,		\
> +	};								\
>  	static int init_enter_##sname(void)				\
>  	{								\
> -		int num;						\
> +		int num, id;						\
>  		num = syscall_name_to_nr("sys"#sname);			\
>  		if (num < 0)						\
>  			return -ENOSYS;					\
> -		register_ftrace_event(&event_syscall_enter);		\
> +		id = register_ftrace_event(&enter_syscall_print_##sname);\



Since kprobes also need a unique id despite a single print callback,
Because this issue is then not isolated, we need a generic event number
generator from ftrace.

IIRC Masami's kprobe patchset brought this. In this case,
we need to remember fixing this on the syscall tracing side once it's merged.




> +		if (!id)						\
> +			return -ENODEV;					\
> +		event_enter_##sname.id = id;				\
> +		set_syscall_enter_id(num, id);				\
>  		INIT_LIST_HEAD(&event_enter_##sname.fields);		\
>  		init_preds(&event_enter_##sname);			\
>  		return 0;						\
> @@ -142,13 +149,20 @@ struct perf_counter_attr;
>  
>  #define SYSCALL_TRACE_EXIT_EVENT(sname)					\
>  	static struct ftrace_event_call event_exit_##sname;		\
> +	struct trace_event exit_syscall_print_##sname = {		\
> +		.trace                  = print_syscall_exit,		\
> +	};								\
>  	static int init_exit_##sname(void)				\
>  	{								\
> -		int num;						\
> +		int num, id;						\
>  		num = syscall_name_to_nr("sys"#sname);			\
>  		if (num < 0)						\
>  			return -ENOSYS;					\
> -		register_ftrace_event(&event_syscall_exit);		\
> +		id = register_ftrace_event(&exit_syscall_print_##sname);\
> +		if (!id)						\
> +			return -ENODEV;					\
> +		event_exit_##sname.id = id;				\
> +		set_syscall_exit_id(num, id);				\
>  		INIT_LIST_HEAD(&event_exit_##sname.fields);		\
>  		init_preds(&event_exit_##sname);			\
>  		return 0;						\
> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
> index 73fb8b4..df62840 100644
> --- a/include/trace/syscall.h
> +++ b/include/trace/syscall.h
> @@ -32,23 +32,31 @@ DECLARE_TRACE_WITH_CALLBACK(syscall_exit,
>   * @nb_args: number of parameters it takes
>   * @types: list of types as strings
>   * @args: list of args as strings (args[i] matches types[i])
> + * @enter_id: associated ftrace enter event id
> + * @exit_id: associated ftrace exit event id
>   */
>  struct syscall_metadata {
>  	const char	*name;
>  	int		nb_args;
>  	const char	**types;
>  	const char	**args;
> +	int		enter_id;
> +	int		exit_id;
>  };
>  
>  #ifdef CONFIG_FTRACE_SYSCALLS
>  extern struct syscall_metadata *syscall_nr_to_meta(int nr);
>  extern int syscall_name_to_nr(char *name);
> +void set_syscall_enter_id(int num, int id);
> +void set_syscall_exit_id(int num, int id);
>  extern struct trace_event event_syscall_enter;
>  extern struct trace_event event_syscall_exit;
>  extern int reg_event_syscall_enter(void *ptr);
>  extern void unreg_event_syscall_enter(void *ptr);
>  extern int reg_event_syscall_exit(void *ptr);
>  extern void unreg_event_syscall_exit(void *ptr);
> +enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags);
> +enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags);
>  #endif
>  
>  #endif /* _TRACE_SYSCALL_H */
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 44308f3..606073c 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -38,8 +38,6 @@ enum trace_type {
>  	TRACE_GRAPH_ENT,
>  	TRACE_USER_STACK,
>  	TRACE_HW_BRANCHES,
> -	TRACE_SYSCALL_ENTER,
> -	TRACE_SYSCALL_EXIT,
>  	TRACE_KMEM_ALLOC,
>  	TRACE_KMEM_FREE,
>  	TRACE_POWER,
> @@ -334,10 +332,6 @@ extern void __ftrace_bad_type(void);
>  			  TRACE_KMEM_ALLOC);	\
>  		IF_ASSIGN(var, ent, struct kmemtrace_free_entry,	\
>  			  TRACE_KMEM_FREE);	\
> -		IF_ASSIGN(var, ent, struct syscall_trace_enter,		\
> -			  TRACE_SYSCALL_ENTER);				\
> -		IF_ASSIGN(var, ent, struct syscall_trace_exit,		\
> -			  TRACE_SYSCALL_EXIT);				\
>  		IF_ASSIGN(var, ent, struct ksym_trace_entry, TRACE_KSYM);\
>  		__ftrace_bad_type();					\
>  	} while (0)
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index c7ae25e..e58a9c1 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -36,14 +36,18 @@ print_syscall_enter(struct trace_iterator *iter, int flags)
>  	struct syscall_metadata *entry;
>  	int i, ret, syscall;
>  
> -	trace_assign_type(trace, ent);
> -
> +	trace = (typeof(trace))ent;
>  	syscall = trace->nr;
> -
>  	entry = syscall_nr_to_meta(syscall);
> +
>  	if (!entry)
>  		goto end;
>  
> +	if (entry->enter_id != ent->type) {
> +		WARN_ON_ONCE(1);
> +		goto end;
> +	}
> +
>  	ret = trace_seq_printf(s, "%s(", entry->name);
>  	if (!ret)
>  		return TRACE_TYPE_PARTIAL_LINE;
> @@ -78,16 +82,20 @@ print_syscall_exit(struct trace_iterator *iter, int flags)
>  	struct syscall_metadata *entry;
>  	int ret;
>  
> -	trace_assign_type(trace, ent);
> -
> +	trace = (typeof(trace))ent;
>  	syscall = trace->nr;
> -
>  	entry = syscall_nr_to_meta(syscall);
> +
>  	if (!entry) {
>  		trace_seq_printf(s, "\n");
>  		return TRACE_TYPE_HANDLED;
>  	}
>  
> +	if (entry->exit_id != ent->type) {
> +		WARN_ON_ONCE(1);
> +		return TRACE_TYPE_UNHANDLED;
> +	}
> +
>  	ret = trace_seq_printf(s, "%s -> 0x%lx\n", entry->name,
>  				trace->ret);
>  	if (!ret)
> @@ -114,7 +122,7 @@ void ftrace_syscall_enter(struct pt_regs *regs, long id)
>  
>  	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
>  
> -	event = trace_current_buffer_lock_reserve(TRACE_SYSCALL_ENTER, size,
> +	event = trace_current_buffer_lock_reserve(sys_data->enter_id, size,
>  							0, 0);
>  	if (!event)
>  		return;
> @@ -142,7 +150,7 @@ void ftrace_syscall_exit(struct pt_regs *regs, long ret)
>  	if (!sys_data)
>  		return;
>  
> -	event = trace_current_buffer_lock_reserve(TRACE_SYSCALL_EXIT,
> +	event = trace_current_buffer_lock_reserve(sys_data->exit_id,
>  				sizeof(*entry), 0, 0);
>  	if (!event)
>  		return;
> @@ -239,10 +247,8 @@ void unreg_event_syscall_exit(void *ptr)
>  
>  struct trace_event event_syscall_enter = {
>  	.trace			= print_syscall_enter,
> -	.type			= TRACE_SYSCALL_ENTER
>  };
>  
>  struct trace_event event_syscall_exit = {
>  	.trace			= print_syscall_exit,
> -	.type			= TRACE_SYSCALL_EXIT
>  };


Do you still need the two above now that you have defined individual
print callbacks from syscall.h ?

> -- 
> 1.6.2.5
> 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-11 10:50   ` Frederic Weisbecker
@ 2009-08-11 11:45     ` Ingo Molnar
  2009-08-11 12:01       ` Frederic Weisbecker
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2009-08-11 11:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jason Baron, linux-kernel, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> > +struct trace_event event_syscall_enter = {
> > +	.trace			= print_syscall_enter,
> > +	.type			= TRACE_SYSCALL_ENTER
> > +};
> > +
> > +struct trace_event event_syscall_exit = {
> > +	.trace			= print_syscall_exit,
> > +	.type			= TRACE_SYSCALL_EXIT
> > +};
> > -- 
> > 1.6.2.5
> > 
> 
> Nice.
> 
> It's a bit too bad that enter and exit must be that separated 
> whereas their callbacks are pretty the same.
> 
> But I guess if we want to nicely decouple both, we don't have the 
> choice.

Yeah - and enter and exit are different, in terms of state.

One thing that would be nice in the future (as an add-on - this 
patch-set looks useful already) is to allow the sampling of user 
register state as well via these tracepoints. That way we'd have a 
much faster (and completely transparent) implementation of strace in 
essence, with unique features such as system-wide or per cpu 
strace-ing.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-11 11:45     ` Ingo Molnar
@ 2009-08-11 12:01       ` Frederic Weisbecker
  0 siblings, 0 replies; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-11 12:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Baron, linux-kernel, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf

On Tue, Aug 11, 2009 at 01:45:12PM +0200, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > > +struct trace_event event_syscall_enter = {
> > > +	.trace			= print_syscall_enter,
> > > +	.type			= TRACE_SYSCALL_ENTER
> > > +};
> > > +
> > > +struct trace_event event_syscall_exit = {
> > > +	.trace			= print_syscall_exit,
> > > +	.type			= TRACE_SYSCALL_EXIT
> > > +};
> > > -- 
> > > 1.6.2.5
> > > 
> > 
> > Nice.
> > 
> > It's a bit too bad that enter and exit must be that separated 
> > whereas their callbacks are pretty the same.
> > 
> > But I guess if we want to nicely decouple both, we don't have the 
> > choice.
> 
> Yeah - and enter and exit are different, in terms of state.
> 
> One thing that would be nice in the future (as an add-on - this 
> patch-set looks useful already) is to allow the sampling of user 
> register state as well via these tracepoints. That way we'd have a 
> much faster (and completely transparent) implementation of strace in 
> essence, with unique features such as system-wide or per cpu 
> strace-ing.
> 
> 	Ingo


Indeed, a missing piece is the syscall record sampling with registers.
Actually, IMO the registers themselves are not the right piece to export
to perfcounter. It's too low-level.

What we need are the fetched arguments, because a lot of them are adresses
(even user adresses), pretty useless for perf tools.

We can already and easily implement that simple args, like we do for ftrace.
That's pretty trivial.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 10/12] add perf counter support
  2009-08-10 20:53 ` [PATCH 10/12] add perf counter support Jason Baron
@ 2009-08-11 12:12   ` Frederic Weisbecker
  2009-08-11 12:17     ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-11 12:12 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, laijs, rostedt, peterz, mathieu.desnoyers,
	jiayingz, mbligh, lizf

On Mon, Aug 10, 2009 at 04:53:02PM -0400, Jason Baron wrote:
> Make 'perf stat -e syscalls:sys_enter_blah' work with syscall style tracepoints.


It would be nice to also be able to type:

perf stat -e syscalls:blah

and then having both enter/exit counters.

Frederic.


> 
> Signed-off-by: Jason Baron <jbaron@redhat.com>
> 
> ---
>  include/linux/perf_counter.h  |    2 +
>  include/linux/syscalls.h      |   52 +++++++++++++++++-
>  include/trace/syscall.h       |    7 +++
>  kernel/trace/trace_syscalls.c |  121 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 181 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
> index c484834..aaf0c74 100644
> --- a/include/linux/perf_counter.h
> +++ b/include/linux/perf_counter.h
> @@ -734,6 +734,8 @@ extern int sysctl_perf_counter_mlock;
>  extern int sysctl_perf_counter_sample_rate;
>  
>  extern void perf_counter_init(void);
> +extern void perf_tpcounter_event(int event_id, u64 addr, u64 count,
> +				 void *record, int entry_size);
>  
>  #ifndef perf_misc_flags
>  #define perf_misc_flags(regs)	(user_mode(regs) ? PERF_EVENT_MISC_USER : \
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index ce4b01c..5541e75 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -98,6 +98,53 @@ struct perf_counter_attr;
>  #define __SC_TEST5(t5, a5, ...)	__SC_TEST(t5); __SC_TEST4(__VA_ARGS__)
>  #define __SC_TEST6(t6, a6, ...)	__SC_TEST(t6); __SC_TEST5(__VA_ARGS__)
>  
> +#ifdef CONFIG_EVENT_PROFILE
> +#define TRACE_SYS_ENTER_PROFILE(sname)					       \
> +static int prof_sysenter_enable_##sname(struct ftrace_event_call *event_call)  \
> +{									       \
> +	int ret = 0;							       \
> +	if (!atomic_inc_return(&event_enter_##sname.profile_count))	       \
> +		ret = reg_prof_syscall_enter("sys"#sname);		       \
> +	return ret;							       \
> +}									       \
> +									       \
> +static void prof_sysenter_disable_##sname(struct ftrace_event_call *event_call)\
> +{									       \
> +	if (atomic_add_negative(-1, &event_enter_##sname.profile_count))       \
> +		unreg_prof_syscall_enter("sys"#sname);			       \
> +}
> +
> +#define TRACE_SYS_EXIT_PROFILE(sname)					       \
> +static int prof_sysexit_enable_##sname(struct ftrace_event_call *event_call)   \
> +{									       \
> +	int ret = 0;							       \
> +	if (!atomic_inc_return(&event_exit_##sname.profile_count))	       \
> +		ret = reg_prof_syscall_exit("sys"#sname);		       \
> +	return ret;							       \
> +}									       \
> +									       \
> +static void prof_sysexit_disable_##sname(struct ftrace_event_call *event_call) \
> +{                                                                              \
> +	if (atomic_add_negative(-1, &event_exit_##sname.profile_count))	       \
> +		unreg_prof_syscall_exit("sys"#sname);			       \
> +}
> +
> +#define TRACE_SYS_ENTER_PROFILE_INIT(sname)				       \
> +	.profile_count = ATOMIC_INIT(-1),				       \
> +	.profile_enable = prof_sysenter_enable_##sname,			       \
> +	.profile_disable = prof_sysenter_disable_##sname,
> +
> +#define TRACE_SYS_EXIT_PROFILE_INIT(sname)				       \
> +	.profile_count = ATOMIC_INIT(-1),				       \
> +	.profile_enable = prof_sysexit_enable_##sname,			       \
> +	.profile_disable = prof_sysexit_disable_##sname,
> +#else
> +#define TRACE_SYS_ENTER_PROFILE(sname)
> +#define TRACE_SYS_ENTER_PROFILE_INIT(sname)
> +#define TRACE_SYS_EXIT_PROFILE(sname)
> +#define TRACE_SYS_EXIT_PROFILE_INIT(sname)
> +#endif
> +
>  #ifdef CONFIG_FTRACE_SYSCALLS
>  #define __SC_STR_ADECL1(t, a)		#a
>  #define __SC_STR_ADECL2(t, a, ...)	#a, __SC_STR_ADECL1(__VA_ARGS__)
> @@ -113,7 +160,6 @@ struct perf_counter_attr;
>  #define __SC_STR_TDECL5(t, a, ...)	#t, __SC_STR_TDECL4(__VA_ARGS__)
>  #define __SC_STR_TDECL6(t, a, ...)	#t, __SC_STR_TDECL5(__VA_ARGS__)
>  
> -
>  #define SYSCALL_TRACE_ENTER_EVENT(sname)				\
>  	static struct ftrace_event_call event_enter_##sname;		\
>  	struct trace_event enter_syscall_print_##sname = {		\
> @@ -134,6 +180,7 @@ struct perf_counter_attr;
>  		init_preds(&event_enter_##sname);			\
>  		return 0;						\
>  	}								\
> +	TRACE_SYS_ENTER_PROFILE(sname);					\
>  	static struct ftrace_event_call __used				\
>  	  __attribute__((__aligned__(4)))				\
>  	  __attribute__((section("_ftrace_events")))			\
> @@ -145,6 +192,7 @@ struct perf_counter_attr;
>  		.regfunc		= reg_event_syscall_enter,	\
>  		.unregfunc		= unreg_event_syscall_enter,	\
>  		.data			= "sys"#sname,			\
> +		TRACE_SYS_ENTER_PROFILE_INIT(sname)			\
>  	}
>  
>  #define SYSCALL_TRACE_EXIT_EVENT(sname)					\
> @@ -167,6 +215,7 @@ struct perf_counter_attr;
>  		init_preds(&event_exit_##sname);			\
>  		return 0;						\
>  	}								\
> +	TRACE_SYS_EXIT_PROFILE(sname);					\
>  	static struct ftrace_event_call __used				\
>  	  __attribute__((__aligned__(4)))				\
>  	  __attribute__((section("_ftrace_events")))			\
> @@ -178,6 +227,7 @@ struct perf_counter_attr;
>  		.regfunc		= reg_event_syscall_exit,	\
>  		.unregfunc		= unreg_event_syscall_exit,	\
>  		.data			= "sys"#sname,			\
> +		TRACE_SYS_EXIT_PROFILE_INIT(sname)			\
>  	}
>  
>  #define SYSCALL_METADATA(sname, nb)				\
> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
> index df62840..3ab6dd1 100644
> --- a/include/trace/syscall.h
> +++ b/include/trace/syscall.h
> @@ -58,5 +58,12 @@ extern void unreg_event_syscall_exit(void *ptr);
>  enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags);
>  enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags);
>  #endif
> +#ifdef CONFIG_EVENT_PROFILE
> +int reg_prof_syscall_enter(char *name);
> +void unreg_prof_syscall_enter(char *name);
> +int reg_prof_syscall_exit(char *name);
> +void unreg_prof_syscall_exit(char *name);
> +
> +#endif
>  
>  #endif /* _TRACE_SYSCALL_H */
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index e58a9c1..f4eaec3 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -1,6 +1,7 @@
>  #include <trace/syscall.h>
>  #include <linux/kernel.h>
>  #include <linux/ftrace.h>
> +#include <linux/perf_counter.h>
>  #include <asm/syscall.h>
>  
>  #include "trace_output.h"
> @@ -252,3 +253,123 @@ struct trace_event event_syscall_enter = {
>  struct trace_event event_syscall_exit = {
>  	.trace			= print_syscall_exit,
>  };
> +
> +#ifdef CONFIG_EVENT_PROFILE
> +static DECLARE_BITMAP(enabled_prof_enter_syscalls, FTRACE_SYSCALL_MAX);
> +static DECLARE_BITMAP(enabled_prof_exit_syscalls, FTRACE_SYSCALL_MAX);
> +static int sys_prof_refcount_enter;
> +static int sys_prof_refcount_exit;
> +
> +static void prof_syscall_enter(struct pt_regs *regs, long id)
> +{
> +	struct syscall_metadata *sys_data;
> +	int syscall_nr;
> +
> +	syscall_nr = syscall_get_nr(current, regs);
> +	if (!test_bit(syscall_nr, enabled_prof_enter_syscalls))
> +		return;
> +
> +	sys_data = syscall_nr_to_meta(syscall_nr);
> +	if (!sys_data)
> +		return;
> +
> +	perf_tpcounter_event(sys_data->enter_id, 0, 1, NULL, 0);
> +}
> +
> +int reg_prof_syscall_enter(char *name)
> +{
> +	int ret = 0;
> +	int num;
> +
> +	num = syscall_name_to_nr(name);
> +	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
> +		return -ENOSYS;
> +
> +	mutex_lock(&syscall_trace_lock);
> +	if (!sys_prof_refcount_enter)
> +		ret = register_trace_syscall_enter(prof_syscall_enter);
> +	if (ret) {
> +		pr_info("event trace: Could not activate"
> +				"syscall entry trace point");
> +	} else {
> +		set_bit(num, enabled_prof_enter_syscalls);
> +		sys_prof_refcount_enter++;
> +	}
> +	mutex_unlock(&syscall_trace_lock);
> +	return ret;
> +}
> +
> +void unreg_prof_syscall_enter(char *name)
> +{
> +	int num;
> +
> +	num = syscall_name_to_nr(name);
> +	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
> +		return;
> +
> +	mutex_lock(&syscall_trace_lock);
> +	sys_prof_refcount_enter--;
> +	clear_bit(num, enabled_prof_enter_syscalls);
> +	if (!sys_prof_refcount_enter)
> +		unregister_trace_syscall_enter(prof_syscall_enter);
> +	mutex_unlock(&syscall_trace_lock);
> +}
> +
> +static void prof_syscall_exit(struct pt_regs *regs, long ret)
> +{
> +	struct syscall_metadata *sys_data;
> +	int syscall_nr;
> +
> +	syscall_nr = syscall_get_nr(current, regs);
> +	if (!test_bit(syscall_nr, enabled_prof_exit_syscalls))
> +		return;
> +
> +	sys_data = syscall_nr_to_meta(syscall_nr);
> +	if (!sys_data)
> +		return;
> +
> +	perf_tpcounter_event(sys_data->exit_id, 0, 1, NULL, 0);
> +}
> +
> +int reg_prof_syscall_exit(char *name)
> +{
> +	int ret = 0;
> +	int num;
> +
> +	num = syscall_name_to_nr(name);
> +	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
> +		return -ENOSYS;
> +
> +	mutex_lock(&syscall_trace_lock);
> +	if (!sys_prof_refcount_exit)
> +		ret = register_trace_syscall_exit(prof_syscall_exit);
> +	if (ret) {
> +		pr_info("event trace: Could not activate"
> +				"syscall entry trace point");
> +	} else {
> +		set_bit(num, enabled_prof_exit_syscalls);
> +		sys_prof_refcount_exit++;
> +	}
> +	mutex_unlock(&syscall_trace_lock);
> +	return ret;
> +}
> +
> +void unreg_prof_syscall_exit(char *name)
> +{
> +	int num;
> +
> +	num = syscall_name_to_nr(name);
> +	if (num < 0 || num >= FTRACE_SYSCALL_MAX)
> +		return;
> +
> +	mutex_lock(&syscall_trace_lock);
> +	sys_prof_refcount_exit--;
> +	clear_bit(num, enabled_prof_exit_syscalls);
> +	if (!sys_prof_refcount_exit)
> +		unregister_trace_syscall_exit(prof_syscall_exit);
> +	mutex_unlock(&syscall_trace_lock);
> +}
> +
> +#endif
> +
> +
> -- 
> 1.6.2.5
> 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 10/12] add perf counter support
  2009-08-11 12:12   ` Frederic Weisbecker
@ 2009-08-11 12:17     ` Ingo Molnar
  2009-08-11 12:25       ` Frederic Weisbecker
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2009-08-11 12:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jason Baron, linux-kernel, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> On Mon, Aug 10, 2009 at 04:53:02PM -0400, Jason Baron wrote:
> > Make 'perf stat -e syscalls:sys_enter_blah' work with syscall style tracepoints.
> 
> 
> It would be nice to also be able to type:
> 
> perf stat -e syscalls:blah
> 
> and then having both enter/exit counters.

Plus wildcard/regex support would be nice as well in the listing of 
events. That way one could do a shortcut of:

 perf stat -e syscalls:*

to trace all syscalls. Or:

 perf stat -e syscalls:*read*

to see all the read variants - etc.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 10/12] add perf counter support
  2009-08-11 12:17     ` Ingo Molnar
@ 2009-08-11 12:25       ` Frederic Weisbecker
  0 siblings, 0 replies; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-11 12:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Baron, linux-kernel, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf

On Tue, Aug 11, 2009 at 02:17:11PM +0200, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > On Mon, Aug 10, 2009 at 04:53:02PM -0400, Jason Baron wrote:
> > > Make 'perf stat -e syscalls:sys_enter_blah' work with syscall style tracepoints.
> > 
> > 
> > It would be nice to also be able to type:
> > 
> > perf stat -e syscalls:blah
> > 
> > and then having both enter/exit counters.
> 
> Plus wildcard/regex support would be nice as well in the listing of 
> events. That way one could do a shortcut of:
> 
>  perf stat -e syscalls:*
> 
> to trace all syscalls. Or:
> 
>  perf stat -e syscalls:*read*
> 
> to see all the read variants - etc.
> 
> 	Ingo


Plus filters in syscalls field. Because arguments are wrapped and post-processed,
they are not treated as usual fields.
I must confess this part is not trivial though.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/12] update FTRACE_SYSCALL_MAX
  2009-08-11 11:00   ` Frederic Weisbecker
@ 2009-08-11 19:39     ` Matt Fleming
  2009-08-24 13:41     ` Paul Mundt
  1 sibling, 0 replies; 88+ messages in thread
From: Matt Fleming @ 2009-08-11 19:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jason Baron, linux-kernel, mingo, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf

On Tue, Aug 11, 2009 at 01:00:25PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 10, 2009 at 04:52:35PM -0400, Jason Baron wrote:
> > update FTRACE_SYSCALL_MAX to the current number of syscalls
> > 
> > Signed-off-by: Jason Baron <jbaron@redhat.com>
> > 
> > ---
> >  arch/x86/include/asm/ftrace.h |    4 ++--
> >  1 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> > index bd2c651..7113654 100644
> > --- a/arch/x86/include/asm/ftrace.h
> > +++ b/arch/x86/include/asm/ftrace.h
> > @@ -30,9 +30,9 @@
> >  
> >  /* FIXME: I don't want to stay hardcoded */
> >  #ifdef CONFIG_X86_64
> > -# define FTRACE_SYSCALL_MAX     296
> > +# define FTRACE_SYSCALL_MAX     299
> >  #else
> > -# define FTRACE_SYSCALL_MAX     333
> > +# define FTRACE_SYSCALL_MAX     337
> >  #endif
> 
> 
> I don't remember why we had to use a hardcoded number.
> Is there no way to keep being sync with the current number of
> syscalls? We dwant to avoid patching the kernel each time we
> have a new syscall :-)
> 

On SH we're using (NR_syscalls - 1) to avoid that exact problem.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/12] add ftrace_event_call void * 'data' field
  2009-08-11 10:09   ` Frederic Weisbecker
@ 2009-08-17 22:19     ` Steven Rostedt
  2009-08-17 23:09       ` Frederic Weisbecker
  0 siblings, 1 reply; 88+ messages in thread
From: Steven Rostedt @ 2009-08-17 22:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jason Baron, linux-kernel, mingo, laijs, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf


On Tue, 11 Aug 2009, Frederic Weisbecker wrote:

> On Mon, Aug 10, 2009 at 04:52:44PM -0400, Jason Baron wrote:
> > add an optional * void pointer to 'ftrace_event_call' that is
> > passed in for regfunc and unregfunc.
> > 
> > Signed-off-by: Jason Baron <jbaron@redhat.com>
> > 
> > ---
> >  include/linux/ftrace_event.h |    5 +++--
> >  include/trace/ftrace.h       |    4 ++--
> >  kernel/trace/trace_events.c  |    4 ++--
> >  3 files changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> > index ac8c6f8..8544f12 100644
> > --- a/include/linux/ftrace_event.h
> > +++ b/include/linux/ftrace_event.h
> > @@ -112,8 +112,8 @@ struct ftrace_event_call {
> >  	struct dentry		*dir;
> >  	struct trace_event	*event;
> >  	int			enabled;
> > -	int			(*regfunc)(void);
> > -	void			(*unregfunc)(void);
> > +	int			(*regfunc)(void *);
> > +	void			(*unregfunc)(void *);
> >  	int			id;
> >  	int			(*raw_init)(void);
> >  	int			(*show_format)(struct trace_seq *s);
> > @@ -122,6 +122,7 @@ struct ftrace_event_call {
> >  	int			filter_active;
> >  	struct event_filter	*filter;
> >  	void			*mod;
> > +	void			*data;
> >  
> >  	atomic_t		profile_count;
> >  	int			(*profile_enable)(struct ftrace_event_call *);
> > diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> > index 80e5f6c..a0de384 100644
> > --- a/include/trace/ftrace.h
> > +++ b/include/trace/ftrace.h
> > @@ -568,7 +568,7 @@ static void ftrace_raw_event_##call(proto)				\
> >  		trace_nowake_buffer_unlock_commit(event, irq_flags, pc); \
> >  }									\
> >  									\
> > -static int ftrace_raw_reg_event_##call(void)				\
> > +static int ftrace_raw_reg_event_##call(void *ptr)			\
> 
> 
> Shouldn't it have a __used attribute here, or something?

Do function parameters need that? There's lots of places where the 
parameter of a function is not used by a function itself.

-- Steve

> 
> 
> >  {									\
> >  	int ret;							\
> >  									\
> > @@ -579,7 +579,7 @@ static int ftrace_raw_reg_event_##call(void)				\
> >  	return ret;							\
> >  }									\
> >  									\
> > -static void ftrace_raw_unreg_event_##call(void)				\
> > +static void ftrace_raw_unreg_event_##call(void *ptr)			\
> 
> 
> 
> Same here.
> 
> 
> 
> >  {									\
> >  	unregister_trace_##call(ftrace_raw_event_##call);		\
> >  }									\
> > diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> > index f95f847..1d289e2 100644
> > --- a/kernel/trace/trace_events.c
> > +++ b/kernel/trace/trace_events.c
> > @@ -86,14 +86,14 @@ static void ftrace_event_enable_disable(struct ftrace_event_call *call,
> >  		if (call->enabled) {
> >  			call->enabled = 0;
> >  			tracing_stop_cmdline_record();
> > -			call->unregfunc();
> > +			call->unregfunc(call->data);
> >  		}
> >  		break;
> >  	case 1:
> >  		if (!call->enabled) {
> >  			call->enabled = 1;
> >  			tracing_start_cmdline_record();
> > -			call->regfunc();
> > +			call->regfunc(call->data);
> >  		}
> >  		break;
> >  	}
> > -- 
> > 1.6.2.5
> > 
> 
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/12] add ftrace_event_call void * 'data' field
  2009-08-17 22:19     ` Steven Rostedt
@ 2009-08-17 23:09       ` Frederic Weisbecker
  2009-08-18  0:06         ` Steven Rostedt
  0 siblings, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-17 23:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jason Baron, linux-kernel, mingo, laijs, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf

On Mon, Aug 17, 2009 at 06:19:33PM -0400, Steven Rostedt wrote:
> 
> On Tue, 11 Aug 2009, Frederic Weisbecker wrote:
> 
> > On Mon, Aug 10, 2009 at 04:52:44PM -0400, Jason Baron wrote:
> > > add an optional * void pointer to 'ftrace_event_call' that is
> > > passed in for regfunc and unregfunc.
> > > 
> > > Signed-off-by: Jason Baron <jbaron@redhat.com>
> > > 
> > > ---
> > >  include/linux/ftrace_event.h |    5 +++--
> > >  include/trace/ftrace.h       |    4 ++--
> > >  kernel/trace/trace_events.c  |    4 ++--
> > >  3 files changed, 7 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> > > index ac8c6f8..8544f12 100644
> > > --- a/include/linux/ftrace_event.h
> > > +++ b/include/linux/ftrace_event.h
> > > @@ -112,8 +112,8 @@ struct ftrace_event_call {
> > >  	struct dentry		*dir;
> > >  	struct trace_event	*event;
> > >  	int			enabled;
> > > -	int			(*regfunc)(void);
> > > -	void			(*unregfunc)(void);
> > > +	int			(*regfunc)(void *);
> > > +	void			(*unregfunc)(void *);
> > >  	int			id;
> > >  	int			(*raw_init)(void);
> > >  	int			(*show_format)(struct trace_seq *s);
> > > @@ -122,6 +122,7 @@ struct ftrace_event_call {
> > >  	int			filter_active;
> > >  	struct event_filter	*filter;
> > >  	void			*mod;
> > > +	void			*data;
> > >  
> > >  	atomic_t		profile_count;
> > >  	int			(*profile_enable)(struct ftrace_event_call *);
> > > diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> > > index 80e5f6c..a0de384 100644
> > > --- a/include/trace/ftrace.h
> > > +++ b/include/trace/ftrace.h
> > > @@ -568,7 +568,7 @@ static void ftrace_raw_event_##call(proto)				\
> > >  		trace_nowake_buffer_unlock_commit(event, irq_flags, pc); \
> > >  }									\
> > >  									\
> > > -static int ftrace_raw_reg_event_##call(void)				\
> > > +static int ftrace_raw_reg_event_##call(void *ptr)			\
> > 
> > 
> > Shouldn't it have a __used attribute here, or something?
> 
> Do function parameters need that? There's lots of places where the 
> parameter of a function is not used by a function itself.
> 
> -- Steve


No actually, I thought gcc would warn, but it didn't :-)



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/12] add ftrace_event_call void * 'data' field
  2009-08-17 23:09       ` Frederic Weisbecker
@ 2009-08-18  0:06         ` Steven Rostedt
  0 siblings, 0 replies; 88+ messages in thread
From: Steven Rostedt @ 2009-08-18  0:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jason Baron, linux-kernel, mingo, laijs, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf


On Tue, 18 Aug 2009, Frederic Weisbecker wrote:
> > > > @@ -568,7 +568,7 @@ static void ftrace_raw_event_##call(proto)				\
> > > >  		trace_nowake_buffer_unlock_commit(event, irq_flags, pc); \
> > > >  }									\
> > > >  									\
> > > > -static int ftrace_raw_reg_event_##call(void)				\
> > > > +static int ftrace_raw_reg_event_##call(void *ptr)			\
> > > 
> > > 
> > > Shouldn't it have a __used attribute here, or something?
> > 
> > Do function parameters need that? There's lots of places where the 
> > parameter of a function is not used by a function itself.
> > 
> > -- Steve
> 
> 
> No actually, I thought gcc would warn, but it didn't :-)

Yeah, that is the right thing too. Because functions can be passed as 
parameters (like this one) and every "stub function" we have will then 
need this attribute. It is OK to ignore parameters of functions without 
telling gcc that you plan on ignoring them.

-- Steve


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/12] update FTRACE_SYSCALL_MAX
  2009-08-11 11:00   ` Frederic Weisbecker
  2009-08-11 19:39     ` Matt Fleming
@ 2009-08-24 13:41     ` Paul Mundt
  2009-08-24 14:06       ` Jason Baron
  1 sibling, 1 reply; 88+ messages in thread
From: Paul Mundt @ 2009-08-24 13:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jason Baron, linux-kernel, mingo, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf

On Tue, Aug 11, 2009 at 01:00:25PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 10, 2009 at 04:52:35PM -0400, Jason Baron wrote:
> > update FTRACE_SYSCALL_MAX to the current number of syscalls
> > 
> > Signed-off-by: Jason Baron <jbaron@redhat.com>
> > 
> > ---
> >  arch/x86/include/asm/ftrace.h |    4 ++--
> >  1 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> > index bd2c651..7113654 100644
> > --- a/arch/x86/include/asm/ftrace.h
> > +++ b/arch/x86/include/asm/ftrace.h
> > @@ -30,9 +30,9 @@
> >  
> >  /* FIXME: I don't want to stay hardcoded */
> >  #ifdef CONFIG_X86_64
> > -# define FTRACE_SYSCALL_MAX     296
> > +# define FTRACE_SYSCALL_MAX     299
> >  #else
> > -# define FTRACE_SYSCALL_MAX     333
> > +# define FTRACE_SYSCALL_MAX     337
> >  #endif
> 
> 
> I don't remember why we had to use a hardcoded number.
> Is there no way to keep being sync with the current number of
> syscalls? We dwant to avoid patching the kernel each time we
> have a new syscall :-)
> 
I hope you can clarify what the meaning of this is supposed to be
exactly. Is this number supposed to be the last usable syscall, or is it
supposed to be the equivalent of NR_syscalls?

Presently on SH we have this as NR_syscalls - 1, while on s390 I see it
is treated as NR_syscalls directly. s390 opencodes the NR_syscalls
directly and so presently blows up in -next due to a missing
FTRACE_SYSCALL_MAX definition:

	http://kisskb.ellerman.id.au/kisskb/buildresult/1120523/

I was in the process of fixing that up when I noticed this difference.
x86 seems to also treat this as NR_syscalls - 1, but that looks to me
like there is an off-by-1 in arch_init_ftrace_syscalls() causing the last
syscall to be skipped?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/12] update FTRACE_SYSCALL_MAX
  2009-08-24 13:41     ` Paul Mundt
@ 2009-08-24 14:06       ` Jason Baron
  2009-08-24 14:15         ` Paul Mundt
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Baron @ 2009-08-24 14:06 UTC (permalink / raw)
  To: Paul Mundt, Frederic Weisbecker, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf

On Mon, Aug 24, 2009 at 10:41:52PM +0900, Paul Mundt wrote:
> > On Mon, Aug 10, 2009 at 04:52:35PM -0400, Jason Baron wrote:
> > > update FTRACE_SYSCALL_MAX to the current number of syscalls
> > > 
> > > Signed-off-by: Jason Baron <jbaron@redhat.com>
> > > 
> > > ---
> > >  arch/x86/include/asm/ftrace.h |    4 ++--
> > >  1 files changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> > > index bd2c651..7113654 100644
> > > --- a/arch/x86/include/asm/ftrace.h
> > > +++ b/arch/x86/include/asm/ftrace.h
> > > @@ -30,9 +30,9 @@
> > >  
> > >  /* FIXME: I don't want to stay hardcoded */
> > >  #ifdef CONFIG_X86_64
> > > -# define FTRACE_SYSCALL_MAX     296
> > > +# define FTRACE_SYSCALL_MAX     299
> > >  #else
> > > -# define FTRACE_SYSCALL_MAX     333
> > > +# define FTRACE_SYSCALL_MAX     337
> > >  #endif
> > 
> > 
> > I don't remember why we had to use a hardcoded number.
> > Is there no way to keep being sync with the current number of
> > syscalls? We dwant to avoid patching the kernel each time we
> > have a new syscall :-)
> > 
> I hope you can clarify what the meaning of this is supposed to be
> exactly. Is this number supposed to be the last usable syscall, or is it
> supposed to be the equivalent of NR_syscalls?
> 

I am using as the equivalent of NR_syscalls.

> Presently on SH we have this as NR_syscalls - 1, while on s390 I see it
> is treated as NR_syscalls directly. s390 opencodes the NR_syscalls
> directly and so presently blows up in -next due to a missing
> FTRACE_SYSCALL_MAX definition:
> 
> 	http://kisskb.ellerman.id.au/kisskb/buildresult/1120523/
> 
> I was in the process of fixing that up when I noticed this difference.
> x86 seems to also treat this as NR_syscalls - 1, but that looks to me
> like there is an off-by-1 in arch_init_ftrace_syscalls() causing the last
> syscall to be skipped?

I don't see how its used as 'NR_syscalls - 1' on x86,
arch_init_ftrace_syscalls() does:

        for (i = 0; i < FTRACE_SYSCALL_MAX; i++) {
                meta = find_syscall_meta(psys_syscall_table[i]);
                syscalls_metadata[i] = meta;
        }

So the last syscall should not be skipped.

We should probably convert *all* the arches to be using NR_syscalls
directly.

thanks,

-Jason




^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/12] update FTRACE_SYSCALL_MAX
  2009-08-24 14:06       ` Jason Baron
@ 2009-08-24 14:15         ` Paul Mundt
  2009-08-24 14:34           ` Frederic Weisbecker
  2009-08-24 14:42           ` Jason Baron
  0 siblings, 2 replies; 88+ messages in thread
From: Paul Mundt @ 2009-08-24 14:15 UTC (permalink / raw)
  To: Jason Baron
  Cc: Frederic Weisbecker, linux-kernel, mingo, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf

On Mon, Aug 24, 2009 at 10:06:29AM -0400, Jason Baron wrote:
> On Mon, Aug 24, 2009 at 10:41:52PM +0900, Paul Mundt wrote:
> > I hope you can clarify what the meaning of this is supposed to be
> > exactly. Is this number supposed to be the last usable syscall, or is it
> > supposed to be the equivalent of NR_syscalls?
> > 
> 
> I am using as the equivalent of NR_syscalls.
> 
NR_syscalls has always been the total number of system calls, not the
last one.

> > Presently on SH we have this as NR_syscalls - 1, while on s390 I see it
> > is treated as NR_syscalls directly. s390 opencodes the NR_syscalls
> > directly and so presently blows up in -next due to a missing
> > FTRACE_SYSCALL_MAX definition:
> > 
> > 	http://kisskb.ellerman.id.au/kisskb/buildresult/1120523/
> > 
> > I was in the process of fixing that up when I noticed this difference.
> > x86 seems to also treat this as NR_syscalls - 1, but that looks to me
> > like there is an off-by-1 in arch_init_ftrace_syscalls() causing the last
> > syscall to be skipped?
> 
> I don't see how its used as 'NR_syscalls - 1' on x86,
> arch_init_ftrace_syscalls() does:
> 
>         for (i = 0; i < FTRACE_SYSCALL_MAX; i++) {
>                 meta = find_syscall_meta(psys_syscall_table[i]);
>                 syscalls_metadata[i] = meta;
>         }
> 
> So the last syscall should not be skipped.
> 

In today's -next:

#ifdef CONFIG_X86_64
# define FTRACE_SYSCALL_MAX     299
#else
# define FTRACE_SYSCALL_MAX     337
#endif

unistd_32.h:

#define __NR_reflinkat          337

unistd_64.h:

#define __NR_reflinkat          299

The first syscall starts at 0, but I don't see how this last syscall is
handled. If there were a __NR_syscalls 300 and 338 respectively, that
would seem to do the right thing. Or am I missing something?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/12] update FTRACE_SYSCALL_MAX
  2009-08-24 14:15         ` Paul Mundt
@ 2009-08-24 14:34           ` Frederic Weisbecker
  2009-08-24 14:37             ` Paul Mundt
  2009-08-24 14:42           ` Jason Baron
  1 sibling, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-24 14:34 UTC (permalink / raw)
  To: Paul Mundt, Jason Baron, linux-kernel, mingo, laijs, rostedt,
	peterz, mathieu.desnoyers, jiayingz, mbligh, lizf

On Mon, Aug 24, 2009 at 11:15:39PM +0900, Paul Mundt wrote:
> On Mon, Aug 24, 2009 at 10:06:29AM -0400, Jason Baron wrote:
> > On Mon, Aug 24, 2009 at 10:41:52PM +0900, Paul Mundt wrote:
> > > I hope you can clarify what the meaning of this is supposed to be
> > > exactly. Is this number supposed to be the last usable syscall, or is it
> > > supposed to be the equivalent of NR_syscalls?
> > > 
> > 
> > I am using as the equivalent of NR_syscalls.
> > 
> NR_syscalls has always been the total number of system calls, not the
> last one.
> 
> > > Presently on SH we have this as NR_syscalls - 1, while on s390 I see it
> > > is treated as NR_syscalls directly. s390 opencodes the NR_syscalls
> > > directly and so presently blows up in -next due to a missing
> > > FTRACE_SYSCALL_MAX definition:
> > > 
> > > 	http://kisskb.ellerman.id.au/kisskb/buildresult/1120523/
> > > 
> > > I was in the process of fixing that up when I noticed this difference.
> > > x86 seems to also treat this as NR_syscalls - 1, but that looks to me
> > > like there is an off-by-1 in arch_init_ftrace_syscalls() causing the last
> > > syscall to be skipped?
> > 
> > I don't see how its used as 'NR_syscalls - 1' on x86,
> > arch_init_ftrace_syscalls() does:
> > 
> >         for (i = 0; i < FTRACE_SYSCALL_MAX; i++) {
> >                 meta = find_syscall_meta(psys_syscall_table[i]);
> >                 syscalls_metadata[i] = meta;
> >         }
> > 
> > So the last syscall should not be skipped.
> > 
> 
> In today's -next:
> 
> #ifdef CONFIG_X86_64
> # define FTRACE_SYSCALL_MAX     299
> #else
> # define FTRACE_SYSCALL_MAX     337
> #endif
> 
> unistd_32.h:
> 
> #define __NR_reflinkat          337
> 
> unistd_64.h:
> 
> #define __NR_reflinkat          299
> 
> The first syscall starts at 0, but I don't see how this last syscall is
> handled. If there were a __NR_syscalls 300 and 338 respectively, that
> would seem to do the right thing. Or am I missing something?


Yeah, I guess what we need here is NR_syscalls + 1.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/12] update FTRACE_SYSCALL_MAX
  2009-08-24 14:34           ` Frederic Weisbecker
@ 2009-08-24 14:37             ` Paul Mundt
  0 siblings, 0 replies; 88+ messages in thread
From: Paul Mundt @ 2009-08-24 14:37 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jason Baron, linux-kernel, mingo, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf

On Mon, Aug 24, 2009 at 04:34:20PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 24, 2009 at 11:15:39PM +0900, Paul Mundt wrote:
> > On Mon, Aug 24, 2009 at 10:06:29AM -0400, Jason Baron wrote:
> > > I don't see how its used as 'NR_syscalls - 1' on x86,
> > > arch_init_ftrace_syscalls() does:
> > > 
> > >         for (i = 0; i < FTRACE_SYSCALL_MAX; i++) {
> > >                 meta = find_syscall_meta(psys_syscall_table[i]);
> > >                 syscalls_metadata[i] = meta;
> > >         }
> > > 
> > > So the last syscall should not be skipped.
> > > 
> > 
> > In today's -next:
> > 
> > #ifdef CONFIG_X86_64
> > # define FTRACE_SYSCALL_MAX     299
> > #else
> > # define FTRACE_SYSCALL_MAX     337
> > #endif
> > 
> > unistd_32.h:
> > 
> > #define __NR_reflinkat          337
> > 
> > unistd_64.h:
> > 
> > #define __NR_reflinkat          299
> > 
> > The first syscall starts at 0, but I don't see how this last syscall is
> > handled. If there were a __NR_syscalls 300 and 338 respectively, that
> > would seem to do the right thing. Or am I missing something?
> 
> 
> Yeah, I guess what we need here is NR_syscalls + 1.
> 
No, just NR_syscalls. NR_syscalls has always been last valid + 1. At
least this is how all architectures are using it, it just seems to have
gone missing from x86.

So having said that, it looks like s390 got it right, while x86 has an
off-by-1, and sh foolishly followed x86 ;-)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/12] update FTRACE_SYSCALL_MAX
  2009-08-24 14:15         ` Paul Mundt
  2009-08-24 14:34           ` Frederic Weisbecker
@ 2009-08-24 14:42           ` Jason Baron
  2009-08-24 14:50             ` Paul Mundt
  1 sibling, 1 reply; 88+ messages in thread
From: Jason Baron @ 2009-08-24 14:42 UTC (permalink / raw)
  To: Paul Mundt, Frederic Weisbecker, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf

On Mon, Aug 24, 2009 at 11:15:39PM +0900, Paul Mundt wrote:
> On Mon, Aug 24, 2009 at 10:06:29AM -0400, Jason Baron wrote:
> > On Mon, Aug 24, 2009 at 10:41:52PM +0900, Paul Mundt wrote:
> > > I hope you can clarify what the meaning of this is supposed to be
> > > exactly. Is this number supposed to be the last usable syscall, or is it
> > > supposed to be the equivalent of NR_syscalls?
> > > 
> > 
> > I am using as the equivalent of NR_syscalls.
> > 
> NR_syscalls has always been the total number of system calls, not the
> last one.
> 
> > > Presently on SH we have this as NR_syscalls - 1, while on s390 I see it
> > > is treated as NR_syscalls directly. s390 opencodes the NR_syscalls
> > > directly and so presently blows up in -next due to a missing
> > > FTRACE_SYSCALL_MAX definition:
> > > 
> > > 	http://kisskb.ellerman.id.au/kisskb/buildresult/1120523/
> > > 
> > > I was in the process of fixing that up when I noticed this difference.
> > > x86 seems to also treat this as NR_syscalls - 1, but that looks to me
> > > like there is an off-by-1 in arch_init_ftrace_syscalls() causing the last
> > > syscall to be skipped?
> > 
> > I don't see how its used as 'NR_syscalls - 1' on x86,
> > arch_init_ftrace_syscalls() does:
> > 
> >         for (i = 0; i < FTRACE_SYSCALL_MAX; i++) {
> >                 meta = find_syscall_meta(psys_syscall_table[i]);
> >                 syscalls_metadata[i] = meta;
> >         }
> > 
> > So the last syscall should not be skipped.
> > 
> 
> In today's -next:
> 
> #ifdef CONFIG_X86_64
> # define FTRACE_SYSCALL_MAX     299
> #else
> # define FTRACE_SYSCALL_MAX     337
> #endif
> 
> unistd_32.h:
> 
> #define __NR_reflinkat          337
> 
> unistd_64.h:
> 
> #define __NR_reflinkat          299
> 
> The first syscall starts at 0, but I don't see how this last syscall is
> handled. If there were a __NR_syscalls 300 and 338 respectively, that
> would seem to do the right thing. Or am I missing something?

No, you are right. When I changed the FTRACE_SYSCALL_MAX to 299, and
337, there was no reflinkat syscall in the tree. So, it was equivalent
to NR_syscalls at that point in time. So that's where the confusion is.

Clearly, all the more reason to drop FTRACE_SYSCALL_MAX and change to
NR_syscalls...

thanks,

-Jason


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/12] update FTRACE_SYSCALL_MAX
  2009-08-24 14:42           ` Jason Baron
@ 2009-08-24 14:50             ` Paul Mundt
  2009-08-24 18:34               ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Paul Mundt @ 2009-08-24 14:50 UTC (permalink / raw)
  To: Jason Baron
  Cc: Frederic Weisbecker, linux-kernel, mingo, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf

On Mon, Aug 24, 2009 at 10:42:28AM -0400, Jason Baron wrote:
> On Mon, Aug 24, 2009 at 11:15:39PM +0900, Paul Mundt wrote:
> > On Mon, Aug 24, 2009 at 10:06:29AM -0400, Jason Baron wrote:
> > > On Mon, Aug 24, 2009 at 10:41:52PM +0900, Paul Mundt wrote:
> > > > I hope you can clarify what the meaning of this is supposed to be
> > > > exactly. Is this number supposed to be the last usable syscall, or is it
> > > > supposed to be the equivalent of NR_syscalls?
> > > > 
> > > 
> > > I am using as the equivalent of NR_syscalls.
> > > 
> > NR_syscalls has always been the total number of system calls, not the
> > last one.
> > 
> > > > Presently on SH we have this as NR_syscalls - 1, while on s390 I see it
> > > > is treated as NR_syscalls directly. s390 opencodes the NR_syscalls
> > > > directly and so presently blows up in -next due to a missing
> > > > FTRACE_SYSCALL_MAX definition:
> > > > 
> > > > 	http://kisskb.ellerman.id.au/kisskb/buildresult/1120523/
> > > > 
> > > > I was in the process of fixing that up when I noticed this difference.
> > > > x86 seems to also treat this as NR_syscalls - 1, but that looks to me
> > > > like there is an off-by-1 in arch_init_ftrace_syscalls() causing the last
> > > > syscall to be skipped?
> > > 
> > > I don't see how its used as 'NR_syscalls - 1' on x86,
> > > arch_init_ftrace_syscalls() does:
> > > 
> > >         for (i = 0; i < FTRACE_SYSCALL_MAX; i++) {
> > >                 meta = find_syscall_meta(psys_syscall_table[i]);
> > >                 syscalls_metadata[i] = meta;
> > >         }
> > > 
> > > So the last syscall should not be skipped.
> > > 
> > 
> > In today's -next:
> > 
> > #ifdef CONFIG_X86_64
> > # define FTRACE_SYSCALL_MAX     299
> > #else
> > # define FTRACE_SYSCALL_MAX     337
> > #endif
> > 
> > unistd_32.h:
> > 
> > #define __NR_reflinkat          337
> > 
> > unistd_64.h:
> > 
> > #define __NR_reflinkat          299
> > 
> > The first syscall starts at 0, but I don't see how this last syscall is
> > handled. If there were a __NR_syscalls 300 and 338 respectively, that
> > would seem to do the right thing. Or am I missing something?
> 
> No, you are right. When I changed the FTRACE_SYSCALL_MAX to 299, and
> 337, there was no reflinkat syscall in the tree. So, it was equivalent
> to NR_syscalls at that point in time. So that's where the confusion is.
> 
> Clearly, all the more reason to drop FTRACE_SYSCALL_MAX and change to
> NR_syscalls...
> 
If FTRACE_SYSCALL_MAX is dropped then s390 will be fixed, and I'll take
care of the sh update. If you want to hold off on adding NR_syscalls back
to x86, then s390 will need a #define FTRACE_SYSCALL_MAX __NR_syscalls in
arch/s390/include/asm/ftrace.h. Keeping FTRACE_SYSCALL_MAX around seems
to be asking for trouble, though (although I don't know what the original
rationale behind adding it was).

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/12] update FTRACE_SYSCALL_MAX
  2009-08-24 14:50             ` Paul Mundt
@ 2009-08-24 18:34               ` Ingo Molnar
  0 siblings, 0 replies; 88+ messages in thread
From: Ingo Molnar @ 2009-08-24 18:34 UTC (permalink / raw)
  To: Paul Mundt, Jason Baron, Frederic Weisbecker, linux-kernel,
	laijs, rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh,
	lizf


* Paul Mundt <lethal@linux-sh.org> wrote:

> On Mon, Aug 24, 2009 at 10:42:28AM -0400, Jason Baron wrote:
> > On Mon, Aug 24, 2009 at 11:15:39PM +0900, Paul Mundt wrote:
> > > On Mon, Aug 24, 2009 at 10:06:29AM -0400, Jason Baron wrote:
> > > > On Mon, Aug 24, 2009 at 10:41:52PM +0900, Paul Mundt wrote:
> > > > > I hope you can clarify what the meaning of this is supposed to be
> > > > > exactly. Is this number supposed to be the last usable syscall, or is it
> > > > > supposed to be the equivalent of NR_syscalls?
> > > > > 
> > > > 
> > > > I am using as the equivalent of NR_syscalls.
> > > > 
> > > NR_syscalls has always been the total number of system calls, not the
> > > last one.
> > > 
> > > > > Presently on SH we have this as NR_syscalls - 1, while on s390 I see it
> > > > > is treated as NR_syscalls directly. s390 opencodes the NR_syscalls
> > > > > directly and so presently blows up in -next due to a missing
> > > > > FTRACE_SYSCALL_MAX definition:
> > > > > 
> > > > > 	http://kisskb.ellerman.id.au/kisskb/buildresult/1120523/
> > > > > 
> > > > > I was in the process of fixing that up when I noticed this difference.
> > > > > x86 seems to also treat this as NR_syscalls - 1, but that looks to me
> > > > > like there is an off-by-1 in arch_init_ftrace_syscalls() causing the last
> > > > > syscall to be skipped?
> > > > 
> > > > I don't see how its used as 'NR_syscalls - 1' on x86,
> > > > arch_init_ftrace_syscalls() does:
> > > > 
> > > >         for (i = 0; i < FTRACE_SYSCALL_MAX; i++) {
> > > >                 meta = find_syscall_meta(psys_syscall_table[i]);
> > > >                 syscalls_metadata[i] = meta;
> > > >         }
> > > > 
> > > > So the last syscall should not be skipped.
> > > > 
> > > 
> > > In today's -next:
> > > 
> > > #ifdef CONFIG_X86_64
> > > # define FTRACE_SYSCALL_MAX     299
> > > #else
> > > # define FTRACE_SYSCALL_MAX     337
> > > #endif
> > > 
> > > unistd_32.h:
> > > 
> > > #define __NR_reflinkat          337
> > > 
> > > unistd_64.h:
> > > 
> > > #define __NR_reflinkat          299
> > > 
> > > The first syscall starts at 0, but I don't see how this last syscall is
> > > handled. If there were a __NR_syscalls 300 and 338 respectively, that
> > > would seem to do the right thing. Or am I missing something?
> > 
> > No, you are right. When I changed the FTRACE_SYSCALL_MAX to 299, and
> > 337, there was no reflinkat syscall in the tree. So, it was equivalent
> > to NR_syscalls at that point in time. So that's where the confusion is.
> > 
> > Clearly, all the more reason to drop FTRACE_SYSCALL_MAX and change to
> > NR_syscalls...
> 
>
> If FTRACE_SYSCALL_MAX is dropped then s390 will be fixed, and I'll 
> take care of the sh update. If you want to hold off on adding 
> NR_syscalls back to x86, then s390 will need a #define 
> FTRACE_SYSCALL_MAX __NR_syscalls in 
> arch/s390/include/asm/ftrace.h. Keeping FTRACE_SYSCALL_MAX around 
> seems to be asking for trouble, though (although I don't know what 
> the original rationale behind adding it was).

I agree with you - we should certainly add a clean and arch-generic 
way and drop the FTRACE_SYSCALL_MAX hack which really just tried to 
hide the arch differences for no strong reason.

At the same time the compat syscall space should be solved too, and 
a synonymous compat_NR_syscalls value introduced. (perhaps defined 
to 0 on non-compat kernels)

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/12] add syscall tracepoints V3 - s390 arch update
  2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
                   ` (11 preceding siblings ...)
  2009-08-10 20:53 ` [PATCH 12/12] convert x86_64 mmap and uname to use DEFINE_SYSCALL Jason Baron
@ 2009-08-25 12:31 ` Hendrik Brueckner
  2009-08-25 13:52   ` Frederic Weisbecker
                     ` (2 more replies)
  12 siblings, 3 replies; 88+ messages in thread
From: Hendrik Brueckner @ 2009-08-25 12:31 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, fweisbec, mingo, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

Hi,

I looked at your recent syscall tracepoint patches and I have few
more s390 arch updates.

This patch includes s390 arch updates for:
- tracing: Map syscall name to number (syscall_name_to_nr())
- tracing: Call arch_init_ftrace_syscalls at boot
- tracing: add support traceopint ids (set_syscall_{enter,exit}_id())

The patch already uses "NR_syscalls" instead of FTRACE_SYSCALL_MAX.

The patch is based on today's linux-next (20090825).
Since few of your patches already include s390 changes,
I would appreciate if you could add the patch to your patch set.

If you have any remarks, please let me know. 
  
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
---
 arch/s390/include/asm/ftrace.h |    4 ++++
 arch/s390/kernel/ftrace.c      |   36 +++++++++++++++++++++++++++---------
 2 files changed, 31 insertions(+), 9 deletions(-)

--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -220,6 +220,29 @@ struct syscall_metadata *syscall_nr_to_m
 	return syscalls_metadata[nr];
 }
 
+int syscall_name_to_nr(char *name)
+{
+	int i;
+
+	if (!syscalls_metadata)
+		return -1;
+	for (i = 0; i < NR_syscalls; i++)
+		if (syscalls_metadata[i])
+			if (!strcmp(syscalls_metadata[i]->name, name))
+				return i;
+	return -1;
+}
+
+void set_syscall_enter_id(int num, int id)
+{
+	syscalls_metadata[num]->enter_id = id;
+}
+
+void set_syscall_exit_id(int num, int id)
+{
+	syscalls_metadata[num]->exit_id = id;
+}
+
 static struct syscall_metadata *find_syscall_meta(unsigned long syscall)
 {
 	struct syscall_metadata *start;
@@ -237,24 +260,19 @@ static struct syscall_metadata *find_sys
 	return NULL;
 }
 
-void arch_init_ftrace_syscalls(void)
+static int __init arch_init_ftrace_syscalls(void)
 {
 	struct syscall_metadata *meta;
 	int i;
-	static atomic_t refs;
-
-	if (atomic_inc_return(&refs) != 1)
-		goto out;
 	syscalls_metadata = kzalloc(sizeof(*syscalls_metadata) * NR_syscalls,
 				    GFP_KERNEL);
 	if (!syscalls_metadata)
-		goto out;
+		return -ENOMEM;
 	for (i = 0; i < NR_syscalls; i++) {
 		meta = find_syscall_meta((unsigned long)sys_call_table[i]);
 		syscalls_metadata[i] = meta;
 	}
-	return;
-out:
-	atomic_dec(&refs);
+	return 0;
 }
+arch_initcall(arch_init_ftrace_syscalls);
 #endif

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-10 20:52 ` [PATCH 08/12] add trace events for each syscall entry/exit Jason Baron
  2009-08-11 10:50   ` Frederic Weisbecker
@ 2009-08-25 12:50   ` Hendrik Brueckner
  2009-08-25 14:15     ` Frederic Weisbecker
                       ` (2 more replies)
  1 sibling, 3 replies; 88+ messages in thread
From: Hendrik Brueckner @ 2009-08-25 12:50 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, fweisbec, mingo, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

On Mon, Aug 10, 2009 at 04:52:47PM -0400, Jason Baron wrote:
> +void ftrace_syscall_enter(struct pt_regs *regs, long id)
>  {
>  	struct syscall_trace_enter *entry;
>  	struct syscall_metadata *sys_data;
> @@ -150,6 +105,8 @@ void ftrace_syscall_enter(struct pt_regs *regs)
>  	int syscall_nr;
> 
>  	syscall_nr = syscall_get_nr(current, regs);
> +	if (!test_bit(syscall_nr, enabled_enter_syscalls))
> +		return;
> 
>  	sys_data = syscall_nr_to_meta(syscall_nr);
>  	if (!sys_data)

> +void ftrace_syscall_exit(struct pt_regs *regs, long ret)
>  {
>  	struct syscall_trace_exit *entry;
>  	struct syscall_metadata *sys_data;
> @@ -178,6 +135,8 @@ void ftrace_syscall_exit(struct pt_regs *regs)
>  	int syscall_nr;
> 
>  	syscall_nr = syscall_get_nr(current, regs);
> +	if (!test_bit(syscall_nr, enabled_exit_syscalls))
> +		return;
Most arch syscall_get_nr() implementations returns -1 if the syscall
number is not valid.  Accessing the bit field without a check might
result in a kernel oops (at least I saw it on s390 for ftrace selftest).

Before this change, this problem did not occur, because the invalid
syscall number (-1) caused syscall_nr_to_meta() to return NULL.

There are at least two scenarios where syscall_get_nr() can return -1:

1. For example, ptrace stores an invalid syscall number, and thus,
   tracing code resets it.
   (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)

2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
   (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
   kernel threads.
   However, the ftrace selftest triggers a kernel oops when testing syscall
   trace points:
      - The kernel thread is started as ususal (do_fork()),
      - tracing code sets TIF_SYSCALL_FTRACE,
      - the ret_from_fork() function is triggered and starts
	ftrace_syscall_exit() with an invalid syscall number.

To avoid these scenarios, I suggest to check the syscall_nr.

For instance, the ftrace selftest fails for s390 (with config option
CONFIG_FTRACE_SYSCALLS set) and produces the following kernel oops.

Unable to handle kernel pointer dereference at virtual kernel address 2000000000

Oops: 0038 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 Not tainted 2.6.31-rc6-next-20090819-dirty #18
Process kthreadd (pid: 818, task: 000000003ea207e8, ksp: 000000003e813eb8)
Krnl PSW : 0704100180000000 00000000000ea54c (ftrace_syscall_exit+0x58/0xdc)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
Krnl GPRS: 0000000000000000 00000000000e0000 ffffffffffffffff 20000000008c2650
           0000000000000007 0000000000000000 0000000000000000 0000000000000000
           0000000000000000 0000000000000000 ffffffffffffffff 000000003e813d78
           000000003e813f58 0000000000505ba8 000000003e813e18 000000003e813d78
Krnl Code: 00000000000ea540: e330d0000008       ag      %r3,0(%r13)
           00000000000ea546: a7480007           lhi     %r4,7
           00000000000ea54a: 1442               nr      %r4,%r2
          >00000000000ea54c: e31030000090       llgc    %r1,0(%r3)
           00000000000ea552: 5410d008           n       %r1,8(%r13)
           00000000000ea556: 8a104000           sra     %r1,0(%r4)
           00000000000ea55a: 5410d00c           n       %r1,12(%r13)
           00000000000ea55e: 1211               ltr     %r1,%r1
Call Trace:
([<0000000000000000>] 0x0)
 [<000000000001fa22>] do_syscall_trace_exit+0x132/0x18c
 [<000000000002d0c4>] sysc_return+0x0/0x8
 [<000000000001c738>] kernel_thread_starter+0x0/0xc
Last Breaking-Event-Address:
 [<00000000000ea51e>] ftrace_syscall_exit+0x2a/0xdc

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
---
 kernel/trace/trace_syscalls.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -224,6 +224,8 @@ void ftrace_syscall_enter(struct pt_regs
 	int syscall_nr;
 
 	syscall_nr = syscall_get_nr(current, regs);
+	if (syscall_nr < 0)
+		return;
 	if (!test_bit(syscall_nr, enabled_enter_syscalls))
 		return;
 
@@ -254,6 +256,8 @@ void ftrace_syscall_exit(struct pt_regs 
 	int syscall_nr;
 
 	syscall_nr = syscall_get_nr(current, regs);
+	if (syscall_nr < 0)
+		return;
 	if (!test_bit(syscall_nr, enabled_exit_syscalls))
 		return;
 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/12] add syscall tracepoints V3 - s390 arch update
  2009-08-25 12:31 ` [PATCH 00/12] add syscall tracepoints V3 - s390 arch update Hendrik Brueckner
@ 2009-08-25 13:52   ` Frederic Weisbecker
  2009-08-25 14:39     ` Heiko Carstens
  2009-08-25 15:38     ` Hendrik Brueckner
  2009-08-26 16:53   ` Frederic Weisbecker
  2009-08-28 12:27   ` [tip:tracing/core] tracing: Add syscall tracepoints - s390 arch update tip-bot for Hendrik Brueckner
  2 siblings, 2 replies; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-25 13:52 UTC (permalink / raw)
  To: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky

On Tue, Aug 25, 2009 at 02:31:11PM +0200, Hendrik Brueckner wrote:
> Hi,
> 
> I looked at your recent syscall tracepoint patches and I have few
> more s390 arch updates.
> 
> This patch includes s390 arch updates for:
> - tracing: Map syscall name to number (syscall_name_to_nr())
> - tracing: Call arch_init_ftrace_syscalls at boot
> - tracing: add support traceopint ids (set_syscall_{enter,exit}_id())
> 
> The patch already uses "NR_syscalls" instead of FTRACE_SYSCALL_MAX.
> 
> The patch is based on today's linux-next (20090825).
> Since few of your patches already include s390 changes,
> I would appreciate if you could add the patch to your patch set.
> 
> If you have any remarks, please let me know. 
>   
> Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>


Looks good at a first glance.



> ---
>  arch/s390/include/asm/ftrace.h |    4 ++++
>  arch/s390/kernel/ftrace.c      |   36 +++++++++++++++++++++++++++---------
>  2 files changed, 31 insertions(+), 9 deletions(-)
> 
> --- a/arch/s390/kernel/ftrace.c
> +++ b/arch/s390/kernel/ftrace.c
> @@ -220,6 +220,29 @@ struct syscall_metadata *syscall_nr_to_m
>  	return syscalls_metadata[nr];
>  }
>  
> +int syscall_name_to_nr(char *name)
> +{
> +	int i;
> +
> +	if (!syscalls_metadata)
> +		return -1;
> +	for (i = 0; i < NR_syscalls; i++)
> +		if (syscalls_metadata[i])
> +			if (!strcmp(syscalls_metadata[i]->name, name))
> +				return i;
> +	return -1;
> +}
> +void set_syscall_enter_id(int num, int id)
> +{
> +	syscalls_metadata[num]->enter_id = id;
> +}
> +
> +void set_syscall_exit_id(int num, int id)
> +{
> +	syscalls_metadata[num]->exit_id = id;
> +}



The three helpers above seem very common between archs, I guess
we can move them to the core: kernel/trace/trace_syscalls.c


>  static struct syscall_metadata *find_syscall_meta(unsigned long syscall)
>  {
>  	struct syscall_metadata *start;
> @@ -237,24 +260,19 @@ static struct syscall_metadata *find_sys
>  	return NULL;
>  }
>  
> -void arch_init_ftrace_syscalls(void)
> +static int __init arch_init_ftrace_syscalls(void)
>  {
>  	struct syscall_metadata *meta;
>  	int i;
> -	static atomic_t refs;
> -
> -	if (atomic_inc_return(&refs) != 1)
> -		goto out;
>  	syscalls_metadata = kzalloc(sizeof(*syscalls_metadata) * NR_syscalls,
>  				    GFP_KERNEL);
>  	if (!syscalls_metadata)
> -		goto out;
> +		return -ENOMEM;
>  	for (i = 0; i < NR_syscalls; i++) {
>  		meta = find_syscall_meta((unsigned long)sys_call_table[i]);
>  		syscalls_metadata[i] = meta;
>  	}
> -	return;
> -out:
> -	atomic_dec(&refs);
> +	return 0;
>  }
> +arch_initcall(arch_init_ftrace_syscalls);
>  #endif



We can even probably move most of this code to the core, expect the tiny parts
that rely on the arch syscall table.

BTW, perhaps a silly question: would it be hard to have a generic syscall table
common to every archs?

Thanks,
Frederic.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 12:50   ` Hendrik Brueckner
@ 2009-08-25 14:15     ` Frederic Weisbecker
  2009-08-25 16:02       ` Hendrik Brueckner
  2009-08-25 21:40     ` [PATCH 08/12] add trace events for each syscall entry/exit Frederic Weisbecker
  2009-08-28 12:27     ` [tip:tracing/core] tracing: Check invalid syscall nr while tracing syscalls tip-bot for Hendrik Brueckner
  2 siblings, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-25 14:15 UTC (permalink / raw)
  To: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky

On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> On Mon, Aug 10, 2009 at 04:52:47PM -0400, Jason Baron wrote:
> > +void ftrace_syscall_enter(struct pt_regs *regs, long id)
> >  {
> >  	struct syscall_trace_enter *entry;
> >  	struct syscall_metadata *sys_data;
> > @@ -150,6 +105,8 @@ void ftrace_syscall_enter(struct pt_regs *regs)
> >  	int syscall_nr;
> > 
> >  	syscall_nr = syscall_get_nr(current, regs);
> > +	if (!test_bit(syscall_nr, enabled_enter_syscalls))
> > +		return;
> > 
> >  	sys_data = syscall_nr_to_meta(syscall_nr);
> >  	if (!sys_data)
> 
> > +void ftrace_syscall_exit(struct pt_regs *regs, long ret)
> >  {
> >  	struct syscall_trace_exit *entry;
> >  	struct syscall_metadata *sys_data;
> > @@ -178,6 +135,8 @@ void ftrace_syscall_exit(struct pt_regs *regs)
> >  	int syscall_nr;
> > 
> >  	syscall_nr = syscall_get_nr(current, regs);
> > +	if (!test_bit(syscall_nr, enabled_exit_syscalls))
> > +		return;
> Most arch syscall_get_nr() implementations returns -1 if the syscall
> number is not valid.  Accessing the bit field without a check might
> result in a kernel oops (at least I saw it on s390 for ftrace selftest).
> 
> Before this change, this problem did not occur, because the invalid
> syscall number (-1) caused syscall_nr_to_meta() to return NULL.
> 
> There are at least two scenarios where syscall_get_nr() can return -1:
> 
> 1. For example, ptrace stores an invalid syscall number, and thus,
>    tracing code resets it.
>    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> 
> 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
>    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
>    kernel threads.
>    However, the ftrace selftest triggers a kernel oops when testing syscall
>    trace points:
>       - The kernel thread is started as ususal (do_fork()),
>       - tracing code sets TIF_SYSCALL_FTRACE,
>       - the ret_from_fork() function is triggered and starts
> 	ftrace_syscall_exit() with an invalid syscall number.



I wonder if there is any way to identify such situation...?


> 
> To avoid these scenarios, I suggest to check the syscall_nr.
> 
> For instance, the ftrace selftest fails for s390 (with config option
> CONFIG_FTRACE_SYSCALLS set) and produces the following kernel oops.
> 
> Unable to handle kernel pointer dereference at virtual kernel address 2000000000
> 
> Oops: 0038 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 0 Not tainted 2.6.31-rc6-next-20090819-dirty #18
> Process kthreadd (pid: 818, task: 000000003ea207e8, ksp: 000000003e813eb8)
> Krnl PSW : 0704100180000000 00000000000ea54c (ftrace_syscall_exit+0x58/0xdc)
>            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
> Krnl GPRS: 0000000000000000 00000000000e0000 ffffffffffffffff 20000000008c2650
>            0000000000000007 0000000000000000 0000000000000000 0000000000000000
>            0000000000000000 0000000000000000 ffffffffffffffff 000000003e813d78
>            000000003e813f58 0000000000505ba8 000000003e813e18 000000003e813d78
> Krnl Code: 00000000000ea540: e330d0000008       ag      %r3,0(%r13)
>            00000000000ea546: a7480007           lhi     %r4,7
>            00000000000ea54a: 1442               nr      %r4,%r2
>           >00000000000ea54c: e31030000090       llgc    %r1,0(%r3)
>            00000000000ea552: 5410d008           n       %r1,8(%r13)
>            00000000000ea556: 8a104000           sra     %r1,0(%r4)
>            00000000000ea55a: 5410d00c           n       %r1,12(%r13)
>            00000000000ea55e: 1211               ltr     %r1,%r1
> Call Trace:
> ([<0000000000000000>] 0x0)
>  [<000000000001fa22>] do_syscall_trace_exit+0x132/0x18c
>  [<000000000002d0c4>] sysc_return+0x0/0x8
>  [<000000000001c738>] kernel_thread_starter+0x0/0xc
> Last Breaking-Event-Address:
>  [<00000000000ea51e>] ftrace_syscall_exit+0x2a/0xdc
> 
> Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>



Yeah, makes sense.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/12] add syscall tracepoints V3 - s390 arch update
  2009-08-25 13:52   ` Frederic Weisbecker
@ 2009-08-25 14:39     ` Heiko Carstens
  2009-08-25 19:52       ` Frederic Weisbecker
  2009-08-25 15:38     ` Hendrik Brueckner
  1 sibling, 1 reply; 88+ messages in thread
From: Heiko Carstens @ 2009-08-25 14:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Martin Schwidefsky

On Tue, Aug 25, 2009 at 03:52:32PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2009 at 02:31:11PM +0200, Hendrik Brueckner wrote:
> >  		meta = find_syscall_meta((unsigned long)sys_call_table[i]);
> >  		syscalls_metadata[i] = meta;
> >  	}
> We can even probably move most of this code to the core, expect the tiny parts
> that rely on the arch syscall table.
> 
> BTW, perhaps a silly question: would it be hard to have a generic syscall table
> common to every archs?

That would cause a lot of churn. Every architecture initializes the syscall
table (two tables if CONFIG_COMPAT is enabled) differently.
s390 also only uses 32 bit pointers in the system call table for 64 bit
kernels, since we know that the functions are within the first 4GB.
I don't think its worth the effort.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/12] add syscall tracepoints V3 - s390 arch update
  2009-08-25 13:52   ` Frederic Weisbecker
  2009-08-25 14:39     ` Heiko Carstens
@ 2009-08-25 15:38     ` Hendrik Brueckner
  1 sibling, 0 replies; 88+ messages in thread
From: Hendrik Brueckner @ 2009-08-25 15:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky

On Tue, Aug 25, 2009 at 03:52:32PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2009 at 02:31:11PM +0200, Hendrik Brueckner wrote:
> > This patch includes s390 arch updates for:
> > - tracing: Map syscall name to number (syscall_name_to_nr())
> > - tracing: Call arch_init_ftrace_syscalls at boot
> > - tracing: add support traceopint ids (set_syscall_{enter,exit}_id())
> > 
> > The patch already uses "NR_syscalls" instead of FTRACE_SYSCALL_MAX.
> > 
> > The patch is based on today's linux-next (20090825).
> > Since few of your patches already include s390 changes,
> > I would appreciate if you could add the patch to your patch set.
> > 
> > If you have any remarks, please let me know. 
> >   
> > Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
> 
> 
> Looks good at a first glance.
Thanks

> > +int syscall_name_to_nr(char *name)
> > +{
> > +	int i;
> > +
> > +	if (!syscalls_metadata)
> > +		return -1;
> > +	for (i = 0; i < NR_syscalls; i++)
> > +		if (syscalls_metadata[i])
> > +			if (!strcmp(syscalls_metadata[i]->name, name))
> > +				return i;
> > +	return -1;
> > +}
> > +void set_syscall_enter_id(int num, int id)
> > +{
> > +	syscalls_metadata[num]->enter_id = id;
> > +}
> > +
> > +void set_syscall_exit_id(int num, int id)
> > +{
> > +	syscalls_metadata[num]->exit_id = id;
> > +}
> 
> The three helpers above seem very common between archs, I guess
> we can move them to the core: kernel/trace/trace_syscalls.c
I think it is a good idea to move the helper routines to
kernel/trace/trace_syscalls.c.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 14:15     ` Frederic Weisbecker
@ 2009-08-25 16:02       ` Hendrik Brueckner
  2009-08-25 16:20         ` Mathieu Desnoyers
                           ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Hendrik Brueckner @ 2009-08-25 16:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky

On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > There are at least two scenarios where syscall_get_nr() can return -1:
> > 
> > 1. For example, ptrace stores an invalid syscall number, and thus,
> >    tracing code resets it.
> >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > 
> > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> >    kernel threads.
> >    However, the ftrace selftest triggers a kernel oops when testing syscall
> >    trace points:
> >       - The kernel thread is started as ususal (do_fork()),
> >       - tracing code sets TIF_SYSCALL_FTRACE,
> >       - the ret_from_fork() function is triggered and starts
> > 	ftrace_syscall_exit() with an invalid syscall number.
> 
> 
> 
> I wonder if there is any way to identify such situation...?
For the second case, it might be an option to avoid setting the
TIF_SYSCALL_FTRACE flag for kernel threads.

Kernel threads have task_struct->mm set to NULL.
(Thanks to Heiko for that hint ;-)

The idea is then to check the mm field in syscall_regfunc() and
set the flag accordingly.

However, I think the patch is an optional add-on becase checking
the syscall number is still required for case 1).

---
 kernel/tracepoint.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -593,7 +593,9 @@ void syscall_regfunc(void)
 	if (!sys_tracepoint_refcount) {
 		read_lock_irqsave(&tasklist_lock, flags);
 		do_each_thread(g, t) {
-			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+			/* Skip kernel threads. */
+			if (t->mm)
+				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
 		} while_each_thread(g, t);
 		read_unlock_irqrestore(&tasklist_lock, flags);
 	}


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 16:02       ` Hendrik Brueckner
@ 2009-08-25 16:20         ` Mathieu Desnoyers
  2009-08-25 16:59           ` Frederic Weisbecker
  2009-08-25 17:04           ` Jason Baron
  2009-08-26 12:35         ` Frederic Weisbecker
  2009-08-28 12:28         ` [tip:tracing/core] tracing: Don't trace kernel thread syscalls tip-bot for Hendrik Brueckner
  2 siblings, 2 replies; 88+ messages in thread
From: Mathieu Desnoyers @ 2009-08-25 16:20 UTC (permalink / raw)
  To: Hendrik Brueckner, Frederic Weisbecker, Jason Baron,
	linux-kernel, mingo, laijs, rostedt, peterz, jiayingz, mbligh,
	lizf, Heiko Carstens, Martin Schwidefsky

* Hendrik Brueckner (brueckner@linux.vnet.ibm.com) wrote:
> On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > 
> > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > >    tracing code resets it.
> > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > 
> > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > >    kernel threads.
> > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > >    trace points:
> > >       - The kernel thread is started as ususal (do_fork()),
> > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > >       - the ret_from_fork() function is triggered and starts
> > > 	ftrace_syscall_exit() with an invalid syscall number.
> > 
> > 
> > 
> > I wonder if there is any way to identify such situation...?
> For the second case, it might be an option to avoid setting the
> TIF_SYSCALL_FTRACE flag for kernel threads.
> 
> Kernel threads have task_struct->mm set to NULL.
> (Thanks to Heiko for that hint ;-)
> 
> The idea is then to check the mm field in syscall_regfunc() and
> set the flag accordingly.
> 
> However, I think the patch is an optional add-on becase checking
> the syscall number is still required for case 1).
> 
> ---
>  kernel/tracepoint.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -593,7 +593,9 @@ void syscall_regfunc(void)
>  	if (!sys_tracepoint_refcount) {
>  		read_lock_irqsave(&tasklist_lock, flags);
>  		do_each_thread(g, t) {
> -			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> +			/* Skip kernel threads. */
> +			if (t->mm)
> +				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);

Uh ? kernel threads can invoke a system call. There are rare places
where kernel code actually invoke system calls. I don't see why we
should not deal with them.

Moreover, the problem you face is more general: if we set the
TIF_SYSCALL_FTRACE flag of a standard thread right in the middle of its
system call, x86_64 will cause the syscall exit to execute by re-reading
the thread flags and run a syscall trace exit.

We could simply initialize the "saved system calls id" number to
something like -1, so that if we happen to return from a syscall that
did not get its id recorded at syscall entry, we know it because it's
not initialized.

We would need to carefully put back the -1 value after clearing the
thread flag when we stop tracing too (while still holding a mutex).

Mathieu

>  		} while_each_thread(g, t);
>  		read_unlock_irqrestore(&tasklist_lock, flags);
>  	}
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 16:20         ` Mathieu Desnoyers
@ 2009-08-25 16:59           ` Frederic Weisbecker
  2009-08-25 17:31             ` Frederic Weisbecker
  2009-08-25 17:04           ` Jason Baron
  1 sibling, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-25 16:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

On Tue, Aug 25, 2009 at 12:20:04PM -0400, Mathieu Desnoyers wrote:
> * Hendrik Brueckner (brueckner@linux.vnet.ibm.com) wrote:
> > On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > > 
> > > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > > >    tracing code resets it.
> > > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > > 
> > > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > > >    kernel threads.
> > > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > > >    trace points:
> > > >       - The kernel thread is started as ususal (do_fork()),
> > > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > > >       - the ret_from_fork() function is triggered and starts
> > > > 	ftrace_syscall_exit() with an invalid syscall number.
> > > 
> > > 
> > > 
> > > I wonder if there is any way to identify such situation...?
> > For the second case, it might be an option to avoid setting the
> > TIF_SYSCALL_FTRACE flag for kernel threads.
> > 
> > Kernel threads have task_struct->mm set to NULL.
> > (Thanks to Heiko for that hint ;-)
> > 
> > The idea is then to check the mm field in syscall_regfunc() and
> > set the flag accordingly.
> > 
> > However, I think the patch is an optional add-on becase checking
> > the syscall number is still required for case 1).
> > 
> > ---
> >  kernel/tracepoint.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > --- a/kernel/tracepoint.c
> > +++ b/kernel/tracepoint.c
> > @@ -593,7 +593,9 @@ void syscall_regfunc(void)
> >  	if (!sys_tracepoint_refcount) {
> >  		read_lock_irqsave(&tasklist_lock, flags);
> >  		do_each_thread(g, t) {
> > -			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > +			/* Skip kernel threads. */
> > +			if (t->mm)
> > +				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> 
> Uh ? kernel threads can invoke a system call. There are rare places
> where kernel code actually invoke system calls. I don't see why we
> should not deal with them.



Yeah they do, but they don't use the sysenter path, they call the
syscall helpers directly, such as do_fork() or things like that.

The syscall tracepoints are set in the sysenter/sysexit path, then
it's no use to trace the kernel threads, it doesn't have any effect,
except random results in case of fork() calls, because we take
the ret_from_fork() path that also ends up to trace_sys_exit()
if the TIF_SYSCALL_TRACEPOINT thing is set, leading to such
asymetric tracing.

Kernel threads use syscalls toward wrappers such as create_thread().
So instead, statically defined tracepoints in create_thread() and such
other syscall wrappers for kernel threads seem more valuable, hmm?

 
> Moreover, the problem you face is more general: if we set the
> TIF_SYSCALL_FTRACE flag of a standard thread right in the middle of its
> system call, x86_64 will cause the syscall exit to execute by re-reading
> the thread flags and run a syscall trace exit.


Well, I don't think that's the problem. The issue here, if I understand
correctly, is that kernel threads don't take the sysenter path, then never hit
the trace_sys_enter() call. And usually they won't ever hit any
trace_sys_exit() calls except in the fork() case, because we take
the ret_from_fork() path, which lead to syscall exit tracing due
to the TIF flags set.

At this stage, the syscall number is supposed to be stored in orig_eax,
but because the kernel thread hasn't called fork() through a syscall and
has called do_fork() directly, the regs values have nothing that look
like syscall parameters.

I guess we don't need to take the sys_enter tracing path to have a sane
orig_eax in the sys_exit tracing path (for non kernel threads).
Though I'm not sure about that, I should check to be sure.

> We could simply initialize the "saved system calls id" number to
> something like -1, so that if we happen to return from a syscall that
> did not get its id recorded at syscall entry, we know it because it's
> not initialized.
> 
> We would need to carefully put back the -1 value after clearing the
> thread flag when we stop tracing too (while still holding a mutex).
> 
> Mathieu
> 
> >  		} while_each_thread(g, t);
> >  		read_unlock_irqrestore(&tasklist_lock, flags);
> >  	}
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 16:20         ` Mathieu Desnoyers
  2009-08-25 16:59           ` Frederic Weisbecker
@ 2009-08-25 17:04           ` Jason Baron
  2009-08-25 18:15             ` Mathieu Desnoyers
  1 sibling, 1 reply; 88+ messages in thread
From: Jason Baron @ 2009-08-25 17:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Hendrik Brueckner, Frederic Weisbecker, linux-kernel, mingo,
	laijs, rostedt, peterz, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

On Tue, Aug 25, 2009 at 12:20:04PM -0400, Mathieu Desnoyers wrote:
> 
> Uh ? kernel threads can invoke a system call. There are rare places
> where kernel code actually invoke system calls. I don't see why we
> should not deal with them.
> 
> Moreover, the problem you face is more general: if we set the
> TIF_SYSCALL_FTRACE flag of a standard thread right in the middle of its
> system call, x86_64 will cause the syscall exit to execute by re-reading
> the thread flags and run a syscall trace exit.
> 
> We could simply initialize the "saved system calls id" number to
> something like -1, so that if we happen to return from a syscall that
> did not get its id recorded at syscall entry, we know it because it's
> not initialized.
> 
> We would need to carefully put back the -1 value after clearing the
> thread flag when we stop tracing too (while still holding a mutex).
> 
> Mathieu
> 

why can't we have a syscall exit that is unmatched? we calculate
the exit syscall number for the the pt_regs structure at exit, so we
don't need to match it up with an entry to know which syscall it is.

thanks,

-Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 16:59           ` Frederic Weisbecker
@ 2009-08-25 17:31             ` Frederic Weisbecker
  2009-08-25 18:31               ` Mathieu Desnoyers
  0 siblings, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-25 17:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

On Tue, Aug 25, 2009 at 06:59:14PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2009 at 12:20:04PM -0400, Mathieu Desnoyers wrote:
> > * Hendrik Brueckner (brueckner@linux.vnet.ibm.com) wrote:
> > > On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > > > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > > > 
> > > > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > > > >    tracing code resets it.
> > > > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > > > 
> > > > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > > > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > > > >    kernel threads.
> > > > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > > > >    trace points:
> > > > >       - The kernel thread is started as ususal (do_fork()),
> > > > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > > > >       - the ret_from_fork() function is triggered and starts
> > > > > 	ftrace_syscall_exit() with an invalid syscall number.
> > > > 
> > > > 
> > > > 
> > > > I wonder if there is any way to identify such situation...?
> > > For the second case, it might be an option to avoid setting the
> > > TIF_SYSCALL_FTRACE flag for kernel threads.
> > > 
> > > Kernel threads have task_struct->mm set to NULL.
> > > (Thanks to Heiko for that hint ;-)
> > > 
> > > The idea is then to check the mm field in syscall_regfunc() and
> > > set the flag accordingly.
> > > 
> > > However, I think the patch is an optional add-on becase checking
> > > the syscall number is still required for case 1).
> > > 
> > > ---
> > >  kernel/tracepoint.c |    4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > --- a/kernel/tracepoint.c
> > > +++ b/kernel/tracepoint.c
> > > @@ -593,7 +593,9 @@ void syscall_regfunc(void)
> > >  	if (!sys_tracepoint_refcount) {
> > >  		read_lock_irqsave(&tasklist_lock, flags);
> > >  		do_each_thread(g, t) {
> > > -			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > > +			/* Skip kernel threads. */
> > > +			if (t->mm)
> > > +				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > 
> > Uh ? kernel threads can invoke a system call. There are rare places
> > where kernel code actually invoke system calls. I don't see why we
> > should not deal with them.
> 
> 
> 
> Yeah they do, but they don't use the sysenter path, they call the
> syscall helpers directly, such as do_fork() or things like that.
> 
> The syscall tracepoints are set in the sysenter/sysexit path, then
> it's no use to trace the kernel threads, it doesn't have any effect,
> except random results in case of fork() calls, because we take
> the ret_from_fork() path that also ends up to trace_sys_exit()
> if the TIF_SYSCALL_TRACEPOINT thing is set, leading to such
> asymetric tracing.
> 
> Kernel threads use syscalls toward wrappers such as create_thread().
> So instead, statically defined tracepoints in create_thread() and such
> other syscall wrappers for kernel threads seem more valuable, hmm?
> 
>  
> > Moreover, the problem you face is more general: if we set the
> > TIF_SYSCALL_FTRACE flag of a standard thread right in the middle of its
> > system call, x86_64 will cause the syscall exit to execute by re-reading
> > the thread flags and run a syscall trace exit.
> 
> 
> Well, I don't think that's the problem. The issue here, if I understand
> correctly, is that kernel threads don't take the sysenter path, then never hit
> the trace_sys_enter() call. And usually they won't ever hit any
> trace_sys_exit() calls except in the fork() case, because we take
> the ret_from_fork() path, which lead to syscall exit tracing due
> to the TIF flags set.
> 
> At this stage, the syscall number is supposed to be stored in orig_eax,
> but because the kernel thread hasn't called fork() through a syscall and
> has called do_fork() directly, the regs values have nothing that look
> like syscall parameters.



I mean, I don't know how look like orig_eax at this stage.

Looking at arch/x86/kernel/process_32.c:copy_thread()

childregs = task_pt_regs(p);
*childregs = *regs;
childregs->ax = 0;
childregs->sp = sp;

p->thread.sp = (unsigned long) childregs;
p->thread.sp0 = (unsigned long) (childregs+1);

p->thread.ip = (unsigned long) ret_from_fork;


sp will be the struct pt_regs * passed to syscall_trace_leave()
later.

ax has the result of the fork syscall -> 0 for the child.
What about orig_eax which has the syscall nr? It depends on the pareent
and I don't know what it has at this stage.

I haven't seen crashes in x86 with kernel threads tracing, may be because
orig_eax is set to a valid syscall nr (may be even fork nr).

Perhaps it's not the case in s390 ?

Anyway, tracing kernel threads syscalls only gives us the fork return,
so it's something me may want to drop and trace higher level kernel
thread syscall wrappers instead.

Moreover every kernel threads is created through a kthreadd fork if
I'm not wrong, then it wouldn't be an accurate thing for us to trace the
fork calls in kernel thread. Tracing higher level kernel thread managment
sounds more interesting, we would then know who really created the thread,
etc...


> 
> I guess we don't need to take the sys_enter tracing path to have a sane
> orig_eax in the sys_exit tracing path (for non kernel threads).
> Though I'm not sure about that, I should check to be sure.
>
> > We could simply initialize the "saved system calls id" number to
> > something like -1, so that if we happen to return from a syscall that
> > did not get its id recorded at syscall entry, we know it because it's
> > not initialized.
> > 
> > We would need to carefully put back the -1 value after clearing the
> > thread flag when we stop tracing too (while still holding a mutex).
> > 
> > Mathieu
> > 
> > >  		} while_each_thread(g, t);
> > >  		read_unlock_irqrestore(&tasklist_lock, flags);
> > >  	}
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 17:04           ` Jason Baron
@ 2009-08-25 18:15             ` Mathieu Desnoyers
  0 siblings, 0 replies; 88+ messages in thread
From: Mathieu Desnoyers @ 2009-08-25 18:15 UTC (permalink / raw)
  To: Jason Baron
  Cc: Hendrik Brueckner, Frederic Weisbecker, linux-kernel, mingo,
	laijs, rostedt, peterz, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

* Jason Baron (jbaron@redhat.com) wrote:
> On Tue, Aug 25, 2009 at 12:20:04PM -0400, Mathieu Desnoyers wrote:
> > 
> > Uh ? kernel threads can invoke a system call. There are rare places
> > where kernel code actually invoke system calls. I don't see why we
> > should not deal with them.
> > 
> > Moreover, the problem you face is more general: if we set the
> > TIF_SYSCALL_FTRACE flag of a standard thread right in the middle of its
> > system call, x86_64 will cause the syscall exit to execute by re-reading
> > the thread flags and run a syscall trace exit.
> > 
> > We could simply initialize the "saved system calls id" number to
> > something like -1, so that if we happen to return from a syscall that
> > did not get its id recorded at syscall entry, we know it because it's
> > not initialized.
> > 
> > We would need to carefully put back the -1 value after clearing the
> > thread flag when we stop tracing too (while still holding a mutex).
> > 
> > Mathieu
> > 
> 
> why can't we have a syscall exit that is unmatched? we calculate
> the exit syscall number for the the pt_regs structure at exit, so we
> don't need to match it up with an entry to know which syscall it is.
> 

Are we certain that it will still be on the pt_regs at system call
exit and not overwritten by the syscall return value ? For every arch ?

Mathieu

> thanks,
> 
> -Jason

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 17:31             ` Frederic Weisbecker
@ 2009-08-25 18:31               ` Mathieu Desnoyers
  2009-08-25 19:42                 ` Frederic Weisbecker
                                   ` (3 more replies)
  0 siblings, 4 replies; 88+ messages in thread
From: Mathieu Desnoyers @ 2009-08-25 18:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

* Frederic Weisbecker (fweisbec@gmail.com) wrote:
> On Tue, Aug 25, 2009 at 06:59:14PM +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 25, 2009 at 12:20:04PM -0400, Mathieu Desnoyers wrote:
> > > * Hendrik Brueckner (brueckner@linux.vnet.ibm.com) wrote:
> > > > On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > > > > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > > > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > > > > 
> > > > > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > > > > >    tracing code resets it.
> > > > > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > > > > 
> > > > > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > > > > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > > > > >    kernel threads.
> > > > > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > > > > >    trace points:
> > > > > >       - The kernel thread is started as ususal (do_fork()),
> > > > > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > > > > >       - the ret_from_fork() function is triggered and starts
> > > > > > 	ftrace_syscall_exit() with an invalid syscall number.
> > > > > 
> > > > > 
> > > > > 
> > > > > I wonder if there is any way to identify such situation...?
> > > > For the second case, it might be an option to avoid setting the
> > > > TIF_SYSCALL_FTRACE flag for kernel threads.
> > > > 
> > > > Kernel threads have task_struct->mm set to NULL.
> > > > (Thanks to Heiko for that hint ;-)
> > > > 
> > > > The idea is then to check the mm field in syscall_regfunc() and
> > > > set the flag accordingly.
> > > > 
> > > > However, I think the patch is an optional add-on becase checking
> > > > the syscall number is still required for case 1).
> > > > 
> > > > ---
> > > >  kernel/tracepoint.c |    4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > --- a/kernel/tracepoint.c
> > > > +++ b/kernel/tracepoint.c
> > > > @@ -593,7 +593,9 @@ void syscall_regfunc(void)
> > > >  	if (!sys_tracepoint_refcount) {
> > > >  		read_lock_irqsave(&tasklist_lock, flags);
> > > >  		do_each_thread(g, t) {
> > > > -			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > > > +			/* Skip kernel threads. */
> > > > +			if (t->mm)
> > > > +				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > > 
> > > Uh ? kernel threads can invoke a system call. There are rare places
> > > where kernel code actually invoke system calls. I don't see why we
> > > should not deal with them.
> > 
> > 
> > 
> > Yeah they do, but they don't use the sysenter path, they call the
> > syscall helpers directly, such as do_fork() or things like that.
> > 
> > The syscall tracepoints are set in the sysenter/sysexit path, then
> > it's no use to trace the kernel threads, it doesn't have any effect,
> > except random results in case of fork() calls, because we take
> > the ret_from_fork() path that also ends up to trace_sys_exit()
> > if the TIF_SYSCALL_TRACEPOINT thing is set, leading to such
> > asymetric tracing.
> > 
> > Kernel threads use syscalls toward wrappers such as create_thread().
> > So instead, statically defined tracepoints in create_thread() and such
> > other syscall wrappers for kernel threads seem more valuable, hmm?
> > 
> >  
> > > Moreover, the problem you face is more general: if we set the
> > > TIF_SYSCALL_FTRACE flag of a standard thread right in the middle of its
> > > system call, x86_64 will cause the syscall exit to execute by re-reading
> > > the thread flags and run a syscall trace exit.
> > 
> > 
> > Well, I don't think that's the problem. The issue here, if I understand
> > correctly, is that kernel threads don't take the sysenter path, then never hit
> > the trace_sys_enter() call. And usually they won't ever hit any
> > trace_sys_exit() calls except in the fork() case, because we take
> > the ret_from_fork() path, which lead to syscall exit tracing due
> > to the TIF flags set.
> > 
> > At this stage, the syscall number is supposed to be stored in orig_eax,
> > but because the kernel thread hasn't called fork() through a syscall and
> > has called do_fork() directly, the regs values have nothing that look
> > like syscall parameters.
> 
> 
> 
> I mean, I don't know how look like orig_eax at this stage.
> 
> Looking at arch/x86/kernel/process_32.c:copy_thread()
> 
> childregs = task_pt_regs(p);
> *childregs = *regs;
> childregs->ax = 0;
> childregs->sp = sp;
> 
> p->thread.sp = (unsigned long) childregs;
> p->thread.sp0 = (unsigned long) (childregs+1);
> 
> p->thread.ip = (unsigned long) ret_from_fork;
> 
> 
> sp will be the struct pt_regs * passed to syscall_trace_leave()
> later.
> 
> ax has the result of the fork syscall -> 0 for the child.
> What about orig_eax which has the syscall nr? It depends on the pareent
> and I don't know what it has at this stage.
> 
> I haven't seen crashes in x86 with kernel threads tracing, may be because
> orig_eax is set to a valid syscall nr (may be even fork nr).
> 
> Perhaps it's not the case in s390 ?
> 
> Anyway, tracing kernel threads syscalls only gives us the fork return,
> so it's something me may want to drop and trace higher level kernel
> thread syscall wrappers instead.
> 
> Moreover every kernel threads is created through a kthreadd fork if
> I'm not wrong, then it wouldn't be an accurate thing for us to trace the
> fork calls in kernel thread. Tracing higher level kernel thread managment
> sounds more interesting, we would then know who really created the thread,
> etc...
> 

(Well, I do not have time currently to look into the gory details
(sorry), but let's try to take a step back from the problem.)

The design proposal for this kthread behavior wrt syscalls is based on a
very specific and current kernel behavior, that may happen to change and
that I have actually seen proven incorrect. For instance, some
proprietary Linux driver does very odd things with system calls within
kernel threads, like invoking them with int 0x80.

Yes, this is odd, but do we really want to tie the tracer that much to
the actual OS implementation specificities ?

That sounds like a recipe for endless breakages and missing bits of
instrumentation.

So my advice would be: if we want to trace the syscall entry/exit paths,
let's trace them for the _whole_ system, and find ways to make it work
for corner-cases rather than finding clever ways to diminish
instrumentation coverage.

Given the ret from fork example happens to be the first event fired
after the thread is created, we should be able to deal with this problem
by initializing the thread structure used by syscall exit tracing to an
initial "ret from fork" value.

Mathieu

> 
> > 
> > I guess we don't need to take the sys_enter tracing path to have a sane
> > orig_eax in the sys_exit tracing path (for non kernel threads).
> > Though I'm not sure about that, I should check to be sure.
> >
> > > We could simply initialize the "saved system calls id" number to
> > > something like -1, so that if we happen to return from a syscall that
> > > did not get its id recorded at syscall entry, we know it because it's
> > > not initialized.
> > > 
> > > We would need to carefully put back the -1 value after clearing the
> > > thread flag when we stop tracing too (while still holding a mutex).
> > > 
> > > Mathieu
> > > 
> > > >  		} while_each_thread(g, t);
> > > >  		read_unlock_irqrestore(&tasklist_lock, flags);
> > > >  	}
> > > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 18:31               ` Mathieu Desnoyers
@ 2009-08-25 19:42                 ` Frederic Weisbecker
  2009-08-25 19:51                   ` Mathieu Desnoyers
  2009-08-26  6:48                   ` Peter Zijlstra
  2009-08-25 22:04                 ` Martin Schwidefsky
                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-25 19:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

On Tue, Aug 25, 2009 at 02:31:19PM -0400, Mathieu Desnoyers wrote:
> (Well, I do not have time currently to look into the gory details
> (sorry), but let's try to take a step back from the problem.)
> 
> The design proposal for this kthread behavior wrt syscalls is based on a
> very specific and current kernel behavior, that may happen to change and
> that I have actually seen proven incorrect. For instance, some
> proprietary Linux driver does very odd things with system calls within
> kernel threads, like invoking them with int 0x80.
> 
> Yes, this is odd, but do we really want to tie the tracer that much to
> the actual OS implementation specificities ?


I really can't see the point in doing this. I don't expect the kernel
behaviour to change soon and have explicit syscalls interrupts done
from it. It's not about a current kernel implementation fashion,
it's about kernel design sanity that is not likely to go backward.

Is it worth it to trace kernel threads, maintain their tracing
specificities (such as workarounds with ret_from_fork that implies)
just because we want to support tracing on some silly proprietary drivers?


> 
> That sounds like a recipe for endless breakages and missing bits of
> instrumentation.
>
> So my advice would be: if we want to trace the syscall entry/exit paths,
> let's trace them for the _whole_ system, and find ways to make it work
> for corner-cases rather than finding clever ways to diminish
> instrumentation coverage.


If developers of out of tree drivers want to implement buggy things
that would never be accepted after a minimal review here, and then instrument
their bugs, then I would suggest them to implement their own ad hoc instrumentation,
really :-/

What's the point in supporting out of tree bugs?

Well, the only advantage of doing this would be to support reverse engineering
in tiny and rare corner cases. Not that worth the effort.

 
> Given the ret from fork example happens to be the first event fired
> after the thread is created, we should be able to deal with this problem
> by initializing the thread structure used by syscall exit tracing to an
> initial "ret from fork" value.
> 
> Mathieu


It means we have to support and check this corner case in every archs
that support syscall tracing, deal with crashes because we omitted it, etc...

For all the things I've explained above I don't think it's worth the effort.

But it's just my opinion...


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 19:42                 ` Frederic Weisbecker
@ 2009-08-25 19:51                   ` Mathieu Desnoyers
  2009-08-26  0:19                     ` Frederic Weisbecker
  2009-08-26  6:48                   ` Peter Zijlstra
  1 sibling, 1 reply; 88+ messages in thread
From: Mathieu Desnoyers @ 2009-08-25 19:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

* Frederic Weisbecker (fweisbec@gmail.com) wrote:
> On Tue, Aug 25, 2009 at 02:31:19PM -0400, Mathieu Desnoyers wrote:
> > (Well, I do not have time currently to look into the gory details
> > (sorry), but let's try to take a step back from the problem.)
> > 
> > The design proposal for this kthread behavior wrt syscalls is based on a
> > very specific and current kernel behavior, that may happen to change and
> > that I have actually seen proven incorrect. For instance, some
> > proprietary Linux driver does very odd things with system calls within
> > kernel threads, like invoking them with int 0x80.
> > 
> > Yes, this is odd, but do we really want to tie the tracer that much to
> > the actual OS implementation specificities ?
> 
> 
> I really can't see the point in doing this. I don't expect the kernel
> behaviour to change soon and have explicit syscalls interrupts done
> from it. It's not about a current kernel implementation fashion,
> it's about kernel design sanity that is not likely to go backward.
> 
> Is it worth it to trace kernel threads, maintain their tracing
> specificities (such as workarounds with ret_from_fork that implies)
> just because we want to support tracing on some silly proprietary drivers?
> 
> 
> > 
> > That sounds like a recipe for endless breakages and missing bits of
> > instrumentation.
> >
> > So my advice would be: if we want to trace the syscall entry/exit paths,
> > let's trace them for the _whole_ system, and find ways to make it work
> > for corner-cases rather than finding clever ways to diminish
> > instrumentation coverage.
> 
> 
> If developers of out of tree drivers want to implement buggy things
> that would never be accepted after a minimal review here, and then instrument
> their bugs, then I would suggest them to implement their own ad hoc instrumentation,
> really :-/
> 
> What's the point in supporting out of tree bugs?
> 
> Well, the only advantage of doing this would be to support reverse engineering
> in tiny and rare corner cases. Not that worth the effort.
> 
>  
> > Given the ret from fork example happens to be the first event fired
> > after the thread is created, we should be able to deal with this problem
> > by initializing the thread structure used by syscall exit tracing to an
> > initial "ret from fork" value.
> > 
> > Mathieu
> 
> 
> It means we have to support and check this corner case in every archs
> that support syscall tracing, deal with crashes because we omitted it, etc...
> 
> For all the things I've explained above I don't think it's worth the effort.
> 
> But it's just my opinion...
> 

Then we might want to explicitly require that calls to sys_*() system
calls made from within the kernel pass through another instrumentation
mechanism. IMHO, that would make sense. It would cover both system calls
made from kernel threads and system calls made from within a system call
or trap.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/12] add syscall tracepoints V3 - s390 arch update
  2009-08-25 14:39     ` Heiko Carstens
@ 2009-08-25 19:52       ` Frederic Weisbecker
  0 siblings, 0 replies; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-25 19:52 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Martin Schwidefsky

On Tue, Aug 25, 2009 at 04:39:51PM +0200, Heiko Carstens wrote:
> On Tue, Aug 25, 2009 at 03:52:32PM +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 25, 2009 at 02:31:11PM +0200, Hendrik Brueckner wrote:
> > >  		meta = find_syscall_meta((unsigned long)sys_call_table[i]);
> > >  		syscalls_metadata[i] = meta;
> > >  	}
> > We can even probably move most of this code to the core, expect the tiny parts
> > that rely on the arch syscall table.
> > 
> > BTW, perhaps a silly question: would it be hard to have a generic syscall table
> > common to every archs?
> 
> That would cause a lot of churn. Every architecture initializes the syscall
> table (two tables if CONFIG_COMPAT is enabled) differently.
> s390 also only uses 32 bit pointers in the system call table for 64 bit
> kernels, since we know that the functions are within the first 4GB.
> I don't think its worth the effort.


Ok.
Well I remembered about some projects of a unified syscall table but may be
it has been given up for such reasons.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 12:50   ` Hendrik Brueckner
  2009-08-25 14:15     ` Frederic Weisbecker
@ 2009-08-25 21:40     ` Frederic Weisbecker
  2009-08-25 22:09       ` Frederic Weisbecker
  2009-08-28 12:27     ` [tip:tracing/core] tracing: Check invalid syscall nr while tracing syscalls tip-bot for Hendrik Brueckner
  2 siblings, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-25 21:40 UTC (permalink / raw)
  To: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky

On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> Most arch syscall_get_nr() implementations returns -1 if the syscall
> number is not valid.  Accessing the bit field without a check might
> result in a kernel oops (at least I saw it on s390 for ftrace selftest).
> 
> Before this change, this problem did not occur, because the invalid
> syscall number (-1) caused syscall_nr_to_meta() to return NULL.
> 
> There are at least two scenarios where syscall_get_nr() can return -1:
> 
> 1. For example, ptrace stores an invalid syscall number, and thus,
>    tracing code resets it.
>    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> 
> 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
>    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
>    kernel threads.
>    However, the ftrace selftest triggers a kernel oops when testing syscall
>    trace points:
>       - The kernel thread is started as ususal (do_fork()),
>       - tracing code sets TIF_SYSCALL_FTRACE,
>       - the ret_from_fork() function is triggered and starts
> 	ftrace_syscall_exit() with an invalid syscall number.
> 
> To avoid these scenarios, I suggest to check the syscall_nr.
> 
> For instance, the ftrace selftest fails for s390 (with config option
> CONFIG_FTRACE_SYSCALLS set) and produces the following kernel oops.
> 
> Unable to handle kernel pointer dereference at virtual kernel address 2000000000
> 
> Oops: 0038 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 0 Not tainted 2.6.31-rc6-next-20090819-dirty #18
> Process kthreadd (pid: 818, task: 000000003ea207e8, ksp: 000000003e813eb8)
> Krnl PSW : 0704100180000000 00000000000ea54c (ftrace_syscall_exit+0x58/0xdc)
>            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
> Krnl GPRS: 0000000000000000 00000000000e0000 ffffffffffffffff 20000000008c2650
>            0000000000000007 0000000000000000 0000000000000000 0000000000000000
>            0000000000000000 0000000000000000 ffffffffffffffff 000000003e813d78
>            000000003e813f58 0000000000505ba8 000000003e813e18 000000003e813d78
> Krnl Code: 00000000000ea540: e330d0000008       ag      %r3,0(%r13)
>            00000000000ea546: a7480007           lhi     %r4,7
>            00000000000ea54a: 1442               nr      %r4,%r2
>           >00000000000ea54c: e31030000090       llgc    %r1,0(%r3)
>            00000000000ea552: 5410d008           n       %r1,8(%r13)
>            00000000000ea556: 8a104000           sra     %r1,0(%r4)
>            00000000000ea55a: 5410d00c           n       %r1,12(%r13)
>            00000000000ea55e: 1211               ltr     %r1,%r1
> Call Trace:
> ([<0000000000000000>] 0x0)
>  [<000000000001fa22>] do_syscall_trace_exit+0x132/0x18c
>  [<000000000002d0c4>] sysc_return+0x0/0x8
>  [<000000000001c738>] kernel_thread_starter+0x0/0xc
> Last Breaking-Event-Address:
>  [<00000000000ea51e>] ftrace_syscall_exit+0x2a/0xdc
> 
> Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>



I'm queueing this one for .32

Thanks.



> ---
>  kernel/trace/trace_syscalls.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -224,6 +224,8 @@ void ftrace_syscall_enter(struct pt_regs
>  	int syscall_nr;
>  
>  	syscall_nr = syscall_get_nr(current, regs);
> +	if (syscall_nr < 0)
> +		return;
>  	if (!test_bit(syscall_nr, enabled_enter_syscalls))
>  		return;
>  
> @@ -254,6 +256,8 @@ void ftrace_syscall_exit(struct pt_regs 
>  	int syscall_nr;
>  
>  	syscall_nr = syscall_get_nr(current, regs);
> +	if (syscall_nr < 0)
> +		return;
>  	if (!test_bit(syscall_nr, enabled_exit_syscalls))
>  		return;
>  


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 18:31               ` Mathieu Desnoyers
  2009-08-25 19:42                 ` Frederic Weisbecker
@ 2009-08-25 22:04                 ` Martin Schwidefsky
  2009-08-26  7:38                   ` Heiko Carstens
  2009-08-26  6:21                 ` Peter Zijlstra
  2009-08-26  7:10                 ` Peter Zijlstra
  3 siblings, 1 reply; 88+ messages in thread
From: Martin Schwidefsky @ 2009-08-25 22:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Hendrik Brueckner, Jason Baron,
	linux-kernel, mingo, laijs, rostedt, peterz, jiayingz, mbligh,
	lizf, Heiko Carstens

On Tue, 25 Aug 2009 14:31:19 -0400
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > On Tue, Aug 25, 2009 at 06:59:14PM +0200, Frederic Weisbecker wrote:
> > > On Tue, Aug 25, 2009 at 12:20:04PM -0400, Mathieu Desnoyers wrote:
> > > > * Hendrik Brueckner (brueckner@linux.vnet.ibm.com) wrote:
> > > > > On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > > > > > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > > > > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > > > > > 
> > > > > > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > > > > > >    tracing code resets it.
> > > > > > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > > > > > 
> > > > > > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > > > > > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > > > > > >    kernel threads.
> > > > > > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > > > > > >    trace points:
> > > > > > >       - The kernel thread is started as ususal (do_fork()),
> > > > > > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > > > > > >       - the ret_from_fork() function is triggered and starts
> > > > > > > 	ftrace_syscall_exit() with an invalid syscall number.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > I wonder if there is any way to identify such situation...?
> > > > > For the second case, it might be an option to avoid setting the
> > > > > TIF_SYSCALL_FTRACE flag for kernel threads.
> > > > > 
> > > > > Kernel threads have task_struct->mm set to NULL.
> > > > > (Thanks to Heiko for that hint ;-)
> > > > > 
> > > > > The idea is then to check the mm field in syscall_regfunc() and
> > > > > set the flag accordingly.
> > > > > 
> > > > > However, I think the patch is an optional add-on becase checking
> > > > > the syscall number is still required for case 1).
> > > > > 
> > > > > ---
> > > > >  kernel/tracepoint.c |    4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > 
> > > > > --- a/kernel/tracepoint.c
> > > > > +++ b/kernel/tracepoint.c
> > > > > @@ -593,7 +593,9 @@ void syscall_regfunc(void)
> > > > >  	if (!sys_tracepoint_refcount) {
> > > > >  		read_lock_irqsave(&tasklist_lock, flags);
> > > > >  		do_each_thread(g, t) {
> > > > > -			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > > > > +			/* Skip kernel threads. */
> > > > > +			if (t->mm)
> > > > > +				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > > > 
> > > > Uh ? kernel threads can invoke a system call. There are rare places
> > > > where kernel code actually invoke system calls. I don't see why we
> > > > should not deal with them.
> > > 
> > > 
> > > 
> > > Yeah they do, but they don't use the sysenter path, they call the
> > > syscall helpers directly, such as do_fork() or things like that.
> > > 
> > > The syscall tracepoints are set in the sysenter/sysexit path, then
> > > it's no use to trace the kernel threads, it doesn't have any effect,
> > > except random results in case of fork() calls, because we take
> > > the ret_from_fork() path that also ends up to trace_sys_exit()
> > > if the TIF_SYSCALL_TRACEPOINT thing is set, leading to such
> > > asymetric tracing.
> > > 
> > > Kernel threads use syscalls toward wrappers such as create_thread().
> > > So instead, statically defined tracepoints in create_thread() and such
> > > other syscall wrappers for kernel threads seem more valuable, hmm?
> > > 
> > >  
> > > > Moreover, the problem you face is more general: if we set the
> > > > TIF_SYSCALL_FTRACE flag of a standard thread right in the middle of its
> > > > system call, x86_64 will cause the syscall exit to execute by re-reading
> > > > the thread flags and run a syscall trace exit.
> > > 
> > > 
> > > Well, I don't think that's the problem. The issue here, if I understand
> > > correctly, is that kernel threads don't take the sysenter path, then never hit
> > > the trace_sys_enter() call. And usually they won't ever hit any
> > > trace_sys_exit() calls except in the fork() case, because we take
> > > the ret_from_fork() path, which lead to syscall exit tracing due
> > > to the TIF flags set.
> > > 
> > > At this stage, the syscall number is supposed to be stored in orig_eax,
> > > but because the kernel thread hasn't called fork() through a syscall and
> > > has called do_fork() directly, the regs values have nothing that look
> > > like syscall parameters.
> > 
> > 
> > 
> > I mean, I don't know how look like orig_eax at this stage.
> > 
> > Looking at arch/x86/kernel/process_32.c:copy_thread()
> > 
> > childregs = task_pt_regs(p);
> > *childregs = *regs;
> > childregs->ax = 0;
> > childregs->sp = sp;
> > 
> > p->thread.sp = (unsigned long) childregs;
> > p->thread.sp0 = (unsigned long) (childregs+1);
> > 
> > p->thread.ip = (unsigned long) ret_from_fork;
> > 
> > 
> > sp will be the struct pt_regs * passed to syscall_trace_leave()
> > later.
> > 
> > ax has the result of the fork syscall -> 0 for the child.
> > What about orig_eax which has the syscall nr? It depends on the pareent
> > and I don't know what it has at this stage.
> > 
> > I haven't seen crashes in x86 with kernel threads tracing, may be because
> > orig_eax is set to a valid syscall nr (may be even fork nr).
> > 
> > Perhaps it's not the case in s390 ?
> > 
> > Anyway, tracing kernel threads syscalls only gives us the fork return,
> > so it's something me may want to drop and trace higher level kernel
> > thread syscall wrappers instead.
> > 
> > Moreover every kernel threads is created through a kthreadd fork if
> > I'm not wrong, then it wouldn't be an accurate thing for us to trace the
> > fork calls in kernel thread. Tracing higher level kernel thread managment
> > sounds more interesting, we would then know who really created the thread,
> > etc...
> > 
> 
> (Well, I do not have time currently to look into the gory details
> (sorry), but let's try to take a step back from the problem.)
> 
> The design proposal for this kthread behavior wrt syscalls is based on a
> very specific and current kernel behavior, that may happen to change and
> that I have actually seen proven incorrect. For instance, some
> proprietary Linux driver does very odd things with system calls within
> kernel threads, like invoking them with int 0x80.

On s390 it is not allowed to execute the system call instruction svc
from kernel code to execute a system call function. You need to call
the system call function by name. The why is hidden in the critical
section cleanup in entry.S. There is a good reason why the inline
assemblies to execute an inline system call have been removed from
the kernel code.
 
> Yes, this is odd, but do we really want to tie the tracer that much to
> the actual OS implementation specificities ?
> 
> That sounds like a recipe for endless breakages and missing bits of
> instrumentation.
> 
> So my advice would be: if we want to trace the syscall entry/exit paths,
> let's trace them for the _whole_ system, and find ways to make it work
> for corner-cases rather than finding clever ways to diminish
> instrumentation coverage.

I guess that the real reason for the crash is hidden in the initialization
of the pt_regs structure of the kernel thread.

> Given the ret from fork example happens to be the first event fired
> after the thread is created, we should be able to deal with this problem
> by initializing the thread structure used by syscall exit tracing to an
> initial "ret from fork" value.

That is my best guess as well.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 21:40     ` [PATCH 08/12] add trace events for each syscall entry/exit Frederic Weisbecker
@ 2009-08-25 22:09       ` Frederic Weisbecker
  2009-08-26  7:47         ` Heiko Carstens
  0 siblings, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-25 22:09 UTC (permalink / raw)
  To: Heiko Carstens, Martin Schwidefsky, Hendrik Brueckner
  Cc: Jason Baron, linux-kernel, mingo, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf

On Tue, Aug 25, 2009 at 11:40:21PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > Most arch syscall_get_nr() implementations returns -1 if the syscall
> > number is not valid.  Accessing the bit field without a check might
> > result in a kernel oops (at least I saw it on s390 for ftrace selftest).
> > 
> > Before this change, this problem did not occur, because the invalid
> > syscall number (-1) caused syscall_nr_to_meta() to return NULL.
> > 
> > There are at least two scenarios where syscall_get_nr() can return -1:
> > 
> > 1. For example, ptrace stores an invalid syscall number, and thus,
> >    tracing code resets it.
> >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > 
> > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> >    kernel threads.
> >    However, the ftrace selftest triggers a kernel oops when testing syscall
> >    trace points:
> >       - The kernel thread is started as ususal (do_fork()),
> >       - tracing code sets TIF_SYSCALL_FTRACE,
> >       - the ret_from_fork() function is triggered and starts
> > 	ftrace_syscall_exit() with an invalid syscall number.
> > 
> > To avoid these scenarios, I suggest to check the syscall_nr.
> > 
> > For instance, the ftrace selftest fails for s390 (with config option
> > CONFIG_FTRACE_SYSCALLS set) and produces the following kernel oops.
> > 
> > Unable to handle kernel pointer dereference at virtual kernel address 2000000000
> > 
> > Oops: 0038 [#1] PREEMPT SMP
> > Modules linked in:
> > CPU: 0 Not tainted 2.6.31-rc6-next-20090819-dirty #18
> > Process kthreadd (pid: 818, task: 000000003ea207e8, ksp: 000000003e813eb8)
> > Krnl PSW : 0704100180000000 00000000000ea54c (ftrace_syscall_exit+0x58/0xdc)
> >            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
> > Krnl GPRS: 0000000000000000 00000000000e0000 ffffffffffffffff 20000000008c2650
> >            0000000000000007 0000000000000000 0000000000000000 0000000000000000
> >            0000000000000000 0000000000000000 ffffffffffffffff 000000003e813d78
> >            000000003e813f58 0000000000505ba8 000000003e813e18 000000003e813d78
> > Krnl Code: 00000000000ea540: e330d0000008       ag      %r3,0(%r13)
> >            00000000000ea546: a7480007           lhi     %r4,7
> >            00000000000ea54a: 1442               nr      %r4,%r2
> >           >00000000000ea54c: e31030000090       llgc    %r1,0(%r3)
> >            00000000000ea552: 5410d008           n       %r1,8(%r13)
> >            00000000000ea556: 8a104000           sra     %r1,0(%r4)
> >            00000000000ea55a: 5410d00c           n       %r1,12(%r13)
> >            00000000000ea55e: 1211               ltr     %r1,%r1
> > Call Trace:
> > ([<0000000000000000>] 0x0)
> >  [<000000000001fa22>] do_syscall_trace_exit+0x132/0x18c
> >  [<000000000002d0c4>] sysc_return+0x0/0x8
> >  [<000000000001c738>] kernel_thread_starter+0x0/0xc
> > Last Breaking-Event-Address:
> >  [<00000000000ea51e>] ftrace_syscall_exit+0x2a/0xdc
> > 
> > Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
> 
> 
> 
> I'm queueing this one for .32
> 
> Thanks.
> 


Btw it would be nice to have an ack from s390 maintainers.
Martin, Heiko, no problem with this patch?


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 19:51                   ` Mathieu Desnoyers
@ 2009-08-26  0:19                     ` Frederic Weisbecker
  2009-08-26  0:42                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-26  0:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

On Tue, Aug 25, 2009 at 03:51:11PM -0400, Mathieu Desnoyers wrote:
> * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > On Tue, Aug 25, 2009 at 02:31:19PM -0400, Mathieu Desnoyers wrote:
> > > (Well, I do not have time currently to look into the gory details
> > > (sorry), but let's try to take a step back from the problem.)
> > > 
> > > The design proposal for this kthread behavior wrt syscalls is based on a
> > > very specific and current kernel behavior, that may happen to change and
> > > that I have actually seen proven incorrect. For instance, some
> > > proprietary Linux driver does very odd things with system calls within
> > > kernel threads, like invoking them with int 0x80.
> > > 
> > > Yes, this is odd, but do we really want to tie the tracer that much to
> > > the actual OS implementation specificities ?
> > 
> > 
> > I really can't see the point in doing this. I don't expect the kernel
> > behaviour to change soon and have explicit syscalls interrupts done
> > from it. It's not about a current kernel implementation fashion,
> > it's about kernel design sanity that is not likely to go backward.
> > 
> > Is it worth it to trace kernel threads, maintain their tracing
> > specificities (such as workarounds with ret_from_fork that implies)
> > just because we want to support tracing on some silly proprietary drivers?
> > 
> > 
> > > 
> > > That sounds like a recipe for endless breakages and missing bits of
> > > instrumentation.
> > >
> > > So my advice would be: if we want to trace the syscall entry/exit paths,
> > > let's trace them for the _whole_ system, and find ways to make it work
> > > for corner-cases rather than finding clever ways to diminish
> > > instrumentation coverage.
> > 
> > 
> > If developers of out of tree drivers want to implement buggy things
> > that would never be accepted after a minimal review here, and then instrument
> > their bugs, then I would suggest them to implement their own ad hoc instrumentation,
> > really :-/
> > 
> > What's the point in supporting out of tree bugs?
> > 
> > Well, the only advantage of doing this would be to support reverse engineering
> > in tiny and rare corner cases. Not that worth the effort.
> > 
> >  
> > > Given the ret from fork example happens to be the first event fired
> > > after the thread is created, we should be able to deal with this problem
> > > by initializing the thread structure used by syscall exit tracing to an
> > > initial "ret from fork" value.
> > > 
> > > Mathieu
> > 
> > 
> > It means we have to support and check this corner case in every archs
> > that support syscall tracing, deal with crashes because we omitted it, etc...
> > 
> > For all the things I've explained above I don't think it's worth the effort.
> > 
> > But it's just my opinion...
> > 
> 
> Then we might want to explicitly require that calls to sys_*() system
> calls made from within the kernel pass through another instrumentation
> mechanism. IMHO, that would make sense. It would cover both system calls
> made from kernel threads and system calls made from within a system call
> or trap.
> 
> Mathieu


Well, we can't really set a tracepoint per sys_*() function. Or more
precisely we already have them, automagically generated and relying on
sysenter ptrace path.

But if we want to check which syscalls are called from kernel threads, we have:

- kthread() -> do_exit()


The entry point of every kernel threads (except "kthreadd") is
kthread(). It calls do_exit() in the end.

If we want to trace the exit of a kernel thread, we can put
a tracepoint there instead of do_exit() which results would
be intermixed with sys_exit() tracing.


- kthreadd :: create_kthread() -> kernel_thread() -> do_fork()


A creation of a thread is the result of the kthreadd thread fork().
If we want to trace the creation of kernel threads, we can again do that
in the upper level: kernel_thread().

But does that inform us about who created the thread? All we would see
is kthreadd that forks. This is a very poor information compared
to a userspace fork() that tells us who really created the new process.

Instead what we want is probably to trace kthread_create() which inserts the
job of a thread creation in the kthreadd thread, so that we know
_who_ asked for this thread creation (process that requested it and callsite).
And that's much more rich in information.

Well, you can even climb in an upper layer and look if this is a workqueue,
a kernel/async.c thread, a slow work, etc...


- kernel_execve() -> sys_execve()

We can execute user apps from kernel through call_usermodehelper().
And we can trace kernel_execve() or again in an upper layer
like call_usermodehelper()

- ... I guess there are other examples

The kernel calls syscalls through wrappers, and tracing these wrappers,
depending of the desired level of informations we want (choose your layer),
are much more verbose / rich in informations.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26  0:19                     ` Frederic Weisbecker
@ 2009-08-26  0:42                       ` Mathieu Desnoyers
  2009-08-26  7:28                         ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Mathieu Desnoyers @ 2009-08-26  0:42 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

* Frederic Weisbecker (fweisbec@gmail.com) wrote:
> On Tue, Aug 25, 2009 at 03:51:11PM -0400, Mathieu Desnoyers wrote:
> > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > > On Tue, Aug 25, 2009 at 02:31:19PM -0400, Mathieu Desnoyers wrote:
> > > > (Well, I do not have time currently to look into the gory details
> > > > (sorry), but let's try to take a step back from the problem.)
> > > > 
> > > > The design proposal for this kthread behavior wrt syscalls is based on a
> > > > very specific and current kernel behavior, that may happen to change and
> > > > that I have actually seen proven incorrect. For instance, some
> > > > proprietary Linux driver does very odd things with system calls within
> > > > kernel threads, like invoking them with int 0x80.
> > > > 
> > > > Yes, this is odd, but do we really want to tie the tracer that much to
> > > > the actual OS implementation specificities ?
> > > 
> > > 
> > > I really can't see the point in doing this. I don't expect the kernel
> > > behaviour to change soon and have explicit syscalls interrupts done
> > > from it. It's not about a current kernel implementation fashion,
> > > it's about kernel design sanity that is not likely to go backward.
> > > 
> > > Is it worth it to trace kernel threads, maintain their tracing
> > > specificities (such as workarounds with ret_from_fork that implies)
> > > just because we want to support tracing on some silly proprietary drivers?
> > > 
> > > 
> > > > 
> > > > That sounds like a recipe for endless breakages and missing bits of
> > > > instrumentation.
> > > >
> > > > So my advice would be: if we want to trace the syscall entry/exit paths,
> > > > let's trace them for the _whole_ system, and find ways to make it work
> > > > for corner-cases rather than finding clever ways to diminish
> > > > instrumentation coverage.
> > > 
> > > 
> > > If developers of out of tree drivers want to implement buggy things
> > > that would never be accepted after a minimal review here, and then instrument
> > > their bugs, then I would suggest them to implement their own ad hoc instrumentation,
> > > really :-/
> > > 
> > > What's the point in supporting out of tree bugs?
> > > 
> > > Well, the only advantage of doing this would be to support reverse engineering
> > > in tiny and rare corner cases. Not that worth the effort.
> > > 
> > >  
> > > > Given the ret from fork example happens to be the first event fired
> > > > after the thread is created, we should be able to deal with this problem
> > > > by initializing the thread structure used by syscall exit tracing to an
> > > > initial "ret from fork" value.
> > > > 
> > > > Mathieu
> > > 
> > > 
> > > It means we have to support and check this corner case in every archs
> > > that support syscall tracing, deal with crashes because we omitted it, etc...
> > > 
> > > For all the things I've explained above I don't think it's worth the effort.
> > > 
> > > But it's just my opinion...
> > > 
> > 
> > Then we might want to explicitly require that calls to sys_*() system
> > calls made from within the kernel pass through another instrumentation
> > mechanism. IMHO, that would make sense. It would cover both system calls
> > made from kernel threads and system calls made from within a system call
> > or trap.
> > 
> > Mathieu
> 
> 
> Well, we can't really set a tracepoint per sys_*() function. Or more
> precisely we already have them, automagically generated and relying on
> sysenter ptrace path.
> 
> But if we want to check which syscalls are called from kernel threads, we have:
> 
> - kthread() -> do_exit()
> 
> 
> The entry point of every kernel threads (except "kthreadd") is
> kthread(). It calls do_exit() in the end.
> 
> If we want to trace the exit of a kernel thread, we can put
> a tracepoint there instead of do_exit() which results would
> be intermixed with sys_exit() tracing.
> 
> 
> - kthreadd :: create_kthread() -> kernel_thread() -> do_fork()
> 
> 
> A creation of a thread is the result of the kthreadd thread fork().
> If we want to trace the creation of kernel threads, we can again do that
> in the upper level: kernel_thread().
> 
> But does that inform us about who created the thread? All we would see
> is kthreadd that forks. This is a very poor information compared
> to a userspace fork() that tells us who really created the new process.
> 
> Instead what we want is probably to trace kthread_create() which inserts the
> job of a thread creation in the kthreadd thread, so that we know
> _who_ asked for this thread creation (process that requested it and callsite).
> And that's much more rich in information.
> 
> Well, you can even climb in an upper layer and look if this is a workqueue,
> a kernel/async.c thread, a slow work, etc...
> 
> 
> - kernel_execve() -> sys_execve()
> 
> We can execute user apps from kernel through call_usermodehelper().
> And we can trace kernel_execve() or again in an upper layer
> like call_usermodehelper()
> 
> - ... I guess there are other examples
> 
> The kernel calls syscalls through wrappers, and tracing these wrappers,
> depending of the desired level of informations we want (choose your layer),
> are much more verbose / rich in informations.
> 

What you describe looks a lot like the approach I use in the LTTng tree.
Actually, the main point I am trying to make here is: if we rely only on
tracing at the syscall entry/exit level for, say, monitoring all uses of
e.g. sys_open(), we might be caught offguard by internal sys_open() uses
within the kernel.

sys_open if just an example (and possibly a bad one), but I am just
saying that syscall entry/exit tracing should not be seen as a complete
replacement of tracepoints added within the most important system call
sites if we plan to keep track of the overall kernel activity.

But we can do that incrementally, and it's only partially related to
syscall entry/exit instrumentation. Actually, if we find out that we
have to add instrumentation within the kernel code for a relatively
large quantity of system calls, going through the current effort to
extract the system call arguments might be unnecessary if we eventually
end up extracting those arguments from tracepoints placed in the sys_*()
implementation.

Mathieu



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 18:31               ` Mathieu Desnoyers
  2009-08-25 19:42                 ` Frederic Weisbecker
  2009-08-25 22:04                 ` Martin Schwidefsky
@ 2009-08-26  6:21                 ` Peter Zijlstra
  2009-08-26 17:08                   ` Mathieu Desnoyers
  2009-08-26  7:10                 ` Peter Zijlstra
  3 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2009-08-26  6:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Hendrik Brueckner, Jason Baron,
	linux-kernel, mingo, laijs, rostedt, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky

On Tue, 2009-08-25 at 14:31 -0400, Mathieu Desnoyers wrote:

> (Well, I do not have time currently to look into the gory details
> (sorry), but let's try to take a step back from the problem.)
> 
> The design proposal for this kthread behavior wrt syscalls is based on a
> very specific and current kernel behavior, that may happen to change and
> that I have actually seen proven incorrect. For instance, some
> proprietary Linux driver does very odd things with system calls within
> kernel threads, like invoking them with int 0x80.
> 
> Yes, this is odd, but do we really want to tie the tracer that much to
> the actual OS implementation specificities ?
> 
> That sounds like a recipe for endless breakages and missing bits of
> instrumentation.
> 
> So my advice would be: if we want to trace the syscall entry/exit paths,
> let's trace them for the _whole_ system, and find ways to make it work
> for corner-cases rather than finding clever ways to diminish
> instrumentation coverage.
> 
> Given the ret from fork example happens to be the first event fired
> after the thread is created, we should be able to deal with this problem
> by initializing the thread structure used by syscall exit tracing to an
> initial "ret from fork" value.

So you're saying we should let proprietary crap influence the design of
the kernel in any way?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 19:42                 ` Frederic Weisbecker
  2009-08-25 19:51                   ` Mathieu Desnoyers
@ 2009-08-26  6:48                   ` Peter Zijlstra
  1 sibling, 0 replies; 88+ messages in thread
From: Peter Zijlstra @ 2009-08-26  6:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Mathieu Desnoyers, Hendrik Brueckner, Jason Baron, linux-kernel,
	mingo, laijs, rostedt, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

On Tue, 2009-08-25 at 21:42 +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2009 at 02:31:19PM -0400, Mathieu Desnoyers wrote:
> > (Well, I do not have time currently to look into the gory details
> > (sorry), but let's try to take a step back from the problem.)
> > 
> > The design proposal for this kthread behavior wrt syscalls is based on a
> > very specific and current kernel behavior, that may happen to change and
> > that I have actually seen proven incorrect. For instance, some
> > proprietary Linux driver does very odd things with system calls within
> > kernel threads, like invoking them with int 0x80.
> > 
> > Yes, this is odd, but do we really want to tie the tracer that much to
> > the actual OS implementation specificities ?
> 
> 
> I really can't see the point in doing this. I don't expect the kernel
> behaviour to change soon and have explicit syscalls interrupts done
> from it. It's not about a current kernel implementation fashion,
> it's about kernel design sanity that is not likely to go backward.
> 
> Is it worth it to trace kernel threads, maintain their tracing
> specificities (such as workarounds with ret_from_fork that implies)
> just because we want to support tracing on some silly proprietary drivers?
> 
> 
> > 
> > That sounds like a recipe for endless breakages and missing bits of
> > instrumentation.
> >
> > So my advice would be: if we want to trace the syscall entry/exit paths,
> > let's trace them for the _whole_ system, and find ways to make it work
> > for corner-cases rather than finding clever ways to diminish
> > instrumentation coverage.
> 
> 
> If developers of out of tree drivers want to implement buggy things
> that would never be accepted after a minimal review here, and then instrument
> their bugs, then I would suggest them to implement their own ad hoc instrumentation,
> really :-/
> 
> What's the point in supporting out of tree bugs?
> 
> Well, the only advantage of doing this would be to support reverse engineering
> in tiny and rare corner cases. Not that worth the effort.
> 
>  
> > Given the ret from fork example happens to be the first event fired
> > after the thread is created, we should be able to deal with this problem
> > by initializing the thread structure used by syscall exit tracing to an
> > initial "ret from fork" value.
> > 
> > Mathieu
> 
> 
> It means we have to support and check this corner case in every archs
> that support syscall tracing, deal with crashes because we omitted it, etc...
> 
> For all the things I've explained above I don't think it's worth the effort.
> 
> But it's just my opinion...

I fully agree, let out of tree people deal with their own crap.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 18:31               ` Mathieu Desnoyers
                                   ` (2 preceding siblings ...)
  2009-08-26  6:21                 ` Peter Zijlstra
@ 2009-08-26  7:10                 ` Peter Zijlstra
  2009-08-26 17:10                   ` Mathieu Desnoyers
  2009-08-26 17:24                   ` H. Peter Anvin
  3 siblings, 2 replies; 88+ messages in thread
From: Peter Zijlstra @ 2009-08-26  7:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Hendrik Brueckner, Jason Baron,
	linux-kernel, mingo, laijs, rostedt, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky, Ingo Molnar, Thomas Gleixner,
	hpa

On Tue, 2009-08-25 at 14:31 -0400, Mathieu Desnoyers wrote:
> For instance, some
> proprietary Linux driver does very odd things with system calls within
> kernel threads, like invoking them with int 0x80.

So who is going to send the x86 patch to make int 0x80 from kernel space
panic the machine? :-)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26  0:42                       ` Mathieu Desnoyers
@ 2009-08-26  7:28                         ` Ingo Molnar
  2009-08-26 17:11                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2009-08-26  7:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Hendrik Brueckner, Jason Baron,
	linux-kernel, laijs, rostedt, peterz, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > On Tue, Aug 25, 2009 at 03:51:11PM -0400, Mathieu Desnoyers wrote:
> > > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > > > On Tue, Aug 25, 2009 at 02:31:19PM -0400, Mathieu Desnoyers wrote:
> > > > > (Well, I do not have time currently to look into the gory details
> > > > > (sorry), but let's try to take a step back from the problem.)
> > > > > 
> > > > > The design proposal for this kthread behavior wrt syscalls is based on a
> > > > > very specific and current kernel behavior, that may happen to change and
> > > > > that I have actually seen proven incorrect. For instance, some
> > > > > proprietary Linux driver does very odd things with system calls within
> > > > > kernel threads, like invoking them with int 0x80.
> > > > > 
> > > > > Yes, this is odd, but do we really want to tie the tracer that much to
> > > > > the actual OS implementation specificities ?
> > > > 
> > > > 
> > > > I really can't see the point in doing this. I don't expect the kernel
> > > > behaviour to change soon and have explicit syscalls interrupts done
> > > > from it. It's not about a current kernel implementation fashion,
> > > > it's about kernel design sanity that is not likely to go backward.
> > > > 
> > > > Is it worth it to trace kernel threads, maintain their tracing
> > > > specificities (such as workarounds with ret_from_fork that implies)
> > > > just because we want to support tracing on some silly proprietary drivers?
> > > > 
> > > > 
> > > > > 
> > > > > That sounds like a recipe for endless breakages and missing bits of
> > > > > instrumentation.
> > > > >
> > > > > So my advice would be: if we want to trace the syscall entry/exit paths,
> > > > > let's trace them for the _whole_ system, and find ways to make it work
> > > > > for corner-cases rather than finding clever ways to diminish
> > > > > instrumentation coverage.
> > > > 
> > > > 
> > > > If developers of out of tree drivers want to implement buggy things
> > > > that would never be accepted after a minimal review here, and then instrument
> > > > their bugs, then I would suggest them to implement their own ad hoc instrumentation,
> > > > really :-/
> > > > 
> > > > What's the point in supporting out of tree bugs?
> > > > 
> > > > Well, the only advantage of doing this would be to support reverse engineering
> > > > in tiny and rare corner cases. Not that worth the effort.
> > > > 
> > > >  
> > > > > Given the ret from fork example happens to be the first event fired
> > > > > after the thread is created, we should be able to deal with this problem
> > > > > by initializing the thread structure used by syscall exit tracing to an
> > > > > initial "ret from fork" value.
> > > > > 
> > > > > Mathieu
> > > > 
> > > > 
> > > > It means we have to support and check this corner case in every archs
> > > > that support syscall tracing, deal with crashes because we omitted it, etc...
> > > > 
> > > > For all the things I've explained above I don't think it's worth the effort.
> > > > 
> > > > But it's just my opinion...
> > > > 
> > > 
> > > Then we might want to explicitly require that calls to sys_*() system
> > > calls made from within the kernel pass through another instrumentation
> > > mechanism. IMHO, that would make sense. It would cover both system calls
> > > made from kernel threads and system calls made from within a system call
> > > or trap.
> > > 
> > > Mathieu
> > 
> > 
> > Well, we can't really set a tracepoint per sys_*() function. Or more
> > precisely we already have them, automagically generated and relying on
> > sysenter ptrace path.
> > 
> > But if we want to check which syscalls are called from kernel threads, we have:
> > 
> > - kthread() -> do_exit()
> > 
> > 
> > The entry point of every kernel threads (except "kthreadd") is
> > kthread(). It calls do_exit() in the end.
> > 
> > If we want to trace the exit of a kernel thread, we can put
> > a tracepoint there instead of do_exit() which results would
> > be intermixed with sys_exit() tracing.
> > 
> > 
> > - kthreadd :: create_kthread() -> kernel_thread() -> do_fork()
> > 
> > 
> > A creation of a thread is the result of the kthreadd thread fork().
> > If we want to trace the creation of kernel threads, we can again do that
> > in the upper level: kernel_thread().
> > 
> > But does that inform us about who created the thread? All we would see
> > is kthreadd that forks. This is a very poor information compared
> > to a userspace fork() that tells us who really created the new process.
> > 
> > Instead what we want is probably to trace kthread_create() which inserts the
> > job of a thread creation in the kthreadd thread, so that we know
> > _who_ asked for this thread creation (process that requested it and callsite).
> > And that's much more rich in information.
> > 
> > Well, you can even climb in an upper layer and look if this is a workqueue,
> > a kernel/async.c thread, a slow work, etc...
> > 
> > 
> > - kernel_execve() -> sys_execve()
> > 
> > We can execute user apps from kernel through call_usermodehelper().
> > And we can trace kernel_execve() or again in an upper layer
> > like call_usermodehelper()
> > 
> > - ... I guess there are other examples
> > 
> > The kernel calls syscalls through wrappers, and tracing these 
> > wrappers, depending of the desired level of informations we want 
> > (choose your layer), are much more verbose / rich in 
> > informations.
> 
> What you describe looks a lot like the approach I use in the LTTng 
> tree. Actually, the main point I am trying to make here is: if we 
> rely only on tracing at the syscall entry/exit level for, say, 
> monitoring all uses of e.g. sys_open(), we might be caught 
> offguard by internal sys_open() uses within the kernel.

There's a lot of 'internal' file opening going on within the kernel 
that ptrace does not notice - see all the filp_open() calls.

Lets worry about this only if it's a true issue.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 22:04                 ` Martin Schwidefsky
@ 2009-08-26  7:38                   ` Heiko Carstens
  2009-08-26 12:32                     ` Frederic Weisbecker
  0 siblings, 1 reply; 88+ messages in thread
From: Heiko Carstens @ 2009-08-26  7:38 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Hendrik Brueckner,
	Jason Baron, linux-kernel, mingo, laijs, rostedt, peterz,
	jiayingz, mbligh, lizf

On Wed, Aug 26, 2009 at 12:04:26AM +0200, Martin Schwidefsky wrote:
> On Tue, 25 Aug 2009 14:31:19 -0400
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > The design proposal for this kthread behavior wrt syscalls is based on a
> > very specific and current kernel behavior, that may happen to change and
> > that I have actually seen proven incorrect. For instance, some
> > proprietary Linux driver does very odd things with system calls within
> > kernel threads, like invoking them with int 0x80.

That's broken.. some proprietary drivers even change the system call table.
Do you want to be able to deal with that as well?

> > Yes, this is odd, but do we really want to tie the tracer that much to
> > the actual OS implementation specificities ?
> > 
> > That sounds like a recipe for endless breakages and missing bits of
> > instrumentation.
> > 
> > So my advice would be: if we want to trace the syscall entry/exit paths,
> > let's trace them for the _whole_ system, and find ways to make it work
> > for corner-cases rather than finding clever ways to diminish
> > instrumentation coverage.
> 
> I guess that the real reason for the crash is hidden in the initialization
> of the pt_regs structure of the kernel thread.

On s390 the reason is that the scvnr in the pt_regs structure of the initial
kernel thread is initialized to 0. svcnr contains the system call number
and system call number 0 does not exist.
That's why we have

static inline long syscall_get_nr(struct task_struct *task,
				  struct pt_regs *regs)
{
	return regs->svcnr ? regs->svcnr : -1;
}

Now, if you fork a kernel thread from the initial task the pt_regs structure
gets copied. Upon ret_from_fork the trace exit path will get -1 for
syscall_get_nr().
 
> > Given the ret from fork example happens to be the first event fired
> > after the thread is created, we should be able to deal with this problem
> > by initializing the thread structure used by syscall exit tracing to an
> > initial "ret from fork" value.
> 
> That is my best guess as well.

What would that value be? __NR_fork?

Syscall tracing of kernel threads seems to be wrong. If somebody would do
a "modprobe" and the init function of the module would create a kernel thread
then syscall_get_nr() at the ret_from_fork path of the kernel thread would
return __NR_init_module. That is of course only true if the old kernel_thread()
API would be used. For kthread_create() it would return the syscall of the
thread from which the kthread daemon was forked (the initial process I would
guess, which was initialized to 0).

So skipping kernel threads at the exit path seems so be the best fix, IMHO ;)

---
 kernel/trace/trace_syscalls.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-next/kernel/trace/trace_syscalls.c
===================================================================
--- linux-next.orig/kernel/trace/trace_syscalls.c
+++ linux-next/kernel/trace/trace_syscalls.c
@@ -253,6 +253,8 @@ void ftrace_syscall_exit(struct pt_regs 
 	struct ring_buffer_event *event;
 	int syscall_nr;
 
+	if (!current->mm)
+		return;
 	syscall_nr = syscall_get_nr(current, regs);
 	if (!test_bit(syscall_nr, enabled_exit_syscalls))
 		return;

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 22:09       ` Frederic Weisbecker
@ 2009-08-26  7:47         ` Heiko Carstens
  0 siblings, 0 replies; 88+ messages in thread
From: Heiko Carstens @ 2009-08-26  7:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Martin Schwidefsky, Hendrik Brueckner, Jason Baron, linux-kernel,
	mingo, laijs, rostedt, peterz, mathieu.desnoyers, jiayingz,
	mbligh, lizf

On Wed, Aug 26, 2009 at 12:09:01AM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2009 at 11:40:21PM +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > Most arch syscall_get_nr() implementations returns -1 if the syscall
> > > number is not valid.  Accessing the bit field without a check might
> > > result in a kernel oops (at least I saw it on s390 for ftrace selftest).
> > > 
> > > Before this change, this problem did not occur, because the invalid
> > > syscall number (-1) caused syscall_nr_to_meta() to return NULL.
> > > 
> > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > 
> > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > >    tracing code resets it.
> > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > 
> > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > >    kernel threads.
> > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > >    trace points:
> > >       - The kernel thread is started as ususal (do_fork()),
> > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > >       - the ret_from_fork() function is triggered and starts
> > > 	ftrace_syscall_exit() with an invalid syscall number.
> > > 
> > > To avoid these scenarios, I suggest to check the syscall_nr.
> > > 
> > > [...]
> > > 
> > > Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
> > 
> > 
> > 
> > I'm queueing this one for .32
> > 
> > Thanks.
> 
> Btw it would be nice to have an ack from s390 maintainers.
> Martin, Heiko, no problem with this patch?

Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26  7:38                   ` Heiko Carstens
@ 2009-08-26 12:32                     ` Frederic Weisbecker
  0 siblings, 0 replies; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-26 12:32 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Martin Schwidefsky, Mathieu Desnoyers, Hendrik Brueckner,
	Jason Baron, linux-kernel, mingo, laijs, rostedt, peterz,
	jiayingz, mbligh, lizf

On Wed, Aug 26, 2009 at 09:38:20AM +0200, Heiko Carstens wrote:
> On Wed, Aug 26, 2009 at 12:04:26AM +0200, Martin Schwidefsky wrote:
> > On Tue, 25 Aug 2009 14:31:19 -0400
> > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > > The design proposal for this kthread behavior wrt syscalls is based on a
> > > very specific and current kernel behavior, that may happen to change and
> > > that I have actually seen proven incorrect. For instance, some
> > > proprietary Linux driver does very odd things with system calls within
> > > kernel threads, like invoking them with int 0x80.
> 
> That's broken.. some proprietary drivers even change the system call table.
> Do you want to be able to deal with that as well?
> 
> > > Yes, this is odd, but do we really want to tie the tracer that much to
> > > the actual OS implementation specificities ?
> > > 
> > > That sounds like a recipe for endless breakages and missing bits of
> > > instrumentation.
> > > 
> > > So my advice would be: if we want to trace the syscall entry/exit paths,
> > > let's trace them for the _whole_ system, and find ways to make it work
> > > for corner-cases rather than finding clever ways to diminish
> > > instrumentation coverage.
> > 
> > I guess that the real reason for the crash is hidden in the initialization
> > of the pt_regs structure of the kernel thread.
> 
> On s390 the reason is that the scvnr in the pt_regs structure of the initial
> kernel thread is initialized to 0. svcnr contains the system call number
> and system call number 0 does not exist.
> That's why we have
> 
> static inline long syscall_get_nr(struct task_struct *task,
> 				  struct pt_regs *regs)
> {
> 	return regs->svcnr ? regs->svcnr : -1;
> }
> 
> Now, if you fork a kernel thread from the initial task the pt_regs structure
> gets copied. Upon ret_from_fork the trace exit path will get -1 for
> syscall_get_nr().
>  
> > > Given the ret from fork example happens to be the first event fired
> > > after the thread is created, we should be able to deal with this problem
> > > by initializing the thread structure used by syscall exit tracing to an
> > > initial "ret from fork" value.
> > 
> > That is my best guess as well.
> 
> What would that value be? __NR_fork?
> 
> Syscall tracing of kernel threads seems to be wrong. If somebody would do
> a "modprobe" and the init function of the module would create a kernel thread
> then syscall_get_nr() at the ret_from_fork path of the kernel thread would
> return __NR_init_module. That is of course only true if the old kernel_thread()
> API would be used. For kthread_create() it would return the syscall of the
> thread from which the kthread daemon was forked (the initial process I would
> guess, which was initialized to 0).
> 
> So skipping kernel threads at the exit path seems so be the best fix, IMHO ;)


Yeah, we can decide to trace syscalls from kernel, but doing so through
the current syscalls tracepoints is broken.


 
> ---
>  kernel/trace/trace_syscalls.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> Index: linux-next/kernel/trace/trace_syscalls.c
> ===================================================================
> --- linux-next.orig/kernel/trace/trace_syscalls.c
> +++ linux-next/kernel/trace/trace_syscalls.c
> @@ -253,6 +253,8 @@ void ftrace_syscall_exit(struct pt_regs 
>  	struct ring_buffer_event *event;
>  	int syscall_nr;
>  
> +	if (!current->mm)
> +		return;


Hendrik Brueckner already beat you at it and sent
a patch that ignores the TIF_SYSCALL_TRACEPOINT setting for
the kernel threads.

I'll add your acked by on it, thanks!


>  	syscall_nr = syscall_get_nr(current, regs);
>  	if (!test_bit(syscall_nr, enabled_exit_syscalls))
>  		return;


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-25 16:02       ` Hendrik Brueckner
  2009-08-25 16:20         ` Mathieu Desnoyers
@ 2009-08-26 12:35         ` Frederic Weisbecker
  2009-08-26 12:59           ` Heiko Carstens
  2009-08-26 14:41           ` Hendrik Brueckner
  2009-08-28 12:28         ` [tip:tracing/core] tracing: Don't trace kernel thread syscalls tip-bot for Hendrik Brueckner
  2 siblings, 2 replies; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-26 12:35 UTC (permalink / raw)
  To: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky

On Tue, Aug 25, 2009 at 06:02:37PM +0200, Hendrik Brueckner wrote:
> On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > 
> > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > >    tracing code resets it.
> > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > 
> > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > >    kernel threads.
> > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > >    trace points:
> > >       - The kernel thread is started as ususal (do_fork()),
> > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > >       - the ret_from_fork() function is triggered and starts
> > > 	ftrace_syscall_exit() with an invalid syscall number.
> > 
> > 
> > 
> > I wonder if there is any way to identify such situation...?
> For the second case, it might be an option to avoid setting the
> TIF_SYSCALL_FTRACE flag for kernel threads.
> 
> Kernel threads have task_struct->mm set to NULL.
> (Thanks to Heiko for that hint ;-)
> 
> The idea is then to check the mm field in syscall_regfunc() and
> set the flag accordingly.
> 
> However, I think the patch is an optional add-on becase checking
> the syscall number is still required for case 1).
> 
> ---
>  kernel/tracepoint.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -593,7 +593,9 @@ void syscall_regfunc(void)
>  	if (!sys_tracepoint_refcount) {
>  		read_lock_irqsave(&tasklist_lock, flags);
>  		do_each_thread(g, t) {
> -			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> +			/* Skip kernel threads. */
> +			if (t->mm)
> +				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
>  		} while_each_thread(g, t);
>  		read_unlock_irqrestore(&tasklist_lock, flags);
>  	}
> 


Yeah, and as told before, syscalls tracing from kernel thread is
an interesting point but we can't do it that way.

I'm queuing this patch for .32, but I need you Signed-off-by to apply it :)

Thanks.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 12:35         ` Frederic Weisbecker
@ 2009-08-26 12:59           ` Heiko Carstens
  2009-08-26 13:30             ` Frederic Weisbecker
  2009-08-26 14:41           ` Hendrik Brueckner
  1 sibling, 1 reply; 88+ messages in thread
From: Heiko Carstens @ 2009-08-26 12:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Martin Schwidefsky

On Wed, Aug 26, 2009 at 02:35:52PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2009 at 06:02:37PM +0200, Hendrik Brueckner wrote:
> > On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > > 
> > > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > > >    tracing code resets it.
> > > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > > 
> > > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > > >    kernel threads.
> > > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > > >    trace points:
> > > >       - The kernel thread is started as ususal (do_fork()),
> > > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > > >       - the ret_from_fork() function is triggered and starts
> > > > 	ftrace_syscall_exit() with an invalid syscall number.
> > > 
> > > 
> > > 
> > > I wonder if there is any way to identify such situation...?
> > For the second case, it might be an option to avoid setting the
> > TIF_SYSCALL_FTRACE flag for kernel threads.
> > 
> > Kernel threads have task_struct->mm set to NULL.
> > (Thanks to Heiko for that hint ;-)
> > 
> > The idea is then to check the mm field in syscall_regfunc() and
> > set the flag accordingly.
> > 
> > However, I think the patch is an optional add-on becase checking
> > the syscall number is still required for case 1).
> > 
> > ---
> >  kernel/tracepoint.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > --- a/kernel/tracepoint.c
> > +++ b/kernel/tracepoint.c
> > @@ -593,7 +593,9 @@ void syscall_regfunc(void)
> >  	if (!sys_tracepoint_refcount) {
> >  		read_lock_irqsave(&tasklist_lock, flags);
> >  		do_each_thread(g, t) {
> > -			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > +			/* Skip kernel threads. */
> > +			if (t->mm)
> > +				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> >  		} while_each_thread(g, t);
> >  		read_unlock_irqrestore(&tasklist_lock, flags);
> >  	}
> 
> Yeah, and as told before, syscalls tracing from kernel thread is
> an interesting point but we can't do it that way.
> 
> I'm queuing this patch for .32, but I need you Signed-off-by to apply it :)

That won't always work as pointed out in the other example:
- Process doing sys_init_module then scheduled away
- User enables syscall tracing -> TIF_SYSCALL_FTRACE gets set
- init function of the module gets called and is doing kernel_thread()
  (old API) -> kernel thread inherits TIF_SYSCALL_FTRACE.

I don't think that's what you want. You might want to clear the flag for
new processes during fork (only for kernel threads I would guess).

At least the current patch leaves a hole.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 12:59           ` Heiko Carstens
@ 2009-08-26 13:30             ` Frederic Weisbecker
  2009-08-26 13:48               ` Steven Rostedt
  2009-08-26 14:10               ` Heiko Carstens
  0 siblings, 2 replies; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-26 13:30 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Martin Schwidefsky

On Wed, Aug 26, 2009 at 02:59:43PM +0200, Heiko Carstens wrote:
> On Wed, Aug 26, 2009 at 02:35:52PM +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 25, 2009 at 06:02:37PM +0200, Hendrik Brueckner wrote:
> > > On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > > > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > > > 
> > > > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > > > >    tracing code resets it.
> > > > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > > > 
> > > > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > > > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > > > >    kernel threads.
> > > > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > > > >    trace points:
> > > > >       - The kernel thread is started as ususal (do_fork()),
> > > > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > > > >       - the ret_from_fork() function is triggered and starts
> > > > > 	ftrace_syscall_exit() with an invalid syscall number.
> > > > 
> > > > 
> > > > 
> > > > I wonder if there is any way to identify such situation...?
> > > For the second case, it might be an option to avoid setting the
> > > TIF_SYSCALL_FTRACE flag for kernel threads.
> > > 
> > > Kernel threads have task_struct->mm set to NULL.
> > > (Thanks to Heiko for that hint ;-)
> > > 
> > > The idea is then to check the mm field in syscall_regfunc() and
> > > set the flag accordingly.
> > > 
> > > However, I think the patch is an optional add-on becase checking
> > > the syscall number is still required for case 1).
> > > 
> > > ---
> > >  kernel/tracepoint.c |    4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > --- a/kernel/tracepoint.c
> > > +++ b/kernel/tracepoint.c
> > > @@ -593,7 +593,9 @@ void syscall_regfunc(void)
> > >  	if (!sys_tracepoint_refcount) {
> > >  		read_lock_irqsave(&tasklist_lock, flags);
> > >  		do_each_thread(g, t) {
> > > -			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > > +			/* Skip kernel threads. */
> > > +			if (t->mm)
> > > +				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > >  		} while_each_thread(g, t);
> > >  		read_unlock_irqrestore(&tasklist_lock, flags);
> > >  	}
> > 
> > Yeah, and as told before, syscalls tracing from kernel thread is
> > an interesting point but we can't do it that way.
> > 
> > I'm queuing this patch for .32, but I need you Signed-off-by to apply it :)
> 
> That won't always work as pointed out in the other example:
> - Process doing sys_init_module then scheduled away
> - User enables syscall tracing -> TIF_SYSCALL_FTRACE gets set
> - init function of the module gets called and is doing kernel_thread()
>   (old API) -> kernel thread inherits TIF_SYSCALL_FTRACE.
> 
> I don't think that's what you want. You might want to clear the flag for
> new processes during fork (only for kernel threads I would guess).
> 
> At least the current patch leaves a hole.


Ah, there are callsites that use kernel_thread() directly?
Does it means that t->mm could be non NULL for such resulting
kernel threads, in that case it would be hard to hook on
do_fork() to check that.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 13:30             ` Frederic Weisbecker
@ 2009-08-26 13:48               ` Steven Rostedt
  2009-08-26 13:53                 ` Frederic Weisbecker
  2009-08-26 13:56                 ` Peter Zijlstra
  2009-08-26 14:10               ` Heiko Carstens
  1 sibling, 2 replies; 88+ messages in thread
From: Steven Rostedt @ 2009-08-26 13:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Heiko Carstens, Hendrik Brueckner, Jason Baron, LKML,
	Ingo Molnar, Lai Jiangshan, Peter Zijlstra, Mathieu Desnoyers,
	jiayingz, mbligh, lizf, Martin Schwidefsky


On Wed, 26 Aug 2009, Frederic Weisbecker wrote:
> > 
> > That won't always work as pointed out in the other example:
> > - Process doing sys_init_module then scheduled away
> > - User enables syscall tracing -> TIF_SYSCALL_FTRACE gets set
> > - init function of the module gets called and is doing kernel_thread()
> >   (old API) -> kernel thread inherits TIF_SYSCALL_FTRACE.
> > 
> > I don't think that's what you want. You might want to clear the flag for
> > new processes during fork (only for kernel threads I would guess).
> > 
> > At least the current patch leaves a hole.
> 
> 
> Ah, there are callsites that use kernel_thread() directly?
> Does it means that t->mm could be non NULL for such resulting
> kernel threads, in that case it would be hard to hook on
> do_fork() to check that.

All kernel threads have a NULL t->mm. Since do_fork is called by kthreadd 
and not by kthread_create, the caller of do_fork will also have a
t->mm = NULL.

-- Steve


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 13:48               ` Steven Rostedt
@ 2009-08-26 13:53                 ` Frederic Weisbecker
  2009-08-26 14:44                   ` Steven Rostedt
  2009-08-26 13:56                 ` Peter Zijlstra
  1 sibling, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-26 13:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Heiko Carstens, Hendrik Brueckner, Jason Baron, LKML,
	Ingo Molnar, Lai Jiangshan, Peter Zijlstra, Mathieu Desnoyers,
	jiayingz, mbligh, lizf, Martin Schwidefsky

On Wed, Aug 26, 2009 at 09:48:48AM -0400, Steven Rostedt wrote:
> 
> On Wed, 26 Aug 2009, Frederic Weisbecker wrote:
> > > 
> > > That won't always work as pointed out in the other example:
> > > - Process doing sys_init_module then scheduled away
> > > - User enables syscall tracing -> TIF_SYSCALL_FTRACE gets set
> > > - init function of the module gets called and is doing kernel_thread()
> > >   (old API) -> kernel thread inherits TIF_SYSCALL_FTRACE.
> > > 
> > > I don't think that's what you want. You might want to clear the flag for
> > > new processes during fork (only for kernel threads I would guess).
> > > 
> > > At least the current patch leaves a hole.
> > 
> > 
> > Ah, there are callsites that use kernel_thread() directly?
> > Does it means that t->mm could be non NULL for such resulting
> > kernel threads, in that case it would be hard to hook on
> > do_fork() to check that.
> 
> All kernel threads have a NULL t->mm. Since do_fork is called by kthreadd 
> and not by kthread_create, the caller of do_fork will also have a
> t->mm = NULL.
> 
> -- Steve
> 

Yeah, that's the case with kthread_create() creation fashion,
but what if you create a kernel thread using the low level
kernel_thread() directly (ie: without relaying on kthreadd queue)?

Especially in Heiko example, it seems to be a duplication of user
task.

I wonder what obvious think I'm missing here...


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 13:48               ` Steven Rostedt
  2009-08-26 13:53                 ` Frederic Weisbecker
@ 2009-08-26 13:56                 ` Peter Zijlstra
  2009-08-26 14:41                   ` Steven Rostedt
  1 sibling, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2009-08-26 13:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, Heiko Carstens, Hendrik Brueckner,
	Jason Baron, LKML, Ingo Molnar, Lai Jiangshan, Mathieu Desnoyers,
	jiayingz, mbligh, lizf, Martin Schwidefsky

On Wed, 2009-08-26 at 09:48 -0400, Steven Rostedt wrote:
> On Wed, 26 Aug 2009, Frederic Weisbecker wrote:
> > > 
> > > That won't always work as pointed out in the other example:
> > > - Process doing sys_init_module then scheduled away
> > > - User enables syscall tracing -> TIF_SYSCALL_FTRACE gets set
> > > - init function of the module gets called and is doing kernel_thread()
> > >   (old API) -> kernel thread inherits TIF_SYSCALL_FTRACE.
> > > 
> > > I don't think that's what you want. You might want to clear the flag for
> > > new processes during fork (only for kernel threads I would guess).
> > > 
> > > At least the current patch leaves a hole.
> > 
> > 
> > Ah, there are callsites that use kernel_thread() directly?
> > Does it means that t->mm could be non NULL for such resulting
> > kernel threads, in that case it would be hard to hook on
> > do_fork() to check that.
> 
> All kernel threads have a NULL t->mm. Since do_fork is called by kthreadd 
> and not by kthread_create, the caller of do_fork will also have a
> t->mm = NULL.

Weren't there a few sites in the kernel where kernel threads temporarily
borrow the mm from someone?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 13:30             ` Frederic Weisbecker
  2009-08-26 13:48               ` Steven Rostedt
@ 2009-08-26 14:10               ` Heiko Carstens
  2009-08-26 14:27                 ` Frederic Weisbecker
  2009-08-26 14:43                 ` Steven Rostedt
  1 sibling, 2 replies; 88+ messages in thread
From: Heiko Carstens @ 2009-08-26 14:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Martin Schwidefsky

On Wed, Aug 26, 2009 at 03:30:22PM +0200, Frederic Weisbecker wrote:
> On Wed, Aug 26, 2009 at 02:59:43PM +0200, Heiko Carstens wrote:
> > > Yeah, and as told before, syscalls tracing from kernel thread is
> > > an interesting point but we can't do it that way.
> > > 
> > > I'm queuing this patch for .32, but I need you Signed-off-by to apply it :)
> > 
> > That won't always work as pointed out in the other example:
> > - Process doing sys_init_module then scheduled away
> > - User enables syscall tracing -> TIF_SYSCALL_FTRACE gets set
> > - init function of the module gets called and is doing kernel_thread()
> >   (old API) -> kernel thread inherits TIF_SYSCALL_FTRACE.
> > 
> > I don't think that's what you want. You might want to clear the flag for
> > new processes during fork (only for kernel threads I would guess).
> > 
> > At least the current patch leaves a hole.
> 
> Ah, there are callsites that use kernel_thread() directly?
> Does it means that t->mm could be non NULL for such resulting
> kernel threads, in that case it would be hard to hook on
> do_fork() to check that.

Oh yes, you are right. kernel threads created with kernel_thread()
have t->mm != NULL if the forking process has an mm too.

There are very few callsites left which still use kernel_thread().
(the last one in s390 driver code will be gone after the next
merge window).

As far as I can there are only four callsites left
(excluding staging): jffs2 and three in net/bluetooth/*

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 14:10               ` Heiko Carstens
@ 2009-08-26 14:27                 ` Frederic Weisbecker
  2009-08-26 14:43                   ` Steven Rostedt
  2009-08-26 14:43                 ` Steven Rostedt
  1 sibling, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-26 14:27 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Martin Schwidefsky

On Wed, Aug 26, 2009 at 04:10:52PM +0200, Heiko Carstens wrote:
> On Wed, Aug 26, 2009 at 03:30:22PM +0200, Frederic Weisbecker wrote:
> > On Wed, Aug 26, 2009 at 02:59:43PM +0200, Heiko Carstens wrote:
> > > > Yeah, and as told before, syscalls tracing from kernel thread is
> > > > an interesting point but we can't do it that way.
> > > > 
> > > > I'm queuing this patch for .32, but I need you Signed-off-by to apply it :)
> > > 
> > > That won't always work as pointed out in the other example:
> > > - Process doing sys_init_module then scheduled away
> > > - User enables syscall tracing -> TIF_SYSCALL_FTRACE gets set
> > > - init function of the module gets called and is doing kernel_thread()
> > >   (old API) -> kernel thread inherits TIF_SYSCALL_FTRACE.
> > > 
> > > I don't think that's what you want. You might want to clear the flag for
> > > new processes during fork (only for kernel threads I would guess).
> > > 
> > > At least the current patch leaves a hole.
> > 
> > Ah, there are callsites that use kernel_thread() directly?
> > Does it means that t->mm could be non NULL for such resulting
> > kernel threads, in that case it would be hard to hook on
> > do_fork() to check that.
> 
> Oh yes, you are right. kernel threads created with kernel_thread()
> have t->mm != NULL if the forking process has an mm too.
> 
> There are very few callsites left which still use kernel_thread().
> (the last one in s390 driver code will be gone after the next
> merge window).
> 
> As far as I can there are only four callsites left
> (excluding staging): jffs2 and three in net/bluetooth/*


In that case, I'd suggest to pick the patch that checks for
kernel threads while setting the TIF_FLAGS.

The check for invalid syscall numbers in another patch should be
sufficient to not crash.

We might have rare inaccurate result because of the remaining
buggy callsites but instead of working around on it, these
should be fixed.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 12:35         ` Frederic Weisbecker
  2009-08-26 12:59           ` Heiko Carstens
@ 2009-08-26 14:41           ` Hendrik Brueckner
  1 sibling, 0 replies; 88+ messages in thread
From: Hendrik Brueckner @ 2009-08-26 14:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky

On Wed, Aug 26, 2009 at 02:35:52PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2009 at 06:02:37PM +0200, Hendrik Brueckner wrote:
> > On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > > 
> > > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > > >    tracing code resets it.
> > > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > > 
> > > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > > >    kernel threads.
> > > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > > >    trace points:
> > > >       - The kernel thread is started as ususal (do_fork()),
> > > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > > >       - the ret_from_fork() function is triggered and starts
> > > > 	ftrace_syscall_exit() with an invalid syscall number.
> > > 
> > > 
> > > 
> > > I wonder if there is any way to identify such situation...?
> > For the second case, it might be an option to avoid setting the
> > TIF_SYSCALL_FTRACE flag for kernel threads.
> > 
> > Kernel threads have task_struct->mm set to NULL.
> > (Thanks to Heiko for that hint ;-)
> > 
> > The idea is then to check the mm field in syscall_regfunc() and
> > set the flag accordingly.
> > 
> > However, I think the patch is an optional add-on becase checking
> > the syscall number is still required for case 1).
> > 
> > ---
> >  kernel/tracepoint.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > --- a/kernel/tracepoint.c
> > +++ b/kernel/tracepoint.c
> > @@ -593,7 +593,9 @@ void syscall_regfunc(void)
> >  	if (!sys_tracepoint_refcount) {
> >  		read_lock_irqsave(&tasklist_lock, flags);
> >  		do_each_thread(g, t) {
> > -			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > +			/* Skip kernel threads. */
> > +			if (t->mm)
> > +				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> >  		} while_each_thread(g, t);
> >  		read_unlock_irqrestore(&tasklist_lock, flags);
> >  	}
> > 
> 
> 
> Yeah, and as told before, syscalls tracing from kernel thread is
> an interesting point but we can't do it that way.
> 
> I'm queuing this patch for .32, but I need you Signed-off-by to apply it :)

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 13:56                 ` Peter Zijlstra
@ 2009-08-26 14:41                   ` Steven Rostedt
  0 siblings, 0 replies; 88+ messages in thread
From: Steven Rostedt @ 2009-08-26 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Heiko Carstens, Hendrik Brueckner,
	Jason Baron, LKML, Ingo Molnar, Lai Jiangshan, Mathieu Desnoyers,
	jiayingz, mbligh, lizf, Martin Schwidefsky


On Wed, 26 Aug 2009, Peter Zijlstra wrote:

> On Wed, 2009-08-26 at 09:48 -0400, Steven Rostedt wrote:
> > On Wed, 26 Aug 2009, Frederic Weisbecker wrote:
> > > > 
> > > > That won't always work as pointed out in the other example:
> > > > - Process doing sys_init_module then scheduled away
> > > > - User enables syscall tracing -> TIF_SYSCALL_FTRACE gets set
> > > > - init function of the module gets called and is doing kernel_thread()
> > > >   (old API) -> kernel thread inherits TIF_SYSCALL_FTRACE.
> > > > 
> > > > I don't think that's what you want. You might want to clear the flag for
> > > > new processes during fork (only for kernel threads I would guess).
> > > > 
> > > > At least the current patch leaves a hole.
> > > 
> > > 
> > > Ah, there are callsites that use kernel_thread() directly?
> > > Does it means that t->mm could be non NULL for such resulting
> > > kernel threads, in that case it would be hard to hook on
> > > do_fork() to check that.
> > 
> > All kernel threads have a NULL t->mm. Since do_fork is called by kthreadd 
> > and not by kthread_create, the caller of do_fork will also have a
> > t->mm = NULL.
> 
> Weren't there a few sites in the kernel where kernel threads temporarily
> borrow the mm from someone?

All kernel threads borrow a mm, but they don't set it as their own. That's 
what t->active_mm is for.

-- Steve


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 14:27                 ` Frederic Weisbecker
@ 2009-08-26 14:43                   ` Steven Rostedt
  2009-08-26 16:14                     ` Frederic Weisbecker
  0 siblings, 1 reply; 88+ messages in thread
From: Steven Rostedt @ 2009-08-26 14:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Heiko Carstens, Hendrik Brueckner, Jason Baron, linux-kernel,
	mingo, laijs, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Martin Schwidefsky


On Wed, 26 Aug 2009, Frederic Weisbecker wrote:
> > 
> > Oh yes, you are right. kernel threads created with kernel_thread()
> > have t->mm != NULL if the forking process has an mm too.
> > 
> > There are very few callsites left which still use kernel_thread().
> > (the last one in s390 driver code will be gone after the next
> > merge window).
> > 
> > As far as I can there are only four callsites left
> > (excluding staging): jffs2 and three in net/bluetooth/*
> 
> 
> In that case, I'd suggest to pick the patch that checks for
> kernel threads while setting the TIF_FLAGS.
> 
> The check for invalid syscall numbers in another patch should be
> sufficient to not crash.
> 
> We might have rare inaccurate result because of the remaining
> buggy callsites but instead of working around on it, these
> should be fixed.

Well, do these (buggy threads) call syscalls?

-- Steve


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 14:10               ` Heiko Carstens
  2009-08-26 14:27                 ` Frederic Weisbecker
@ 2009-08-26 14:43                 ` Steven Rostedt
  1 sibling, 0 replies; 88+ messages in thread
From: Steven Rostedt @ 2009-08-26 14:43 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Frederic Weisbecker, Hendrik Brueckner, Jason Baron,
	linux-kernel, mingo, laijs, peterz, mathieu.desnoyers, jiayingz,
	mbligh, lizf, Martin Schwidefsky


On Wed, 26 Aug 2009, Heiko Carstens wrote:
> > Ah, there are callsites that use kernel_thread() directly?
> > Does it means that t->mm could be non NULL for such resulting
> > kernel threads, in that case it would be hard to hook on
> > do_fork() to check that.
> 
> Oh yes, you are right. kernel threads created with kernel_thread()
> have t->mm != NULL if the forking process has an mm too.
> 
> There are very few callsites left which still use kernel_thread().
> (the last one in s390 driver code will be gone after the next
> merge window).
> 
> As far as I can there are only four callsites left
> (excluding staging): jffs2 and three in net/bluetooth/*

Ouch! Those need to be fixed.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 13:53                 ` Frederic Weisbecker
@ 2009-08-26 14:44                   ` Steven Rostedt
  0 siblings, 0 replies; 88+ messages in thread
From: Steven Rostedt @ 2009-08-26 14:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Heiko Carstens, Hendrik Brueckner, Jason Baron, LKML,
	Ingo Molnar, Lai Jiangshan, Peter Zijlstra, Mathieu Desnoyers,
	jiayingz, mbligh, lizf, Martin Schwidefsky


On Wed, 26 Aug 2009, Frederic Weisbecker wrote:
> > 
> > All kernel threads have a NULL t->mm. Since do_fork is called by kthreadd 
> > and not by kthread_create, the caller of do_fork will also have a
> > t->mm = NULL.
> > 
> > -- Steve
> > 
> 
> Yeah, that's the case with kthread_create() creation fashion,
> but what if you create a kernel thread using the low level
> kernel_thread() directly (ie: without relaying on kthreadd queue)?
> 
> Especially in Heiko example, it seems to be a duplication of user
> task.
> 
> I wonder what obvious think I'm missing here...

The obvious is that those calls are buggy ;-)

-- Steve


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 14:43                   ` Steven Rostedt
@ 2009-08-26 16:14                     ` Frederic Weisbecker
  0 siblings, 0 replies; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-26 16:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Heiko Carstens, Hendrik Brueckner, Jason Baron, linux-kernel,
	mingo, laijs, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Martin Schwidefsky

On Wed, Aug 26, 2009 at 10:43:01AM -0400, Steven Rostedt wrote:
> 
> On Wed, 26 Aug 2009, Frederic Weisbecker wrote:
> > > 
> > > Oh yes, you are right. kernel threads created with kernel_thread()
> > > have t->mm != NULL if the forking process has an mm too.
> > > 
> > > There are very few callsites left which still use kernel_thread().
> > > (the last one in s390 driver code will be gone after the next
> > > merge window).
> > > 
> > > As far as I can there are only four callsites left
> > > (excluding staging): jffs2 and three in net/bluetooth/*
> > 
> > 
> > In that case, I'd suggest to pick the patch that checks for
> > kernel threads while setting the TIF_FLAGS.
> > 
> > The check for invalid syscall numbers in another patch should be
> > sufficient to not crash.
> > 
> > We might have rare inaccurate result because of the remaining
> > buggy callsites but instead of working around on it, these
> > should be fixed.
> 
> Well, do these (buggy threads) call syscalls?
> 
> -- Steve
> 


That's not exactly what raises bugs because their syscalls
won't be traced. The problem occurs when the freshly created
kernel thread reaches ret_from_fork with its ->mm != NULL

It will take the trace_sys_exit() call and then will be traced
as a normal user thread.



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/12] add syscall tracepoints V3 - s390 arch update
  2009-08-25 12:31 ` [PATCH 00/12] add syscall tracepoints V3 - s390 arch update Hendrik Brueckner
  2009-08-25 13:52   ` Frederic Weisbecker
@ 2009-08-26 16:53   ` Frederic Weisbecker
  2009-08-27  7:27     ` [PATCH]: tracing: s390 arch updates for tracing syscalls Hendrik Brueckner
  2009-08-28 12:27   ` [tip:tracing/core] tracing: Add syscall tracepoints - s390 arch update tip-bot for Hendrik Brueckner
  2 siblings, 1 reply; 88+ messages in thread
From: Frederic Weisbecker @ 2009-08-26 16:53 UTC (permalink / raw)
  To: Hendrik Brueckner
  Cc: Jason Baron, linux-kernel, mingo, laijs, rostedt, peterz,
	mathieu.desnoyers, jiayingz, mbligh, lizf, Heiko Carstens,
	Martin Schwidefsky

On Tue, Aug 25, 2009 at 02:31:11PM +0200, Hendrik Brueckner wrote:
> Hi,
> 
> I looked at your recent syscall tracepoint patches and I have few
> more s390 arch updates.
> 
> This patch includes s390 arch updates for:
> - tracing: Map syscall name to number (syscall_name_to_nr())
> - tracing: Call arch_init_ftrace_syscalls at boot
> - tracing: add support traceopint ids (set_syscall_{enter,exit}_id())
> 
> The patch already uses "NR_syscalls" instead of FTRACE_SYSCALL_MAX.
> 
> The patch is based on today's linux-next (20090825).
> Since few of your patches already include s390 changes,
> I would appreciate if you could add the patch to your patch set.
> 
> If you have any remarks, please let me know. 
>   
> Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
> ---
>  arch/s390/include/asm/ftrace.h |    4 ++++



Btw, the stat have changes in ftrace.h but your patch haven't.
Is there something missing?



>  arch/s390/kernel/ftrace.c      |   36 +++++++++++++++++++++++++++---------
>  2 files changed, 31 insertions(+), 9 deletions(-)
> 
> --- a/arch/s390/kernel/ftrace.c
> +++ b/arch/s390/kernel/ftrace.c
> @@ -220,6 +220,29 @@ struct syscall_metadata *syscall_nr_to_m
>  	return syscalls_metadata[nr];
>  }
>  
> +int syscall_name_to_nr(char *name)
> +{
> +	int i;
> +
> +	if (!syscalls_metadata)
> +		return -1;
> +	for (i = 0; i < NR_syscalls; i++)
> +		if (syscalls_metadata[i])
> +			if (!strcmp(syscalls_metadata[i]->name, name))
> +				return i;
> +	return -1;
> +}
> +
> +void set_syscall_enter_id(int num, int id)
> +{
> +	syscalls_metadata[num]->enter_id = id;
> +}
> +
> +void set_syscall_exit_id(int num, int id)
> +{
> +	syscalls_metadata[num]->exit_id = id;
> +}
> +
>  static struct syscall_metadata *find_syscall_meta(unsigned long syscall)
>  {
>  	struct syscall_metadata *start;
> @@ -237,24 +260,19 @@ static struct syscall_metadata *find_sys
>  	return NULL;
>  }
>  
> -void arch_init_ftrace_syscalls(void)
> +static int __init arch_init_ftrace_syscalls(void)
>  {
>  	struct syscall_metadata *meta;
>  	int i;
> -	static atomic_t refs;
> -
> -	if (atomic_inc_return(&refs) != 1)
> -		goto out;
>  	syscalls_metadata = kzalloc(sizeof(*syscalls_metadata) * NR_syscalls,
>  				    GFP_KERNEL);
>  	if (!syscalls_metadata)
> -		goto out;
> +		return -ENOMEM;
>  	for (i = 0; i < NR_syscalls; i++) {
>  		meta = find_syscall_meta((unsigned long)sys_call_table[i]);
>  		syscalls_metadata[i] = meta;
>  	}
> -	return;
> -out:
> -	atomic_dec(&refs);
> +	return 0;
>  }
> +arch_initcall(arch_init_ftrace_syscalls);
>  #endif


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26  6:21                 ` Peter Zijlstra
@ 2009-08-26 17:08                   ` Mathieu Desnoyers
  2009-08-26 18:41                     ` Christoph Hellwig
  0 siblings, 1 reply; 88+ messages in thread
From: Mathieu Desnoyers @ 2009-08-26 17:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Hendrik Brueckner, Jason Baron,
	linux-kernel, mingo, laijs, rostedt, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2009-08-25 at 14:31 -0400, Mathieu Desnoyers wrote:
> 
> > (Well, I do not have time currently to look into the gory details
> > (sorry), but let's try to take a step back from the problem.)
> > 
> > The design proposal for this kthread behavior wrt syscalls is based on a
> > very specific and current kernel behavior, that may happen to change and
> > that I have actually seen proven incorrect. For instance, some
> > proprietary Linux driver does very odd things with system calls within
> > kernel threads, like invoking them with int 0x80.
> > 
> > Yes, this is odd, but do we really want to tie the tracer that much to
> > the actual OS implementation specificities ?
> > 
> > That sounds like a recipe for endless breakages and missing bits of
> > instrumentation.
> > 
> > So my advice would be: if we want to trace the syscall entry/exit paths,
> > let's trace them for the _whole_ system, and find ways to make it work
> > for corner-cases rather than finding clever ways to diminish
> > instrumentation coverage.
> > 
> > Given the ret from fork example happens to be the first event fired
> > after the thread is created, we should be able to deal with this problem
> > by initializing the thread structure used by syscall exit tracing to an
> > initial "ret from fork" value.
> 
> So you're saying we should let proprietary crap influence the design of
> the kernel in any way?

Nah. And I start to feel comfortable with syscall entry/exit being only
be traced for userspace threads. But as I pointed out in a follow-up
email, the lack of sys_*() tracing for invocation from within the kernel
might be problematic. This is actually my main point.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26  7:10                 ` Peter Zijlstra
@ 2009-08-26 17:10                   ` Mathieu Desnoyers
  2009-08-26 17:24                   ` H. Peter Anvin
  1 sibling, 0 replies; 88+ messages in thread
From: Mathieu Desnoyers @ 2009-08-26 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Hendrik Brueckner, Jason Baron,
	linux-kernel, mingo, laijs, rostedt, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky, Thomas Gleixner, hpa

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2009-08-25 at 14:31 -0400, Mathieu Desnoyers wrote:
> > For instance, some
> > proprietary Linux driver does very odd things with system calls within
> > kernel threads, like invoking them with int 0x80.
> 
> So who is going to send the x86 patch to make int 0x80 from kernel space
> panic the machine? :-)

I'm pretty sure ATI or Nvidia already cooked something like this in the
past. Let's not bother too much with the proprietary aspect. Tracing
internal kernel invocations of sys_*() is actually the main point I was
trying to come to.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26  7:28                         ` Ingo Molnar
@ 2009-08-26 17:11                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 88+ messages in thread
From: Mathieu Desnoyers @ 2009-08-26 17:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, Hendrik Brueckner, Jason Baron,
	linux-kernel, laijs, rostedt, peterz, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > > On Tue, Aug 25, 2009 at 03:51:11PM -0400, Mathieu Desnoyers wrote:
> > > > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > > > > On Tue, Aug 25, 2009 at 02:31:19PM -0400, Mathieu Desnoyers wrote:
> > > > > > (Well, I do not have time currently to look into the gory details
> > > > > > (sorry), but let's try to take a step back from the problem.)
> > > > > > 
> > > > > > The design proposal for this kthread behavior wrt syscalls is based on a
> > > > > > very specific and current kernel behavior, that may happen to change and
> > > > > > that I have actually seen proven incorrect. For instance, some
> > > > > > proprietary Linux driver does very odd things with system calls within
> > > > > > kernel threads, like invoking them with int 0x80.
> > > > > > 
> > > > > > Yes, this is odd, but do we really want to tie the tracer that much to
> > > > > > the actual OS implementation specificities ?
> > > > > 
> > > > > 
> > > > > I really can't see the point in doing this. I don't expect the kernel
> > > > > behaviour to change soon and have explicit syscalls interrupts done
> > > > > from it. It's not about a current kernel implementation fashion,
> > > > > it's about kernel design sanity that is not likely to go backward.
> > > > > 
> > > > > Is it worth it to trace kernel threads, maintain their tracing
> > > > > specificities (such as workarounds with ret_from_fork that implies)
> > > > > just because we want to support tracing on some silly proprietary drivers?
> > > > > 
> > > > > 
> > > > > > 
> > > > > > That sounds like a recipe for endless breakages and missing bits of
> > > > > > instrumentation.
> > > > > >
> > > > > > So my advice would be: if we want to trace the syscall entry/exit paths,
> > > > > > let's trace them for the _whole_ system, and find ways to make it work
> > > > > > for corner-cases rather than finding clever ways to diminish
> > > > > > instrumentation coverage.
> > > > > 
> > > > > 
> > > > > If developers of out of tree drivers want to implement buggy things
> > > > > that would never be accepted after a minimal review here, and then instrument
> > > > > their bugs, then I would suggest them to implement their own ad hoc instrumentation,
> > > > > really :-/
> > > > > 
> > > > > What's the point in supporting out of tree bugs?
> > > > > 
> > > > > Well, the only advantage of doing this would be to support reverse engineering
> > > > > in tiny and rare corner cases. Not that worth the effort.
> > > > > 
> > > > >  
> > > > > > Given the ret from fork example happens to be the first event fired
> > > > > > after the thread is created, we should be able to deal with this problem
> > > > > > by initializing the thread structure used by syscall exit tracing to an
> > > > > > initial "ret from fork" value.
> > > > > > 
> > > > > > Mathieu
> > > > > 
> > > > > 
> > > > > It means we have to support and check this corner case in every archs
> > > > > that support syscall tracing, deal with crashes because we omitted it, etc...
> > > > > 
> > > > > For all the things I've explained above I don't think it's worth the effort.
> > > > > 
> > > > > But it's just my opinion...
> > > > > 
> > > > 
> > > > Then we might want to explicitly require that calls to sys_*() system
> > > > calls made from within the kernel pass through another instrumentation
> > > > mechanism. IMHO, that would make sense. It would cover both system calls
> > > > made from kernel threads and system calls made from within a system call
> > > > or trap.
> > > > 
> > > > Mathieu
> > > 
> > > 
> > > Well, we can't really set a tracepoint per sys_*() function. Or more
> > > precisely we already have them, automagically generated and relying on
> > > sysenter ptrace path.
> > > 
> > > But if we want to check which syscalls are called from kernel threads, we have:
> > > 
> > > - kthread() -> do_exit()
> > > 
> > > 
> > > The entry point of every kernel threads (except "kthreadd") is
> > > kthread(). It calls do_exit() in the end.
> > > 
> > > If we want to trace the exit of a kernel thread, we can put
> > > a tracepoint there instead of do_exit() which results would
> > > be intermixed with sys_exit() tracing.
> > > 
> > > 
> > > - kthreadd :: create_kthread() -> kernel_thread() -> do_fork()
> > > 
> > > 
> > > A creation of a thread is the result of the kthreadd thread fork().
> > > If we want to trace the creation of kernel threads, we can again do that
> > > in the upper level: kernel_thread().
> > > 
> > > But does that inform us about who created the thread? All we would see
> > > is kthreadd that forks. This is a very poor information compared
> > > to a userspace fork() that tells us who really created the new process.
> > > 
> > > Instead what we want is probably to trace kthread_create() which inserts the
> > > job of a thread creation in the kthreadd thread, so that we know
> > > _who_ asked for this thread creation (process that requested it and callsite).
> > > And that's much more rich in information.
> > > 
> > > Well, you can even climb in an upper layer and look if this is a workqueue,
> > > a kernel/async.c thread, a slow work, etc...
> > > 
> > > 
> > > - kernel_execve() -> sys_execve()
> > > 
> > > We can execute user apps from kernel through call_usermodehelper().
> > > And we can trace kernel_execve() or again in an upper layer
> > > like call_usermodehelper()
> > > 
> > > - ... I guess there are other examples
> > > 
> > > The kernel calls syscalls through wrappers, and tracing these 
> > > wrappers, depending of the desired level of informations we want 
> > > (choose your layer), are much more verbose / rich in 
> > > informations.
> > 
> > What you describe looks a lot like the approach I use in the LTTng 
> > tree. Actually, the main point I am trying to make here is: if we 
> > rely only on tracing at the syscall entry/exit level for, say, 
> > monitoring all uses of e.g. sys_open(), we might be caught 
> > offguard by internal sys_open() uses within the kernel.
> 
> There's a lot of 'internal' file opening going on within the kernel 
> that ptrace does not notice - see all the filp_open() calls.
> 
> Lets worry about this only if it's a true issue.
> 

We're already using open/close calls to map the read/write operations to
the actual files they affect in the LTTV analysis. So yes, it matters
from our side.

Mathieu

> 	Ingo

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26  7:10                 ` Peter Zijlstra
  2009-08-26 17:10                   ` Mathieu Desnoyers
@ 2009-08-26 17:24                   ` H. Peter Anvin
  1 sibling, 0 replies; 88+ messages in thread
From: H. Peter Anvin @ 2009-08-26 17:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Hendrik Brueckner,
	Jason Baron, linux-kernel, mingo, laijs, rostedt, jiayingz,
	mbligh, lizf, Heiko Carstens, Martin Schwidefsky,
	Thomas Gleixner

On 08/26/2009 12:10 AM, Peter Zijlstra wrote:
> On Tue, 2009-08-25 at 14:31 -0400, Mathieu Desnoyers wrote:
>> For instance, some
>> proprietary Linux driver does very odd things with system calls within
>> kernel threads, like invoking them with int 0x80.
> 
> So who is going to send the x86 patch to make int 0x80 from kernel space
> panic the machine? :-)

Panicing might be a pretty darn good idea, since $DEITY knows what state
your stacks are in at that point, and if it is about to overflow.

... besides all the other evilness that entails.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 17:08                   ` Mathieu Desnoyers
@ 2009-08-26 18:41                     ` Christoph Hellwig
  2009-08-26 18:42                       ` Christoph Hellwig
  0 siblings, 1 reply; 88+ messages in thread
From: Christoph Hellwig @ 2009-08-26 18:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Hendrik Brueckner,
	Jason Baron, linux-kernel, mingo, laijs, rostedt, jiayingz,
	mbligh, lizf, Heiko Carstens, Martin Schwidefsky

On Wed, Aug 26, 2009 at 01:08:12PM -0400, Mathieu Desnoyers wrote:
> Nah. And I start to feel comfortable with syscall entry/exit being only
> be traced for userspace threads. But as I pointed out in a follow-up
> email, the lack of sys_*() tracing for invocation from within the kernel
> might be problematic. This is actually my main point.

We do not support system calls from kernelspace anymore.  All the
macros to do real system calls are gone, and there are very very few
places left calling sys_foo as normal function calls.

An how exactly is calling sys_foo as a normal function call different
from calling do_foo or vfs_foo?


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 18:41                     ` Christoph Hellwig
@ 2009-08-26 18:42                       ` Christoph Hellwig
  2009-08-26 19:01                         ` Mathieu Desnoyers
  0 siblings, 1 reply; 88+ messages in thread
From: Christoph Hellwig @ 2009-08-26 18:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Hendrik Brueckner,
	Jason Baron, linux-kernel, mingo, laijs, rostedt, jiayingz,
	mbligh, lizf, Heiko Carstens, Martin Schwidefsky

On Wed, Aug 26, 2009 at 02:41:16PM -0400, Christoph Hellwig wrote:
> We do not support system calls from kernelspace anymore.  All the
> macros to do real system calls are gone, and there are very very few
> places left calling sys_foo as normal function calls.
> 
> An how exactly is calling sys_foo as a normal function call different
> from calling do_foo or vfs_foo?

And if you really need to trace direct callers of sys_foo just put
a jprobe on it.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/12] add trace events for each syscall entry/exit
  2009-08-26 18:42                       ` Christoph Hellwig
@ 2009-08-26 19:01                         ` Mathieu Desnoyers
  0 siblings, 0 replies; 88+ messages in thread
From: Mathieu Desnoyers @ 2009-08-26 19:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Frederic Weisbecker, Hendrik Brueckner,
	Jason Baron, linux-kernel, mingo, laijs, rostedt, jiayingz,
	mbligh, lizf, Heiko Carstens, Martin Schwidefsky

* Christoph Hellwig (hch@infradead.org) wrote:
> On Wed, Aug 26, 2009 at 02:41:16PM -0400, Christoph Hellwig wrote:
> > We do not support system calls from kernelspace anymore.  All the
> > macros to do real system calls are gone, and there are very very few
> > places left calling sys_foo as normal function calls.
> > 
> > An how exactly is calling sys_foo as a normal function call different
> > from calling do_foo or vfs_foo?
> 

Not very different, no. But the fact is that we would not be
instrumenting do_foo nor vfs_foo because we would somehow expect all
callers to go through a system call, which ain't always true.

> And if you really need to trace direct callers of sys_foo just put
> a jprobe on it.
> 

Yep, that would do it. Getting the arguments of the function upon entry
and the return value on exit is pretty much all we need.

But I would like to ensure that we do not duplicate the instrumentation
done by the generic syscall instrumentation neither, so we don't end up
having:

syscall_entry X (args)
do_X (args)
do_X return (return value)
syscall_exit X (return value)

If I am not mistakened, the current execution paths are:

>From userland:

syscall -> sys_X() -> do_X()

>From the kernel:

do_X()

Adding a trampoline taken only when the kernel is doing the call might
be useful there to selectively trace calls made by the kernel.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH]: tracing: s390 arch updates for tracing syscalls
  2009-08-26 16:53   ` Frederic Weisbecker
@ 2009-08-27  7:27     ` Hendrik Brueckner
  0 siblings, 0 replies; 88+ messages in thread
From: Hendrik Brueckner @ 2009-08-27  7:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hendrik Brueckner, Jason Baron, linux-kernel, mingo, laijs,
	rostedt, peterz, mathieu.desnoyers, jiayingz, mbligh, lizf,
	Heiko Carstens, Martin Schwidefsky

On Wed, Aug 26, 2009 at 06:53:25PM +0200, Frederic Weisbecker wrote:
> > Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
> > ---
> >  arch/s390/include/asm/ftrace.h |    4 ++++
> 
> Btw, the stat have changes in ftrace.h but your patch haven't.
> Is there something missing?
No. diffstat has not been refreshed after removing the
FTRACE_SYSCALL_MAX constant.
Here is the patch again with the refreshed diffstat:

Syscall tracing updates for s390 architecture.

This patch includes s390 arch updates for:
- tracing: Map syscall name to number (syscall_name_to_nr())
- tracing: Call arch_init_ftrace_syscalls at boot
- tracing: add support traceopint ids (set_syscall_{enter,exit}_id())
  
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
---
 arch/s390/kernel/ftrace.c |   36 +++++++++++++++++++++++++++---------
 1 file changed, 27 insertions(+), 9 deletions(-)

--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -220,6 +220,29 @@ struct syscall_metadata *syscall_nr_to_m
 	return syscalls_metadata[nr];
 }
 
+int syscall_name_to_nr(char *name)
+{
+	int i;
+
+	if (!syscalls_metadata)
+		return -1;
+	for (i = 0; i < NR_syscalls; i++)
+		if (syscalls_metadata[i])
+			if (!strcmp(syscalls_metadata[i]->name, name))
+				return i;
+	return -1;
+}
+
+void set_syscall_enter_id(int num, int id)
+{
+	syscalls_metadata[num]->enter_id = id;
+}
+
+void set_syscall_exit_id(int num, int id)
+{
+	syscalls_metadata[num]->exit_id = id;
+}
+
 static struct syscall_metadata *find_syscall_meta(unsigned long syscall)
 {
 	struct syscall_metadata *start;
@@ -237,24 +260,19 @@ static struct syscall_metadata *find_sys
 	return NULL;
 }
 
-void arch_init_ftrace_syscalls(void)
+static int __init arch_init_ftrace_syscalls(void)
 {
 	struct syscall_metadata *meta;
 	int i;
-	static atomic_t refs;
-
-	if (atomic_inc_return(&refs) != 1)
-		goto out;
 	syscalls_metadata = kzalloc(sizeof(*syscalls_metadata) * NR_syscalls,
 				    GFP_KERNEL);
 	if (!syscalls_metadata)
-		goto out;
+		return -ENOMEM;
 	for (i = 0; i < NR_syscalls; i++) {
 		meta = find_syscall_meta((unsigned long)sys_call_table[i]);
 		syscalls_metadata[i] = meta;
 	}
-	return;
-out:
-	atomic_dec(&refs);
+	return 0;
 }
+arch_initcall(arch_init_ftrace_syscalls);
 #endif

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [tip:tracing/core] tracing: Add syscall tracepoints - s390 arch update
  2009-08-25 12:31 ` [PATCH 00/12] add syscall tracepoints V3 - s390 arch update Hendrik Brueckner
  2009-08-25 13:52   ` Frederic Weisbecker
  2009-08-26 16:53   ` Frederic Weisbecker
@ 2009-08-28 12:27   ` tip-bot for Hendrik Brueckner
  2 siblings, 0 replies; 88+ messages in thread
From: tip-bot for Hendrik Brueckner @ 2009-08-28 12:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mathieu.desnoyers, brueckner, mingo, schwidefsky, peterz,
	fweisbec, rostedt, tglx, jbaron, laijs, hpa, jiayingz,
	linux-kernel, lizf, lethal, mingo, mbligh

Commit-ID:  7515bf59f87f19b2a17972b74230d2f91756fe3c
Gitweb:     http://git.kernel.org/tip/7515bf59f87f19b2a17972b74230d2f91756fe3c
Author:     Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
AuthorDate: Tue, 25 Aug 2009 14:31:11 +0200
Committer:  Frederic Weisbecker <fweisbec@gmail.com>
CommitDate: Wed, 26 Aug 2009 21:29:44 +0200

tracing: Add syscall tracepoints - s390 arch update

This patch includes s390 arch updates to synchronize with latest
core changes in the syscalls tracing area.

- tracing: Map syscall name to number (syscall_name_to_nr())
- tracing: Call arch_init_ftrace_syscalls at boot
- tracing: add support tracepoint ids (set_syscall_{enter,exit}_id())

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
LKML-Reference: <20090825123111.GD4639@cetus.boeblingen.de.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>


---
 arch/s390/kernel/ftrace.c |   36 +++++++++++++++++++++++++++---------
 1 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index 3e298e6..57bdcb1 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -220,6 +220,29 @@ struct syscall_metadata *syscall_nr_to_meta(int nr)
 	return syscalls_metadata[nr];
 }
 
+int syscall_name_to_nr(char *name)
+{
+	int i;
+
+	if (!syscalls_metadata)
+		return -1;
+	for (i = 0; i < NR_syscalls; i++)
+		if (syscalls_metadata[i])
+			if (!strcmp(syscalls_metadata[i]->name, name))
+				return i;
+	return -1;
+}
+
+void set_syscall_enter_id(int num, int id)
+{
+	syscalls_metadata[num]->enter_id = id;
+}
+
+void set_syscall_exit_id(int num, int id)
+{
+	syscalls_metadata[num]->exit_id = id;
+}
+
 static struct syscall_metadata *find_syscall_meta(unsigned long syscall)
 {
 	struct syscall_metadata *start;
@@ -237,24 +260,19 @@ static struct syscall_metadata *find_syscall_meta(unsigned long syscall)
 	return NULL;
 }
 
-void arch_init_ftrace_syscalls(void)
+static int __init arch_init_ftrace_syscalls(void)
 {
 	struct syscall_metadata *meta;
 	int i;
-	static atomic_t refs;
-
-	if (atomic_inc_return(&refs) != 1)
-		goto out;
 	syscalls_metadata = kzalloc(sizeof(*syscalls_metadata) * NR_syscalls,
 				    GFP_KERNEL);
 	if (!syscalls_metadata)
-		goto out;
+		return -ENOMEM;
 	for (i = 0; i < NR_syscalls; i++) {
 		meta = find_syscall_meta((unsigned long)sys_call_table[i]);
 		syscalls_metadata[i] = meta;
 	}
-	return;
-out:
-	atomic_dec(&refs);
+	return 0;
 }
+arch_initcall(arch_init_ftrace_syscalls);
 #endif

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [tip:tracing/core] tracing: Check invalid syscall nr while tracing syscalls
  2009-08-25 12:50   ` Hendrik Brueckner
  2009-08-25 14:15     ` Frederic Weisbecker
  2009-08-25 21:40     ` [PATCH 08/12] add trace events for each syscall entry/exit Frederic Weisbecker
@ 2009-08-28 12:27     ` tip-bot for Hendrik Brueckner
  2 siblings, 0 replies; 88+ messages in thread
From: tip-bot for Hendrik Brueckner @ 2009-08-28 12:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mathieu.desnoyers, brueckner, mingo, schwidefsky, peterz,
	fweisbec, rostedt, heiko.carstens, tglx, jbaron, laijs,
	linux-kernel, hpa, jiayingz, lizf, lethal, mingo, mbligh

Commit-ID:  cd0980fc8add25e8ab12fcf1051c0f20cbc7c0c0
Gitweb:     http://git.kernel.org/tip/cd0980fc8add25e8ab12fcf1051c0f20cbc7c0c0
Author:     Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
AuthorDate: Tue, 25 Aug 2009 14:50:27 +0200
Committer:  Frederic Weisbecker <fweisbec@gmail.com>
CommitDate: Wed, 26 Aug 2009 21:29:48 +0200

tracing: Check invalid syscall nr while tracing syscalls

Most arch syscall_get_nr() implementations returns -1 if the syscall
number is not valid.  Accessing the bit field without a check might
result in a kernel oops (at least I saw it on s390 for ftrace selftest).

Before this change, this problem did not occur, because the invalid
syscall number (-1) caused syscall_nr_to_meta() to return NULL.

There are at least two scenarios where syscall_get_nr() can return -1:

1. For example, ptrace stores an invalid syscall number, and thus,
   tracing code resets it.
   (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)

2. The syscall_regfunc() (kernel/tracepoint.c) sets the
   TIF_SYSCALL_FTRACE (now: TIF_SYSCALL_TRACEPOINT) flag for all threads
   which include kernel threads.
   However, the ftrace selftest triggers a kernel oops when testing
   syscall trace points:
      - The kernel thread is started as ususal (do_fork()),
      - tracing code sets TIF_SYSCALL_FTRACE,
      - the ret_from_fork() function is triggered and starts
	ftrace_syscall_exit() with an invalid syscall number.

To avoid these scenarios, I suggest to check the syscall_nr.

For instance, the ftrace selftest fails for s390 (with config option
CONFIG_FTRACE_SYSCALLS set) and produces the following kernel oops.

Unable to handle kernel pointer dereference at virtual kernel address 2000000000

Oops: 0038 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 Not tainted 2.6.31-rc6-next-20090819-dirty #18
Process kthreadd (pid: 818, task: 000000003ea207e8, ksp: 000000003e813eb8)
Krnl PSW : 0704100180000000 00000000000ea54c (ftrace_syscall_exit+0x58/0xdc)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
Krnl GPRS: 0000000000000000 00000000000e0000 ffffffffffffffff 20000000008c2650
           0000000000000007 0000000000000000 0000000000000000 0000000000000000
           0000000000000000 0000000000000000 ffffffffffffffff 000000003e813d78
           000000003e813f58 0000000000505ba8 000000003e813e18 000000003e813d78
Krnl Code: 00000000000ea540: e330d0000008       ag      %r3,0(%r13)
           00000000000ea546: a7480007           lhi     %r4,7
           00000000000ea54a: 1442               nr      %r4,%r2
          >00000000000ea54c: e31030000090       llgc    %r1,0(%r3)
           00000000000ea552: 5410d008           n       %r1,8(%r13)
           00000000000ea556: 8a104000           sra     %r1,0(%r4)
           00000000000ea55a: 5410d00c           n       %r1,12(%r13)
           00000000000ea55e: 1211               ltr     %r1,%r1
Call Trace:
([<0000000000000000>] 0x0)
 [<000000000001fa22>] do_syscall_trace_exit+0x132/0x18c
 [<000000000002d0c4>] sysc_return+0x0/0x8
 [<000000000001c738>] kernel_thread_starter+0x0/0xc
Last Breaking-Event-Address:
 [<00000000000ea51e>] ftrace_syscall_exit+0x2a/0xdc

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
LKML-Reference: <20090825125027.GE4639@cetus.boeblingen.de.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>


---
 kernel/trace/trace_syscalls.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 85291c4..cb7f600 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -227,6 +227,8 @@ void ftrace_syscall_enter(struct pt_regs *regs, long id)
 	int syscall_nr;
 
 	syscall_nr = syscall_get_nr(current, regs);
+	if (syscall_nr < 0)
+		return;
 	if (!test_bit(syscall_nr, enabled_enter_syscalls))
 		return;
 
@@ -257,6 +259,8 @@ void ftrace_syscall_exit(struct pt_regs *regs, long ret)
 	int syscall_nr;
 
 	syscall_nr = syscall_get_nr(current, regs);
+	if (syscall_nr < 0)
+		return;
 	if (!test_bit(syscall_nr, enabled_exit_syscalls))
 		return;
 

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [tip:tracing/core] tracing: Don't trace kernel thread syscalls
  2009-08-25 16:02       ` Hendrik Brueckner
  2009-08-25 16:20         ` Mathieu Desnoyers
  2009-08-26 12:35         ` Frederic Weisbecker
@ 2009-08-28 12:28         ` tip-bot for Hendrik Brueckner
  2 siblings, 0 replies; 88+ messages in thread
From: tip-bot for Hendrik Brueckner @ 2009-08-28 12:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mathieu.desnoyers, brueckner, mingo, schwidefsky, peterz,
	fweisbec, rostedt, heiko.carstens, tglx, jbaron, laijs, hpa,
	jiayingz, linux-kernel, lizf, lethal, mingo, mbligh

Commit-ID:  cc3b13c11c567c69a6356be98d0c03ff11541d5c
Gitweb:     http://git.kernel.org/tip/cc3b13c11c567c69a6356be98d0c03ff11541d5c
Author:     Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
AuthorDate: Tue, 25 Aug 2009 18:02:37 +0200
Committer:  Frederic Weisbecker <fweisbec@gmail.com>
CommitDate: Wed, 26 Aug 2009 21:29:52 +0200

tracing: Don't trace kernel thread syscalls

Kernel threads don't call syscalls using the sysenter/sysexit
path. Instead they directly call the sys_* or do_* functions
that implement the syscalls inside the kernel.

The current syscall tracepoints only bind the sysenter/sysexit
path, then it has no effect to trace the kernel thread calls
to syscalls in that path.
Setting the TIF_SYSCALL_TRACEPOINT flag is then useless for these.

Actually there is only one case when a kernel thread can reach the
usual syscall exit tracing path: when we create a kernel thread, the
child comes to ret_from_fork and is the fork() return is then traced.
But this information alone is useless, then we don't want to set the
TIF flags for these threads.

Kernel threads have task_struct->mm set to NULL.
(Thanks to Heiko for that hint ;-)
The idea is then to check the mm field in syscall_regfunc() and
set the flag accordingly.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
LKML-Reference: <20090825160237.GG4639@cetus.boeblingen.de.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>


---
 kernel/tracepoint.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 1a6a453..9489a0a 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -597,7 +597,9 @@ void syscall_regfunc(void)
 	if (!sys_tracepoint_refcount) {
 		read_lock_irqsave(&tasklist_lock, flags);
 		do_each_thread(g, t) {
-			set_tsk_thread_flag(t, TIF_SYSCALL_TRACEPOINT);
+			/* Skip kernel threads. */
+			if (t->mm)
+				set_tsk_thread_flag(t, TIF_SYSCALL_TRACEPOINT);
 		} while_each_thread(g, t);
 		read_unlock_irqrestore(&tasklist_lock, flags);
 	}

^ permalink raw reply related	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2009-08-28 12:29 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
2009-08-10 20:52 ` [PATCH 01/12] map syscall name to number Jason Baron
2009-08-10 20:52 ` [PATCH 02/12] call arch_init_ftrace_syscalls at boot Jason Baron
2009-08-10 20:52 ` [PATCH 03/12] add DECLARE_TRACE_WITH_CALLBACK() macro Jason Baron
2009-08-10 20:52 ` [PATCH 04/12] add syscall tracepoints Jason Baron
2009-08-10 20:52 ` [PATCH 05/12] update FTRACE_SYSCALL_MAX Jason Baron
2009-08-11 11:00   ` Frederic Weisbecker
2009-08-11 19:39     ` Matt Fleming
2009-08-24 13:41     ` Paul Mundt
2009-08-24 14:06       ` Jason Baron
2009-08-24 14:15         ` Paul Mundt
2009-08-24 14:34           ` Frederic Weisbecker
2009-08-24 14:37             ` Paul Mundt
2009-08-24 14:42           ` Jason Baron
2009-08-24 14:50             ` Paul Mundt
2009-08-24 18:34               ` Ingo Molnar
2009-08-10 20:52 ` [PATCH 06/12] trace_event - raw_init bailout Jason Baron
2009-08-10 20:52 ` [PATCH 07/12] add ftrace_event_call void * 'data' field Jason Baron
2009-08-11 10:09   ` Frederic Weisbecker
2009-08-17 22:19     ` Steven Rostedt
2009-08-17 23:09       ` Frederic Weisbecker
2009-08-18  0:06         ` Steven Rostedt
2009-08-10 20:52 ` [PATCH 08/12] add trace events for each syscall entry/exit Jason Baron
2009-08-11 10:50   ` Frederic Weisbecker
2009-08-11 11:45     ` Ingo Molnar
2009-08-11 12:01       ` Frederic Weisbecker
2009-08-25 12:50   ` Hendrik Brueckner
2009-08-25 14:15     ` Frederic Weisbecker
2009-08-25 16:02       ` Hendrik Brueckner
2009-08-25 16:20         ` Mathieu Desnoyers
2009-08-25 16:59           ` Frederic Weisbecker
2009-08-25 17:31             ` Frederic Weisbecker
2009-08-25 18:31               ` Mathieu Desnoyers
2009-08-25 19:42                 ` Frederic Weisbecker
2009-08-25 19:51                   ` Mathieu Desnoyers
2009-08-26  0:19                     ` Frederic Weisbecker
2009-08-26  0:42                       ` Mathieu Desnoyers
2009-08-26  7:28                         ` Ingo Molnar
2009-08-26 17:11                           ` Mathieu Desnoyers
2009-08-26  6:48                   ` Peter Zijlstra
2009-08-25 22:04                 ` Martin Schwidefsky
2009-08-26  7:38                   ` Heiko Carstens
2009-08-26 12:32                     ` Frederic Weisbecker
2009-08-26  6:21                 ` Peter Zijlstra
2009-08-26 17:08                   ` Mathieu Desnoyers
2009-08-26 18:41                     ` Christoph Hellwig
2009-08-26 18:42                       ` Christoph Hellwig
2009-08-26 19:01                         ` Mathieu Desnoyers
2009-08-26  7:10                 ` Peter Zijlstra
2009-08-26 17:10                   ` Mathieu Desnoyers
2009-08-26 17:24                   ` H. Peter Anvin
2009-08-25 17:04           ` Jason Baron
2009-08-25 18:15             ` Mathieu Desnoyers
2009-08-26 12:35         ` Frederic Weisbecker
2009-08-26 12:59           ` Heiko Carstens
2009-08-26 13:30             ` Frederic Weisbecker
2009-08-26 13:48               ` Steven Rostedt
2009-08-26 13:53                 ` Frederic Weisbecker
2009-08-26 14:44                   ` Steven Rostedt
2009-08-26 13:56                 ` Peter Zijlstra
2009-08-26 14:41                   ` Steven Rostedt
2009-08-26 14:10               ` Heiko Carstens
2009-08-26 14:27                 ` Frederic Weisbecker
2009-08-26 14:43                   ` Steven Rostedt
2009-08-26 16:14                     ` Frederic Weisbecker
2009-08-26 14:43                 ` Steven Rostedt
2009-08-26 14:41           ` Hendrik Brueckner
2009-08-28 12:28         ` [tip:tracing/core] tracing: Don't trace kernel thread syscalls tip-bot for Hendrik Brueckner
2009-08-25 21:40     ` [PATCH 08/12] add trace events for each syscall entry/exit Frederic Weisbecker
2009-08-25 22:09       ` Frederic Weisbecker
2009-08-26  7:47         ` Heiko Carstens
2009-08-28 12:27     ` [tip:tracing/core] tracing: Check invalid syscall nr while tracing syscalls tip-bot for Hendrik Brueckner
2009-08-10 20:52 ` [PATCH 09/12] add support traceopint ids Jason Baron
2009-08-11 11:28   ` Frederic Weisbecker
2009-08-10 20:53 ` [PATCH 10/12] add perf counter support Jason Baron
2009-08-11 12:12   ` Frederic Weisbecker
2009-08-11 12:17     ` Ingo Molnar
2009-08-11 12:25       ` Frederic Weisbecker
2009-08-10 20:53 ` [PATCH 11/12] add more namespace area to 'perf list' output Jason Baron
2009-08-10 20:53 ` [PATCH 12/12] convert x86_64 mmap and uname to use DEFINE_SYSCALL Jason Baron
2009-08-25 12:31 ` [PATCH 00/12] add syscall tracepoints V3 - s390 arch update Hendrik Brueckner
2009-08-25 13:52   ` Frederic Weisbecker
2009-08-25 14:39     ` Heiko Carstens
2009-08-25 19:52       ` Frederic Weisbecker
2009-08-25 15:38     ` Hendrik Brueckner
2009-08-26 16:53   ` Frederic Weisbecker
2009-08-27  7:27     ` [PATCH]: tracing: s390 arch updates for tracing syscalls Hendrik Brueckner
2009-08-28 12:27   ` [tip:tracing/core] tracing: Add syscall tracepoints - s390 arch update tip-bot for Hendrik Brueckner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.