All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Enhance and speed up syscall tracing
@ 2012-03-26 18:39 Vaibhav Nagarnaik
  2012-03-26 18:39 ` [PATCH 1/6] trace: syscalls.h - cleanup and simplify SYSCALL_METADATA() Vaibhav Nagarnaik
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-26 18:39 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar
  Cc: David Sharp, Justin Teravest, Laurent Chavey, x86, linux-kernel,
	Vaibhav Nagarnaik

Tracing syscalls in the kernel adds its own complexity. This patchset
attempts to simplify the code for syscall tracing and speed it up.

It also adds tracing for ia32 compat syscalls.


These patches simplify the syscall tracing macros:
* trace: syscalls.h - cleanup and simplify SYSCALL_METADATA()
* trace: Refactor ftrace syscall macros to make them more readable


These patches add tracing support for compat syscalls:
* trace: add support for 32 bit compat syscalls on x86_64
* trace: raw_syscalls: Mark compat syscalls in the MSB of the syscall
number


This patch reduces syscall tracing latency.
* trace: trace syscall in its handler not from ptrace handler

It changes the way syscall tracing is plumbed. The syscalls tracepoint
was earlier called from ptrace handler which added latency to the
syscall tracing code path because of manipulation required to the stack.
It can be similarly handled by having a little indirection function as
the syscall handler and calling the tracepoints just before the actual
syscall handler.

It decreases latency for all syscall tracepoints.


David Sharp (1):
  trace: raw_syscalls: Mark compat syscalls in the MSB of the syscall
    number

Michael Davidson (3):
  trace: syscalls.h - cleanup and simplify SYSCALL_METADATA()
  trace: add support for 32 bit compat syscalls on x86_64
  trace: get rid of the enabled_*_syscalls bitmaps

Vaibhav Nagarnaik (2):
  trace: Refactor ftrace syscall macros to make them more readable
  trace: trace syscall in its handler not from ptrace handler

 arch/openrisc/include/asm/thread_info.h |    1 -
 arch/powerpc/Kconfig                    |    1 -
 arch/powerpc/include/asm/thread_info.h  |    4 +-
 arch/powerpc/kernel/ptrace.c            |    6 -
 arch/s390/Kconfig                       |    1 -
 arch/s390/include/asm/thread_info.h     |    2 -
 arch/s390/kernel/entry.S                |    3 +-
 arch/s390/kernel/entry64.S              |    3 +-
 arch/s390/kernel/ptrace.c               |    9 -
 arch/sh/Kconfig                         |    1 -
 arch/sh/include/asm/thread_info.h       |    8 +-
 arch/sh/kernel/ptrace_32.c              |    9 -
 arch/sh/kernel/ptrace_64.c              |    9 -
 arch/sparc/Kconfig                      |    1 -
 arch/sparc/include/asm/thread_info_64.h |    2 -
 arch/sparc/kernel/ptrace_64.c           |    9 -
 arch/sparc/kernel/syscalls.S            |   10 +-
 arch/x86/Kconfig                        |    1 -
 arch/x86/ia32/Makefile                  |    2 +
 arch/x86/ia32/ia32_syscall_metadata.c   |  443 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/thread_info.h      |   10 +-
 arch/x86/kernel/ptrace.c                |    9 -
 fs/dcookies.c                           |    2 +-
 fs/open.c                               |    2 +-
 fs/read_write.c                         |    8 +-
 fs/sync.c                               |    8 +-
 include/linux/syscalls.h                |  153 +++++++----
 include/trace/events/syscalls.h         |   38 ++--
 include/trace/syscall.h                 |   18 ++-
 kernel/trace/Kconfig                    |   12 +-
 kernel/trace/trace_syscalls.c           |  153 +++++------
 kernel/tracepoint.c                     |   38 ---
 mm/fadvise.c                            |    5 +-
 mm/filemap.c                            |    2 +-
 34 files changed, 681 insertions(+), 302 deletions(-)
 create mode 100644 arch/x86/ia32/ia32_syscall_metadata.c

-- 
1.7.7.3


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/6] trace: syscalls.h - cleanup and simplify SYSCALL_METADATA()
  2012-03-26 18:39 [PATCH 0/6] Enhance and speed up syscall tracing Vaibhav Nagarnaik
@ 2012-03-26 18:39 ` Vaibhav Nagarnaik
  2012-03-26 18:39 ` [PATCH 2/6] trace: add support for 32 bit compat syscalls on x86_64 Vaibhav Nagarnaik
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-26 18:39 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar
  Cc: David Sharp, Justin Teravest, Laurent Chavey, x86, linux-kernel,
	Michael Davidson, Vaibhav Nagarnaik

From: Michael Davidson <md@google.com>

Add 0 argument versions of the __SC_* macros so that system calls
with 0 arguments are no longer a special case.

Change SYSCALL_DEFINE0() to use SYSCALL_DEFINEx() like everything else.

Move the declarations of types_##sname and args_##sname into the
SYSCALL_METADATA() macro so that the changes to SYSCALL_DEFINEx()
when CONFIG_FTRACE_SYSCALLS is defined are all hidden inside of
SYSCALL_METADATA().

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
---
 include/linux/syscalls.h |   51 ++++++++++++++++-----------------------------
 1 files changed, 18 insertions(+), 33 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 8ec1153..ed0003c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -76,6 +76,7 @@ struct file_handle;
 #include <linux/key.h>
 #include <trace/syscall.h>
 
+#define __SC_DECL0()
 #define __SC_DECL1(t1, a1)	t1 a1
 #define __SC_DECL2(t2, a2, ...) t2 a2, __SC_DECL1(__VA_ARGS__)
 #define __SC_DECL3(t3, a3, ...) t3 a3, __SC_DECL2(__VA_ARGS__)
@@ -83,6 +84,7 @@ struct file_handle;
 #define __SC_DECL5(t5, a5, ...) t5 a5, __SC_DECL4(__VA_ARGS__)
 #define __SC_DECL6(t6, a6, ...) t6 a6, __SC_DECL5(__VA_ARGS__)
 
+#define __SC_LONG0()
 #define __SC_LONG1(t1, a1) 	long a1
 #define __SC_LONG2(t2, a2, ...) long a2, __SC_LONG1(__VA_ARGS__)
 #define __SC_LONG3(t3, a3, ...) long a3, __SC_LONG2(__VA_ARGS__)
@@ -90,6 +92,7 @@ struct file_handle;
 #define __SC_LONG5(t5, a5, ...) long a5, __SC_LONG4(__VA_ARGS__)
 #define __SC_LONG6(t6, a6, ...) long a6, __SC_LONG5(__VA_ARGS__)
 
+#define __SC_CAST0()
 #define __SC_CAST1(t1, a1)	(t1) a1
 #define __SC_CAST2(t2, a2, ...) (t2) a2, __SC_CAST1(__VA_ARGS__)
 #define __SC_CAST3(t3, a3, ...) (t3) a3, __SC_CAST2(__VA_ARGS__)
@@ -98,6 +101,7 @@ struct file_handle;
 #define __SC_CAST6(t6, a6, ...) (t6) a6, __SC_CAST5(__VA_ARGS__)
 
 #define __SC_TEST(type)		BUILD_BUG_ON(sizeof(type) > sizeof(long))
+#define __SC_TEST0()
 #define __SC_TEST1(t1, a1)	__SC_TEST(t1)
 #define __SC_TEST2(t2, a2, ...)	__SC_TEST(t2); __SC_TEST1(__VA_ARGS__)
 #define __SC_TEST3(t3, a3, ...)	__SC_TEST(t3); __SC_TEST2(__VA_ARGS__)
@@ -106,6 +110,7 @@ struct file_handle;
 #define __SC_TEST6(t6, a6, ...)	__SC_TEST(t6); __SC_TEST5(__VA_ARGS__)
 
 #ifdef CONFIG_FTRACE_SYSCALLS
+#define __SC_STR_ADECL0()		(0)
 #define __SC_STR_ADECL1(t, a)		#a
 #define __SC_STR_ADECL2(t, a, ...)	#a, __SC_STR_ADECL1(__VA_ARGS__)
 #define __SC_STR_ADECL3(t, a, ...)	#a, __SC_STR_ADECL2(__VA_ARGS__)
@@ -113,6 +118,7 @@ struct file_handle;
 #define __SC_STR_ADECL5(t, a, ...)	#a, __SC_STR_ADECL4(__VA_ARGS__)
 #define __SC_STR_ADECL6(t, a, ...)	#a, __SC_STR_ADECL5(__VA_ARGS__)
 
+#define __SC_STR_TDECL0()		(0)
 #define __SC_STR_TDECL1(t, a)		#t
 #define __SC_STR_TDECL2(t, a, ...)	#t, __SC_STR_TDECL1(__VA_ARGS__)
 #define __SC_STR_TDECL3(t, a, ...)	#t, __SC_STR_TDECL2(__VA_ARGS__)
@@ -153,14 +159,20 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 	  __attribute__((section("_ftrace_events")))			\
 	*__event_exit_##sname = &event_exit_##sname;
 
-#define SYSCALL_METADATA(sname, nb)				\
+#define SYSCALL_METADATAx(x, sname, ...)			\
+	static const char *types_##sname[] = {			\
+		__SC_STR_TDECL##x(__VA_ARGS__)			\
+	};							\
+	static const char *args_##sname[] = {			\
+		__SC_STR_ADECL##x(__VA_ARGS__)			\
+	};							\
 	SYSCALL_TRACE_ENTER_EVENT(sname);			\
 	SYSCALL_TRACE_EXIT_EVENT(sname);			\
 	static struct syscall_metadata __used			\
 	  __syscall_meta_##sname = {				\
 		.name 		= "sys"#sname,			\
 		.syscall_nr	= -1,	/* Filled in at boot */	\
-		.nb_args 	= nb,				\
+		.nb_args 	= x,				\
 		.types		= types_##sname,		\
 		.args		= args_##sname,			\
 		.enter_event	= &event_enter_##sname,		\
@@ -170,27 +182,11 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 	static struct syscall_metadata __used			\
 	  __attribute__((section("__syscalls_metadata")))	\
 	 *__p_syscall_meta_##sname = &__syscall_meta_##sname;
-
-#define SYSCALL_DEFINE0(sname)					\
-	SYSCALL_TRACE_ENTER_EVENT(_##sname);			\
-	SYSCALL_TRACE_EXIT_EVENT(_##sname);			\
-	static struct syscall_metadata __used			\
-	  __syscall_meta__##sname = {				\
-		.name 		= "sys_"#sname,			\
-		.syscall_nr	= -1,	/* Filled in at boot */	\
-		.nb_args 	= 0,				\
-		.enter_event	= &event_enter__##sname,	\
-		.exit_event	= &event_exit__##sname,		\
-		.enter_fields	= LIST_HEAD_INIT(__syscall_meta__##sname.enter_fields), \
-	};							\
-	static struct syscall_metadata __used			\
-	  __attribute__((section("__syscalls_metadata")))	\
-	 *__p_syscall_meta_##sname = &__syscall_meta__##sname;	\
-	asmlinkage long sys_##sname(void)
 #else
-#define SYSCALL_DEFINE0(name)	   asmlinkage long sys_##name(void)
-#endif
+#define SYSCALL_METADATAx(x, name, ...)
+#endif /* CONFIG_FTRACE_SYSCALLS */
 
+#define SYSCALL_DEFINE0(name, ...) SYSCALL_DEFINEx(0, _##name, __VA_ARGS__)
 #define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
 #define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
 #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
@@ -212,20 +208,9 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 #endif
 #endif
 
-#ifdef CONFIG_FTRACE_SYSCALLS
-#define SYSCALL_DEFINEx(x, sname, ...)				\
-	static const char *types_##sname[] = {			\
-		__SC_STR_TDECL##x(__VA_ARGS__)			\
-	};							\
-	static const char *args_##sname[] = {			\
-		__SC_STR_ADECL##x(__VA_ARGS__)			\
-	};							\
-	SYSCALL_METADATA(sname, x);				\
-	__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
-#else
 #define SYSCALL_DEFINEx(x, sname, ...)				\
+	SYSCALL_METADATAx(x, sname, __VA_ARGS__)			\
 	__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
-#endif
 
 #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/6] trace: add support for 32 bit compat syscalls on x86_64
  2012-03-26 18:39 [PATCH 0/6] Enhance and speed up syscall tracing Vaibhav Nagarnaik
  2012-03-26 18:39 ` [PATCH 1/6] trace: syscalls.h - cleanup and simplify SYSCALL_METADATA() Vaibhav Nagarnaik
@ 2012-03-26 18:39 ` Vaibhav Nagarnaik
  2012-03-27  4:49   ` H. Peter Anvin
  2012-03-26 18:39 ` [PATCH 3/6] trace: Refactor ftrace syscall macros to make them more readable Vaibhav Nagarnaik
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-26 18:39 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar
  Cc: David Sharp, Justin Teravest, Laurent Chavey, x86, linux-kernel,
	Michael Davidson, Vaibhav Nagarnaik

From: Michael Davidson <md@google.com>

Add support for a set of events to trace 32 bit compat system calls
in addition to the native 64 bit system calls.

Events for compat system calls have event names of the form:
  syscalls:sys_enter_compat_<name>
  syscalls:sys_exit_compat_<name>

The ascii formatted version of trace events that can be read from
the tracing/trace file reports compat system calls as:
  compat_<name>(...)

- add CONFIG_FTRACE_COMPAT_SYSCALLS

- add a "compat" flag to the syscall_metadata struct so that we can
  distinguish between "native" and "compat" metadata at init time
  when building the system call # to metadata mapping tables

- add a COMPAT_SYSCALL_METADATAx() macro to define system call
  metadata for compat system calls

- define a set of COMPAT_SYSCALL_METADATA[0-6] macros

- modify syscall_nr_to_meta() to know about compat system calls
  and return a pointer to the correct metadata

- modify print_syscall_{enter|exit}() to find the system call metadata
  by looking in the containing ftrace_event_call struct

- add system call metadata definitions for the x86_64 32 bit compat
  system calls

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
---
 arch/x86/ia32/Makefile                |    2 +
 arch/x86/ia32/ia32_syscall_metadata.c |  443 +++++++++++++++++++++++++++++++++
 fs/dcookies.c                         |    2 +-
 fs/open.c                             |    2 +-
 fs/read_write.c                       |    8 +-
 fs/sync.c                             |    8 +-
 include/linux/syscalls.h              |   24 ++-
 include/trace/syscall.h               |   18 ++-
 kernel/trace/Kconfig                  |    6 +
 kernel/trace/trace_syscalls.c         |   24 ++-
 mm/fadvise.c                          |    5 +-
 mm/filemap.c                          |    2 +-
 12 files changed, 521 insertions(+), 23 deletions(-)
 create mode 100644 arch/x86/ia32/ia32_syscall_metadata.c

diff --git a/arch/x86/ia32/Makefile b/arch/x86/ia32/Makefile
index 455646e..ba6d3c8 100644
--- a/arch/x86/ia32/Makefile
+++ b/arch/x86/ia32/Makefile
@@ -12,3 +12,5 @@ obj-$(CONFIG_IA32_AOUT) += ia32_aout.o
 
 audit-class-$(CONFIG_AUDIT) := audit.o
 obj-$(CONFIG_IA32_EMULATION) += $(audit-class-y)
+
+obj-$(CONFIG_FTRACE_COMPAT_SYSCALLS) += ia32_syscall_metadata.o
diff --git a/arch/x86/ia32/ia32_syscall_metadata.c b/arch/x86/ia32/ia32_syscall_metadata.c
new file mode 100644
index 0000000..8001794
--- /dev/null
+++ b/arch/x86/ia32/ia32_syscall_metadata.c
@@ -0,0 +1,443 @@
+#include <linux/slab.h>
+#include <linux/syscalls.h>
+#include <linux/module.h>
+#include <asm/asm-offsets.h>
+
+/*
+ * syscall metadata for 32 bit compatible system calls
+ *
+ * The metadata entries are in the same order as the system call table
+ * but this is just to make it easier to check them for completeness
+ * and correctness.
+ */
+
+COMPAT_SYSCALL_METADATA0(restart_syscall)
+COMPAT_SYSCALL_METADATA1(exit, int, error_code)
+/* fork */
+COMPAT_SYSCALL_METADATA3(read, unsigned int, fd, char __user *, buf, size_t, count)
+COMPAT_SYSCALL_METADATA3(write, unsigned int, fd, const char __user *, buf, size_t, count)
+COMPAT_SYSCALL_METADATA3(open, const char __user *, filename, int, flags, int, mode)
+COMPAT_SYSCALL_METADATA1(close, unsigned int, fd)
+COMPAT_SYSCALL_METADATA3(waitpid, compat_pid_t, pid, int __user *, stat_addr, int, options)
+COMPAT_SYSCALL_METADATA2(creat, const char __user *, pathname, int, mode)
+COMPAT_SYSCALL_METADATA2(link, const char __user *, oldname, const char __user *, newname)
+COMPAT_SYSCALL_METADATA1(unlink, const char __user *, pathname)
+/* execve */
+COMPAT_SYSCALL_METADATA1(chdir, const char __user *, filename)
+COMPAT_SYSCALL_METADATA1(time, compat_time_t __user *, tloc)
+COMPAT_SYSCALL_METADATA3(mknod, const char __user *, filename, int, mode, unsigned, dev)
+COMPAT_SYSCALL_METADATA2(chmod, const char __user *, filename, mode_t, mode)
+COMPAT_SYSCALL_METADATA3(lchown16, const char __user *, filename, old_uid_t, user, old_gid_t, group)
+/* ni */
+COMPAT_SYSCALL_METADATA2(stat, char __user *, filename, struct __old_kernel_stat __user *, statbuf)
+COMPAT_SYSCALL_METADATA3(lseek, unsigned int, fd, int, offset, unsigned int, whence)
+COMPAT_SYSCALL_METADATA0(getpid)
+COMPAT_SYSCALL_METADATA5(mount, char __user *, dev_name, char __user *, dir_name, char __user *, type, unsigned long, flags, void __user *, data)
+COMPAT_SYSCALL_METADATA1(oldumount, char __user *, name)
+COMPAT_SYSCALL_METADATA1(setuid16, old_uid_t, uid)
+COMPAT_SYSCALL_METADATA0(getuid16)
+COMPAT_SYSCALL_METADATA1(stime, compat_time_t __user *, tptr)
+COMPAT_SYSCALL_METADATA4(ptrace, compat_long_t, request, compat_long_t, pid, compat_long_t, addr, compat_long_t, data)
+COMPAT_SYSCALL_METADATA1(alarm, unsigned int, seconds)
+COMPAT_SYSCALL_METADATA2(fstat, unsigned int, fd, struct __old_kernel_stat __user *, statbuf)
+COMPAT_SYSCALL_METADATA0(pause)
+COMPAT_SYSCALL_METADATA2(utime, char __user *, filename, struct compat_utimbuf __user *, times)
+/* ni */
+/* ni */
+COMPAT_SYSCALL_METADATA2(access, const char __user *, filename, int, mode)
+COMPAT_SYSCALL_METADATA1(nice, int, increment)
+/* ni */
+COMPAT_SYSCALL_METADATA0(sync)
+COMPAT_SYSCALL_METADATA2(kill, int, pid, int, sig)
+COMPAT_SYSCALL_METADATA2(rename, const char __user *, oldname, const char __user *, newname)
+COMPAT_SYSCALL_METADATA2(mkdir, const char __user *, pathname, int, mode)
+COMPAT_SYSCALL_METADATA1(rmdir, const char __user *, pathname)
+COMPAT_SYSCALL_METADATA1(dup, unsigned int, fildes)
+COMPAT_SYSCALL_METADATA1(pipe, int __user *, fildes)
+COMPAT_SYSCALL_METADATA1(times, struct compat_tms __user *, tbuf)
+/* ni */
+COMPAT_SYSCALL_METADATA1(brk, unsigned long, brk)
+COMPAT_SYSCALL_METADATA1(setgid16, old_gid_t, gid)
+COMPAT_SYSCALL_METADATA0(getgid16)
+COMPAT_SYSCALL_METADATA2(signal, int, sig, __sighandler_t, handler)
+COMPAT_SYSCALL_METADATA0(geteuid16)
+COMPAT_SYSCALL_METADATA0(getegid16)
+COMPAT_SYSCALL_METADATA1(acct, const char __user *, name)
+COMPAT_SYSCALL_METADATA2(umount, char __user *, name, int, flags)
+/* ni */
+COMPAT_SYSCALL_METADATA3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
+/* ni */
+COMPAT_SYSCALL_METADATA2(setpgid, pid_t, pid, pid_t, pgid)
+/* ni */
+COMPAT_SYSCALL_METADATA1(olduname, struct oldold_utsname __user *, name)
+COMPAT_SYSCALL_METADATA1(umask, int, mask)
+COMPAT_SYSCALL_METADATA1(chroot, const char __user *, filename)
+COMPAT_SYSCALL_METADATA2(ustat, unsigned, dev, struct compat_ustat __user *, ubuf)
+COMPAT_SYSCALL_METADATA2(dup2, unsigned int, oldfd, unsigned int, newfd)
+COMPAT_SYSCALL_METADATA0(getppid)
+COMPAT_SYSCALL_METADATA0(getpgrp)
+COMPAT_SYSCALL_METADATA0(setsid)
+COMPAT_SYSCALL_METADATA3(sigaction, int, sig, struct old_sigaction32 __user *, act, struct old_sigaction32 __user *, oact)
+COMPAT_SYSCALL_METADATA0(sgetmask)
+COMPAT_SYSCALL_METADATA1(ssetmask, int, newmask)
+COMPAT_SYSCALL_METADATA2(setreuid16, old_uid_t, ruid, old_uid_t, euid)
+COMPAT_SYSCALL_METADATA2(setregid16, old_gid_t, rgid, old_gid_t, egid)
+COMPAT_SYSCALL_METADATA3(sigsuspend, int, history0, int, history1, old_sigset_t, mask)
+COMPAT_SYSCALL_METADATA1(sigpending, compat_old_sigset_t __user *, set)
+COMPAT_SYSCALL_METADATA2(sethostname, char __user *, name, int, len)
+COMPAT_SYSCALL_METADATA2(setrlimit, unsigned int, resource, struct compat_rlimit __user *, rlim)
+COMPAT_SYSCALL_METADATA2(old_getrlimit, unsigned int, resource, struct compat_rlimit __user *, rlim)
+COMPAT_SYSCALL_METADATA2(getrusage, int, who, struct compat_rusage __user *, ru)
+COMPAT_SYSCALL_METADATA2(gettimeofday, struct compat_timeval __user *, tv, struct timezone __user *, tz)
+COMPAT_SYSCALL_METADATA2(settimeofday, struct compat_timeval __user *, tv, struct timezone __user *, tz)
+COMPAT_SYSCALL_METADATA2(getgroups16, int, gidsetsize, old_gid_t __user *, grouplist)
+COMPAT_SYSCALL_METADATA2(setgroups16, int, gidsetsize, old_gid_t __user *, grouplist)
+COMPAT_SYSCALL_METADATA1(old_select, struct compat_sel_arg_struct __user *, arg)
+COMPAT_SYSCALL_METADATA2(symlink, const char __user *, oldname, const char __user *, newname)
+COMPAT_SYSCALL_METADATA2(lstat, char __user *, filename, struct __old_kernel_stat __user *, statbuf)
+COMPAT_SYSCALL_METADATA3(readlink, const char __user *, path, char __user *, buf, int, bufsiz)
+COMPAT_SYSCALL_METADATA1(uselib, const char __user *, library)
+COMPAT_SYSCALL_METADATA2(swapon, const char __user *, specialfile, int, swap_flags)
+COMPAT_SYSCALL_METADATA4(reboot, int, magic1, int, magic2, unsigned int, cmd, void __user *, arg)
+COMPAT_SYSCALL_METADATA3(old_readdir, unsigned int, fd, struct compat_old_linux_dirent __user *, dirent, unsigned int, count)
+COMPAT_SYSCALL_METADATA1(mmap, struct mmap_arg_struct32 __user *, arg)
+COMPAT_SYSCALL_METADATA2(munmap, unsigned long, addr, size_t, len)
+COMPAT_SYSCALL_METADATA2(truncate, const char __user *, path, long, length)
+COMPAT_SYSCALL_METADATA2(ftruncate, unsigned int, fd, unsigned long, length)
+COMPAT_SYSCALL_METADATA2(fchmod, unsigned int, fd, mode_t, mode)
+COMPAT_SYSCALL_METADATA3(fchown16, unsigned int, fd, old_uid_t, user, old_gid_t, group)
+COMPAT_SYSCALL_METADATA2(getpriority, int, which, int, who)
+COMPAT_SYSCALL_METADATA3(setpriority, int, which, int, who, int, niceval)
+/* ni */
+COMPAT_SYSCALL_METADATA2(statfs, const char __user *, pathname, struct compat_statfs __user *, buf)
+COMPAT_SYSCALL_METADATA2(fstatfs, unsigned int, fd, struct compat_statfs __user *, buf)
+COMPAT_SYSCALL_METADATA3(ioperm, unsigned long, from, unsigned long, num, int, turn_on)
+COMPAT_SYSCALL_METADATA2(socketcall, int, call, u32 __user *, args)
+COMPAT_SYSCALL_METADATA3(syslog, int, type, char __user *, buf, int, len)
+COMPAT_SYSCALL_METADATA3(setitimer, int, which, struct compat_itimerval __user *, in, struct compat_itimerval __user *, out)
+COMPAT_SYSCALL_METADATA2(getitimer, int, which, struct compat_itimerval __user *, it)
+COMPAT_SYSCALL_METADATA2(newstat, char __user *, filename, struct compat_stat __user *, statbuf)
+COMPAT_SYSCALL_METADATA2(newlstat, char __user *, filename, struct compat_stat __user *, statbuf)
+COMPAT_SYSCALL_METADATA2(newfstat, unsigned int, fd, struct compat_stat __user *, statbuf)
+COMPAT_SYSCALL_METADATA1(uname, struct old_utsname __user *, name)
+COMPAT_SYSCALL_METADATA1(iopl, unsigned int, level)
+COMPAT_SYSCALL_METADATA0(vhangup)
+/* ni */
+/* sys32_vm86_warning */
+COMPAT_SYSCALL_METADATA4(wait4, compat_pid_t, pid, compat_uint_t __user *, stat_addr, int, options, struct compat_rusage __user *, ru)
+COMPAT_SYSCALL_METADATA1(swapoff, const char __user *, specialfile)
+COMPAT_SYSCALL_METADATA1(sysinfo, struct compat_sysinfo __user *, info)
+COMPAT_SYSCALL_METADATA6(ipc, u32, call, int, first, int, second, int, third, compat_uptr_t, ptr, u32, fifth)
+COMPAT_SYSCALL_METADATA1(fsync, unsigned int, fd)
+/* stub32_sigreturn */
+/* stub32_clone */
+COMPAT_SYSCALL_METADATA2(setdomainname, char __user *, name, int, len)
+COMPAT_SYSCALL_METADATA1(newuname, struct new_utsname __user *, name)
+/* sys_modify_ldt */
+COMPAT_SYSCALL_METADATA1(adjtimex, struct compat_timex __user *, utp)
+COMPAT_SYSCALL_METADATA3(mprotect, unsigned long, start, size_t, len, unsigned long, prot)
+COMPAT_SYSCALL_METADATA3(sigprocmask, int, how, compat_old_sigset_t __user *, set, compat_old_sigset_t __user *, oset)
+/* ni */
+COMPAT_SYSCALL_METADATA3(init_module, void __user *, umod, unsigned long, len, const char __user *, uargs)
+COMPAT_SYSCALL_METADATA2(delete_module, const char __user *, name_user, unsigned int, flags)
+/* ni */
+COMPAT_SYSCALL_METADATA4(quotactl, unsigned int, cmd, const char __user *, special, qid_t, id, void __user *, addr)
+COMPAT_SYSCALL_METADATA1(getpgid, pid_t, pid)
+COMPAT_SYSCALL_METADATA1(fchdir, unsigned int, fd)
+/* ni */
+COMPAT_SYSCALL_METADATA3(sysfs, int, option, unsigned long, arg1, unsigned long, arg2)
+COMPAT_SYSCALL_METADATA1(personality, u_long, personality)
+/* ni */
+COMPAT_SYSCALL_METADATA1(setfsuid16, old_uid_t, uid)
+COMPAT_SYSCALL_METADATA1(setfsgid16, old_gid_t, gid)
+COMPAT_SYSCALL_METADATA5(llseek, unsigned int, fd, unsigned long, offset_high, unsigned long, offset_low, loff_t __user *, result, unsigned int, origin)
+COMPAT_SYSCALL_METADATA3(getdents, unsigned int, fd, struct compat_linux_dirent __user *, dirent, unsigned int, count)
+COMPAT_SYSCALL_METADATA5(select, int, n, compat_ulong_t __user *, inp, compat_ulong_t __user *, outp, compat_ulong_tt __user *, exp, struct compat_timeval __user *, tvp)
+COMPAT_SYSCALL_METADATA2(flock, unsigned int, fd, unsigned int, cmd)
+COMPAT_SYSCALL_METADATA3(msync, unsigned long, start, size_t, len, int, flags)
+COMPAT_SYSCALL_METADATA3(readv, unsigned long, fd, const struct compat_iovec __user *, vec, unsigned long, vlen)
+COMPAT_SYSCALL_METADATA3(writev, unsigned long, fd, const struct compat_iovec __user *, vec, unsigned long, vlen)
+COMPAT_SYSCALL_METADATA1(getsid, pid_t, pid)
+COMPAT_SYSCALL_METADATA1(fdatasync, unsigned int, fd)
+COMPAT_SYSCALL_METADATA1(sysctl, struct compat_sysctl_args __user *, args)
+COMPAT_SYSCALL_METADATA2(mlock, unsigned long, start, size_t, len)
+COMPAT_SYSCALL_METADATA1(mlockall, int, flags)
+COMPAT_SYSCALL_METADATA2(munlock, unsigned long, start, size_t, len)
+COMPAT_SYSCALL_METADATA0(munlockall)
+COMPAT_SYSCALL_METADATA2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
+COMPAT_SYSCALL_METADATA2(sched_getparam, pid_t, pid, struct sched_param __user *, param)
+COMPAT_SYSCALL_METADATA3(sched_setscheduler, pid_t, pid, int, policy, struct sched_param __user *, param)
+COMPAT_SYSCALL_METADATA1(sched_getscheduler, pid_t, pid)
+COMPAT_SYSCALL_METADATA0(sched_yield)
+COMPAT_SYSCALL_METADATA1(sched_get_priority_max, int, policy)
+COMPAT_SYSCALL_METADATA1(sched_get_priority_min, int, policy)
+COMPAT_SYSCALL_METADATA2(sched_rr_get_interval, compat_pid_t, pid, struct compat_timespec __user *, interval)
+COMPAT_SYSCALL_METADATA2(nanosleep, struct compat_timespec __user *, rqtp, struct compat_timespec __user *, rmtp)
+COMPAT_SYSCALL_METADATA5(mremap, unsigned long, addr, unsigned long, old_len, unsigned long, new_len, unsigned long, flags, unsigned long, new_addr)
+COMPAT_SYSCALL_METADATA3(setresuid16, old_uid_t, ruid, old_uid_t, euid, old_uid_t, suid)
+COMPAT_SYSCALL_METADATA3(getresuid16, old_uid_t __user *, ruid, old_uid_t __user *, euid, old_uid_t __user *, suid)
+/* sys32_vm86_warning */
+/* ni */
+COMPAT_SYSCALL_METADATA3(poll, struct pollfd __user *, ufds, unsigned int, nfds, long, timeout_msecs)
+COMPAT_SYSCALL_METADATA3(nfsservctl, int, cmd, struct compat_nfsctl_arg __user *, arg, struct compat_nfsctl_res  __user *, res)
+COMPAT_SYSCALL_METADATA3(setresgid16, old_gid_t, rgid, old_gid_t, egid, old_gid_t, sgid)
+COMPAT_SYSCALL_METADATA3(getresgid16, old_gid_t __user *, rgid, old_gid_t __user *, egid, old_gid_t __user *, sgid)
+COMPAT_SYSCALL_METADATA5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5)
+/* stub32_rt_sigreturn */
+COMPAT_SYSCALL_METADATA4(rt_sigaction, int, sig, const struct sigaction32 __user *, act, struct sigaction32 __user *, oact, unsigned int , sigsetsize)
+COMPAT_SYSCALL_METADATA4(rt_sigprocmask, int, how, compat_sigset_t __user *, set, compat_sigset_t __user *, oset, unsigned int, sigsetsize)
+COMPAT_SYSCALL_METADATA2(rt_sigpending, compat_sigset_t __user *, set, unsigned int, sigsetsize)
+COMPAT_SYSCALL_METADATA4(rt_sigtimedwait, const compat_sigset_t __user *, uthese, compat_siginfo_t __user *, uinfo, const struct compat_timespec __user *, uts, compat_size_t, sigsetsize)
+COMPAT_SYSCALL_METADATA3(rt_sigqueueinfo, int, pid, int, sig, compat_siginfo_t __user *, uinfo)
+COMPAT_SYSCALL_METADATA2(rt_sigsuspend, sigset_t __user *, unewset, size_t, sigsetsize)
+COMPAT_SYSCALL_METADATA5(pread, unsigned int,  fd, char __user *, ubuf, u32, count, u32,  poslo, u32,  poshi)
+COMPAT_SYSCALL_METADATA5(pwrite, unsigned int,  fd, char __user *, ubuf, u32, count, u32,  poslo, u32,  poshi)
+COMPAT_SYSCALL_METADATA3(chown16, const char __user *, filename, old_uid_t, user, old_gid_t, group)
+COMPAT_SYSCALL_METADATA2(getcwd, char __user *, buf, unsigned long, size)
+COMPAT_SYSCALL_METADATA2(capget, cap_user_header_t, header, cap_user_data_t, dataptr)
+COMPAT_SYSCALL_METADATA2(capset, cap_user_header_t, header, const cap_user_data_t, data)
+COMPAT_SYSCALL_METADATA2(sigaltstack, const stack_ia32_t __user *, uss_ptr, stack_ia32_t __user *, uoss_ptr)
+COMPAT_SYSCALL_METADATA4(sendfile, int, out_fd, int, in_fd, compat_off_t __user *, offset, s32, count)
+/* ni */
+/* ni */
+/* stub32_vfork */
+COMPAT_SYSCALL_METADATA2(getrlimit, unsigned int, resource, struct rlimit __user *, rlim)
+COMPAT_SYSCALL_METADATA6(mmap_pgoff, unsigned long, addr, unsigned long, len, unsigned long, prot, unsigned long, flags, unsigned long, fd, unsigned long, pgoff)
+COMPAT_SYSCALL_METADATA3(truncate64, const char __user *,  path, unsigned long, offset_lo, unsigned long, offset_high)
+COMPAT_SYSCALL_METADATA3(ftruncate64, unsigned  int, fd, unsigned long, offset_lo, unsigned long, offset_high)
+COMPAT_SYSCALL_METADATA2(stat64, char __user *, filename, struct stat64 __user *, statbuf)
+COMPAT_SYSCALL_METADATA2(lstat64, char __user *, filename, struct stat64 __user *, statbuf)
+COMPAT_SYSCALL_METADATA2(fstat64, unsigned long, fd, struct stat64 __user *, statbuf)
+COMPAT_SYSCALL_METADATA3(lchown, const char __user *, filename, uid_t, user, gid_t, group)
+COMPAT_SYSCALL_METADATA0(getuid)
+COMPAT_SYSCALL_METADATA0(getgid)
+COMPAT_SYSCALL_METADATA0(geteuid)
+COMPAT_SYSCALL_METADATA0(getegid)
+COMPAT_SYSCALL_METADATA2(setreuid, uid_t, ruid, uid_t, euid)
+COMPAT_SYSCALL_METADATA2(setregid, gid_t, rgid, gid_t, egid)
+COMPAT_SYSCALL_METADATA2(getgroups, int, gidsetsize, gid_t __user *, grouplist)
+COMPAT_SYSCALL_METADATA2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
+COMPAT_SYSCALL_METADATA3(fchown, unsigned int, fd, uid_t, user, gid_t, group)
+COMPAT_SYSCALL_METADATA3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid)
+COMPAT_SYSCALL_METADATA3(getresuid, uid_t __user *, ruid, uid_t __user *, euid, uid_t __user *, suid)
+COMPAT_SYSCALL_METADATA3(setresgid, gid_t, rgid, gid_t, egid, gid_t, sgid)
+COMPAT_SYSCALL_METADATA3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __user *, sgid)
+COMPAT_SYSCALL_METADATA3(chown, const char __user *, filename, uid_t, user, gid_t, group)
+COMPAT_SYSCALL_METADATA1(setuid, uid_t, uid)
+COMPAT_SYSCALL_METADATA1(setgid, gid_t, gid)
+COMPAT_SYSCALL_METADATA1(setfsuid, uid_t, uid)
+COMPAT_SYSCALL_METADATA1(setfsgid, gid_t, gid)
+COMPAT_SYSCALL_METADATA2(pivot_root, const char __user *, new_root, const char __user *, put_old)
+COMPAT_SYSCALL_METADATA3(mincore, unsigned long, start, size_t, len, unsigned char __user *, vec)
+COMPAT_SYSCALL_METADATA3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+COMPAT_SYSCALL_METADATA3(getdents64, unsigned int, fd, struct linux_dirent64 __user *, dirent, unsigned int, count)
+COMPAT_SYSCALL_METADATA3(fcntl64, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
+/* ni */
+/* ni */
+COMPAT_SYSCALL_METADATA0(gettid)
+COMPAT_SYSCALL_METADATA4(readahead, int, fd, unsigned, off_lo, unsigned, off_hi, size_t, count)
+COMPAT_SYSCALL_METADATA5(setxattr, const char __user *, pathname, const char __user *, name, const void __user *, value, size_t, size, int, flags)
+COMPAT_SYSCALL_METADATA5(lsetxattr, const char __user *, pathname, const char __user *, name, const void __user *, value, size_t, size, int, flags)
+COMPAT_SYSCALL_METADATA5(fsetxattr, int, fd, const char __user *, name, const void __user *,value, size_t, size, int, flags)
+COMPAT_SYSCALL_METADATA4(getxattr, const char __user *, pathname, const char __user *, name, void __user *, value, size_t, size)
+COMPAT_SYSCALL_METADATA4(fgetxattr, int, fd, const char __user *, name, void __user *, value, size_t, size)
+COMPAT_SYSCALL_METADATA4(lgetxattr, const char __user *, pathname, const char __user *, name, void __user *, value, size_t, size)
+COMPAT_SYSCALL_METADATA3(listxattr, const char __user *, pathname, char __user *, list, size_t, size)
+COMPAT_SYSCALL_METADATA3(llistxattr, const char __user *, pathname, char __user *, list, size_t, size)
+COMPAT_SYSCALL_METADATA3(flistxattr, int, fd, char __user *, list, size_t, size)
+COMPAT_SYSCALL_METADATA2(removexattr, const char __user *, pathname, const char __user *, name)
+COMPAT_SYSCALL_METADATA2(lremovexattr, const char __user *, pathname, const char __user *, name)
+COMPAT_SYSCALL_METADATA2(fremovexattr, int, fd, const char __user *, name)
+COMPAT_SYSCALL_METADATA2(tkill, pid_t, pid, int, sig)
+COMPAT_SYSCALL_METADATA4(sendfile64, int, out_fd, int, in_fd, loff_t __user *, offset, size_t, count)
+COMPAT_SYSCALL_METADATA6(futex, u32 __user *, uaddr, int, op, u32, val, struct compat_timespec __user *, utime, u32 __user *, uaddr2, u32, val3)
+COMPAT_SYSCALL_METADATA3(sched_setaffinity, compat_pid_t, pid, unsigned int, len, compat_ulong_t __user *, user_mask_ptr)
+COMPAT_SYSCALL_METADATA3(sched_getaffinity, compat_pid_t, pid, unsigned int, len, compat_ulong_t __user *, user_mask_ptr)
+COMPAT_SYSCALL_METADATA1(set_thread_area, struct user_desc __user *, u_info)
+COMPAT_SYSCALL_METADATA1(get_thread_area, struct user_desc __user *, u_info)
+COMPAT_SYSCALL_METADATA2(io_setup, unsigned, nr_reqs, u32 __user *, ctxp)
+COMPAT_SYSCALL_METADATA1(io_destroy, aio_context_t, ctx)
+COMPAT_SYSCALL_METADATA5(io_getevents, aio_context_t, ctx_id, long, min_nr, long, nr, struct io_event __user *, events, struct timespec __user *, timeout)
+COMPAT_SYSCALL_METADATA3(io_submit, aio_context_t, ctx_id, long, nr, struct iocb __user * __user *, iocbpp)
+COMPAT_SYSCALL_METADATA3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, struct io_event __user *, result)
+COMPAT_SYSCALL_METADATA6(fadvise64, int, fd, __u32, offset_low, __u32, offset_high, __u32, len_low, __u32, len_high, int, advice)
+/* ni */
+COMPAT_SYSCALL_METADATA1(exit_group, int, error_code)
+COMPAT_SYSCALL_METADATA3(lookup_dcookie, u64,  cookie64, char __user *,  buf, size_t, len)
+COMPAT_SYSCALL_METADATA1(epoll_create, int, size)
+COMPAT_SYSCALL_METADATA4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event __user *, event)
+COMPAT_SYSCALL_METADATA4(epoll_wait, int, epfd, struct epoll_event __user *, events, int, maxevents, int, timeout)
+COMPAT_SYSCALL_METADATA5(remap_file_pages, unsigned long, start, unsigned long, size, unsigned long, prot, unsigned long, pgoff, unsigned long, flags)
+COMPAT_SYSCALL_METADATA1(set_tid_address, int __user *, tidptr)
+COMPAT_SYSCALL_METADATA3(timer_create, const clockid_t, which_clock, struct sigevent __user *, timer_event_spec, timer_t __user *, created_timer_id)
+COMPAT_SYSCALL_METADATA4(timer_settime, timer_t, timer_id, int, flags, const struct itimerspec __user *, new_setting, struct itimerspec __user *, old_setting)
+COMPAT_SYSCALL_METADATA2(timer_gettime, timer_t, timer_id, struct itimerspec __user *, setting)
+COMPAT_SYSCALL_METADATA1(timer_getoverrun, timer_t, timer_id)
+COMPAT_SYSCALL_METADATA1(timer_delete, timer_t, timer_id)
+COMPAT_SYSCALL_METADATA2(clock_settime, const clockid_t, which_clock, const struct timespec __user *, tp)
+COMPAT_SYSCALL_METADATA2(clock_gettime, const clockid_t, which_clock, struct timespec __user *,tp)
+COMPAT_SYSCALL_METADATA2(clock_getres, const clockid_t, which_clock, struct timespec __user *, tp)
+COMPAT_SYSCALL_METADATA4(clock_nanosleep, const clockid_t, which_clock, int, flags, const struct timespec __user *, rqtp, struct timespec __user *, rmtp)
+COMPAT_SYSCALL_METADATA3(statfs64, const char __user *, pathname, size_t, sz, struct statfs64 __user *, buf)
+COMPAT_SYSCALL_METADATA3(fstatfs64, unsigned int, fd, size_t, sz, struct statfs64 __user *, buf)
+COMPAT_SYSCALL_METADATA3(tgkill, pid_t, tgid, pid_t, pid, int, sig)
+COMPAT_SYSCALL_METADATA2(utimes, char __user *, filename, struct timeval __user *, utimes)
+COMPAT_SYSCALL_METADATA4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
+/* ni */
+COMPAT_SYSCALL_METADATA6(mbind, unsigned long, start, unsigned long, len, unsigned long, mode, unsigned long __user *, nmask, unsigned long, maxnode, unsigned, flags)
+COMPAT_SYSCALL_METADATA5(get_mempolicy, int __user *, policy, unsigned long __user *, nmask, unsigned long, maxnode, unsigned long, addr, unsigned long, flags)
+COMPAT_SYSCALL_METADATA3(set_mempolicy, int, mode, unsigned long __user *, nmask, unsigned long, maxnode)
+COMPAT_SYSCALL_METADATA4(mq_open, const char __user *, u_name, int, oflag, mode_t, mode, struct mq_attr __user *, u_attr)
+COMPAT_SYSCALL_METADATA1(mq_unlink, const char __user *, u_name)
+COMPAT_SYSCALL_METADATA5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr, size_t, msg_len, unsigned int, msg_prio, const struct timespec __user *, u_abs_timeout)
+COMPAT_SYSCALL_METADATA5(mq_timedreceive, mqd_t, mqdes, char __user *, u_msg_ptr, size_t, msg_len, unsigned int __user *, u_msg_prio, const struct timespec __user *, u_abs_timeout)
+COMPAT_SYSCALL_METADATA2(mq_notify, mqd_t, mqdes, const struct sigevent __user *, u_notification)
+COMPAT_SYSCALL_METADATA3(mq_getsetattr, mqd_t, mqdes, const struct mq_attr __user *, u_mqstat, struct mq_attr __user *, u_omqstat)
+COMPAT_SYSCALL_METADATA4(kexec_load, unsigned long, entry, unsigned long, nr_segments, struct kexec_segment __user *, segments, unsigned long, flags)
+COMPAT_SYSCALL_METADATA5(waitid, int, which, pid_t, upid, struct siginfo __user *, infop, int, options, struct rusage __user *, ru)
+/* ni */
+COMPAT_SYSCALL_METADATA5(add_key, const char __user *, _type, const char __user *, _description, const void __user *, _payload, size_t, plen, key_serial_t, ringid)
+COMPAT_SYSCALL_METADATA4(request_key, const char __user *, _type, const char __user *, _description, const char __user *, _callout_info,  key_serial_t, destringid)
+COMPAT_SYSCALL_METADATA5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5)
+COMPAT_SYSCALL_METADATA3(ioprio_set, int, which, int, who, int, ioprio)
+COMPAT_SYSCALL_METADATA2(ioprio_get, int, which, int, who)
+COMPAT_SYSCALL_METADATA0(inotify_init)
+COMPAT_SYSCALL_METADATA3(inotify_add_watch, int, fd, const char __user *, pathname, u32, mask)
+COMPAT_SYSCALL_METADATA2(inotify_rm_watch, int, fd, __s32, wd)
+COMPAT_SYSCALL_METADATA4(migrate_pages, pid_t, pid, unsigned long, maxnode, const unsigned long __user *, old_nodes, const unsigned long __user *, new_nodes)
+COMPAT_SYSCALL_METADATA4(openat, int, dfd, const char __user *, filename, int, flags, int, mode)
+COMPAT_SYSCALL_METADATA3(mkdirat, int, dfd, const char __user *, pathname, int, mode)
+COMPAT_SYSCALL_METADATA4(mknodat, int, dfd, const char __user *, filename, int, mode, unsigned, dev)
+COMPAT_SYSCALL_METADATA5(fchownat, int, dfd, const char __user *, filename, uid_t, user, gid_t, group, int, flag)
+COMPAT_SYSCALL_METADATA3(futimesat, int, dfd, char __user *, filename, struct timeval __user *, utimes)
+COMPAT_SYSCALL_METADATA4(fstatat, int, dfd, char __user *, filename, struct stat64 __user *, statbuf, int, flag)
+COMPAT_SYSCALL_METADATA3(unlinkat, int, dfd, const char __user *, pathname, int, flag)
+COMPAT_SYSCALL_METADATA4(renameat, int, olddfd, const char __user *, oldname, int, newdfd, const char __user *, newname)
+COMPAT_SYSCALL_METADATA5(linkat, int, olddfd, const char __user *, oldname, int, newdfd, const char __user *, newname, int, flags)
+COMPAT_SYSCALL_METADATA3(symlinkat, const char __user *, oldname, int, newdfd, const char __user *, newname)
+COMPAT_SYSCALL_METADATA4(readlinkat, int, dfd, const char __user *, pathname, char __user *, buf, int, bufsiz)
+COMPAT_SYSCALL_METADATA3(fchmodat, int, dfd, const char __user *, filename, mode_t, mode)
+COMPAT_SYSCALL_METADATA3(faccessat, int, dfd, const char __user *, filename, int, mode)
+COMPAT_SYSCALL_METADATA6(pselect6, int, n, fd_set __user *, inp, fd_set __user *, outp, fd_set __user *, exp, struct timespec __user *, tsp, void __user *, sig)
+COMPAT_SYSCALL_METADATA5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, struct timespec __user *, tsp, const sigset_t __user *, sigmask, size_t, sigsetsize)
+COMPAT_SYSCALL_METADATA1(unshare, unsigned long, unshare_flags)
+COMPAT_SYSCALL_METADATA2(set_robust_list, struct robust_list_head __user *, head, compat_size_t, len)
+COMPAT_SYSCALL_METADATA3(get_robust_list, int, pid, struct robust_list_head __user * __user *, head_ptr, compat_size_t __user *, len_ptr)
+COMPAT_SYSCALL_METADATA6(splice, int, fd_in, loff_t __user *, off_in, int, fd_out, loff_t __user *, off_out, size_t, len, unsigned int, flags)
+COMPAT_SYSCALL_METADATA4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes, unsigned int, flags)
+COMPAT_SYSCALL_METADATA4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
+COMPAT_SYSCALL_METADATA4(vmsplice, int, fd, const struct compat_iovec __user *, iov, unsigned long, nr_segs, unsigned int, flags)
+COMPAT_SYSCALL_METADATA6(move_pages, pid_t, pid, unsigned long, nr_pages, const void __user * __user *, pages, const int __user *, nodes, int __user *, status, int, flags)
+COMPAT_SYSCALL_METADATA3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep, struct getcpu_cache __user *, unused)
+COMPAT_SYSCALL_METADATA6(epoll_pwait, int, epfd, struct epoll_event __user *, events, int, maxevents, int, timeout, const sigset_t __user *, sigmask, size_t, sigsetsize)
+COMPAT_SYSCALL_METADATA4(utimensat, int, dfd, char __user *, filename, struct timespec __user *, utimes, int, flags)
+COMPAT_SYSCALL_METADATA3(signalfd, int, ufd, sigset_t __user *, user_mask, size_t, sizemask)
+COMPAT_SYSCALL_METADATA2(timerfd_create, int, clockid, int, flags)
+COMPAT_SYSCALL_METADATA1(eventfd, unsigned int, count)
+COMPAT_SYSCALL_METADATA4(fallocate, int, fd, int, mode, loff_t, offset, loff_t, len)
+COMPAT_SYSCALL_METADATA4(timerfd_settime, int, ufd, int, flags, const struct itimerspec __user *, utmr, struct itimerspec __user *, otmr)
+COMPAT_SYSCALL_METADATA2(timerfd_gettime, int, ufd, struct itimerspec __user *, otmr)
+COMPAT_SYSCALL_METADATA4(signalfd4, int, ufd, sigset_t __user *, user_mask, size_t, sizemask, int, flags)
+COMPAT_SYSCALL_METADATA2(eventfd2, unsigned int, count, int, flags)
+COMPAT_SYSCALL_METADATA1(epoll_create1, int, flags)
+COMPAT_SYSCALL_METADATA3(dup3, unsigned int, oldfd, unsigned int, newfd, int, flags)
+COMPAT_SYSCALL_METADATA2(pipe2, int __user *, fildes, int, flags)
+COMPAT_SYSCALL_METADATA1(inotify_init1, int, flags)
+COMPAT_SYSCALL_METADATA5(preadv, unsigned long, fd, const struct compat_iovec __user *, vec, unsigned long, vlen, u32, pos_low, 32, pos_high)
+COMPAT_SYSCALL_METADATA5(pwritev, unsigned long, fd, const struct compat_iovec __user *, vec, unsigned long, vlen, u32, pos_low, u32, pos_high)
+COMPAT_SYSCALL_METADATA4(rt_tgsigqueueinfo, pid_t, tgid, pid_t, pid, int, sig, siginfo_t __user *, uinfo)
+COMPAT_SYSCALL_METADATA5(perf_event_open, struct perf_event_attr __user *, attr_uptr, pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
+COMPAT_SYSCALL_METADATA5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg, unsigned int, vlen, unsigned int, flags, struct timespec __user *, timeout)
+
+
+extern long ia32_sys_call_table[];
+
+int nr_compat_syscalls;
+struct syscall_metadata **compat_syscalls_metadata;
+
+static const char *prefixes[] = { "sys32", "stub32", "compat_sys", "sys", NULL };
+
+/*
+ * This is truly horrible.
+ *
+ * There is no header file that defines a *complete* set of 32 bit system
+ * call numbers (unistd_32.h only defines ones that are currently exported
+ * to user space and omits lots of old system calls that are still implemented
+ * by the kernel.
+ *
+ * There is also no table that can be used to map a system call number into
+ * the canonical name of that system call.
+ *
+ * So we are forced to use an even more unpleasant version of the technique
+ * used by ftrace to handle the normal system call metadata ...
+ *
+ * For each entry in the 32 bit system call table:
+ *	Look up the address in the kernel symbol table
+ *	Strip off any "sys32|stub32|sys|compat" prefix
+ *	Search through all of the compat metadata entries for a matching name
+ *
+ */
+static struct syscall_metadata __init *find_compat_syscall_meta(unsigned long addr)
+{
+	struct syscall_metadata **start;
+	struct syscall_metadata **stop;
+	char str[KSYM_SYMBOL_LEN];
+	const char *name;
+	const char **p;
+	extern struct syscall_metadata *__start_syscalls_metadata[];
+	extern struct syscall_metadata *__stop_syscalls_metadata[];
+
+	start = __start_syscalls_metadata;
+	stop = __stop_syscalls_metadata;
+	kallsyms_lookup(addr, NULL, NULL, NULL, str);
+
+	/*
+	 * If there is a {sys|compat|sys32|stub32} prefix strip it off
+	 */
+	for (p = prefixes, name = str; *p; p++) {
+		int len = strlen(*p);
+		if (strncmp(name, *p, len) == 0) {
+			name += len;
+			break;
+		}
+	}
+
+	for ( ; start < stop; start++) {
+		if (!(*start)->compat)
+			continue;
+
+		/*
+		 * ignore the "compat_" prefix on the metadata name
+		 * when doing the comparison
+		 */
+		if ((*start)->name && !strcmp((*start)->name + 6, name))
+			return *start;
+	}
+	return NULL;
+}
+
+static int __init init_compat_syscall_metadata(void)
+{
+	struct syscall_metadata *meta;
+	int	i;
+
+	nr_compat_syscalls = __NR_ia32_syscall_max;
+	compat_syscalls_metadata = kzalloc(sizeof(*compat_syscalls_metadata) *
+					nr_compat_syscalls, GFP_KERNEL);
+	if (!compat_syscalls_metadata) {
+		nr_compat_syscalls = -1;
+		WARN_ON(1);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < nr_compat_syscalls; i++) {
+		if ((meta = find_compat_syscall_meta(ia32_sys_call_table[i]))) {
+			meta->syscall_nr = i;
+			compat_syscalls_metadata[i] = meta;
+		}
+	}
+
+	return 0;
+}
+
+core_initcall(init_compat_syscall_metadata);
diff --git a/fs/dcookies.c b/fs/dcookies.c
index dda0dc7..9a448c8 100644
--- a/fs/dcookies.c
+++ b/fs/dcookies.c
@@ -145,7 +145,7 @@ out:
 /* And here is where the userspace process can look up the cookie value
  * to retrieve the path.
  */
-SYSCALL_DEFINE(lookup_dcookie)(u64 cookie64, char __user * buf, size_t len)
+SYSCALL_DEFINE3(lookup_dcookie, u64, cookie64, char __user *, buf, size_t, len)
 {
 	unsigned long cookie = (unsigned long)cookie64;
 	int err = -EINVAL;
diff --git a/fs/open.c b/fs/open.c
index 77becc0..cc5edeb 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -269,7 +269,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	return file->f_op->fallocate(file, mode, offset, len);
 }
 
-SYSCALL_DEFINE(fallocate)(int fd, int mode, loff_t offset, loff_t len)
+SYSCALL_DEFINE4(fallocate, int, fd, int, mode, loff_t, offset, loff_t, len)
 {
 	struct file *file;
 	int error = -EBADF;
diff --git a/fs/read_write.c b/fs/read_write.c
index 5ad4248..72b675b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -492,8 +492,8 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 	return ret;
 }
 
-SYSCALL_DEFINE(pread64)(unsigned int fd, char __user *buf,
-			size_t count, loff_t pos)
+SYSCALL_DEFINE4(pread64, unsigned int, fd, char __user *, buf,
+			size_t, count, loff_t, pos)
 {
 	struct file *file;
 	ssize_t ret = -EBADF;
@@ -521,8 +521,8 @@ asmlinkage long SyS_pread64(long fd, long buf, long count, loff_t pos)
 SYSCALL_ALIAS(sys_pread64, SyS_pread64);
 #endif
 
-SYSCALL_DEFINE(pwrite64)(unsigned int fd, const char __user *buf,
-			 size_t count, loff_t pos)
+SYSCALL_DEFINE4(pwrite64, unsigned int, fd, const char __user *, buf,
+			 size_t, count, loff_t, pos)
 {
 	struct file *file;
 	ssize_t ret = -EBADF;
diff --git a/fs/sync.c b/fs/sync.c
index f3501ef..584cbd0 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -271,8 +271,8 @@ EXPORT_SYMBOL(generic_write_sync);
  * already-instantiated disk blocks, there are no guarantees here that the data
  * will be available after a crash.
  */
-SYSCALL_DEFINE(sync_file_range)(int fd, loff_t offset, loff_t nbytes,
-				unsigned int flags)
+SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
+				unsigned int, flags)
 {
 	int ret;
 	struct file *file;
@@ -366,8 +366,8 @@ SYSCALL_ALIAS(sys_sync_file_range, SyS_sync_file_range);
 
 /* It would be nice if people remember that not all the world's an i386
    when they introduce new system calls */
-SYSCALL_DEFINE(sync_file_range2)(int fd, unsigned int flags,
-				 loff_t offset, loff_t nbytes)
+SYSCALL_DEFINE4(sync_file_range2, int, fd, unsigned int, flags,
+				 loff_t, offset, loff_t, nbytes)
 {
 	return sys_sync_file_range(fd, offset, nbytes, flags);
 }
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index ed0003c..a5b0cae 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -159,7 +159,7 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 	  __attribute__((section("_ftrace_events")))			\
 	*__event_exit_##sname = &event_exit_##sname;
 
-#define SYSCALL_METADATAx(x, sname, ...)			\
+#define _SYSCALL_METADATAx(x, mname, sname, _compat, ...)	\
 	static const char *types_##sname[] = {			\
 		__SC_STR_TDECL##x(__VA_ARGS__)			\
 	};							\
@@ -170,11 +170,12 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 	SYSCALL_TRACE_EXIT_EVENT(sname);			\
 	static struct syscall_metadata __used			\
 	  __syscall_meta_##sname = {				\
-		.name 		= "sys"#sname,			\
+		.name 		= mname,			\
 		.syscall_nr	= -1,	/* Filled in at boot */	\
 		.nb_args 	= x,				\
 		.types		= types_##sname,		\
 		.args		= args_##sname,			\
+		.compat		= _compat,			\
 		.enter_event	= &event_enter_##sname,		\
 		.exit_event	= &event_exit_##sname,		\
 		.enter_fields	= LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields), \
@@ -182,6 +183,25 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 	static struct syscall_metadata __used			\
 	  __attribute__((section("__syscalls_metadata")))	\
 	 *__p_syscall_meta_##sname = &__syscall_meta_##sname;
+
+#define SYSCALL_METADATAx(x, sname, ...)		\
+	_SYSCALL_METADATAx(x, "sys"#sname, sname, 0, __VA_ARGS__)
+
+#ifdef CONFIG_FTRACE_COMPAT_SYSCALLS
+
+#define COMPAT_SYSCALL_METADATAx(x, sname, ...)			\
+	_SYSCALL_METADATAx(x, "compat"#sname, _compat##sname, 1, __VA_ARGS__)
+
+#define COMPAT_SYSCALL_METADATA0(name, ...) COMPAT_SYSCALL_METADATAx(0, _##name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA1(name, ...) COMPAT_SYSCALL_METADATAx(1, _##name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA2(name, ...) COMPAT_SYSCALL_METADATAx(2, _##name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA3(name, ...) COMPAT_SYSCALL_METADATAx(3, _##name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA4(name, ...) COMPAT_SYSCALL_METADATAx(4, _##name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA5(name, ...) COMPAT_SYSCALL_METADATAx(5, _##name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA6(name, ...) COMPAT_SYSCALL_METADATAx(6, _##name, __VA_ARGS__)
+
+#endif /* CONFIG_FTRACE_COMPAT_SYSCALLS */
+
 #else
 #define SYSCALL_METADATAx(x, name, ...)
 #endif /* CONFIG_FTRACE_SYSCALLS */
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 31966a4..f495e55 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -21,8 +21,9 @@
  */
 struct syscall_metadata {
 	const char	*name;
-	int		syscall_nr;
-	int		nb_args;
+	u16		syscall_nr;
+	u8		nb_args;
+	u8		compat;
 	const char	**types;
 	const char	**args;
 	struct list_head enter_fields;
@@ -32,6 +33,19 @@ struct syscall_metadata {
 };
 
 #ifdef CONFIG_FTRACE_SYSCALLS
+
+#ifdef CONFIG_FTRACE_COMPAT_SYSCALLS
+
+extern int nr_compat_syscalls;
+extern struct syscall_metadata **compat_syscalls_metadata;
+
+static inline struct syscall_metadata *compat_syscall_nr_to_meta(int nr)
+{
+	return (nr < nr_compat_syscalls) ? compat_syscalls_metadata[nr]
+						  : NULL;
+}
+#endif /* CONFIG_FTRACE_COMPAT_SYSCALLS */
+
 extern unsigned long arch_syscall_addr(int nr);
 extern int init_syscall_trace(struct ftrace_event_call *call);
 
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index cd31345..cd0954b 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -240,6 +240,12 @@ config FTRACE_SYSCALLS
 	help
 	  Basic tracer to catch the syscall entry and exit events.
 
+config FTRACE_COMPAT_SYSCALLS
+	bool "Trace 32 bit compat syscalls"
+	depends on FTRACE_SYSCALLS
+	help
+	  Trace syscall entry and exit events for 32 bit compat syscalls.
+
 config TRACE_BRANCH_PROFILING
 	bool
 	select GENERIC_TRACER
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index cb65454..43a8685 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -5,6 +5,7 @@
 #include <linux/module.h>	/* for MODULE_NAME_LEN via KSYM_SYMBOL_LEN */
 #include <linux/ftrace.h>
 #include <linux/perf_event.h>
+#include <linux/compat.h>
 #include <asm/syscall.h>
 
 #include "trace_output.h"
@@ -90,6 +91,10 @@ find_syscall_meta(unsigned long syscall)
 		return NULL;
 
 	for ( ; start < stop; start++) {
+#ifdef CONFIG_FTRACE_COMPAT_SYSCALLS
+		if ((*start)->compat)	/* skip compat syscalls */
+			continue;
+#endif
 		if ((*start)->name && arch_syscall_match_sym_name(str, (*start)->name))
 			return *start;
 	}
@@ -98,6 +103,10 @@ find_syscall_meta(unsigned long syscall)
 
 static struct syscall_metadata *syscall_nr_to_meta(int nr)
 {
+#ifdef CONFIG_FTRACE_COMPAT_SYSCALLS
+	if (is_compat_task())
+		return compat_syscall_nr_to_meta(nr);
+#endif
 	if (!syscalls_metadata || nr >= NR_syscalls || nr < 0)
 		return NULL;
 
@@ -112,11 +121,13 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
 	struct trace_entry *ent = iter->ent;
 	struct syscall_trace_enter *trace;
 	struct syscall_metadata *entry;
-	int i, ret, syscall;
+	int i, ret;
+	struct ftrace_event_call *call;
 
 	trace = (typeof(trace))ent;
-	syscall = trace->nr;
-	entry = syscall_nr_to_meta(syscall);
+	event = ftrace_find_event(ent->type);
+	call = container_of(event, struct ftrace_event_call, event);
+	entry = call->data;
 
 	if (!entry)
 		goto end;
@@ -164,13 +175,14 @@ print_syscall_exit(struct trace_iterator *iter, int flags,
 	struct trace_seq *s = &iter->seq;
 	struct trace_entry *ent = iter->ent;
 	struct syscall_trace_exit *trace;
-	int syscall;
 	struct syscall_metadata *entry;
 	int ret;
+	struct ftrace_event_call *call;
 
 	trace = (typeof(trace))ent;
-	syscall = trace->nr;
-	entry = syscall_nr_to_meta(syscall);
+	event = ftrace_find_event(ent->type);
+	call = container_of(event, struct ftrace_event_call, event);
+	entry = call->data;
 
 	if (!entry) {
 		trace_seq_printf(s, "\n");
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e..60d83eb 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -24,7 +24,8 @@
  * POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could
  * deactivate the pages and clear PG_Referenced.
  */
-SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
+SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset,
+			loff_t, len, int, advice)
 {
 	struct file *file = fget(fd);
 	struct address_space *mapping;
@@ -145,7 +146,7 @@ SYSCALL_ALIAS(sys_fadvise64_64, SyS_fadvise64_64);
 
 #ifdef __ARCH_WANT_SYS_FADVISE64
 
-SYSCALL_DEFINE(fadvise64)(int fd, loff_t offset, size_t len, int advice)
+SYSCALL_DEFINE4(fadvise64, int, fd, loff_t, offset, size_t, len, int, advice)
 {
 	return sys_fadvise64_64(fd, offset, len, advice);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index cb8ee3e..6e06d27 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1499,7 +1499,7 @@ do_readahead(struct address_space *mapping, struct file *filp,
 	return 0;
 }
 
-SYSCALL_DEFINE(readahead)(int fd, loff_t offset, size_t count)
+SYSCALL_DEFINE3(readahead, int, fd, loff_t, offset, size_t, count)
 {
 	ssize_t ret;
 	struct file *file;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/6] trace: Refactor ftrace syscall macros to make them more readable
  2012-03-26 18:39 [PATCH 0/6] Enhance and speed up syscall tracing Vaibhav Nagarnaik
  2012-03-26 18:39 ` [PATCH 1/6] trace: syscalls.h - cleanup and simplify SYSCALL_METADATA() Vaibhav Nagarnaik
  2012-03-26 18:39 ` [PATCH 2/6] trace: add support for 32 bit compat syscalls on x86_64 Vaibhav Nagarnaik
@ 2012-03-26 18:39 ` Vaibhav Nagarnaik
  2012-03-26 18:39 ` [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler Vaibhav Nagarnaik
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-26 18:39 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar
  Cc: David Sharp, Justin Teravest, Laurent Chavey, x86, linux-kernel,
	Vaibhav Nagarnaik

The functions and structures defined for syscall tracepoints have
underscores added at different places in their names. It is a bit
difficult to read and understand where an underscore is added and what
the final name of the element will be.

This patch makes sure that all the underscores that are added to the
name are at the declaration for the variables and functions to make it
more readable. It is also clearer what the final name would be.

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
---
 include/linux/syscalls.h |   84 +++++++++++++++++++++++-----------------------
 1 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a5b0cae..fd4d37d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -132,73 +132,73 @@ extern struct trace_event_functions enter_syscall_print_funcs;
 extern struct trace_event_functions exit_syscall_print_funcs;
 
 #define SYSCALL_TRACE_ENTER_EVENT(sname)				\
-	static struct syscall_metadata __syscall_meta_##sname;		\
+	static struct syscall_metadata __syscall_meta__##sname;		\
 	static struct ftrace_event_call __used				\
-	  event_enter_##sname = {					\
-		.name                   = "sys_enter"#sname,		\
+	  event_enter__##sname = {					\
+		.name                   = "sys_enter_"#sname,		\
 		.class			= &event_class_syscall_enter,	\
 		.event.funcs            = &enter_syscall_print_funcs,	\
-		.data			= (void *)&__syscall_meta_##sname,\
+		.data			= (void *)&__syscall_meta__##sname,\
 		.flags			= TRACE_EVENT_FL_CAP_ANY,	\
 	};								\
 	static struct ftrace_event_call __used				\
 	  __attribute__((section("_ftrace_events")))			\
-	 *__event_enter_##sname = &event_enter_##sname;
+	 *__event_enter__##sname = &event_enter__##sname;
 
 #define SYSCALL_TRACE_EXIT_EVENT(sname)					\
-	static struct syscall_metadata __syscall_meta_##sname;		\
+	static struct syscall_metadata __syscall_meta__##sname;		\
 	static struct ftrace_event_call __used				\
-	  event_exit_##sname = {					\
-		.name                   = "sys_exit"#sname,		\
+	  event_exit__##sname = {					\
+		.name                   = "sys_exit_"#sname,		\
 		.class			= &event_class_syscall_exit,	\
 		.event.funcs		= &exit_syscall_print_funcs,	\
-		.data			= (void *)&__syscall_meta_##sname,\
+		.data			= (void *)&__syscall_meta__##sname,\
 		.flags			= TRACE_EVENT_FL_CAP_ANY,	\
 	};								\
 	static struct ftrace_event_call __used				\
 	  __attribute__((section("_ftrace_events")))			\
-	*__event_exit_##sname = &event_exit_##sname;
+	*__event_exit__##sname = &event_exit__##sname;
 
 #define _SYSCALL_METADATAx(x, mname, sname, _compat, ...)	\
-	static const char *types_##sname[] = {			\
+	static const char *types__##sname[] = {			\
 		__SC_STR_TDECL##x(__VA_ARGS__)			\
 	};							\
-	static const char *args_##sname[] = {			\
+	static const char *args__##sname[] = {			\
 		__SC_STR_ADECL##x(__VA_ARGS__)			\
 	};							\
 	SYSCALL_TRACE_ENTER_EVENT(sname);			\
 	SYSCALL_TRACE_EXIT_EVENT(sname);			\
 	static struct syscall_metadata __used			\
-	  __syscall_meta_##sname = {				\
+	  __syscall_meta__##sname = {				\
 		.name 		= mname,			\
 		.syscall_nr	= -1,	/* Filled in at boot */	\
 		.nb_args 	= x,				\
-		.types		= types_##sname,		\
-		.args		= args_##sname,			\
+		.types		= types__##sname,		\
+		.args		= args__##sname,		\
 		.compat		= _compat,			\
-		.enter_event	= &event_enter_##sname,		\
-		.exit_event	= &event_exit_##sname,		\
-		.enter_fields	= LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields), \
+		.enter_event	= &event_enter__##sname,	\
+		.exit_event	= &event_exit__##sname,		\
+		.enter_fields	= LIST_HEAD_INIT(__syscall_meta__##sname.enter_fields), \
 	};							\
 	static struct syscall_metadata __used			\
 	  __attribute__((section("__syscalls_metadata")))	\
-	 *__p_syscall_meta_##sname = &__syscall_meta_##sname;
+	 *__p_syscall_meta__##sname = &__syscall_meta__##sname;
 
 #define SYSCALL_METADATAx(x, sname, ...)		\
-	_SYSCALL_METADATAx(x, "sys"#sname, sname, 0, __VA_ARGS__)
+	_SYSCALL_METADATAx(x, "sys_"#sname, sname, 0, __VA_ARGS__)
 
 #ifdef CONFIG_FTRACE_COMPAT_SYSCALLS
 
 #define COMPAT_SYSCALL_METADATAx(x, sname, ...)			\
-	_SYSCALL_METADATAx(x, "compat"#sname, _compat##sname, 1, __VA_ARGS__)
+	_SYSCALL_METADATAx(x, "compat_"#sname, compat_##sname, 1, __VA_ARGS__)
 
-#define COMPAT_SYSCALL_METADATA0(name, ...) COMPAT_SYSCALL_METADATAx(0, _##name, __VA_ARGS__)
-#define COMPAT_SYSCALL_METADATA1(name, ...) COMPAT_SYSCALL_METADATAx(1, _##name, __VA_ARGS__)
-#define COMPAT_SYSCALL_METADATA2(name, ...) COMPAT_SYSCALL_METADATAx(2, _##name, __VA_ARGS__)
-#define COMPAT_SYSCALL_METADATA3(name, ...) COMPAT_SYSCALL_METADATAx(3, _##name, __VA_ARGS__)
-#define COMPAT_SYSCALL_METADATA4(name, ...) COMPAT_SYSCALL_METADATAx(4, _##name, __VA_ARGS__)
-#define COMPAT_SYSCALL_METADATA5(name, ...) COMPAT_SYSCALL_METADATAx(5, _##name, __VA_ARGS__)
-#define COMPAT_SYSCALL_METADATA6(name, ...) COMPAT_SYSCALL_METADATAx(6, _##name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA0(name, ...) COMPAT_SYSCALL_METADATAx(0, name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA1(name, ...) COMPAT_SYSCALL_METADATAx(1, name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA2(name, ...) COMPAT_SYSCALL_METADATAx(2, name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA3(name, ...) COMPAT_SYSCALL_METADATAx(3, name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA4(name, ...) COMPAT_SYSCALL_METADATAx(4, name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA5(name, ...) COMPAT_SYSCALL_METADATAx(5, name, __VA_ARGS__)
+#define COMPAT_SYSCALL_METADATA6(name, ...) COMPAT_SYSCALL_METADATAx(6, name, __VA_ARGS__)
 
 #endif /* CONFIG_FTRACE_COMPAT_SYSCALLS */
 
@@ -206,13 +206,13 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 #define SYSCALL_METADATAx(x, name, ...)
 #endif /* CONFIG_FTRACE_SYSCALLS */
 
-#define SYSCALL_DEFINE0(name, ...) SYSCALL_DEFINEx(0, _##name, __VA_ARGS__)
-#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
-#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
-#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
-#define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
-#define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
-#define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)
+#define SYSCALL_DEFINE0(name, ...) SYSCALL_DEFINEx(0, name, __VA_ARGS__)
+#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, name, __VA_ARGS__)
+#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, name, __VA_ARGS__)
+#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, name, __VA_ARGS__)
+#define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, name, __VA_ARGS__)
+#define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, name, __VA_ARGS__)
+#define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, name, __VA_ARGS__)
 
 #ifdef CONFIG_PPC64
 #define SYSCALL_ALIAS(alias, name)					\
@@ -237,21 +237,21 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 #define SYSCALL_DEFINE(name) static inline long SYSC_##name
 
 #define __SYSCALL_DEFINEx(x, name, ...)					\
-	asmlinkage long sys##name(__SC_DECL##x(__VA_ARGS__));		\
-	static inline long SYSC##name(__SC_DECL##x(__VA_ARGS__));	\
-	asmlinkage long SyS##name(__SC_LONG##x(__VA_ARGS__))		\
+	asmlinkage long sys_##name(__SC_DECL##x(__VA_ARGS__));		\
+	static inline long SYSC_##name(__SC_DECL##x(__VA_ARGS__));	\
+	asmlinkage long SyS_##name(__SC_LONG##x(__VA_ARGS__))		\
 	{								\
 		__SC_TEST##x(__VA_ARGS__);				\
-		return (long) SYSC##name(__SC_CAST##x(__VA_ARGS__));	\
+		return (long) SYSC_##name(__SC_CAST##x(__VA_ARGS__));	\
 	}								\
-	SYSCALL_ALIAS(sys##name, SyS##name);				\
-	static inline long SYSC##name(__SC_DECL##x(__VA_ARGS__))
+	SYSCALL_ALIAS(sys_##name, SyS_##name);				\
+	static inline long SYSC_##name(__SC_DECL##x(__VA_ARGS__))
 
 #else /* CONFIG_HAVE_SYSCALL_WRAPPERS */
 
 #define SYSCALL_DEFINE(name) asmlinkage long sys_##name
 #define __SYSCALL_DEFINEx(x, name, ...)					\
-	asmlinkage long sys##name(__SC_DECL##x(__VA_ARGS__))
+	asmlinkage long sys_##name(__SC_DECL##x(__VA_ARGS__))
 
 #endif /* CONFIG_HAVE_SYSCALL_WRAPPERS */
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-26 18:39 [PATCH 0/6] Enhance and speed up syscall tracing Vaibhav Nagarnaik
                   ` (2 preceding siblings ...)
  2012-03-26 18:39 ` [PATCH 3/6] trace: Refactor ftrace syscall macros to make them more readable Vaibhav Nagarnaik
@ 2012-03-26 18:39 ` Vaibhav Nagarnaik
  2012-03-27  5:00   ` H. Peter Anvin
  2012-03-26 18:39 ` [PATCH 5/6] trace: raw_syscalls: Mark compat syscalls in the MSB of the syscall number Vaibhav Nagarnaik
  2012-03-26 18:39 ` [PATCH 6/6] trace: get rid of the enabled_*_syscalls bitmaps Vaibhav Nagarnaik
  5 siblings, 1 reply; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-26 18:39 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar
  Cc: David Sharp, Justin Teravest, Laurent Chavey, x86, linux-kernel,
	Vaibhav Nagarnaik

The syscalls are a tricky bunch to trace, because of their multitude and
dynamic nature of the list. In order to solve this, a macro handled the
sycalls handler definition and it was expanded into setting up the
metadata for the syscall event. A handler hooked into the ptrace syscall
tracer to check whether an invoked syscall was supposed to be traced.

This added latency to all the invoked syscalls, since they had to be
checked for tracing and also affected the latency of syscall that was
actually getting traced. For e.g., using a simple program which invokes
getuid() in a repeated loop and calculates the average time per syscall
invocation found a latency of 570 - 117 = 453 ns added to every traced
syscall.

This patch changes the syscall macro expansion, to create a function
that adds the entry and exit tracepoints for the given syscall so that
the latency can be avoided. This was suggested by Mathieu Desnoyers in
https://lkml.org/lkml/2010/10/13/337

After this patch, the latency added is 370 - 117 = 253 ns per invocation
of a traced syscall. This is on par with a simple tracepoint added to
any kernel code path.

This patch also makes syscall tracing architecture independent as there
is no need to have a hook into the architecture specific syscall tracer
functions.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
---
 arch/openrisc/include/asm/thread_info.h |    1 -
 arch/powerpc/Kconfig                    |    1 -
 arch/powerpc/include/asm/thread_info.h  |    4 +--
 arch/powerpc/kernel/ptrace.c            |    6 -----
 arch/s390/Kconfig                       |    1 -
 arch/s390/include/asm/thread_info.h     |    2 -
 arch/s390/kernel/entry.S                |    3 +-
 arch/s390/kernel/entry64.S              |    3 +-
 arch/s390/kernel/ptrace.c               |    9 -------
 arch/sh/Kconfig                         |    1 -
 arch/sh/include/asm/thread_info.h       |    8 +-----
 arch/sh/kernel/ptrace_32.c              |    9 -------
 arch/sh/kernel/ptrace_64.c              |    9 -------
 arch/sparc/Kconfig                      |    1 -
 arch/sparc/include/asm/thread_info_64.h |    2 -
 arch/sparc/kernel/ptrace_64.c           |    9 -------
 arch/sparc/kernel/syscalls.S            |   10 ++++----
 arch/x86/Kconfig                        |    1 -
 arch/x86/include/asm/thread_info.h      |   10 ++-----
 arch/x86/kernel/ptrace.c                |    9 -------
 include/linux/syscalls.h                |   38 ++++++++++++++++++++++++++++--
 include/trace/events/syscalls.h         |   19 +++------------
 kernel/trace/Kconfig                    |    6 -----
 kernel/trace/trace_syscalls.c           |   33 ++++++++++++++++++++++++--
 kernel/tracepoint.c                     |   38 -------------------------------
 25 files changed, 82 insertions(+), 151 deletions(-)

diff --git a/arch/openrisc/include/asm/thread_info.h b/arch/openrisc/include/asm/thread_info.h
index 07a8bc0..f39aa73 100644
--- a/arch/openrisc/include/asm/thread_info.h
+++ b/arch/openrisc/include/asm/thread_info.h
@@ -110,7 +110,6 @@ register struct thread_info *current_thread_info_reg asm("r10");
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user
 					 * mode
 					 */
-#define TIF_SYSCALL_TRACEPOINT  8       /* for ftrace syscall instrumentation */
 #define TIF_RESTORE_SIGMASK     9
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling						 * TIF_NEED_RESCHED
 					 */
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1919634..ab6b8f5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -139,7 +139,6 @@ config PPC
 	select GENERIC_IRQ_SHOW_LEVEL
 	select IRQ_FORCED_THREADING
 	select HAVE_RCU_TABLE_FREE if SMP
-	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_BPF_JIT if (PPC64 && NET)
 	select HAVE_ARCH_JUMP_LABEL
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 96471494..b0721f2 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -109,7 +109,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_RESTOREALL		11	/* Restore all regs (implies NOERROR) */
 #define TIF_NOERROR		12	/* Force successful syscall return */
 #define TIF_NOTIFY_RESUME	13	/* callback before returning to user */
-#define TIF_SYSCALL_TRACEPOINT	15	/* syscall tracepoint instrumentation */
 #define TIF_RUNLATCH		16	/* Is the runlatch enabled? */
 
 /* as above, but as bit values */
@@ -126,10 +125,9 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_RESTOREALL		(1<<TIF_RESTOREALL)
 #define _TIF_NOERROR		(1<<TIF_NOERROR)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
-#define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_RUNLATCH		(1<<TIF_RUNLATCH)
 #define _TIF_SYSCALL_T_OR_A	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
-				 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
+				 _TIF_SECCOMP)
 
 #define _TIF_USER_WORK_MASK	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
 				 _TIF_NOTIFY_RESUME)
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 5b43325..46a917c 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -1721,9 +1721,6 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 		 */
 		ret = -1L;
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_enter(regs, regs->gpr[0]);
-
 #ifdef CONFIG_PPC64
 	if (!is_32bit_task())
 		audit_syscall_entry(AUDIT_ARCH_PPC64,
@@ -1748,9 +1745,6 @@ void do_syscall_trace_leave(struct pt_regs *regs)
 
 	audit_syscall_exit(regs);
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_exit(regs, regs->result);
-
 	step = test_thread_flag(TIF_SINGLESTEP);
 	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, step);
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 6d99a5f..3065759 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -69,7 +69,6 @@ config S390
 	select HAVE_FUNCTION_TRACE_MCOUNT_TEST
 	select HAVE_FTRACE_MCOUNT_RECORD
 	select HAVE_C_RECORDMCOUNT
-	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_DYNAMIC_FTRACE
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h
index a730381..4207a43 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -94,7 +94,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	9	/* syscall auditing active */
 #define TIF_SECCOMP		10	/* secure computing */
-#define TIF_SYSCALL_TRACEPOINT	11	/* syscall tracepoint instrumentation */
 #define TIF_SIE			12	/* guest execution active */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling
 					   TIF_NEED_RESCHED */
@@ -113,7 +112,6 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1<<TIF_SECCOMP)
-#define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_SIE		(1<<TIF_SIE)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_31BIT		(1<<TIF_31BIT)
diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S
index 3705700..7ad7929 100644
--- a/arch/s390/kernel/entry.S
+++ b/arch/s390/kernel/entry.S
@@ -40,8 +40,7 @@ _TIF_WORK_SVC = (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED | \
 		 _TIF_MCCK_PENDING | _TIF_PER_TRAP )
 _TIF_WORK_INT = (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED | \
 		 _TIF_MCCK_PENDING)
-_TIF_TRACE    = (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SECCOMP | \
-		 _TIF_SYSCALL_TRACEPOINT)
+_TIF_TRACE    = (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SECCOMP)
 
 STACK_SHIFT = PAGE_SHIFT + THREAD_ORDER
 STACK_SIZE  = 1 << STACK_SHIFT
diff --git a/arch/s390/kernel/entry64.S b/arch/s390/kernel/entry64.S
index 412a7b8..1459a5b 100644
--- a/arch/s390/kernel/entry64.S
+++ b/arch/s390/kernel/entry64.S
@@ -43,8 +43,7 @@ _TIF_WORK_SVC = (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED | \
 		 _TIF_MCCK_PENDING | _TIF_PER_TRAP )
 _TIF_WORK_INT = (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED | \
 		 _TIF_MCCK_PENDING)
-_TIF_TRACE    = (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SECCOMP | \
-		 _TIF_SYSCALL_TRACEPOINT)
+_TIF_TRACE    = (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SECCOMP)
 _TIF_EXIT_SIE = (_TIF_SIGPENDING | _TIF_NEED_RESCHED | _TIF_MCCK_PENDING)
 
 #define BASED(name) name-system_call(%r13)
diff --git a/arch/s390/kernel/ptrace.c b/arch/s390/kernel/ptrace.c
index 61f9548..2151616 100644
--- a/arch/s390/kernel/ptrace.c
+++ b/arch/s390/kernel/ptrace.c
@@ -35,9 +35,6 @@
 #include "compat_ptrace.h"
 #endif
 
-#define CREATE_TRACE_POINTS
-#include <trace/events/syscalls.h>
-
 enum s390_regset {
 	REGSET_GENERAL,
 	REGSET_FP,
@@ -737,9 +734,6 @@ asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
 		ret = -1;
 	}
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_enter(regs, regs->gprs[2]);
-
 	audit_syscall_entry(is_compat_task() ?
 				AUDIT_ARCH_S390 : AUDIT_ARCH_S390X,
 			    regs->gprs[2], regs->orig_gpr2,
@@ -752,9 +746,6 @@ asmlinkage void do_syscall_trace_exit(struct pt_regs *regs)
 {
 	audit_syscall_exit(regs);
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_exit(regs, regs->gprs[2]);
-
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, 0);
 }
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 713fb58..e64c779 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -19,7 +19,6 @@ config SUPERH
 	select HAVE_KERNEL_LZMA
 	select HAVE_KERNEL_XZ
 	select HAVE_KERNEL_LZO
-	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_GENERIC_HARDIRQS
 	select HAVE_SPARSE_IRQ
diff --git a/arch/sh/include/asm/thread_info.h b/arch/sh/include/asm/thread_info.h
index 20ee40a..34abf14 100644
--- a/arch/sh/include/asm/thread_info.h
+++ b/arch/sh/include/asm/thread_info.h
@@ -119,7 +119,6 @@ extern void init_thread_xstate(void);
 #define TIF_SYSCALL_AUDIT	5	/* syscall auditing active */
 #define TIF_SECCOMP		6	/* secure computing */
 #define TIF_NOTIFY_RESUME	7	/* callback before returning to user */
-#define TIF_SYSCALL_TRACEPOINT	8	/* for ftrace syscall instrumentation */
 #define TIF_POLLING_NRFLAG	17	/* true if poll_idle() is polling TIF_NEED_RESCHED */
 #define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
@@ -130,7 +129,6 @@ extern void init_thread_xstate(void);
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
-#define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 
 /*
@@ -141,14 +139,12 @@ extern void init_thread_xstate(void);
 
 /* work to do in syscall trace */
 #define _TIF_WORK_SYSCALL_MASK	(_TIF_SYSCALL_TRACE | _TIF_SINGLESTEP | \
-				 _TIF_SYSCALL_AUDIT | _TIF_SECCOMP    | \
-				 _TIF_SYSCALL_TRACEPOINT)
+				 _TIF_SYSCALL_AUDIT | _TIF_SECCOMP)
 
 /* work to do on any return to u-space */
 #define _TIF_ALLWORK_MASK	(_TIF_SYSCALL_TRACE | _TIF_SIGPENDING      | \
 				 _TIF_NEED_RESCHED  | _TIF_SYSCALL_AUDIT   | \
-				 _TIF_SINGLESTEP    | _TIF_NOTIFY_RESUME   | \
-				 _TIF_SYSCALL_TRACEPOINT)
+				 _TIF_SINGLESTEP    | _TIF_NOTIFY_RESUME)
 
 /* work to do on interrupt/exception return */
 #define _TIF_WORK_MASK		(_TIF_ALLWORK_MASK & ~(_TIF_SYSCALL_TRACE | \
diff --git a/arch/sh/kernel/ptrace_32.c b/arch/sh/kernel/ptrace_32.c
index a3e6515..a8c2aa2 100644
--- a/arch/sh/kernel/ptrace_32.c
+++ b/arch/sh/kernel/ptrace_32.c
@@ -34,9 +34,6 @@
 #include <asm/syscalls.h>
 #include <asm/fpu.h>
 
-#define CREATE_TRACE_POINTS
-#include <trace/events/syscalls.h>
-
 /*
  * This routine will get a word off of the process kernel stack.
  */
@@ -515,9 +512,6 @@ asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
 		 */
 		ret = -1L;
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_enter(regs, regs->regs[0]);
-
 	audit_syscall_entry(audit_arch(), regs->regs[3],
 			    regs->regs[4], regs->regs[5],
 			    regs->regs[6], regs->regs[7]);
@@ -531,9 +525,6 @@ asmlinkage void do_syscall_trace_leave(struct pt_regs *regs)
 
 	audit_syscall_exit(regs);
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_exit(regs, regs->regs[0]);
-
 	step = test_thread_flag(TIF_SINGLESTEP);
 	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, step);
diff --git a/arch/sh/kernel/ptrace_64.c b/arch/sh/kernel/ptrace_64.c
index 3d0080b..7cf8212 100644
--- a/arch/sh/kernel/ptrace_64.c
+++ b/arch/sh/kernel/ptrace_64.c
@@ -40,9 +40,6 @@
 #include <asm/syscalls.h>
 #include <asm/fpu.h>
 
-#define CREATE_TRACE_POINTS
-#include <trace/events/syscalls.h>
-
 /* This mask defines the bits of the SR which the user is not allowed to
    change, which are everything except S, Q, M, PR, SZ, FR. */
 #define SR_MASK      (0xffff8cfd)
@@ -533,9 +530,6 @@ asmlinkage long long do_syscall_trace_enter(struct pt_regs *regs)
 		 */
 		ret = -1LL;
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_enter(regs, regs->regs[9]);
-
 	audit_syscall_entry(audit_arch(), regs->regs[1],
 			    regs->regs[2], regs->regs[3],
 			    regs->regs[4], regs->regs[5]);
@@ -549,9 +543,6 @@ asmlinkage void do_syscall_trace_leave(struct pt_regs *regs)
 
 	audit_syscall_exit(regs);
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_exit(regs, regs->regs[9]);
-
 	step = test_thread_flag(TIF_SINGLESTEP);
 	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, step);
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index ca5580e..df3ba69 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -50,7 +50,6 @@ config SPARC64
 	select HAVE_SYSCALL_WRAPPERS
 	select HAVE_DYNAMIC_FTRACE
 	select HAVE_FTRACE_MCOUNT_RECORD
-	select HAVE_SYSCALL_TRACEPOINTS
 	select RTC_DRV_CMOS
 	select RTC_DRV_BQ4802
 	select RTC_DRV_SUN4V
diff --git a/arch/sparc/include/asm/thread_info_64.h b/arch/sparc/include/asm/thread_info_64.h
index 01d057f..2afad03 100644
--- a/arch/sparc/include/asm/thread_info_64.h
+++ b/arch/sparc/include/asm/thread_info_64.h
@@ -217,7 +217,6 @@ register struct thread_info *current_thread_info_reg asm("g6");
 /* flag bit 8 is available */
 #define TIF_SECCOMP		9	/* secure computing */
 #define TIF_SYSCALL_AUDIT	10	/* syscall auditing active */
-#define TIF_SYSCALL_TRACEPOINT	11	/* syscall tracepoint instrumentation */
 /* NOTE: Thread flags >= 12 should be ones we have no interest
  *       in using in assembly, else we can't use the mask as
  *       an immediate value in instructions such as andcc.
@@ -234,7 +233,6 @@ register struct thread_info *current_thread_info_reg asm("g6");
 #define _TIF_32BIT		(1<<TIF_32BIT)
 #define _TIF_SECCOMP		(1<<TIF_SECCOMP)
 #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
-#define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 
 #define _TIF_USER_WORK_MASK	((0xff << TI_FLAG_WSAVED_SHIFT) | \
diff --git a/arch/sparc/kernel/ptrace_64.c b/arch/sparc/kernel/ptrace_64.c
index 9388844..6f3ba31 100644
--- a/arch/sparc/kernel/ptrace_64.c
+++ b/arch/sparc/kernel/ptrace_64.c
@@ -38,9 +38,6 @@
 #include <asm/cpudata.h>
 #include <asm/cacheflush.h>
 
-#define CREATE_TRACE_POINTS
-#include <trace/events/syscalls.h>
-
 #include "entry.h"
 
 /* #define ALLOW_INIT_TRACING */
@@ -1068,9 +1065,6 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		ret = tracehook_report_syscall_entry(regs);
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_enter(regs, regs->u_regs[UREG_G1]);
-
 	audit_syscall_entry((test_thread_flag(TIF_32BIT) ?
 			     AUDIT_ARCH_SPARC :
 			     AUDIT_ARCH_SPARC64),
@@ -1087,9 +1081,6 @@ asmlinkage void syscall_trace_leave(struct pt_regs *regs)
 {
 	audit_syscall_exit(regs);
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_exit(regs, regs->u_regs[UREG_G1]);
-
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, 0);
 }
diff --git a/arch/sparc/kernel/syscalls.S b/arch/sparc/kernel/syscalls.S
index 1d7e274..c8b3bb2 100644
--- a/arch/sparc/kernel/syscalls.S
+++ b/arch/sparc/kernel/syscalls.S
@@ -62,7 +62,7 @@ sys32_rt_sigreturn:
 #endif
 	.align	32
 1:	ldx	[%g6 + TI_FLAGS], %l5
-	andcc	%l5, (_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT|_TIF_SYSCALL_TRACEPOINT), %g0
+	andcc	%l5, (_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT), %g0
 	be,pt	%icc, rtrap
 	 nop
 	call	syscall_trace_leave
@@ -179,7 +179,7 @@ linux_sparc_syscall32:
 
 	srl	%i5, 0, %o5				! IEU1
 	srl	%i2, 0, %o2				! IEU0	Group
-	andcc	%l0, (_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT|_TIF_SYSCALL_TRACEPOINT), %g0
+	andcc	%l0, (_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT), %g0
 	bne,pn	%icc, linux_syscall_trace32		! CTI
 	 mov	%i0, %l5				! IEU1
 	call	%l7					! CTI	Group brk forced
@@ -202,7 +202,7 @@ linux_sparc_syscall:
 
 	mov	%i3, %o3				! IEU1
 	mov	%i4, %o4				! IEU0	Group
-	andcc	%l0, (_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT|_TIF_SYSCALL_TRACEPOINT), %g0
+	andcc	%l0, (_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT), %g0
 	bne,pn	%icc, linux_syscall_trace		! CTI	Group
 	 mov	%i0, %l5				! IEU0
 2:	call	%l7					! CTI	Group brk forced
@@ -226,7 +226,7 @@ ret_sys_call:
 
 	cmp	%o0, -ERESTART_RESTARTBLOCK
 	bgeu,pn	%xcc, 1f
-	 andcc	%l0, (_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT|_TIF_SYSCALL_TRACEPOINT), %l6
+	 andcc	%l0, (_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT), %l6
 80:
 	/* System call success, clear Carry condition code. */
 	andn	%g3, %g2, %g3
@@ -241,7 +241,7 @@ ret_sys_call:
 	/* System call failure, set Carry condition code.
 	 * Also, get abs(errno) to return to the process.
 	 */
-	andcc	%l0, (_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT|_TIF_SYSCALL_TRACEPOINT), %l6	
+	andcc	%l0, (_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT), %l6
 	sub	%g0, %o0, %o0
 	or	%g3, %g2, %g3
 	stx	%o0, [%sp + PTREGS_OFF + PT_V9_I0]
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e..1f19cf6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -41,7 +41,6 @@ config X86
 	select HAVE_FUNCTION_GRAPH_FP_TEST
 	select HAVE_FUNCTION_TRACE_MCOUNT_TEST
 	select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE
-	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_KVM
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index cfd8144..192b7a3 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -94,7 +94,6 @@ struct thread_info {
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
-#define TIF_SYSCALL_TRACEPOINT	28	/* syscall tracepoint instrumentation */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
@@ -115,17 +114,15 @@ struct thread_info {
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
-#define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
 
 /* work to do in syscall_trace_enter() */
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)
+	 _TIF_SECCOMP | _TIF_SINGLESTEP)
 
 /* work to do in syscall_trace_leave() */
 #define _TIF_WORK_SYSCALL_EXIT	\
-	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP |	\
-	 _TIF_SYSCALL_TRACEPOINT)
+	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP)
 
 /* work to do on interrupt/exception return */
 #define _TIF_WORK_MASK							\
@@ -134,8 +131,7 @@ struct thread_info {
 	   _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU))
 
 /* work to do on any return to user space */
-#define _TIF_ALLWORK_MASK						\
-	((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT)
+#define _TIF_ALLWORK_MASK		(0x0000FFFF & ~_TIF_SECCOMP)
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5026738..3f1bab2 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -36,9 +36,6 @@
 
 #include "tls.h"
 
-#define CREATE_TRACE_POINTS
-#include <trace/events/syscalls.h>
-
 enum x86_regset {
 	REGSET_GENERAL,
 	REGSET_FP,
@@ -1389,9 +1386,6 @@ long syscall_trace_enter(struct pt_regs *regs)
 	    tracehook_report_syscall_entry(regs))
 		ret = -1L;
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_enter(regs, regs->orig_ax);
-
 	if (IS_IA32)
 		audit_syscall_entry(AUDIT_ARCH_I386,
 				    regs->orig_ax,
@@ -1414,9 +1408,6 @@ void syscall_trace_leave(struct pt_regs *regs)
 
 	audit_syscall_exit(regs);
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_exit(regs, regs->ax);
-
 	/*
 	 * If TIF_SYSCALL_EMU is set, we only get here because of
 	 * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fd4d37d..9f3e5cf 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -75,8 +75,9 @@ struct file_handle;
 #include <linux/quota.h>
 #include <linux/key.h>
 #include <trace/syscall.h>
+#include <asm/syscall.h>
 
-#define __SC_DECL0()
+#define __SC_DECL0() void
 #define __SC_DECL1(t1, a1)	t1 a1
 #define __SC_DECL2(t2, a2, ...) t2 a2, __SC_DECL1(__VA_ARGS__)
 #define __SC_DECL3(t3, a3, ...) t3 a3, __SC_DECL2(__VA_ARGS__)
@@ -232,6 +233,27 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 	SYSCALL_METADATAx(x, sname, __VA_ARGS__)			\
 	__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
 
+extern void trace_sys_enter_handler(struct pt_regs *regs, long id);
+extern void trace_sys_exit_handler(struct pt_regs *regs, long ret);
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+#define SYSCALL_TRACE_HANDLERx(x, name, ...)				\
+	({								\
+		long ret;						\
+		struct pt_regs *regs = task_pt_regs(current);		\
+		long syscall_nr = syscall_get_nr(current, regs);	\
+		trace_sys_enter_handler(regs, syscall_nr);		\
+		ret = _sys_##name(__SC_CAST##x(__VA_ARGS__));		\
+		trace_sys_exit_handler(regs, ret);			\
+		ret;							\
+	 })
+#else
+#define SYSCALL_TRACE_HANDLERx(x, name, ...)				\
+	({								\
+		_sys_##name(__SC_CAST##x(__VA_ARGS__));			\
+	 })
+#endif
+
 #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
 
 #define SYSCALL_DEFINE(name) static inline long SYSC_##name
@@ -245,13 +267,23 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 		return (long) SYSC_##name(__SC_CAST##x(__VA_ARGS__));	\
 	}								\
 	SYSCALL_ALIAS(sys_##name, SyS_##name);				\
-	static inline long SYSC_##name(__SC_DECL##x(__VA_ARGS__))
+	static inline long _sys_##name(__SC_DECL##x(__VA_ARGS__));	\
+	static inline long SYSC_##name(__SC_DECL##x(__VA_ARGS__))	\
+	{								\
+		return SYSCALL_TRACE_HANDLERx(x, name, __VA_ARGS__);	\
+	}								\
+	static inline long _sys_##name(__SC_DECL##x(__VA_ARGS__))
 
 #else /* CONFIG_HAVE_SYSCALL_WRAPPERS */
 
 #define SYSCALL_DEFINE(name) asmlinkage long sys_##name
 #define __SYSCALL_DEFINEx(x, name, ...)					\
-	asmlinkage long sys_##name(__SC_DECL##x(__VA_ARGS__))
+	static inline long _sys_##name(__SC_DECL##x(__VA_ARGS__));	\
+	asmlinkage long sys_##name(__SC_DECL##x(__VA_ARGS__))		\
+	{								\
+		return SYSCALL_TRACE_HANDLERx(x, name, __VA_ARGS__);	\
+	}								\
+	static inline long _sys_##name(__SC_DECL##x(__VA_ARGS__))
 
 #endif /* CONFIG_HAVE_SYSCALL_WRAPPERS */
 
diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
index 5a4c04a..aeaa536 100644
--- a/include/trace/events/syscalls.h
+++ b/include/trace/events/syscalls.h
@@ -11,12 +11,7 @@
 #include <asm/syscall.h>
 
 
-#ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
-
-extern void syscall_regfunc(void);
-extern void syscall_unregfunc(void);
-
-TRACE_EVENT_FN(sys_enter,
+TRACE_EVENT(sys_enter,
 
 	TP_PROTO(struct pt_regs *regs, long id),
 
@@ -35,14 +30,12 @@ TRACE_EVENT_FN(sys_enter,
 	TP_printk("NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)",
 		  __entry->id,
 		  __entry->args[0], __entry->args[1], __entry->args[2],
-		  __entry->args[3], __entry->args[4], __entry->args[5]),
-
-	syscall_regfunc, syscall_unregfunc
+		  __entry->args[3], __entry->args[4], __entry->args[5])
 );
 
 TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)
 
-TRACE_EVENT_FN(sys_exit,
+TRACE_EVENT(sys_exit,
 
 	TP_PROTO(struct pt_regs *regs, long ret),
 
@@ -59,15 +52,11 @@ TRACE_EVENT_FN(sys_exit,
 	),
 
 	TP_printk("NR %ld = %ld",
-		  __entry->id, __entry->ret),
-
-	syscall_regfunc, syscall_unregfunc
+		  __entry->id, __entry->ret)
 );
 
 TRACE_EVENT_FLAGS(sys_exit, TRACE_EVENT_FL_CAP_ANY)
 
-#endif /* CONFIG_HAVE_SYSCALL_TRACEPOINTS */
-
 #endif /* _TRACE_EVENTS_SYSCALLS_H */
 
 /* This part must be outside protection */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index cd0954b..5afa3f5 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -44,11 +44,6 @@ config HAVE_FTRACE_MCOUNT_RECORD
 	help
 	  See Documentation/trace/ftrace-design.txt
 
-config HAVE_SYSCALL_TRACEPOINTS
-	bool
-	help
-	  See Documentation/trace/ftrace-design.txt
-
 config HAVE_C_RECORDMCOUNT
 	bool
 	help
@@ -234,7 +229,6 @@ config ENABLE_DEFAULT_TRACERS
 
 config FTRACE_SYSCALLS
 	bool "Trace syscalls"
-	depends on HAVE_SYSCALL_TRACEPOINTS
 	select GENERIC_TRACER
 	select KALLSYMS
 	help
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 43a8685..b757eba 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -1,5 +1,4 @@
 #include <trace/syscall.h>
-#include <trace/events/syscalls.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
 #include <linux/module.h>	/* for MODULE_NAME_LEN via KSYM_SYMBOL_LEN */
@@ -11,6 +10,9 @@
 #include "trace_output.h"
 #include "trace.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/syscalls.h>
+
 static DEFINE_MUTEX(syscall_trace_lock);
 static int sys_refcount_enter;
 static int sys_refcount_exit;
@@ -369,7 +371,7 @@ void ftrace_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 
 	entry = ring_buffer_event_data(event);
 	entry->nr = syscall_nr;
-	entry->ret = syscall_get_return_value(current, regs);
+	entry->ret = ret;
 
 	if (!filter_current_check_discard(buffer, sys_data->exit_event,
 					  entry, event))
@@ -501,6 +503,31 @@ int __init init_ftrace_syscalls(void)
 }
 core_initcall(init_ftrace_syscalls);
 
+/*
+ * trace_sys_(enter|exit)_handler
+ *
+ * These functions provide a way to add tracepoints to every syscall.
+ * The macros that define syscall handlers using SYSCALL_DEFINE in
+ * include/linux/syscalls.h conflict with the recursive included ftrace
+ * macro magic defined in include/linux/tracepoint.h.
+ *
+ * So it not feasible to include events/trace/syscalls.h to provide the
+ * definition for trace_sys_(enter|exit) probes.
+ *
+ * Hence the need for these functions which are forward defined in
+ * include/linux/syscalls.h and are called from SYSCALL_DEFINE section
+ * for each syscall.
+ */
+void trace_sys_enter_handler(struct pt_regs *regs, long id)
+{
+	trace_sys_enter(regs, id);
+}
+
+void trace_sys_exit_handler(struct pt_regs *regs, long ret)
+{
+	trace_sys_exit(regs, ret);
+}
+
 #ifdef CONFIG_PERF_EVENTS
 
 static DECLARE_BITMAP(enabled_perf_enter_syscalls, NR_syscalls);
@@ -617,7 +644,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 		return;
 
 	rec->nr = syscall_nr;
-	rec->ret = syscall_get_return_value(current, regs);
+	rec->ret = ret;
 
 	head = this_cpu_ptr(sys_data->exit_event->perf_events);
 	perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head);
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index f1539de..3b41d3e 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -726,41 +726,3 @@ static int init_tracepoints(void)
 }
 __initcall(init_tracepoints);
 #endif /* CONFIG_MODULES */
-
-#ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
-
-/* NB: reg/unreg are called while guarded with the tracepoints_mutex */
-static int sys_tracepoint_refcount;
-
-void syscall_regfunc(void)
-{
-	unsigned long flags;
-	struct task_struct *g, *t;
-
-	if (!sys_tracepoint_refcount) {
-		read_lock_irqsave(&tasklist_lock, flags);
-		do_each_thread(g, t) {
-			/* Skip kernel threads. */
-			if (t->mm)
-				set_tsk_thread_flag(t, TIF_SYSCALL_TRACEPOINT);
-		} while_each_thread(g, t);
-		read_unlock_irqrestore(&tasklist_lock, flags);
-	}
-	sys_tracepoint_refcount++;
-}
-
-void syscall_unregfunc(void)
-{
-	unsigned long flags;
-	struct task_struct *g, *t;
-
-	sys_tracepoint_refcount--;
-	if (!sys_tracepoint_refcount) {
-		read_lock_irqsave(&tasklist_lock, flags);
-		do_each_thread(g, t) {
-			clear_tsk_thread_flag(t, TIF_SYSCALL_TRACEPOINT);
-		} while_each_thread(g, t);
-		read_unlock_irqrestore(&tasklist_lock, flags);
-	}
-}
-#endif
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/6] trace: raw_syscalls: Mark compat syscalls in the MSB of the syscall number
  2012-03-26 18:39 [PATCH 0/6] Enhance and speed up syscall tracing Vaibhav Nagarnaik
                   ` (3 preceding siblings ...)
  2012-03-26 18:39 ` [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler Vaibhav Nagarnaik
@ 2012-03-26 18:39 ` Vaibhav Nagarnaik
  2012-03-26 18:39 ` [PATCH 6/6] trace: get rid of the enabled_*_syscalls bitmaps Vaibhav Nagarnaik
  5 siblings, 0 replies; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-26 18:39 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar
  Cc: David Sharp, Justin Teravest, Laurent Chavey, x86, linux-kernel,
	Vaibhav Nagarnaik

From: David Sharp <dhsharp@google.com>

The compat syscalls are undifferentiable from standard syscalls, making
correct interpretation of 'id' impossible on systems where compat tasks
are running. Set the MSB of 'id' if the traced syscall is compat, and
output this bit in the print format.

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
---
 include/trace/events/syscalls.h |   23 ++++++++++++++++++-----
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
index aeaa536..4f42fdc 100644
--- a/include/trace/events/syscalls.h
+++ b/include/trace/events/syscalls.h
@@ -6,10 +6,13 @@
 #define _TRACE_EVENTS_SYSCALLS_H
 
 #include <linux/tracepoint.h>
+#include <linux/compat.h>
 
 #include <asm/ptrace.h>
 #include <asm/syscall.h>
 
+#define COMPAT_MASK (~0UL>>1)
+#define COMPAT_BIT ~(~0UL>>1)
 
 TRACE_EVENT(sys_enter,
 
@@ -23,14 +26,19 @@ TRACE_EVENT(sys_enter,
 	),
 
 	TP_fast_assign(
+#ifdef CONFIG_COMPAT
+		if (is_compat_task())
+			id |= COMPAT_BIT;
+#endif
 		__entry->id	= id;
 		syscall_get_arguments(current, regs, 0, 6, __entry->args);
 	),
 
-	TP_printk("NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)",
-		  __entry->id,
+	TP_printk("NR %ld (%lx, %lx, %lx, %lx, %lx, %lx) isCompat: %d",
+		  __entry->id & COMPAT_MASK,
 		  __entry->args[0], __entry->args[1], __entry->args[2],
-		  __entry->args[3], __entry->args[4], __entry->args[5])
+		  __entry->args[3], __entry->args[4], __entry->args[5],
+		  !!(__entry->id & COMPAT_BIT))
 );
 
 TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)
@@ -48,11 +56,16 @@ TRACE_EVENT(sys_exit,
 
 	TP_fast_assign(
 		__entry->id	= syscall_get_nr(current, regs);
+#ifdef CONFIG_COMPAT
+		if (is_compat_task())
+			__entry->id |= COMPAT_BIT;
+#endif
 		__entry->ret	= ret;
 	),
 
-	TP_printk("NR %ld = %ld",
-		  __entry->id, __entry->ret)
+	TP_printk("NR %ld = %ld isCompat: %d",
+		  __entry->id & COMPAT_MASK, __entry->ret,
+		  !!(__entry->id & COMPAT_BIT))
 );
 
 TRACE_EVENT_FLAGS(sys_exit, TRACE_EVENT_FL_CAP_ANY)
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/6] trace: get rid of the enabled_*_syscalls bitmaps
  2012-03-26 18:39 [PATCH 0/6] Enhance and speed up syscall tracing Vaibhav Nagarnaik
                   ` (4 preceding siblings ...)
  2012-03-26 18:39 ` [PATCH 5/6] trace: raw_syscalls: Mark compat syscalls in the MSB of the syscall number Vaibhav Nagarnaik
@ 2012-03-26 18:39 ` Vaibhav Nagarnaik
  5 siblings, 0 replies; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-26 18:39 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar
  Cc: David Sharp, Justin Teravest, Laurent Chavey, x86, linux-kernel,
	Michael Davidson, Vaibhav Nagarnaik

From: Michael Davidson <md@google.com>

Get rid of the enabled_*_syscalls bitmaps.

Since there is a separate event for each possible system call entry
and exit the bitmaps are unnecessary because the information that
we need already exists in the ftrace_event_call struct.

The "enabled" field indicates that the event is enabled for regular
system call tracing and a "perf_refcount" value greater than zero
indicates that the perf_event is enabled.

The motivation for this change is to avoid the need to create yet
another set of bitmaps for 32 bit system call numbers when support
for tracing those system calls is added.

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
---
 kernel/trace/trace_syscalls.c |   96 ++++++++++------------------------------
 1 files changed, 24 insertions(+), 72 deletions(-)

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index b757eba..f3fcd13 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -16,8 +16,6 @@
 static DEFINE_MUTEX(syscall_trace_lock);
 static int sys_refcount_enter;
 static int sys_refcount_exit;
-static DECLARE_BITMAP(enabled_enter_syscalls, NR_syscalls);
-static DECLARE_BITMAP(enabled_exit_syscalls, NR_syscalls);
 
 static int syscall_enter_register(struct ftrace_event_call *event,
 				 enum trace_reg type);
@@ -323,13 +321,14 @@ void ftrace_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	syscall_nr = syscall_get_nr(current, regs);
 	if (syscall_nr < 0)
 		return;
-	if (!test_bit(syscall_nr, enabled_enter_syscalls))
-		return;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
 		return;
 
+	if (!(sys_data->enter_event->flags & TRACE_EVENT_FL_ENABLED))
+		return;
+
 	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
 
 	event = trace_current_buffer_lock_reserve(&buffer,
@@ -357,13 +356,14 @@ void ftrace_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	syscall_nr = syscall_get_nr(current, regs);
 	if (syscall_nr < 0)
 		return;
-	if (!test_bit(syscall_nr, enabled_exit_syscalls))
-		return;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
 		return;
 
+	if (!(sys_data->exit_event->flags & TRACE_EVENT_FL_ENABLED))
+		return;
+
 	event = trace_current_buffer_lock_reserve(&buffer,
 			sys_data->exit_event->event.type, sizeof(*entry), 0, 0);
 	if (!event)
@@ -381,32 +381,19 @@ void ftrace_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 int reg_event_syscall_enter(struct ftrace_event_call *call)
 {
 	int ret = 0;
-	int num;
 
-	num = ((struct syscall_metadata *)call->data)->syscall_nr;
-	if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
-		return -ENOSYS;
 	mutex_lock(&syscall_trace_lock);
-	if (!sys_refcount_enter)
-		ret = register_trace_sys_enter(ftrace_syscall_enter, NULL);
-	if (!ret) {
-		set_bit(num, enabled_enter_syscalls);
+	if (sys_refcount_enter ||
+	    (ret = register_trace_sys_enter(ftrace_syscall_enter, NULL)) == 0)
 		sys_refcount_enter++;
-	}
 	mutex_unlock(&syscall_trace_lock);
 	return ret;
 }
 
 void unreg_event_syscall_enter(struct ftrace_event_call *call)
 {
-	int num;
-
-	num = ((struct syscall_metadata *)call->data)->syscall_nr;
-	if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
-		return;
 	mutex_lock(&syscall_trace_lock);
 	sys_refcount_enter--;
-	clear_bit(num, enabled_enter_syscalls);
 	if (!sys_refcount_enter)
 		unregister_trace_sys_enter(ftrace_syscall_enter, NULL);
 	mutex_unlock(&syscall_trace_lock);
@@ -415,32 +402,19 @@ void unreg_event_syscall_enter(struct ftrace_event_call *call)
 int reg_event_syscall_exit(struct ftrace_event_call *call)
 {
 	int ret = 0;
-	int num;
 
-	num = ((struct syscall_metadata *)call->data)->syscall_nr;
-	if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
-		return -ENOSYS;
 	mutex_lock(&syscall_trace_lock);
-	if (!sys_refcount_exit)
-		ret = register_trace_sys_exit(ftrace_syscall_exit, NULL);
-	if (!ret) {
-		set_bit(num, enabled_exit_syscalls);
+	if (sys_refcount_exit ||
+	    (ret = register_trace_sys_exit(ftrace_syscall_exit, NULL)) == 0)
 		sys_refcount_exit++;
-	}
 	mutex_unlock(&syscall_trace_lock);
 	return ret;
 }
 
 void unreg_event_syscall_exit(struct ftrace_event_call *call)
 {
-	int num;
-
-	num = ((struct syscall_metadata *)call->data)->syscall_nr;
-	if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
-		return;
 	mutex_lock(&syscall_trace_lock);
 	sys_refcount_exit--;
-	clear_bit(num, enabled_exit_syscalls);
 	if (!sys_refcount_exit)
 		unregister_trace_sys_exit(ftrace_syscall_exit, NULL);
 	mutex_unlock(&syscall_trace_lock);
@@ -530,8 +504,6 @@ void trace_sys_exit_handler(struct pt_regs *regs, long ret)
 
 #ifdef CONFIG_PERF_EVENTS
 
-static DECLARE_BITMAP(enabled_perf_enter_syscalls, NR_syscalls);
-static DECLARE_BITMAP(enabled_perf_exit_syscalls, NR_syscalls);
 static int sys_perf_refcount_enter;
 static int sys_perf_refcount_exit;
 
@@ -545,13 +517,13 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	int size;
 
 	syscall_nr = syscall_get_nr(current, regs);
-	if (!test_bit(syscall_nr, enabled_perf_enter_syscalls))
-		return;
-
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
 		return;
 
+	if (sys_data->enter_event->perf_refcount < 1)
+		return;
+
 	/* get the size after alignment with the u32 buffer size field */
 	size = sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec);
 	size = ALIGN(size + sizeof(u32), sizeof(u64));
@@ -577,33 +549,23 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 int perf_sysenter_enable(struct ftrace_event_call *call)
 {
 	int ret = 0;
-	int num;
-
-	num = ((struct syscall_metadata *)call->data)->syscall_nr;
 
 	mutex_lock(&syscall_trace_lock);
-	if (!sys_perf_refcount_enter)
-		ret = register_trace_sys_enter(perf_syscall_enter, NULL);
+	if (sys_perf_refcount_enter ||
+	    (ret = register_trace_sys_enter(perf_syscall_enter, NULL)) == 0)
+		sys_perf_refcount_enter++;
+	mutex_unlock(&syscall_trace_lock);
 	if (ret) {
 		pr_info("event trace: Could not activate"
 				"syscall entry trace point");
-	} else {
-		set_bit(num, enabled_perf_enter_syscalls);
-		sys_perf_refcount_enter++;
 	}
-	mutex_unlock(&syscall_trace_lock);
 	return ret;
 }
 
 void perf_sysenter_disable(struct ftrace_event_call *call)
 {
-	int num;
-
-	num = ((struct syscall_metadata *)call->data)->syscall_nr;
-
 	mutex_lock(&syscall_trace_lock);
 	sys_perf_refcount_enter--;
-	clear_bit(num, enabled_perf_enter_syscalls);
 	if (!sys_perf_refcount_enter)
 		unregister_trace_sys_enter(perf_syscall_enter, NULL);
 	mutex_unlock(&syscall_trace_lock);
@@ -619,13 +581,13 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	int size;
 
 	syscall_nr = syscall_get_nr(current, regs);
-	if (!test_bit(syscall_nr, enabled_perf_exit_syscalls))
-		return;
-
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
 		return;
 
+	if (sys_data->exit_event->perf_refcount < 1)
+		return;
+
 	/* We can probably do that at build time */
 	size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64));
 	size -= sizeof(u32);
@@ -653,33 +615,23 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 int perf_sysexit_enable(struct ftrace_event_call *call)
 {
 	int ret = 0;
-	int num;
-
-	num = ((struct syscall_metadata *)call->data)->syscall_nr;
 
 	mutex_lock(&syscall_trace_lock);
-	if (!sys_perf_refcount_exit)
-		ret = register_trace_sys_exit(perf_syscall_exit, NULL);
+	if (sys_perf_refcount_exit ||
+	    (ret = register_trace_sys_exit(perf_syscall_exit, NULL)) == 0)
+		sys_perf_refcount_exit++;
+	mutex_unlock(&syscall_trace_lock);
 	if (ret) {
 		pr_info("event trace: Could not activate"
 				"syscall exit trace point");
-	} else {
-		set_bit(num, enabled_perf_exit_syscalls);
-		sys_perf_refcount_exit++;
 	}
-	mutex_unlock(&syscall_trace_lock);
 	return ret;
 }
 
 void perf_sysexit_disable(struct ftrace_event_call *call)
 {
-	int num;
-
-	num = ((struct syscall_metadata *)call->data)->syscall_nr;
-
 	mutex_lock(&syscall_trace_lock);
 	sys_perf_refcount_exit--;
-	clear_bit(num, enabled_perf_exit_syscalls);
 	if (!sys_perf_refcount_exit)
 		unregister_trace_sys_exit(perf_syscall_exit, NULL);
 	mutex_unlock(&syscall_trace_lock);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] trace: add support for 32 bit compat syscalls on x86_64
  2012-03-26 18:39 ` [PATCH 2/6] trace: add support for 32 bit compat syscalls on x86_64 Vaibhav Nagarnaik
@ 2012-03-27  4:49   ` H. Peter Anvin
  2012-03-28 21:10     ` Vaibhav Nagarnaik
  0 siblings, 1 reply; 29+ messages in thread
From: H. Peter Anvin @ 2012-03-27  4:49 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, David Sharp, Justin Teravest, Laurent Chavey, x86,
	linux-kernel, Michael Davidson

On 03/26/2012 11:39 AM, Vaibhav Nagarnaik wrote:
> +/*
> + * syscall metadata for 32 bit compatible system calls
> + *
> + * The metadata entries are in the same order as the system call table
> + * but this is just to make it easier to check them for completeness
> + * and correctness.
> + */
> +
> +COMPAT_SYSCALL_METADATA0(restart_syscall)
> +COMPAT_SYSCALL_METADATA1(exit, int, error_code)
   [...]
> +COMPAT_SYSCALL_METADATA5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg, unsigned int, vlen, unsigned int, flags, struct timespec __user *, timeout)

> +/*
> + * This is truly horrible.

Yes, it is.  How on Earth do you expect the above to ever be maintained?

> + *
> + * There is no header file that defines a *complete* set of 32 bit system
> + * call numbers (unistd_32.h only defines ones that are currently exported
> + * to user space and omits lots of old system calls that are still implemented
> + * by the kernel.

> + * There is also no table that can be used to map a system call number into
> + * the canonical name of that system call.

arch/x86/syscalls has all of those.  If it's not in there, it doesn't
exist, because that's where the system call table comes from.

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-26 18:39 ` [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler Vaibhav Nagarnaik
@ 2012-03-27  5:00   ` H. Peter Anvin
  2012-03-28 18:23     ` Vaibhav Nagarnaik
  0 siblings, 1 reply; 29+ messages in thread
From: H. Peter Anvin @ 2012-03-27  5:00 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, David Sharp, Justin Teravest, Laurent Chavey, x86,
	linux-kernel

On 03/26/2012 11:39 AM, Vaibhav Nagarnaik wrote:
> The syscalls are a tricky bunch to trace, because of their multitude and
> dynamic nature of the list. In order to solve this, a macro handled the
> sycalls handler definition and it was expanded into setting up the
> metadata for the syscall event. A handler hooked into the ptrace syscall
> tracer to check whether an invoked syscall was supposed to be traced.
> 
> This added latency to all the invoked syscalls, since they had to be
> checked for tracing and also affected the latency of syscall that was
> actually getting traced. For e.g., using a simple program which invokes
> getuid() in a repeated loop and calculates the average time per syscall
> invocation found a latency of 570 - 117 = 453 ns added to every traced
> syscall.
> 
> This patch changes the syscall macro expansion, to create a function
> that adds the entry and exit tracepoints for the given syscall so that
> the latency can be avoided. This was suggested by Mathieu Desnoyers in
> https://lkml.org/lkml/2010/10/13/337
> 
> After this patch, the latency added is 370 - 117 = 253 ns per invocation
> of a traced syscall. This is on par with a simple tracepoint added to
> any kernel code path.
> 
> This patch also makes syscall tracing architecture independent as there
> is no need to have a hook into the architecture specific syscall tracer
> functions.
> 

I am officially confused here.  You have a single, common, dispatch
point for all system calls -- why don't you use it?  That is of course
the system call table.  If you want to trace a system call, override the
entry point in the syscall table to point to a hook function which can
provide entry and exit hooks.  It's not even code, it's data, so you
don't even have to play the code patching song and dance routine
(although you may have to map it read/write which is normally not the
case for security reasons.)

The best part is that the cost for an untraced system call is *zero*.

	-hpa


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-27  5:00   ` H. Peter Anvin
@ 2012-03-28 18:23     ` Vaibhav Nagarnaik
  2012-03-29  2:43       ` H. Peter Anvin
  0 siblings, 1 reply; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-28 18:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, David Sharp, Justin Teravest, Laurent Chavey, x86,
	linux-kernel

On Mon, Mar 26, 2012 at 10:00 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 03/26/2012 11:39 AM, Vaibhav Nagarnaik wrote:
>> The syscalls are a tricky bunch to trace, because of their multitude and
>> dynamic nature of the list. In order to solve this, a macro handled the
>> sycalls handler definition and it was expanded into setting up the
>> metadata for the syscall event. A handler hooked into the ptrace syscall
>> tracer to check whether an invoked syscall was supposed to be traced.
>>
>> This added latency to all the invoked syscalls, since they had to be
>> checked for tracing and also affected the latency of syscall that was
>> actually getting traced. For e.g., using a simple program which invokes
>> getuid() in a repeated loop and calculates the average time per syscall
>> invocation found a latency of 570 - 117 = 453 ns added to every traced
>> syscall.
>>
>> This patch changes the syscall macro expansion, to create a function
>> that adds the entry and exit tracepoints for the given syscall so that
>> the latency can be avoided. This was suggested by Mathieu Desnoyers in
>> https://lkml.org/lkml/2010/10/13/337
>>
>> After this patch, the latency added is 370 - 117 = 253 ns per invocation
>> of a traced syscall. This is on par with a simple tracepoint added to
>> any kernel code path.
>>
>> This patch also makes syscall tracing architecture independent as there
>> is no need to have a hook into the architecture specific syscall tracer
>> functions.
>>
>
> I am officially confused here.  You have a single, common, dispatch
> point for all system calls -- why don't you use it?  That is of course
> the system call table.  If you want to trace a system call, override the
> entry point in the syscall table to point to a hook function which can
> provide entry and exit hooks.  It's not even code, it's data, so you
> don't even have to play the code patching song and dance routine
> (although you may have to map it read/write which is normally not the
> case for security reasons.)

I am sorry I don't see how that would be possible without having some
sort of architecture dependent changes. Also as you mentioned, it will
have some security considerations.

If you can suggest a better way without going through this macro
magic, I will be glad to implement it. The 2 main reasons I made this
patch was to remove the added latency in syscall tracing and to remove
penalty for syscalls that are not traced.



Thanks

Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] trace: add support for 32 bit compat syscalls on x86_64
  2012-03-27  4:49   ` H. Peter Anvin
@ 2012-03-28 21:10     ` Vaibhav Nagarnaik
  2012-03-28 21:11       ` Vaibhav Nagarnaik
  0 siblings, 1 reply; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-28 21:10 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, David Sharp, Justin Teravest, Laurent Chavey, x86,
	linux-kernel, Michael Davidson

On Mon, Mar 26, 2012 at 9:49 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 03/26/2012 11:39 AM, Vaibhav Nagarnaik wrote:
>> +/*
>> + * syscall metadata for 32 bit compatible system calls
>> + *
>> + * The metadata entries are in the same order as the system call table
>> + * but this is just to make it easier to check them for completeness
>> + * and correctness.
>> + */
>> +
>> +COMPAT_SYSCALL_METADATA0(restart_syscall)
>> +COMPAT_SYSCALL_METADATA1(exit, int, error_code)
>   [...]
>> +COMPAT_SYSCALL_METADATA5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg, unsigned int, vlen, unsigned int, flags, struct timespec __user *, timeout)
>
>> +/*
>> + * This is truly horrible.
>
> Yes, it is.  How on Earth do you expect the above to ever be maintained?

You are right. I found that I can just reuse the SYSCALL_DEFINEx macro
to define the compat syscall metadata. I am sending the new patch, can
you take a look again?


Thanks


Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2/6] trace: add support for 32 bit compat syscalls on x86_64
  2012-03-28 21:10     ` Vaibhav Nagarnaik
@ 2012-03-28 21:11       ` Vaibhav Nagarnaik
  2012-03-28 23:00         ` Vaibhav Nagarnaik
  0 siblings, 1 reply; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-28 21:11 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin
  Cc: David Sharp, Justin Teravest, Laurent Chavey, x86, linux-kernel,
	Michael Davidson, Vaibhav Nagarnaik

From: Michael Davidson <md@google.com>

Add support for a set of events to trace 32 bit compat system calls
in addition to the native 64 bit system calls.

Events for compat system calls have event names of the form:
  syscalls:sys_enter_compat_<name>
  syscalls:sys_exit_compat_<name>

The ascii formatted version of trace events that can be read from
the tracing/trace file reports compat system calls as:
  compat_<name>(...)

- add CONFIG_FTRACE_COMPAT_SYSCALLS

- add a "compat" flag to the syscall_metadata struct so that we can
  distinguish between "native" and "compat" metadata at init time
  when building the system call # to metadata mapping tables

- add a COMPAT_SYSCALL_METADATAx() macro to define system call
  metadata for compat system calls

- modify syscall_nr_to_meta() to know about compat system calls
  and return a pointer to the correct metadata

- modify print_syscall_{enter|exit}() to find the system call metadata
  by looking in the containing ftrace_event_call struct

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
---
Changelog:
* Remove unmaintainable list of syscalls and use SYSCALL_DEFINEx macro
  to define the metadata for equivalent compat syscall

 arch/x86/ia32/Makefile                |    2 +
 arch/x86/ia32/ia32_syscall_metadata.c |   84 +++++++++++++++++++++++++++++++++
 include/linux/syscalls.h              |   17 ++++++-
 include/trace/syscall.h               |   23 ++++++++-
 kernel/trace/Kconfig                  |    6 ++
 kernel/trace/trace_syscalls.c         |   20 ++++++--
 6 files changed, 142 insertions(+), 10 deletions(-)
 create mode 100644 arch/x86/ia32/ia32_syscall_metadata.c

diff --git a/arch/x86/ia32/Makefile b/arch/x86/ia32/Makefile
index 455646e..ba6d3c8 100644
--- a/arch/x86/ia32/Makefile
+++ b/arch/x86/ia32/Makefile
@@ -12,3 +12,5 @@ obj-$(CONFIG_IA32_AOUT) += ia32_aout.o
 
 audit-class-$(CONFIG_AUDIT) := audit.o
 obj-$(CONFIG_IA32_EMULATION) += $(audit-class-y)
+
+obj-$(CONFIG_FTRACE_COMPAT_SYSCALLS) += ia32_syscall_metadata.o
diff --git a/arch/x86/ia32/ia32_syscall_metadata.c b/arch/x86/ia32/ia32_syscall_metadata.c
new file mode 100644
index 0000000..f3f554a
--- /dev/null
+++ b/arch/x86/ia32/ia32_syscall_metadata.c
@@ -0,0 +1,84 @@
+#include <linux/slab.h>
+#include <linux/syscalls.h>
+#include <linux/module.h>
+#include <asm/asm-offsets.h>
+
+extern long ia32_sys_call_table[];
+
+int nr_compat_syscalls;
+struct syscall_metadata **compat_syscalls_metadata;
+
+static const char *prefixes[] = { "sys32", "stub32", "compat_sys",
+					"sys", NULL };
+
+/*
+ * For each entry in the 32 bit system call table:
+ *	Look up the address in the kernel symbol table
+ *	Strip off any "sys32|stub32|sys|compat" prefix
+ *	Search through all of the compat metadata entries for a matching name
+ */
+static struct syscall_metadata __init *
+find_compat_syscall_meta(unsigned long addr)
+{
+	struct syscall_metadata **start;
+	struct syscall_metadata **stop;
+	char str[KSYM_SYMBOL_LEN];
+	const char *name;
+	const char **p;
+	extern struct syscall_metadata *__start_syscalls_metadata[];
+	extern struct syscall_metadata *__stop_syscalls_metadata[];
+
+	start = __start_syscalls_metadata;
+	stop = __stop_syscalls_metadata;
+	kallsyms_lookup(addr, NULL, NULL, NULL, str);
+
+	/*
+	 * If there is a {sys|compat|sys32|stub32} prefix strip it off
+	 */
+	for (p = prefixes, name = str; *p; p++) {
+		int len = strlen(*p);
+		if (strncmp(name, *p, len) == 0) {
+			name += len;
+			break;
+		}
+	}
+
+	for ( ; start < stop; start++) {
+		if (!(*start)->compat)
+			continue;
+
+		/*
+		 * ignore the "compat_" prefix on the metadata name
+		 * when doing the comparison
+		 */
+		if ((*start)->name && !strcmp((*start)->name + 6, name))
+			return *start;
+	}
+	return NULL;
+}
+
+static int __init init_compat_syscall_metadata(void)
+{
+	struct syscall_metadata *meta;
+	int	i;
+
+	nr_compat_syscalls = __NR_ia32_syscall_max;
+	compat_syscalls_metadata = kzalloc(sizeof(*compat_syscalls_metadata) *
+					nr_compat_syscalls, GFP_KERNEL);
+	if (!compat_syscalls_metadata) {
+		nr_compat_syscalls = -1;
+		WARN_ON(1);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < nr_compat_syscalls; i++) {
+		if ((meta = find_compat_syscall_meta(ia32_sys_call_table[i]))) {
+			meta->syscall_nr = i;
+			compat_syscalls_metadata[i] = meta;
+		}
+	}
+
+	return 0;
+}
+
+core_initcall(init_compat_syscall_metadata);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index ed0003c..f2e4106 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -159,7 +159,7 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 	  __attribute__((section("_ftrace_events")))			\
 	*__event_exit_##sname = &event_exit_##sname;
 
-#define SYSCALL_METADATAx(x, sname, ...)			\
+#define _SYSCALL_METADATAx(x, mname, sname, _compat, ...)	\
 	static const char *types_##sname[] = {			\
 		__SC_STR_TDECL##x(__VA_ARGS__)			\
 	};							\
@@ -170,11 +170,12 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 	SYSCALL_TRACE_EXIT_EVENT(sname);			\
 	static struct syscall_metadata __used			\
 	  __syscall_meta_##sname = {				\
-		.name 		= "sys"#sname,			\
+		.name 		= mname,			\
 		.syscall_nr	= -1,	/* Filled in at boot */	\
 		.nb_args 	= x,				\
 		.types		= types_##sname,		\
 		.args		= args_##sname,			\
+		.compat		= _compat,			\
 		.enter_event	= &event_enter_##sname,		\
 		.exit_event	= &event_exit_##sname,		\
 		.enter_fields	= LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields), \
@@ -182,10 +183,21 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 	static struct syscall_metadata __used			\
 	  __attribute__((section("__syscalls_metadata")))	\
 	 *__p_syscall_meta_##sname = &__syscall_meta_##sname;
+
+#define SYSCALL_METADATAx(x, sname, ...)		\
+	_SYSCALL_METADATAx(x, "sys"#sname, sname, 0, __VA_ARGS__)
+
 #else
 #define SYSCALL_METADATAx(x, name, ...)
 #endif /* CONFIG_FTRACE_SYSCALLS */
 
+#if defined(CONFIG_FTRACE_COMPAT_SYSCALLS)
+#define COMPAT_SYSCALL_METADATAx(x, sname, ...)			\
+	_SYSCALL_METADATAx(x, "compat_"#sname, compat_##sname, 1, __VA_ARGS__)
+#else
+#define COMPAT_SYSCALL_METADATAx(x, sname, ...)
+#endif /* CONFIG_FTRACE_COMPAT_SYSCALLS */
+
 #define SYSCALL_DEFINE0(name, ...) SYSCALL_DEFINEx(0, _##name, __VA_ARGS__)
 #define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
 #define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
@@ -210,6 +222,7 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 
 #define SYSCALL_DEFINEx(x, sname, ...)				\
 	SYSCALL_METADATAx(x, sname, __VA_ARGS__)			\
+	COMPAT_SYSCALL_METADATAx(x, sname, __VA_ARGS__)			\
 	__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
 
 #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 31966a4..29169e6 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -21,8 +21,9 @@
  */
 struct syscall_metadata {
 	const char	*name;
-	int		syscall_nr;
-	int		nb_args;
+	u16		syscall_nr;
+	u8		nb_args;
+	u8		compat;
 	const char	**types;
 	const char	**args;
 	struct list_head enter_fields;
@@ -32,6 +33,24 @@ struct syscall_metadata {
 };
 
 #ifdef CONFIG_FTRACE_SYSCALLS
+
+#ifdef CONFIG_FTRACE_COMPAT_SYSCALLS
+
+extern int nr_compat_syscalls;
+extern struct syscall_metadata **compat_syscalls_metadata;
+
+static inline struct syscall_metadata *compat_syscall_nr_to_meta(int nr)
+{
+	return (nr < nr_compat_syscalls) ? compat_syscalls_metadata[nr]
+						  : NULL;
+}
+#else
+static inline struct syscall_metadata *compat_syscall_nr_to_meta(int nr)
+{
+	return NULL;
+}
+#endif /* CONFIG_FTRACE_COMPAT_SYSCALLS */
+
 extern unsigned long arch_syscall_addr(int nr);
 extern int init_syscall_trace(struct ftrace_event_call *call);
 
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index cd31345..cd0954b 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -240,6 +240,12 @@ config FTRACE_SYSCALLS
 	help
 	  Basic tracer to catch the syscall entry and exit events.
 
+config FTRACE_COMPAT_SYSCALLS
+	bool "Trace 32 bit compat syscalls"
+	depends on FTRACE_SYSCALLS
+	help
+	  Trace syscall entry and exit events for 32 bit compat syscalls.
+
 config TRACE_BRANCH_PROFILING
 	bool
 	select GENERIC_TRACER
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index cb65454..8bc89c5 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -5,6 +5,7 @@
 #include <linux/module.h>	/* for MODULE_NAME_LEN via KSYM_SYMBOL_LEN */
 #include <linux/ftrace.h>
 #include <linux/perf_event.h>
+#include <linux/compat.h>
 #include <asm/syscall.h>
 
 #include "trace_output.h"
@@ -90,6 +91,8 @@ find_syscall_meta(unsigned long syscall)
 		return NULL;
 
 	for ( ; start < stop; start++) {
+		if ((*start)->compat)	/* skip compat syscalls */
+			continue;
 		if ((*start)->name && arch_syscall_match_sym_name(str, (*start)->name))
 			return *start;
 	}
@@ -98,6 +101,8 @@ find_syscall_meta(unsigned long syscall)
 
 static struct syscall_metadata *syscall_nr_to_meta(int nr)
 {
+	if (is_compat_task())
+		return compat_syscall_nr_to_meta(nr);
 	if (!syscalls_metadata || nr >= NR_syscalls || nr < 0)
 		return NULL;
 
@@ -112,11 +117,13 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
 	struct trace_entry *ent = iter->ent;
 	struct syscall_trace_enter *trace;
 	struct syscall_metadata *entry;
-	int i, ret, syscall;
+	int i, ret;
+	struct ftrace_event_call *call;
 
 	trace = (typeof(trace))ent;
-	syscall = trace->nr;
-	entry = syscall_nr_to_meta(syscall);
+	event = ftrace_find_event(ent->type);
+	call = container_of(event, struct ftrace_event_call, event);
+	entry = call->data;
 
 	if (!entry)
 		goto end;
@@ -164,13 +171,14 @@ print_syscall_exit(struct trace_iterator *iter, int flags,
 	struct trace_seq *s = &iter->seq;
 	struct trace_entry *ent = iter->ent;
 	struct syscall_trace_exit *trace;
-	int syscall;
 	struct syscall_metadata *entry;
 	int ret;
+	struct ftrace_event_call *call;
 
 	trace = (typeof(trace))ent;
-	syscall = trace->nr;
-	entry = syscall_nr_to_meta(syscall);
+	event = ftrace_find_event(ent->type);
+	call = container_of(event, struct ftrace_event_call, event);
+	entry = call->data;
 
 	if (!entry) {
 		trace_seq_printf(s, "\n");
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] trace: add support for 32 bit compat syscalls on x86_64
  2012-03-28 21:11       ` Vaibhav Nagarnaik
@ 2012-03-28 23:00         ` Vaibhav Nagarnaik
  0 siblings, 0 replies; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-28 23:00 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin
  Cc: David Sharp, Justin Teravest, Laurent Chavey, x86, linux-kernel,
	Michael Davidson, Vaibhav Nagarnaik

On Wed, Mar 28, 2012 at 2:11 PM, Vaibhav Nagarnaik
<vnagarnaik@google.com> wrote:
> Changelog:
> * Remove unmaintainable list of syscalls and use SYSCALL_DEFINEx macro
>  to define the metadata for equivalent compat syscall

This simplifies the patch significantly, but there are problems to
this approach.
* This doesn't trace compat syscalls which don't call the 64-bit
handler (e.g. sys32_stat64). They need COMPAT_SYSCALL_DEFINEx wrapper
macro where they are defined. I am planning to add them for x86.

* This will generate useless metadata for these syscalls. For e.g. it
will have metadata for compat_sys_lseek which does not generate a
trace-able event. Instead, there will be an event for sys32_lseek when
I add the corresponding metadata wrapper.

(BTW, I just found that I need a change to the check in
find_compat_syscall_meta(). Basically remove the prefixes "sys32_" and
"stub32_")



Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-28 18:23     ` Vaibhav Nagarnaik
@ 2012-03-29  2:43       ` H. Peter Anvin
  2012-03-29  2:59         ` Steven Rostedt
  2012-03-29  3:02         ` Vaibhav Nagarnaik
  0 siblings, 2 replies; 29+ messages in thread
From: H. Peter Anvin @ 2012-03-29  2:43 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, David Sharp, Justin Teravest, Laurent Chavey, x86,
	linux-kernel

On 03/28/2012 11:23 AM, Vaibhav Nagarnaik wrote:
> 
> I am sorry I don't see how that would be possible without having some
> sort of architecture dependent changes.

Tough, that's sometimes the way it goes.  On most architectures it's
just a simple table.

> Also as you mentioned, it will have some security considerations.

Not any more than your little scheme.

> If you can suggest a better way without going through this macro
> magic, I will be glad to implement it. The 2 main reasons I made this
> patch was to remove the added latency in syscall tracing and to remove
> penalty for syscalls that are not traced.

But instead you add a penalty for every syscall, even if tracing is
disabled.  Not cool.

> If you can suggest a better way without going through this macro
> magic, I will be glad to implement it.

The more I look at this stuff the more I think it is not just crazy, but
batsh*t crazy... we produce *how* much "metadata" which is stored in
non-pageable kernel memory, and all it seems to be *actually* doing is
store a variable number of parameters in a buffer.

This is insane.  Not just a little insane, but utterly bonkers.

The syscall interface is the single most stable interface in the kernel.
 Just plunk down the system call number and the six arguments in the
buffer, and be done with it.  On the way out, there is a single return
argument, *by design*.  No need to burden the kernel in this way! That
this information can be perfectly well decoded in userspace is already
shown by strace, although it would be highly beneficial if the kernel
build could export information to strace and other tools.  There is
absolutely no need for it to live in kernel memory, though.

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29  2:43       ` H. Peter Anvin
@ 2012-03-29  2:59         ` Steven Rostedt
  2012-03-29  3:15           ` H. Peter Anvin
  2012-03-29  3:02         ` Vaibhav Nagarnaik
  1 sibling, 1 reply; 29+ messages in thread
From: Steven Rostedt @ 2012-03-29  2:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Vaibhav Nagarnaik, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, David Sharp, Justin Teravest, Laurent Chavey, x86,
	linux-kernel

On Wed, 2012-03-28 at 19:43 -0700, H. Peter Anvin wrote:

> The syscall interface is the single most stable interface in the kernel.
>  Just plunk down the system call number and the six arguments in the
> buffer, and be done with it.  On the way out, there is a single return
> argument, *by design*.  No need to burden the kernel in this way! That
> this information can be perfectly well decoded in userspace is already
> shown by strace, although it would be highly beneficial if the kernel
> build could export information to strace and other tools.  There is
> absolutely no need for it to live in kernel memory, though.

Even if it did live in kernel memory (which it does now, and I'm not
sure if we can change it due to the *don't break existing tools* law).
We should be able to at least compress it so that it doesn't waste as
much memory.

-- Steve



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29  2:43       ` H. Peter Anvin
  2012-03-29  2:59         ` Steven Rostedt
@ 2012-03-29  3:02         ` Vaibhav Nagarnaik
  2012-03-29  3:16           ` H. Peter Anvin
  2012-03-29  6:20           ` Ingo Molnar
  1 sibling, 2 replies; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-29  3:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, David Sharp, Justin Teravest, Laurent Chavey, x86,
	linux-kernel

On Wed, Mar 28, 2012 at 7:43 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> But instead you add a penalty for every syscall, even if tracing is
> disabled.  Not cool.

I just ran a small test binary which calls syscall(SYS_getuid) in a
tight loop and calculates the latency per syscall.

Without my patch: it is 70 ns/call
With my patch: it is 83 ns/call

So yes, it does add a bit of latency to the syscall even if tracing is
disabled. I wonder if I can change the redirection function so that it
doesn't add so much latency.

But if it doesn't seem to help, then I will not push this patch.



Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29  2:59         ` Steven Rostedt
@ 2012-03-29  3:15           ` H. Peter Anvin
  0 siblings, 0 replies; 29+ messages in thread
From: H. Peter Anvin @ 2012-03-29  3:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Vaibhav Nagarnaik, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, David Sharp, Justin Teravest, Laurent Chavey, x86,
	linux-kernel

On 03/28/2012 07:59 PM, Steven Rostedt wrote:
> On Wed, 2012-03-28 at 19:43 -0700, H. Peter Anvin wrote:
> 
>> The syscall interface is the single most stable interface in the kernel.
>>  Just plunk down the system call number and the six arguments in the
>> buffer, and be done with it.  On the way out, there is a single return
>> argument, *by design*.  No need to burden the kernel in this way! That
>> this information can be perfectly well decoded in userspace is already
>> shown by strace, although it would be highly beneficial if the kernel
>> build could export information to strace and other tools.  There is
>> absolutely no need for it to live in kernel memory, though.
> 
> Even if it did live in kernel memory (which it does now, and I'm not
> sure if we can change it due to the *don't break existing tools* law).
> We should be able to at least compress it so that it doesn't waste as
> much memory.
> 

This whole facility is the logical equivalent of doing binary-to-ascii
conversion with a switch statement:

switch (foo)
{
	case 0:
		printf("0");
		break;

	case 1:
		printf("1");
		break;

	case 2:
		printf("2");
		break;

	/* ... */
}

We see that kind of code on The Daily WTF all the time, but it has no
excuse being seen anywhere close to the Linux kernel.

Furthermore, if we can't even fix grotesque brokenness like this in
*debugging tools*, then we might as well go home, as there is absolutely
no hope to ever make forward progress.  This is worse than "let's pick
up a bunch of random kernel internals and make them stable ABIs" Xen.

	-hpa


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29  3:02         ` Vaibhav Nagarnaik
@ 2012-03-29  3:16           ` H. Peter Anvin
  2012-03-29  6:20           ` Ingo Molnar
  1 sibling, 0 replies; 29+ messages in thread
From: H. Peter Anvin @ 2012-03-29  3:16 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Ingo Molnar, David Sharp, Justin Teravest, Laurent Chavey, x86,
	linux-kernel

On 03/28/2012 08:02 PM, Vaibhav Nagarnaik wrote:
> 
> But if it doesn't seem to help, then I will not push this patch.
> 

Consider this patch and anything even remotely like it NAKed with
extreme prejudice.

	-hpa


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29  3:02         ` Vaibhav Nagarnaik
  2012-03-29  3:16           ` H. Peter Anvin
@ 2012-03-29  6:20           ` Ingo Molnar
  2012-03-29 19:02             ` Vaibhav Nagarnaik
  1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2012-03-29  6:20 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: H. Peter Anvin, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, David Sharp, Justin Teravest,
	Laurent Chavey, x86, linux-kernel


* Vaibhav Nagarnaik <vnagarnaik@google.com> wrote:

> On Wed, Mar 28, 2012 at 7:43 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> > But instead you add a penalty for every syscall, even if tracing is
> > disabled.  Not cool.
> 
> I just ran a small test binary which calls syscall(SYS_getuid) in a
> tight loop and calculates the latency per syscall.
> 
> Without my patch: it is 70 ns/call
> With my patch: it is 83 ns/call
> 
> So yes, it does add a bit of latency to the syscall even if 
> tracing is disabled. I wonder if I can change the redirection 
> function so that it doesn't add so much latency.

There's a really simple rule for anything tracing/debugging 
related: and syscalls don't add *ANY* kind of latency to the 
non-tracing case. That is true of the current syscall tracing 
bits, they work via a TIF flag and don't add any latency.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29  6:20           ` Ingo Molnar
@ 2012-03-29 19:02             ` Vaibhav Nagarnaik
  2012-03-29 19:12               ` H. Peter Anvin
  0 siblings, 1 reply; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-29 19:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, David Sharp, Justin Teravest,
	Laurent Chavey, x86, linux-kernel

On Wed, Mar 28, 2012 at 11:20 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Vaibhav Nagarnaik <vnagarnaik@google.com> wrote:
>
>> On Wed, Mar 28, 2012 at 7:43 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> > But instead you add a penalty for every syscall, even if tracing is
>> > disabled.  Not cool.
>>
>> I just ran a small test binary which calls syscall(SYS_getuid) in a
>> tight loop and calculates the latency per syscall.
>>
>> Without my patch: it is 70 ns/call
>> With my patch: it is 83 ns/call
>>
>> So yes, it does add a bit of latency to the syscall even if
>> tracing is disabled. I wonder if I can change the redirection
>> function so that it doesn't add so much latency.
>
> There's a really simple rule for anything tracing/debugging
> related: and syscalls don't add *ANY* kind of latency to the
> non-tracing case. That is true of the current syscall tracing
> bits, they work via a TIF flag and don't add any latency.
>

Thanks for the feedback. I had missed this added latency due to this
patch when tracing is disabled.

To fix that, instead of a TIF flag, I am using a flag in the
current->trace bitmap. I check that flag before jumping to the tracing
function. That reduces the latency from 83 ns/call to 74 ns/call.


Thanks

Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29 19:02             ` Vaibhav Nagarnaik
@ 2012-03-29 19:12               ` H. Peter Anvin
  2012-03-29 19:43                 ` Vaibhav Nagarnaik
  2012-03-29 22:44                 ` David Sharp
  0 siblings, 2 replies; 29+ messages in thread
From: H. Peter Anvin @ 2012-03-29 19:12 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: Ingo Molnar, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, David Sharp, Justin Teravest,
	Laurent Chavey, x86, linux-kernel

On 03/29/2012 12:02 PM, Vaibhav Nagarnaik wrote:
> 
> Thanks for the feedback. I had missed this added latency due to this
> patch when tracing is disabled.
> 
> To fix that, instead of a TIF flag, I am using a flag in the
> current->trace bitmap. I check that flag before jumping to the tracing
> function. That reduces the latency from 83 ns/call to 74 ns/call.
> 

ANY increase to the fastpath is unacceptable, period.

Furthermore, as I have discussed with some people over the last few
days, I think we should consider the whole syscall tracing interface set
to be a mistake and deprecate it.  There are much better ways to
accomplish something that will work more reliable without all these thunks.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29 19:12               ` H. Peter Anvin
@ 2012-03-29 19:43                 ` Vaibhav Nagarnaik
  2012-03-29 20:06                   ` H. Peter Anvin
  2012-03-29 22:44                 ` David Sharp
  1 sibling, 1 reply; 29+ messages in thread
From: Vaibhav Nagarnaik @ 2012-03-29 19:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, David Sharp, Justin Teravest,
	Laurent Chavey, Michael Davidson, x86, linux-kernel

On Thu, Mar 29, 2012 at 12:12 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> ANY increase to the fastpath is unacceptable, period.

I agree.

I know that this or any similar solutions won't be acceptable
upstream, but it works for us within the current syscall tracing
framework.

> Furthermore, as I have discussed with some people over the last few
> days, I think we should consider the whole syscall tracing interface set
> to be a mistake and deprecate it.  There are much better ways to
> accomplish something that will work more reliable without all these thunks.


We rely heavily on a system-wide tracing framework and having the
capability of syscall tracing in the kernel helps with debugging
performance issues. ftrace is the best tool for us in this respect.

However, we agree that the syscall tracing as implemented currently is
a bit unwieldy. We would want to be a part of the re-designing effort
if there is a momentum in the community towards that goal. We would be
happy to contribute towards this effort.


Thanks

Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29 19:43                 ` Vaibhav Nagarnaik
@ 2012-03-29 20:06                   ` H. Peter Anvin
  2012-03-29 22:40                     ` David Sharp
  2012-03-30 11:57                     ` Frederic Weisbecker
  0 siblings, 2 replies; 29+ messages in thread
From: H. Peter Anvin @ 2012-03-29 20:06 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: Ingo Molnar, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, David Sharp, Justin Teravest,
	Laurent Chavey, Michael Davidson, x86, linux-kernel

On 03/29/2012 12:43 PM, Vaibhav Nagarnaik wrote:
> 
> However, we agree that the syscall tracing as implemented currently is
> a bit unwieldy. We would want to be a part of the re-designing effort
> if there is a momentum in the community towards that goal. We would be
> happy to contribute towards this effort.
> 

I had a long discussion with Frederic over IRC earlier today.  We came
up with the following strawman:

1. A system call thunk (which could be enabled/disabled by patching the
syscall table.)  This provides an entry and exit hook, and also sets a
per-thread flag to capture userspace traffic.

2. Instrumenting get_user/put_user/copy_from_user/copy_to_user to
capture traffic to userspace.  This captures the *full* set of system
call arguments, including things addressed via pointers.  Furthermore,
it captures the exact versions fed to or returned from the kernel, and
deals with data-dependent collection like ioctl().

This has to be done with extreme care to avoid introducing overhead in
the no-tracing case, however, as these functions are extraordinarily
performance sensitive.  This probably will require careful patching in
the first enable/last disable case.

3. There will need to be userspace tools written to decode the resulting
trace buffer.  This is pretty much needed anyway, but once you throw in
complex data structures it becomes even more so.  A trace will basically
consist of:

SYSCALL_ENTRY <syscall number> <arg1..6>
COPY_FROM_USER <address> <data>
  ...
COPY_TO_USER <address> <data>
  ...
SYSCALL_EXIT <return value>

Outputting this in human-readable format requires some reasonably
sophisticated logic, but the *HUGE* advantage is that not only is all
the information there, it is *correct by construction*.

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29 20:06                   ` H. Peter Anvin
@ 2012-03-29 22:40                     ` David Sharp
  2012-03-29 22:44                       ` H. Peter Anvin
  2012-03-30 12:06                       ` Frederic Weisbecker
  2012-03-30 11:57                     ` Frederic Weisbecker
  1 sibling, 2 replies; 29+ messages in thread
From: David Sharp @ 2012-03-29 22:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Vaibhav Nagarnaik, Ingo Molnar, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Justin Teravest, Laurent Chavey, Michael Davidson, x86,
	linux-kernel

On Thu, Mar 29, 2012 at 1:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> I had a long discussion with Frederic over IRC earlier today.  We came
> up with the following strawman:
>
> 1. A system call thunk (which could be enabled/disabled by patching the
> syscall table.)  This provides an entry and exit hook, and also sets a
> per-thread flag to capture userspace traffic.

Our goal is for syscall traces to be as fast as regular tracepoints.
iirc, What we've found is that much of the extra overhead of syscall
tracepoints as compared to regular tracepoints is due to that the code
path for syscall tracing is bundled with checks for ptrace and other
stuff (Vaibhav did all this characterization, he can jump in with
details if wanted). How much work would this "thunk" have to do that
is not either recording the trace or calling the syscall?

>
> 2. Instrumenting get_user/put_user/copy_from_user/copy_to_user to
> capture traffic to userspace.  This captures the *full* set of system
> call arguments, including things addressed via pointers.  Furthermore,
> it captures the exact versions fed to or returned from the kernel, and
> deals with data-dependent collection like ioctl().

Do I understand correctly that you are thinking to copy tho contents
of those buffers into the ring buffer? This sounds useful. However I
think it should be optional and the number of bytes copied should be
limited (tunable). On highly utilized systems, we don't always have a
lot of memory to dedicate to the ring bufffer, so filling it with the
contents of, eg, the payload of "read" or "write" would not be
acceptable under those circumstances. And since events in the ring
buffer can't cross page boundaries, at some threshold this will cause
an unacceptable level of unutilized space in the ring buffer.

(For context, this is coming from the folks that added "tiny" versions
of syscall tracepoints that only put 16 bits of arg0 into the ring
buffer so we can get longer trace durations.)

>
> This has to be done with extreme care to avoid introducing overhead in
> the no-tracing case, however, as these functions are extraordinarily
> performance sensitive.  This probably will require careful patching in
> the first enable/last disable case.
>
> 3. There will need to be userspace tools written to decode the resulting
> trace buffer.  This is pretty much needed anyway, but once you throw in
> complex data structures it becomes even more so.  A trace will basically
> consist of:
>
> SYSCALL_ENTRY <syscall number> <arg1..6>
> COPY_FROM_USER <address> <data>
>  ...
> COPY_TO_USER <address> <data>
>  ...
> SYSCALL_EXIT <return value>
>
> Outputting this in human-readable format requires some reasonably
> sophisticated logic, but the *HUGE* advantage is that not only is all
> the information there, it is *correct by construction*.
>
>        -hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29 22:40                     ` David Sharp
@ 2012-03-29 22:44                       ` H. Peter Anvin
  2012-03-30 12:06                       ` Frederic Weisbecker
  1 sibling, 0 replies; 29+ messages in thread
From: H. Peter Anvin @ 2012-03-29 22:44 UTC (permalink / raw)
  To: David Sharp
  Cc: Vaibhav Nagarnaik, Ingo Molnar, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Justin Teravest, Laurent Chavey, Michael Davidson, x86,
	linux-kernel

On 03/29/2012 03:40 PM, David Sharp wrote:
> On Thu, Mar 29, 2012 at 1:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> I had a long discussion with Frederic over IRC earlier today.  We came
>> up with the following strawman:
>>
>> 1. A system call thunk (which could be enabled/disabled by patching the
>> syscall table.)  This provides an entry and exit hook, and also sets a
>> per-thread flag to capture userspace traffic.
> 
> Our goal is for syscall traces to be as fast as regular tracepoints.
> iirc, What we've found is that much of the extra overhead of syscall
> tracepoints as compared to regular tracepoints is due to that the code
> path for syscall tracing is bundled with checks for ptrace and other
> stuff (Vaibhav did all this characterization, he can jump in with
> details if wanted). How much work would this "thunk" have to do that
> is not either recording the trace or calling the syscall?

Nothing.  That IS what the thunk would do:

thunk:
	<record syscall entry>
	call real_syscall_table(syscall_number)
	<record syscall exit>
	ret

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29 19:12               ` H. Peter Anvin
  2012-03-29 19:43                 ` Vaibhav Nagarnaik
@ 2012-03-29 22:44                 ` David Sharp
  2012-03-29 22:48                   ` H. Peter Anvin
  1 sibling, 1 reply; 29+ messages in thread
From: David Sharp @ 2012-03-29 22:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Vaibhav Nagarnaik, Ingo Molnar, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Justin Teravest, Laurent Chavey, x86, linux-kernel

On Thu, Mar 29, 2012 at 12:12 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 03/29/2012 12:02 PM, Vaibhav Nagarnaik wrote:
>>
>> Thanks for the feedback. I had missed this added latency due to this
>> patch when tracing is disabled.
>>
>> To fix that, instead of a TIF flag, I am using a flag in the
>> current->trace bitmap. I check that flag before jumping to the tracing
>> function. That reduces the latency from 83 ns/call to 74 ns/call.
>>
>
> ANY increase to the fastpath is unacceptable, period.

I think the last 4 ns would probably be eliminated if we could figure
out how to call the inline function trace_sys_enter directly instead
of through the out-of-line wrapper we had to add to work around the
preprocessor magic.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29 22:44                 ` David Sharp
@ 2012-03-29 22:48                   ` H. Peter Anvin
  0 siblings, 0 replies; 29+ messages in thread
From: H. Peter Anvin @ 2012-03-29 22:48 UTC (permalink / raw)
  To: David Sharp
  Cc: Vaibhav Nagarnaik, Ingo Molnar, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Justin Teravest, Laurent Chavey, x86, linux-kernel

On 03/29/2012 03:44 PM, David Sharp wrote:
>>
>> ANY increase to the fastpath is unacceptable, period.
> 
> I think the last 4 ns would probably be eliminated if we could figure
> out how to call the inline function trace_sys_enter directly instead
> of through the out-of-line wrapper we had to add to work around the
> preprocessor magic.

I already told you how to do it... override the system call table entry.
 There is a nice, juicy FUNCTION POINTER that we indirect through.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29 20:06                   ` H. Peter Anvin
  2012-03-29 22:40                     ` David Sharp
@ 2012-03-30 11:57                     ` Frederic Weisbecker
  1 sibling, 0 replies; 29+ messages in thread
From: Frederic Weisbecker @ 2012-03-30 11:57 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Vaibhav Nagarnaik, Ingo Molnar, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, David Sharp, Justin Teravest, Laurent Chavey,
	Michael Davidson, x86, linux-kernel

On Thu, Mar 29, 2012 at 01:06:10PM -0700, H. Peter Anvin wrote:
> On 03/29/2012 12:43 PM, Vaibhav Nagarnaik wrote:
> > 
> > However, we agree that the syscall tracing as implemented currently is
> > a bit unwieldy. We would want to be a part of the re-designing effort
> > if there is a momentum in the community towards that goal. We would be
> > happy to contribute towards this effort.
> > 
> 
> I had a long discussion with Frederic over IRC earlier today.  We came
> up with the following strawman:
> 
> 1. A system call thunk (which could be enabled/disabled by patching the
> syscall table.)  This provides an entry and exit hook, and also sets a
> per-thread flag to capture userspace traffic.
> 
> 2. Instrumenting get_user/put_user/copy_from_user/copy_to_user to
> capture traffic to userspace.  This captures the *full* set of system
> call arguments, including things addressed via pointers.  Furthermore,
> it captures the exact versions fed to or returned from the kernel, and
> deals with data-dependent collection like ioctl().
> 
> This has to be done with extreme care to avoid introducing overhead in
> the no-tracing case, however, as these functions are extraordinarily
> performance sensitive.  This probably will require careful patching in
> the first enable/last disable case.
> 
> 3. There will need to be userspace tools written to decode the resulting
> trace buffer.  This is pretty much needed anyway, but once you throw in
> complex data structures it becomes even more so.  A trace will basically
> consist of:
> 
> SYSCALL_ENTRY <syscall number> <arg1..6>
> COPY_FROM_USER <address> <data>
>   ...
> COPY_TO_USER <address> <data>
>   ...
> SYSCALL_EXIT <return value>
> 
> Outputting this in human-readable format requires some reasonably
> sophisticated logic, but the *HUGE* advantage is that not only is all
> the information there, it is *correct by construction*.
> 
> 	-hpa


Note we have the relevant tracepoints in place with the "raw_syscalls"
events subsystem. They are generic with only two tracepoints sys_enter
and sys_exit and they blindly dump the syscall number/arg/return value:

$ cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/format 
name: sys_enter
ID: 53
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;
        field:int common_padding;       offset:8;       size:4; signed:1;

        field:long id;  offset:16;      size:8; signed:1;
        field:unsigned long args[6];    offset:24;      size:48;        signed:0;

print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], 
REC->args[4], REC->args[5]

$ cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_exit/format 
name: sys_exit
ID: 52
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;
        field:int common_padding;       offset:8;       size:4; signed:1;

        field:long id;  offset:16;      size:8; signed:1;
        field:long ret; offset:24;      size:8; signed:1;

print fmt: "NR %ld = %ld", REC->id, REC->ret

Now we have yet to do the syscall table patching and the copy_*_user() tracepoints.
But other than these details the bulk of the remaining work is in userspace.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler
  2012-03-29 22:40                     ` David Sharp
  2012-03-29 22:44                       ` H. Peter Anvin
@ 2012-03-30 12:06                       ` Frederic Weisbecker
  1 sibling, 0 replies; 29+ messages in thread
From: Frederic Weisbecker @ 2012-03-30 12:06 UTC (permalink / raw)
  To: David Sharp
  Cc: H. Peter Anvin, Vaibhav Nagarnaik, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, Justin Teravest, Laurent Chavey,
	Michael Davidson, x86, linux-kernel

On Thu, Mar 29, 2012 at 03:40:17PM -0700, David Sharp wrote:
> On Thu, Mar 29, 2012 at 1:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> > I had a long discussion with Frederic over IRC earlier today.  We came
> > up with the following strawman:
> >
> > 1. A system call thunk (which could be enabled/disabled by patching the
> > syscall table.)  This provides an entry and exit hook, and also sets a
> > per-thread flag to capture userspace traffic.
> 
> Our goal is for syscall traces to be as fast as regular tracepoints.
> iirc, What we've found is that much of the extra overhead of syscall
> tracepoints as compared to regular tracepoints is due to that the code
> path for syscall tracing is bundled with checks for ptrace and other
> stuff (Vaibhav did all this characterization, he can jump in with
> details if wanted). How much work would this "thunk" have to do that
> is not either recording the trace or calling the syscall?
> 
> >
> > 2. Instrumenting get_user/put_user/copy_from_user/copy_to_user to
> > capture traffic to userspace.  This captures the *full* set of system
> > call arguments, including things addressed via pointers.  Furthermore,
> > it captures the exact versions fed to or returned from the kernel, and
> > deals with data-dependent collection like ioctl().
> 
> Do I understand correctly that you are thinking to copy tho contents
> of those buffers into the ring buffer? This sounds useful. However I
> think it should be optional and the number of bytes copied should be
> limited (tunable). On highly utilized systems, we don't always have a
> lot of memory to dedicate to the ring bufffer, so filling it with the
> contents of, eg, the payload of "read" or "write" would not be
> acceptable under those circumstances. And since events in the ring
> buffer can't cross page boundaries, at some threshold this will cause
> an unacceptable level of unutilized space in the ring buffer.
> 
> (For context, this is coming from the folks that added "tiny" versions
> of syscall tracepoints that only put 16 bits of arg0 into the ring
> buffer so we can get longer trace durations.)

BTW, since tracing overhead (in terms of volume and throughput) is
important for you guys, have you considered adding some option to ftrace
to ignore the "common" fields on the trace record:

        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;
        field:int common_padding;       offset:8;       size:4; signed:1;

I think you talked about that on the last kernel summit. This would be
interesting for everyone.

You can find out the pid on top of sched switch events. The rest is probably useless
most of the time.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2012-03-30 12:07 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-26 18:39 [PATCH 0/6] Enhance and speed up syscall tracing Vaibhav Nagarnaik
2012-03-26 18:39 ` [PATCH 1/6] trace: syscalls.h - cleanup and simplify SYSCALL_METADATA() Vaibhav Nagarnaik
2012-03-26 18:39 ` [PATCH 2/6] trace: add support for 32 bit compat syscalls on x86_64 Vaibhav Nagarnaik
2012-03-27  4:49   ` H. Peter Anvin
2012-03-28 21:10     ` Vaibhav Nagarnaik
2012-03-28 21:11       ` Vaibhav Nagarnaik
2012-03-28 23:00         ` Vaibhav Nagarnaik
2012-03-26 18:39 ` [PATCH 3/6] trace: Refactor ftrace syscall macros to make them more readable Vaibhav Nagarnaik
2012-03-26 18:39 ` [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler Vaibhav Nagarnaik
2012-03-27  5:00   ` H. Peter Anvin
2012-03-28 18:23     ` Vaibhav Nagarnaik
2012-03-29  2:43       ` H. Peter Anvin
2012-03-29  2:59         ` Steven Rostedt
2012-03-29  3:15           ` H. Peter Anvin
2012-03-29  3:02         ` Vaibhav Nagarnaik
2012-03-29  3:16           ` H. Peter Anvin
2012-03-29  6:20           ` Ingo Molnar
2012-03-29 19:02             ` Vaibhav Nagarnaik
2012-03-29 19:12               ` H. Peter Anvin
2012-03-29 19:43                 ` Vaibhav Nagarnaik
2012-03-29 20:06                   ` H. Peter Anvin
2012-03-29 22:40                     ` David Sharp
2012-03-29 22:44                       ` H. Peter Anvin
2012-03-30 12:06                       ` Frederic Weisbecker
2012-03-30 11:57                     ` Frederic Weisbecker
2012-03-29 22:44                 ` David Sharp
2012-03-29 22:48                   ` H. Peter Anvin
2012-03-26 18:39 ` [PATCH 5/6] trace: raw_syscalls: Mark compat syscalls in the MSB of the syscall number Vaibhav Nagarnaik
2012-03-26 18:39 ` [PATCH 6/6] trace: get rid of the enabled_*_syscalls bitmaps Vaibhav Nagarnaik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.