linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch for 2.6.26 0/7] Architecture Independent Markers
@ 2008-03-27 13:20 Mathieu Desnoyers
  2008-03-27 13:20 ` [patch for 2.6.26 1/7] Markers - define non optimized marker Mathieu Desnoyers
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-27 13:20 UTC (permalink / raw)
  To: akpm, linux-kernel

Hi Andrew,

After a few RFC rounds, I propose these markers for 2.6.26. They include
work done after comments from the memory management community. Most of them have
been used by the LTTng project for about 2 years.

The first patch in the patchset does a small addition to the markers API : it
allows marker sites to declare a _trace_mark marker. I forces use of a marker
which does not rely on an instruction modification mechanism to enable itself.
It's required in some kernel code paths (lockdep, printk, some traps, __init and
__exit code). I would prefer to get this in before the immediate values, since
the immediate values optimization, which depends on the rework of x86
alternatives, paravirt and kprobes currently being merged, takes longer than
expected.

I do not expect this marker set to cover every bits of the kernel (and this is
not its purpose). It's just a good start that has proven to be very useful to
the LTTng community in the past 2 years.

This patchset applies over 2.6.25-rc7 in this order :

markers-define-non-optimized-marker.patch
lttng-instrumentation-fs.patch
lttng-instrumentation-ipc.patch
lttng-instrumentation-kernel.patch
lttng-instrumentation-mm.patch
lttng-instrumentation-net.patch
lttng-instrumentation-lib.patch

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch for 2.6.26 1/7] Markers - define non optimized marker
  2008-03-27 13:20 [patch for 2.6.26 0/7] Architecture Independent Markers Mathieu Desnoyers
@ 2008-03-27 13:20 ` Mathieu Desnoyers
  2008-03-27 13:20 ` [patch for 2.6.26 2/7] LTTng instrumentation fs Mathieu Desnoyers
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-27 13:20 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: markers-define-non-optimized-marker.patch --]
[-- Type: text/plain, Size: 3128 bytes --]

To support the forthcoming "immediate values" marker optimization, we must have
a way to declare markers in few code paths that does not use instruction
modification based enable. This will be the case of printk(), some traps and
eventually lockdep instrumentation.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 include/linux/marker.h |   29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

Index: linux-2.6-lttng/include/linux/marker.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/marker.h	2008-03-27 07:52:45.000000000 -0400
+++ linux-2.6-lttng/include/linux/marker.h	2008-03-27 08:10:38.000000000 -0400
@@ -58,8 +58,12 @@ struct marker {
  * Make sure the alignment of the structure in the __markers section will
  * not add unwanted padding between the beginning of the section and the
  * structure. Force alignment to the same alignment as the section start.
+ *
+ * The "generic" argument controls which marker enabling mechanism must be used.
+ * If generic is true, a variable read is used.
+ * If generic is false, immediate values are used.
  */
-#define __trace_mark(name, call_private, format, args...)		\
+#define __trace_mark(generic, name, call_private, format, args...)	\
 	do {								\
 		static const char __mstrtab_##name[]			\
 		__attribute__((section("__markers_strings")))		\
@@ -80,7 +84,7 @@ struct marker {
 extern void marker_update_probe_range(struct marker *begin,
 	struct marker *end);
 #else /* !CONFIG_MARKERS */
-#define __trace_mark(name, call_private, format, args...) \
+#define __trace_mark(generic, name, call_private, format, args...) \
 		__mark_check_format(format, ## args)
 static inline void marker_update_probe_range(struct marker *begin,
 	struct marker *end)
@@ -88,15 +92,30 @@ static inline void marker_update_probe_r
 #endif /* CONFIG_MARKERS */
 
 /**
- * trace_mark - Marker
+ * trace_mark - Marker using code patching
  * @name: marker name, not quoted.
  * @format: format string
  * @args...: variable argument list
  *
- * Places a marker.
+ * Places a marker using optimized code patching technique (imv_read())
+ * to be enabled when immediate values are present.
  */
 #define trace_mark(name, format, args...) \
-	__trace_mark(name, NULL, format, ## args)
+	__trace_mark(1, name, NULL, format, ## args)
+
+/**
+ * _trace_mark - Marker using variable read
+ * @name: marker name, not quoted.
+ * @format: format string
+ * @args...: variable argument list
+ *
+ * Places a marker using a standard memory read (_imv_read()) to be
+ * enabled. Should be used for markers in code paths where instruction
+ * modification based enabling is not welcome. (__init and __exit functions,
+ * lockdep, some traps, printk).
+ */
+#define _trace_mark(name, format, args...) \
+	__trace_mark(0, name, NULL, format, ## args)
 
 /**
  * MARK_NOARGS - Format string for a marker with no argument.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch for 2.6.26 2/7] LTTng instrumentation fs
  2008-03-27 13:20 [patch for 2.6.26 0/7] Architecture Independent Markers Mathieu Desnoyers
  2008-03-27 13:20 ` [patch for 2.6.26 1/7] Markers - define non optimized marker Mathieu Desnoyers
@ 2008-03-27 13:20 ` Mathieu Desnoyers
  2008-03-27 13:21 ` [patch for 2.6.26 3/7] LTTng instrumentation ipc Mathieu Desnoyers
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-27 13:20 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers, Alexander Viro

[-- Attachment #1: lttng-instrumentation-fs.patch --]
[-- Type: text/plain, Size: 8576 bytes --]

Core filesystem events markers.

Markers added :

fs_buffer_wait_end
fs_buffer_wait_start
fs_close
fs_exec
fs_ioctl
fs_llseek
fs_lseek
fs_open
fs_pollfd
fs_pread64
fs_pwrite64
fs_read
fs_readv
fs_select
fs_write
fs_writev

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
---
 fs/buffer.c     |    3 +++
 fs/compat.c     |    2 ++
 fs/exec.c       |    2 ++
 fs/ioctl.c      |    3 +++
 fs/open.c       |    3 +++
 fs/read_write.c |   23 +++++++++++++++++++++--
 fs/select.c     |    5 +++++
 7 files changed, 39 insertions(+), 2 deletions(-)

Index: linux-2.6-lttng/fs/buffer.c
===================================================================
--- linux-2.6-lttng.orig/fs/buffer.c	2008-03-27 07:39:56.000000000 -0400
+++ linux-2.6-lttng/fs/buffer.c	2008-03-27 07:39:57.000000000 -0400
@@ -41,6 +41,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/marker.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -89,7 +90,9 @@ void unlock_buffer(struct buffer_head *b
  */
 void __wait_on_buffer(struct buffer_head * bh)
 {
+	trace_mark(fs_buffer_wait_start, "bh %p", bh);
 	wait_on_bit(&bh->b_state, BH_Lock, sync_buffer, TASK_UNINTERRUPTIBLE);
+	trace_mark(fs_buffer_wait_end, "bh %p", bh);
 }
 
 static void
Index: linux-2.6-lttng/fs/compat.c
===================================================================
--- linux-2.6-lttng.orig/fs/compat.c	2008-03-27 07:39:56.000000000 -0400
+++ linux-2.6-lttng/fs/compat.c	2008-03-27 07:39:57.000000000 -0400
@@ -50,6 +50,7 @@
 #include <linux/poll.h>
 #include <linux/mm.h>
 #include <linux/eventpoll.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1401,6 +1402,7 @@ int compat_do_execve(char * filename,
 
 	retval = search_binary_handler(bprm, regs);
 	if (retval >= 0) {
+		trace_mark(fs_exec, "filename %s", filename);
 		/* execve success */
 		security_bprm_free(bprm);
 		acct_update_integrals(current);
Index: linux-2.6-lttng/fs/ioctl.c
===================================================================
--- linux-2.6-lttng.orig/fs/ioctl.c	2008-03-27 07:39:56.000000000 -0400
+++ linux-2.6-lttng/fs/ioctl.c	2008-03-27 07:39:57.000000000 -0400
@@ -13,6 +13,7 @@
 #include <linux/security.h>
 #include <linux/module.h>
 #include <linux/uaccess.h>
+#include <linux/marker.h>
 
 #include <asm/ioctls.h>
 
@@ -201,6 +202,8 @@ asmlinkage long sys_ioctl(unsigned int f
 	if (!filp)
 		goto out;
 
+	trace_mark(fs_ioctl, "fd %u cmd %u arg %lu", fd, cmd, arg);
+
 	error = security_file_ioctl(filp, cmd, arg);
 	if (error)
 		goto out_fput;
Index: linux-2.6-lttng/fs/open.c
===================================================================
--- linux-2.6-lttng.orig/fs/open.c	2008-03-27 07:39:56.000000000 -0400
+++ linux-2.6-lttng/fs/open.c	2008-03-27 07:39:57.000000000 -0400
@@ -27,6 +27,7 @@
 #include <linux/rcupdate.h>
 #include <linux/audit.h>
 #include <linux/falloc.h>
+#include <linux/marker.h>
 
 int vfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
@@ -1052,6 +1053,7 @@ long do_sys_open(int dfd, const char __u
 				fsnotify_open(f->f_path.dentry);
 				fd_install(fd, f);
 			}
+			trace_mark(fs_open, "fd %d filename %s", fd, tmp);
 		}
 		putname(tmp);
 	}
@@ -1141,6 +1143,7 @@ asmlinkage long sys_close(unsigned int f
 	filp = fdt->fd[fd];
 	if (!filp)
 		goto out_unlock;
+	trace_mark(fs_close, "fd %u", fd);
 	rcu_assign_pointer(fdt->fd[fd], NULL);
 	FD_CLR(fd, fdt->close_on_exec);
 	__put_unused_fd(files, fd);
Index: linux-2.6-lttng/fs/read_write.c
===================================================================
--- linux-2.6-lttng.orig/fs/read_write.c	2008-03-27 07:39:56.000000000 -0400
+++ linux-2.6-lttng/fs/read_write.c	2008-03-27 07:39:57.000000000 -0400
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
 #include <linux/splice.h>
+#include <linux/marker.h>
 #include "read_write.h"
 
 #include <asm/uaccess.h>
@@ -146,6 +147,9 @@ asmlinkage off_t sys_lseek(unsigned int 
 		if (res != (loff_t)retval)
 			retval = -EOVERFLOW;	/* LFS: should only happen on 32 bit platforms */
 	}
+
+	trace_mark(fs_lseek, "fd %u offset %ld origin %u", fd, offset, origin);
+
 	fput_light(file, fput_needed);
 bad:
 	return retval;
@@ -173,6 +177,10 @@ asmlinkage long sys_llseek(unsigned int 
 	offset = vfs_llseek(file, ((loff_t) offset_high << 32) | offset_low,
 			origin);
 
+	trace_mark(fs_llseek, "fd %u offset %llu origin %u", fd,
+			(unsigned long long)offset,
+			origin);
+
 	retval = (int)offset;
 	if (offset >= 0) {
 		retval = -EFAULT;
@@ -359,6 +367,7 @@ asmlinkage ssize_t sys_read(unsigned int
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		loff_t pos = file_pos_read(file);
+		trace_mark(fs_read, "fd %u count %zu", fd, count);
 		ret = vfs_read(file, buf, count, &pos);
 		file_pos_write(file, pos);
 		fput_light(file, fput_needed);
@@ -376,6 +385,7 @@ asmlinkage ssize_t sys_write(unsigned in
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		loff_t pos = file_pos_read(file);
+		trace_mark(fs_write, "fd %u count %zu", fd, count);
 		ret = vfs_write(file, buf, count, &pos);
 		file_pos_write(file, pos);
 		fput_light(file, fput_needed);
@@ -397,8 +407,12 @@ asmlinkage ssize_t sys_pread64(unsigned 
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		ret = -ESPIPE;
-		if (file->f_mode & FMODE_PREAD)
+		if (file->f_mode & FMODE_PREAD) {
+			trace_mark(fs_pread64, "fd %u count %zu pos %llu",
+				fd, count, (unsigned long long)pos);
 			ret = vfs_read(file, buf, count, &pos);
+		}
+
 		fput_light(file, fput_needed);
 	}
 
@@ -418,8 +432,11 @@ asmlinkage ssize_t sys_pwrite64(unsigned
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		ret = -ESPIPE;
-		if (file->f_mode & FMODE_PWRITE)  
+		if (file->f_mode & FMODE_PWRITE) {
+			trace_mark(fs_pwrite64, "fd %u count %zu pos %llu",
+				fd, count, (unsigned long long)pos);
 			ret = vfs_write(file, buf, count, &pos);
+		}
 		fput_light(file, fput_needed);
 	}
 
@@ -663,6 +680,7 @@ sys_readv(unsigned long fd, const struct
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		loff_t pos = file_pos_read(file);
+		trace_mark(fs_readv, "fd %lu vlen %lu", fd, vlen);
 		ret = vfs_readv(file, vec, vlen, &pos);
 		file_pos_write(file, pos);
 		fput_light(file, fput_needed);
@@ -684,6 +702,7 @@ sys_writev(unsigned long fd, const struc
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		loff_t pos = file_pos_read(file);
+		trace_mark(fs_writev, "fd %lu vlen %lu", fd, vlen);
 		ret = vfs_writev(file, vec, vlen, &pos);
 		file_pos_write(file, pos);
 		fput_light(file, fput_needed);
Index: linux-2.6-lttng/fs/select.c
===================================================================
--- linux-2.6-lttng.orig/fs/select.c	2008-03-27 07:39:56.000000000 -0400
+++ linux-2.6-lttng/fs/select.c	2008-03-27 07:39:57.000000000 -0400
@@ -23,6 +23,7 @@
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/rcupdate.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 
@@ -231,6 +232,9 @@ int do_select(int n, fd_set_bits *fds, s
 				file = fget_light(i, &fput_needed);
 				if (file) {
 					f_op = file->f_op;
+					trace_mark(fs_select,
+							"fd %d timeout #8d%lld",
+							i, (long long)*timeout);
 					mask = DEFAULT_POLLMASK;
 					if (f_op && f_op->poll)
 						mask = (*f_op->poll)(file, retval ? NULL : wait);
@@ -559,6 +563,7 @@ static inline unsigned int do_pollfd(str
 		file = fget_light(fd, &fput_needed);
 		mask = POLLNVAL;
 		if (file != NULL) {
+			trace_mark(fs_pollfd, "fd %d", fd);
 			mask = DEFAULT_POLLMASK;
 			if (file->f_op && file->f_op->poll)
 				mask = file->f_op->poll(file, pwait);
Index: linux-2.6-lttng/fs/exec.c
===================================================================
--- linux-2.6-lttng.orig/fs/exec.c	2008-03-27 07:39:56.000000000 -0400
+++ linux-2.6-lttng/fs/exec.c	2008-03-27 07:39:57.000000000 -0400
@@ -51,6 +51,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1338,6 +1339,7 @@ int do_execve(char * filename,
 
 	retval = search_binary_handler(bprm,regs);
 	if (retval >= 0) {
+		trace_mark(fs_exec, "filename %s", filename);
 		/* execve success */
 		free_arg_pages(bprm);
 		security_bprm_free(bprm);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch for 2.6.26 3/7] LTTng instrumentation ipc
  2008-03-27 13:20 [patch for 2.6.26 0/7] Architecture Independent Markers Mathieu Desnoyers
  2008-03-27 13:20 ` [patch for 2.6.26 1/7] Markers - define non optimized marker Mathieu Desnoyers
  2008-03-27 13:20 ` [patch for 2.6.26 2/7] LTTng instrumentation fs Mathieu Desnoyers
@ 2008-03-27 13:21 ` Mathieu Desnoyers
  2008-03-27 13:21 ` [patch for 2.6.26 4/7] LTTng instrumentation kernel Mathieu Desnoyers
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-27 13:21 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: lttng-instrumentation-ipc.patch --]
[-- Type: text/plain, Size: 3432 bytes --]

Interprocess communication, core events.

Added markers :

ipc_msg_create
ipc_sem_create
ipc_shm_create

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 ipc/msg.c |    6 +++++-
 ipc/sem.c |    6 +++++-
 ipc/shm.c |    6 +++++-
 3 files changed, 15 insertions(+), 3 deletions(-)

Index: linux-2.6-lttng/ipc/msg.c
===================================================================
--- linux-2.6-lttng.orig/ipc/msg.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/ipc/msg.c	2008-03-27 07:40:12.000000000 -0400
@@ -37,6 +37,7 @@
 #include <linux/rwsem.h>
 #include <linux/nsproxy.h>
 #include <linux/ipc_namespace.h>
+#include <linux/marker.h>
 
 #include <asm/current.h>
 #include <asm/uaccess.h>
@@ -293,6 +294,7 @@ asmlinkage long sys_msgget(key_t key, in
 	struct ipc_namespace *ns;
 	struct ipc_ops msg_ops;
 	struct ipc_params msg_params;
+	long ret;
 
 	ns = current->nsproxy->ipc_ns;
 
@@ -303,7 +305,9 @@ asmlinkage long sys_msgget(key_t key, in
 	msg_params.key = key;
 	msg_params.flg = msgflg;
 
-	return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params);
+	ret = ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params);
+	trace_mark(ipc_msg_create, "id %ld flags %d", ret, msgflg);
+	return ret;
 }
 
 static inline unsigned long
Index: linux-2.6-lttng/ipc/sem.c
===================================================================
--- linux-2.6-lttng.orig/ipc/sem.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/ipc/sem.c	2008-03-27 07:40:12.000000000 -0400
@@ -83,6 +83,7 @@
 #include <linux/rwsem.h>
 #include <linux/nsproxy.h>
 #include <linux/ipc_namespace.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include "util.h"
@@ -312,6 +313,7 @@ asmlinkage long sys_semget(key_t key, in
 	struct ipc_namespace *ns;
 	struct ipc_ops sem_ops;
 	struct ipc_params sem_params;
+	long err;
 
 	ns = current->nsproxy->ipc_ns;
 
@@ -326,7 +328,9 @@ asmlinkage long sys_semget(key_t key, in
 	sem_params.flg = semflg;
 	sem_params.u.nsems = nsems;
 
-	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
+	err = ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
+	trace_mark(ipc_sem_create, "id %ld flags %d", err, semflg);
+	return err;
 }
 
 /* Manage the doubly linked list sma->sem_pending as a FIFO:
Index: linux-2.6-lttng/ipc/shm.c
===================================================================
--- linux-2.6-lttng.orig/ipc/shm.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/ipc/shm.c	2008-03-27 07:40:12.000000000 -0400
@@ -39,6 +39,7 @@
 #include <linux/nsproxy.h>
 #include <linux/mount.h>
 #include <linux/ipc_namespace.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 
@@ -482,6 +483,7 @@ asmlinkage long sys_shmget (key_t key, s
 	struct ipc_namespace *ns;
 	struct ipc_ops shm_ops;
 	struct ipc_params shm_params;
+	long err;
 
 	ns = current->nsproxy->ipc_ns;
 
@@ -493,7 +495,9 @@ asmlinkage long sys_shmget (key_t key, s
 	shm_params.flg = shmflg;
 	shm_params.u.size = size;
 
-	return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
+	err = ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
+	trace_mark(ipc_shm_create, "id %ld flags %d", err, shmflg);
+	return err;
 }
 
 static inline unsigned long copy_shmid_to_user(void __user *buf, struct shmid64_ds *in, int version)

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch for 2.6.26 4/7] LTTng instrumentation kernel
  2008-03-27 13:20 [patch for 2.6.26 0/7] Architecture Independent Markers Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2008-03-27 13:21 ` [patch for 2.6.26 3/7] LTTng instrumentation ipc Mathieu Desnoyers
@ 2008-03-27 13:21 ` Mathieu Desnoyers
  2008-03-27 13:21 ` [patch for 2.6.26 5/7] LTTng instrumentation mm Mathieu Desnoyers
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-27 13:21 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: lttng-instrumentation-kernel.patch --]
[-- Type: text/plain, Size: 17370 bytes --]

Core kernel events.

*not* present in this patch because they are architecture specific :
- syscall entry/exit
- traps
- kernel thread creation

Added markers :

kernel_irq_entry
kernel_irq_exit
kernel_kthread_stop
kernel_kthread_stop_ret
kernel_module_free
kernel_module_load
kernel_printk
kernel_process_exit
kernel_process_fork
kernel_process_free
kernel_process_wait
kernel_sched_migrate_task
kernel_sched_schedule
kernel_sched_try_wakeup
kernel_sched_wait_task
kernel_sched_wakeup_new_task
kernel_send_signal
kernel_softirq_entry
kernel_softirq_exit
kernel_softirq_raise
kernel_tasklet_high_entry
kernel_tasklet_high_exit
kernel_tasklet_low_entry
kernel_tasklet_low_exit
kernel_timer_itimer_expired
kernel_timer_itimer_set
kernel_timer_set
kernel_timer_timeout
kernel_timer_update_time
kernel_vprintk
locking_hardirqs_off
locking_hardirqs_on
locking_lock_acquire
locking_lock_release
locking_softirqs_off
locking_softirqs_on

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 kernel/exit.c       |    7 +++++++
 kernel/fork.c       |    5 +++++
 kernel/irq/handle.c |    7 +++++++
 kernel/itimer.c     |   13 +++++++++++++
 kernel/kthread.c    |    5 +++++
 kernel/lockdep.c    |   20 ++++++++++++++++++++
 kernel/module.c     |    5 +++++
 kernel/printk.c     |   27 +++++++++++++++++++++++++++
 kernel/sched.c      |   12 ++++++++++++
 kernel/signal.c     |    3 +++
 kernel/softirq.c    |   23 +++++++++++++++++++++++
 kernel/timer.c      |   13 ++++++++++++-
 12 files changed, 139 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/kernel/irq/handle.c
===================================================================
--- linux-2.6-lttng.orig/kernel/irq/handle.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/irq/handle.c	2008-03-27 07:40:15.000000000 -0400
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
+#include <linux/marker.h>
 
 #include "internals.h"
 
@@ -130,6 +131,10 @@ irqreturn_t handle_IRQ_event(unsigned in
 {
 	irqreturn_t ret, retval = IRQ_NONE;
 	unsigned int status = 0;
+	struct pt_regs *regs = get_irq_regs();
+
+	trace_mark(kernel_irq_entry, "irq_id %u kernel_mode %u", irq,
+		(regs)?(!user_mode(regs)):(1));
 
 	handle_dynamic_tick(action);
 
@@ -148,6 +153,8 @@ irqreturn_t handle_IRQ_event(unsigned in
 		add_interrupt_randomness(irq);
 	local_irq_disable();
 
+	trace_mark(kernel_irq_exit, MARK_NOARGS);
+
 	return retval;
 }
 
Index: linux-2.6-lttng/kernel/itimer.c
===================================================================
--- linux-2.6-lttng.orig/kernel/itimer.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/itimer.c	2008-03-27 07:40:15.000000000 -0400
@@ -12,6 +12,7 @@
 #include <linux/time.h>
 #include <linux/posix-timers.h>
 #include <linux/hrtimer.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 
@@ -132,6 +133,9 @@ enum hrtimer_restart it_real_fn(struct h
 	struct signal_struct *sig =
 		container_of(timer, struct signal_struct, real_timer);
 
+	trace_mark(kernel_timer_itimer_expired, "pid %d",
+		pid_nr(sig->leader_pid));
+
 	kill_pid_info(SIGALRM, SEND_SIG_PRIV, sig->leader_pid);
 
 	return HRTIMER_NORESTART;
@@ -157,6 +161,15 @@ int do_setitimer(int which, struct itime
 	    !timeval_valid(&value->it_interval))
 		return -EINVAL;
 
+	trace_mark(kernel_timer_itimer_set,
+			"which %d interval_sec %ld interval_usec %ld "
+			"value_sec %ld value_usec %ld",
+			which,
+			value->it_interval.tv_sec,
+			value->it_interval.tv_usec,
+			value->it_value.tv_sec,
+			value->it_value.tv_usec);
+
 	switch (which) {
 	case ITIMER_REAL:
 again:
Index: linux-2.6-lttng/kernel/kthread.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kthread.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/kthread.c	2008-03-27 07:40:15.000000000 -0400
@@ -13,6 +13,7 @@
 #include <linux/file.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/marker.h>
 #include <asm/semaphore.h>
 
 #define KTHREAD_NICE_LEVEL (-5)
@@ -205,6 +206,8 @@ int kthread_stop(struct task_struct *k)
 	/* It could exit after stop_info.k set, but before wake_up_process. */
 	get_task_struct(k);
 
+	trace_mark(kernel_kthread_stop, "pid %d", k->pid);
+
 	/* Must init completion *before* thread sees kthread_stop_info.k */
 	init_completion(&kthread_stop_info.done);
 	smp_wmb();
@@ -220,6 +223,8 @@ int kthread_stop(struct task_struct *k)
 	ret = kthread_stop_info.err;
 	mutex_unlock(&kthread_stop_lock);
 
+	trace_mark(kernel_kthread_stop_ret, "ret %d", ret);
+
 	return ret;
 }
 EXPORT_SYMBOL(kthread_stop);
Index: linux-2.6-lttng/kernel/lockdep.c
===================================================================
--- linux-2.6-lttng.orig/kernel/lockdep.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/lockdep.c	2008-03-27 07:40:15.000000000 -0400
@@ -39,6 +39,7 @@
 #include <linux/irqflags.h>
 #include <linux/utsname.h>
 #include <linux/hash.h>
+#include <linux/marker.h>
 
 #include <asm/sections.h>
 
@@ -2018,6 +2019,9 @@ void trace_hardirqs_on(void)
 	struct task_struct *curr = current;
 	unsigned long ip;
 
+	_trace_mark(locking_hardirqs_on, "ip #p%lu",
+		(unsigned long) __builtin_return_address(0));
+
 	if (unlikely(!debug_locks || current->lockdep_recursion))
 		return;
 
@@ -2065,6 +2069,9 @@ void trace_hardirqs_off(void)
 {
 	struct task_struct *curr = current;
 
+	_trace_mark(locking_hardirqs_off, "ip #p%lu",
+		(unsigned long) __builtin_return_address(0));
+
 	if (unlikely(!debug_locks || current->lockdep_recursion))
 		return;
 
@@ -2092,6 +2099,9 @@ void trace_softirqs_on(unsigned long ip)
 {
 	struct task_struct *curr = current;
 
+	_trace_mark(locking_softirqs_on, "ip #p%lu",
+		(unsigned long) __builtin_return_address(0));
+
 	if (unlikely(!debug_locks))
 		return;
 
@@ -2126,6 +2136,9 @@ void trace_softirqs_off(unsigned long ip
 {
 	struct task_struct *curr = current;
 
+	_trace_mark(locking_softirqs_off, "ip #p%lu",
+		(unsigned long) __builtin_return_address(0));
+
 	if (unlikely(!debug_locks))
 		return;
 
@@ -2358,6 +2371,10 @@ static int __lock_acquire(struct lockdep
 	int chain_head = 0;
 	u64 chain_key;
 
+	_trace_mark(locking_lock_acquire,
+		"ip #p%lu subclass %u lock %p trylock %d",
+		ip, subclass, lock, trylock);
+
 	if (!prove_locking)
 		check = 1;
 
@@ -2631,6 +2648,9 @@ __lock_release(struct lockdep_map *lock,
 {
 	struct task_struct *curr = current;
 
+	_trace_mark(locking_lock_release, "ip #p%lu lock %p nested %d",
+		ip, lock, nested);
+
 	if (!check_unlock(curr, lock, ip))
 		return;
 
Index: linux-2.6-lttng/kernel/printk.c
===================================================================
--- linux-2.6-lttng.orig/kernel/printk.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/printk.c	2008-03-27 07:40:15.000000000 -0400
@@ -32,6 +32,7 @@
 #include <linux/security.h>
 #include <linux/bootmem.h>
 #include <linux/syscalls.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 
@@ -607,6 +608,7 @@ asmlinkage int printk(const char *fmt, .
 	int r;
 
 	va_start(args, fmt);
+	trace_mark(kernel_printk, "ip %p", __builtin_return_address(0));
 	r = vprintk(fmt, args);
 	va_end(args);
 
@@ -671,6 +673,31 @@ asmlinkage int vprintk(const char *fmt, 
 	raw_local_irq_save(flags);
 	this_cpu = smp_processor_id();
 
+	if (printed_len > 0) {
+		unsigned int loglevel;
+		int mark_len;
+		char *mark_buf;
+		char saved_char;
+
+		if (printk_buf[0] == '<' && printk_buf[1] >= '0' &&
+		   printk_buf[1] <= '7' && printk_buf[2] == '>') {
+			loglevel = printk_buf[1] - '0';
+			mark_buf = &printk_buf[3];
+			mark_len = printed_len - 3;
+		} else {
+			loglevel = default_message_loglevel;
+			mark_buf = printk_buf;
+			mark_len = printed_len;
+		}
+		if (mark_buf[mark_len - 1] == '\n')
+			mark_len--;
+		saved_char = mark_buf[mark_len];
+		mark_buf[mark_len] = '\0';
+		_trace_mark(kernel_vprintk, "loglevel %c string %s ip %p",
+			loglevel, mark_buf, __builtin_return_address(0));
+		mark_buf[mark_len] = saved_char;
+	}
+
 	/*
 	 * Ouch, printk recursed into itself!
 	 */
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/sched.c	2008-03-27 07:40:15.000000000 -0400
@@ -66,6 +66,7 @@
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
 #include <linux/hrtimer.h>
+#include <linux/marker.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -1563,6 +1564,8 @@ void wait_task_inactive(struct task_stru
 		 * just go back and repeat.
 		 */
 		rq = task_rq_lock(p, &flags);
+		trace_mark(kernel_sched_wait_task, "pid %d state %ld",
+			p->pid, p->state);
 		running = task_running(rq, p);
 		on_rq = p->se.on_rq;
 		task_rq_unlock(rq, &flags);
@@ -1841,6 +1844,8 @@ static int try_to_wake_up(struct task_st
 
 	smp_wmb();
 	rq = task_rq_lock(p, &flags);
+	trace_mark(kernel_sched_try_wakeup, "pid %d state %ld",
+		p->pid, p->state);
 	old_state = p->state;
 	if (!(old_state & state))
 		goto out;
@@ -2017,6 +2022,8 @@ void wake_up_new_task(struct task_struct
 	struct rq *rq;
 
 	rq = task_rq_lock(p, &flags);
+	trace_mark(kernel_sched_wakeup_new_task, "pid %d state %ld",
+		p->pid, p->state);
 	BUG_ON(p->state != TASK_RUNNING);
 	update_rq_clock(rq);
 
@@ -2204,6 +2211,9 @@ context_switch(struct rq *rq, struct tas
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
+	trace_mark(kernel_sched_schedule,
+		"prev_pid %d next_pid %d prev_state %ld",
+		prev->pid, next->pid, prev->state);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
@@ -2436,6 +2446,8 @@ static void sched_migrate_task(struct ta
 	    || unlikely(cpu_is_offline(dest_cpu)))
 		goto out;
 
+	trace_mark(kernel_sched_migrate_task, "pid %d state %ld dest_cpu %d",
+		p->pid, p->state, dest_cpu);
 	/* force the process onto the specified CPU */
 	if (migrate_task(p, dest_cpu, &req)) {
 		/* Need to wait for migration thread (might exit: take ref). */
Index: linux-2.6-lttng/kernel/signal.c
===================================================================
--- linux-2.6-lttng.orig/kernel/signal.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/signal.c	2008-03-27 07:40:15.000000000 -0400
@@ -26,6 +26,7 @@
 #include <linux/freezer.h>
 #include <linux/pid_namespace.h>
 #include <linux/nsproxy.h>
+#include <linux/marker.h>
 
 #include <asm/param.h>
 #include <asm/uaccess.h>
@@ -663,6 +664,8 @@ static int send_signal(int sig, struct s
 	struct sigqueue * q = NULL;
 	int ret = 0;
 
+	trace_mark(kernel_send_signal, "pid %d signal %d", t->pid, sig);
+
 	/*
 	 * Deliver the signal to listening signalfds. This must be called
 	 * with the sighand lock held.
Index: linux-2.6-lttng/kernel/softirq.c
===================================================================
--- linux-2.6-lttng.orig/kernel/softirq.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/softirq.c	2008-03-27 07:40:15.000000000 -0400
@@ -21,6 +21,7 @@
 #include <linux/rcupdate.h>
 #include <linux/smp.h>
 #include <linux/tick.h>
+#include <linux/marker.h>
 
 #include <asm/irq.h>
 /*
@@ -231,7 +232,15 @@ restart:
 
 	do {
 		if (pending & 1) {
+			trace_mark(kernel_softirq_entry, "softirq_id %lu",
+				((unsigned long)h
+					- (unsigned long)softirq_vec)
+					/ sizeof(*h));
 			h->action(h);
+			trace_mark(kernel_softirq_exit, "softirq_id %lu",
+				((unsigned long)h
+					- (unsigned long)softirq_vec)
+					/ sizeof(*h));
 			rcu_bh_qsctr_inc(cpu);
 		}
 		h++;
@@ -323,6 +332,8 @@ void irq_exit(void)
  */
 inline void raise_softirq_irqoff(unsigned int nr)
 {
+	trace_mark(kernel_softirq_raise, "softirq_id %u", nr);
+
 	__raise_softirq_irqoff(nr);
 
 	/*
@@ -408,7 +419,13 @@ static void tasklet_action(struct softir
 			if (!atomic_read(&t->count)) {
 				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
 					BUG();
+				trace_mark(kernel_tasklet_low_entry,
+						"func %p data %lu",
+						t->func, t->data);
 				t->func(t->data);
+				trace_mark(kernel_tasklet_low_exit,
+						"func %p data %lu",
+						t->func, t->data);
 				tasklet_unlock(t);
 				continue;
 			}
@@ -441,7 +458,13 @@ static void tasklet_hi_action(struct sof
 			if (!atomic_read(&t->count)) {
 				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
 					BUG();
+				trace_mark(kernel_tasklet_high_entry,
+						"func %p data %lu",
+						t->func, t->data);
 				t->func(t->data);
+				trace_mark(kernel_tasklet_high_exit,
+						"func %p data %lu",
+						t->func, t->data);
 				tasklet_unlock(t);
 				continue;
 			}
Index: linux-2.6-lttng/kernel/timer.c
===================================================================
--- linux-2.6-lttng.orig/kernel/timer.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/timer.c	2008-03-27 07:40:15.000000000 -0400
@@ -37,12 +37,14 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 #include <asm/div64.h>
 #include <asm/timex.h>
 #include <asm/io.h>
+#include <asm/irq_regs.h>
 
 u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
 
@@ -288,6 +290,8 @@ static void internal_add_timer(struct tv
 		i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
 		vec = base->tv5.vec + i;
 	}
+	trace_mark(kernel_timer_set, "expires %lu function %p data %lu",
+		expires, timer->function, timer->data);
 	/*
 	 * Timers are FIFO:
 	 */
@@ -940,6 +944,11 @@ void do_timer(unsigned long ticks)
 {
 	jiffies_64 += ticks;
 	update_times(ticks);
+	trace_mark(kernel_timer_update_time,
+		"jiffies #8u%llu xtime_sec %ld xtime_nsec %ld "
+		"walltomonotonic_sec %ld walltomonotonic_nsec %ld",
+		(unsigned long long)jiffies_64, xtime.tv_sec, xtime.tv_nsec,
+		wall_to_monotonic.tv_sec, wall_to_monotonic.tv_nsec);
 }
 
 #ifdef __ARCH_WANT_SYS_ALARM
@@ -1021,7 +1030,9 @@ asmlinkage long sys_getegid(void)
 
 static void process_timeout(unsigned long __data)
 {
-	wake_up_process((struct task_struct *)__data);
+	struct task_struct *task = (struct task_struct *)__data;
+	trace_mark(kernel_timer_timeout, "pid %d", task->pid);
+	wake_up_process(task);
 }
 
 /**
Index: linux-2.6-lttng/kernel/exit.c
===================================================================
--- linux-2.6-lttng.orig/kernel/exit.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/exit.c	2008-03-27 07:40:15.000000000 -0400
@@ -44,6 +44,7 @@
 #include <linux/resource.h>
 #include <linux/blkdev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -137,6 +138,8 @@ static void __exit_signal(struct task_st
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
 {
+	trace_mark(kernel_process_free, "pid %d",
+		container_of(rhp, struct task_struct, rcu)->pid);
 	put_task_struct(container_of(rhp, struct task_struct, rcu));
 }
 
@@ -966,6 +969,8 @@ NORET_TYPE void do_exit(long code)
 
 	if (group_dead)
 		acct_process();
+	trace_mark(kernel_process_exit, "pid %d", tsk->pid);
+
 	exit_sem(tsk);
 	__exit_files(tsk);
 	__exit_fs(tsk);
@@ -1450,6 +1455,8 @@ static long do_wait(enum pid_type type, 
 	struct task_struct *tsk;
 	int flag, retval;
 
+	trace_mark(kernel_process_wait, "pid %d", pid_nr(pid));
+
 	add_wait_queue(&current->signal->wait_chldexit,&wait);
 repeat:
 	/* If there is nothing that can match our critier just get out */
Index: linux-2.6-lttng/kernel/fork.c
===================================================================
--- linux-2.6-lttng.orig/kernel/fork.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/fork.c	2008-03-27 07:40:15.000000000 -0400
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/marker.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1488,6 +1489,10 @@ long do_fork(unsigned long clone_flags,
 	if (!IS_ERR(p)) {
 		struct completion vfork;
 
+		trace_mark(kernel_process_fork,
+			"parent_pid %d child_pid %d child_tgid %d",
+			current->pid, p->pid, p->tgid);
+
 		nr = task_pid_vnr(p);
 
 		if (clone_flags & CLONE_PARENT_SETTID)
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c	2008-03-27 07:40:15.000000000 -0400
@@ -47,6 +47,7 @@
 #include <asm/cacheflush.h>
 #include <linux/license.h>
 #include <asm/sections.h>
+#include <linux/marker.h>
 
 #if 0
 #define DEBUGP printk
@@ -1332,6 +1333,8 @@ static int __unlink_module(void *_mod)
 /* Free a module, remove from lists, etc (must hold module_mutex). */
 static void free_module(struct module *mod)
 {
+	trace_mark(kernel_module_free, "name %s", mod->name);
+
 	/* Delete from various lists */
 	stop_machine_run(__unlink_module, mod, NR_CPUS);
 	remove_notes_attrs(mod);
@@ -2104,6 +2107,8 @@ static struct module *load_module(void _
 	/* Get rid of temporary copy */
 	vfree(hdr);
 
+	trace_mark(kernel_module_load, "name %s", mod->name);
+
 	/* Done! */
 	return mod;
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch for 2.6.26 5/7] LTTng instrumentation mm
  2008-03-27 13:20 [patch for 2.6.26 0/7] Architecture Independent Markers Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2008-03-27 13:21 ` [patch for 2.6.26 4/7] LTTng instrumentation kernel Mathieu Desnoyers
@ 2008-03-27 13:21 ` Mathieu Desnoyers
  2008-03-27 13:21 ` [patch for 2.6.26 6/7] LTTng instrumentation net Mathieu Desnoyers
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-27 13:21 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers, linux-mm, Dave Hansen

[-- Attachment #1: lttng-instrumentation-mm.patch --]
[-- Type: text/plain, Size: 9773 bytes --]

Memory management core events.

Added markers :

mm_filemap_wait_end
mm_filemap_wait_start
mm_handle_fault_entry
mm_handle_fault_exit
mm_huge_page_alloc
mm_huge_page_free
mm_page_alloc
mm_page_free
mm_swap_file_close
mm_swap_file_open
mm_swap_in
mm_swap_out
statedump_swap_files

Changelog:
- Use page_to_pfn for swap out instrumentation, wait_on_page_bit, do_swap_page,
  page alloc/free.
- add missing free_hot_cold_page instrumentation.
- add hugetlb page_alloc page_free instrumentation.
- Add write_access to mm fault.
- Add page bit_nr waited for by wait_on_page_bit.
- Move page alloc instrumentation to __aloc_pages so we cover the alloc zeroed
  page path.
- Add swap file used for swap in and swap out events.
- Dump the swap files, instrument swapon and swapoff.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: linux-mm@kvack.org
CC: Dave Hansen <haveblue@us.ibm.com>
---
 include/linux/swapops.h |    8 ++++++++
 mm/filemap.c            |    7 +++++++
 mm/hugetlb.c            |    3 +++
 mm/memory.c             |   39 ++++++++++++++++++++++++++++++---------
 mm/page_alloc.c         |    9 +++++++++
 mm/page_io.c            |    6 ++++++
 mm/swapfile.c           |   23 +++++++++++++++++++++++
 7 files changed, 86 insertions(+), 9 deletions(-)

Index: linux-2.6-lttng/mm/filemap.c
===================================================================
--- linux-2.6-lttng.orig/mm/filemap.c	2008-03-27 09:18:44.000000000 -0400
+++ linux-2.6-lttng/mm/filemap.c	2008-03-27 09:19:23.000000000 -0400
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/marker.h>
 #include "internal.h"
 
 /*
@@ -540,9 +541,15 @@ void wait_on_page_bit(struct page *page,
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
+	trace_mark(mm_filemap_wait_start, "pfn %lu bit_nr %d",
+		page_to_pfn(page), bit_nr);
+
 	if (test_bit(bit_nr, &page->flags))
 		__wait_on_bit(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
+
+	trace_mark(mm_filemap_wait_end, "pfn %lu bit_nr %d",
+		page_to_pfn(page), bit_nr);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c	2008-03-27 09:18:44.000000000 -0400
+++ linux-2.6-lttng/mm/memory.c	2008-03-27 09:19:23.000000000 -0400
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/marker.h>
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/delayacct.h>
@@ -2050,6 +2051,10 @@ static int do_swap_page(struct mm_struct
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+		trace_mark(mm_swap_in, "pfn %lu filp %p offset %lu",
+			page_to_pfn(page),
+			get_swap_info_struct(swp_type(entry))->swap_file,
+			swp_offset(entry));
 	}
 
 	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
@@ -2509,30 +2514,46 @@ unlock:
 int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, int write_access)
 {
+	int res;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
+	trace_mark(mm_handle_fault_entry,
+		"address %lu ip #p%ld write_access %d",
+		address, KSTK_EIP(current), write_access);
+
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
 
-	if (unlikely(is_vm_hugetlb_page(vma)))
-		return hugetlb_fault(mm, vma, address, write_access);
+	if (unlikely(is_vm_hugetlb_page(vma))) {
+		res = hugetlb_fault(mm, vma, address, write_access);
+		goto end;
+	}
 
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);
-	if (!pud)
-		return VM_FAULT_OOM;
+	if (!pud) {
+		res = VM_FAULT_OOM;
+		goto end;
+	}
 	pmd = pmd_alloc(mm, pud, address);
-	if (!pmd)
-		return VM_FAULT_OOM;
+	if (!pmd) {
+		res = VM_FAULT_OOM;
+		goto end;
+	}
 	pte = pte_alloc_map(mm, pmd, address);
-	if (!pte)
-		return VM_FAULT_OOM;
+	if (!pte) {
+		res = VM_FAULT_OOM;
+		goto end;
+	}
 
-	return handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+	res = handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+end:
+	trace_mark(mm_handle_fault_exit, MARK_NOARGS);
+	return res;
 }
 
 #ifndef __PAGETABLE_PUD_FOLDED
Index: linux-2.6-lttng/mm/page_alloc.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_alloc.c	2008-03-27 09:18:44.000000000 -0400
+++ linux-2.6-lttng/mm/page_alloc.c	2008-03-27 09:19:23.000000000 -0400
@@ -45,6 +45,7 @@
 #include <linux/fault-inject.h>
 #include <linux/page-isolation.h>
 #include <linux/memcontrol.h>
+#include <linux/marker.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -527,6 +528,9 @@ static void __free_pages_ok(struct page 
 	int i;
 	int reserved = 0;
 
+	trace_mark(mm_page_free, "order %u pfn %lu",
+		order, page_to_pfn(page));
+
 	for (i = 0 ; i < (1 << order) ; ++i)
 		reserved += free_pages_check(page + i);
 	if (reserved)
@@ -990,6 +994,8 @@ static void free_hot_cold_page(struct pa
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 
+	trace_mark(mm_page_free, "order %u pfn %lu", 0, page_to_pfn(page));
+
 	if (PageAnon(page))
 		page->mapping = NULL;
 	if (free_pages_check(page))
@@ -1643,6 +1649,9 @@ nopage:
 		show_mem();
 	}
 got_pg:
+	if (page)
+		trace_mark(mm_page_alloc, "order %u pfn %lu", order,
+			   page_to_pfn(page));
 	return page;
 }
 
Index: linux-2.6-lttng/mm/page_io.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_io.c	2008-03-27 09:18:44.000000000 -0400
+++ linux-2.6-lttng/mm/page_io.c	2008-03-27 09:19:23.000000000 -0400
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/marker.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -114,6 +115,11 @@ int swap_writepage(struct page *page, st
 		rw |= (1 << BIO_RW_SYNC);
 	count_vm_event(PSWPOUT);
 	set_page_writeback(page);
+	trace_mark(mm_swap_out, "pfn %lu filp %p offset %lu",
+			page_to_pfn(page),
+			get_swap_info_struct(swp_type(
+				page_swp_entry(page)))->swap_file,
+			swp_offset(page_swp_entry(page)));
 	unlock_page(page);
 	submit_bio(rw, bio);
 out:
Index: linux-2.6-lttng/mm/hugetlb.c
===================================================================
--- linux-2.6-lttng.orig/mm/hugetlb.c	2008-03-27 09:18:44.000000000 -0400
+++ linux-2.6-lttng/mm/hugetlb.c	2008-03-27 09:19:23.000000000 -0400
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/marker.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -137,6 +138,7 @@ static void free_huge_page(struct page *
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
+	trace_mark(mm_huge_page_free, "pfn %lu", page_to_pfn(page));
 	mapping = (struct address_space *) page_private(page);
 	set_page_private(page, 0);
 	BUG_ON(page_count(page));
@@ -485,6 +487,7 @@ static struct page *alloc_huge_page(stru
 	if (!IS_ERR(page)) {
 		set_page_refcounted(page);
 		set_page_private(page, (unsigned long) mapping);
+		trace_mark(mm_huge_page_alloc, "pfn %lu", page_to_pfn(page));
 	}
 	return page;
 }
Index: linux-2.6-lttng/include/linux/swapops.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/swapops.h	2008-03-27 09:18:44.000000000 -0400
+++ linux-2.6-lttng/include/linux/swapops.h	2008-03-27 09:19:23.000000000 -0400
@@ -76,6 +76,14 @@ static inline pte_t swp_entry_to_pte(swp
 	return __swp_entry_to_pte(arch_entry);
 }
 
+static inline swp_entry_t page_swp_entry(struct page *page)
+{
+	swp_entry_t entry;
+	VM_BUG_ON(!PageSwapCache(page));
+	entry.val = page_private(page);
+	return entry;
+}
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
Index: linux-2.6-lttng/mm/swapfile.c
===================================================================
--- linux-2.6-lttng.orig/mm/swapfile.c	2008-03-27 09:18:44.000000000 -0400
+++ linux-2.6-lttng/mm/swapfile.c	2008-03-27 09:19:43.000000000 -0400
@@ -28,6 +28,7 @@
 #include <linux/capability.h>
 #include <linux/syscalls.h>
 #include <linux/memcontrol.h>
+#include <linux/marker.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -1310,6 +1311,7 @@ asmlinkage long sys_swapoff(const char _
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
 	p->flags = 0;
+	trace_mark(mm_swap_file_close, "filp %p", swap_file);
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
@@ -1691,6 +1693,8 @@ asmlinkage long sys_swapon(const char __
 	} else {
 		swap_info[prev].next = p - swap_info;
 	}
+	trace_mark(mm_swap_file_open, "filp %p filename %s",
+		swap_file, name);
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	error = 0;
@@ -1844,3 +1848,22 @@ int valid_swaphandles(swp_entry_t entry,
 	*offset = ++toff;
 	return nr_pages? ++nr_pages: 0;
 }
+
+void ltt_dump_swap_files(void *call_data)
+{
+	int type;
+	struct swap_info_struct *p = NULL;
+
+	mutex_lock(&swapon_mutex);
+	for (type = swap_list.head; type >= 0; type = swap_info[type].next) {
+		p = swap_info + type;
+		if ((p->flags & SWP_ACTIVE) != SWP_ACTIVE)
+			continue;
+		__trace_mark(0, statedump_swap_files, call_data,
+			"filp %p vfsmount %p dname %s",
+			p->swap_file, p->swap_file->f_vfsmnt,
+			p->swap_file->f_dentry->d_name.name);
+	}
+	mutex_unlock(&swapon_mutex);
+}
+EXPORT_SYMBOL_GPL(ltt_dump_swap_files);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch for 2.6.26 6/7] LTTng instrumentation net
  2008-03-27 13:20 [patch for 2.6.26 0/7] Architecture Independent Markers Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2008-03-27 13:21 ` [patch for 2.6.26 5/7] LTTng instrumentation mm Mathieu Desnoyers
@ 2008-03-27 13:21 ` Mathieu Desnoyers
  2008-03-27 13:21 ` [patch for 2.6.26 7/7] LTTng instrumentation - lib Mathieu Desnoyers
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-27 13:21 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers, netdev

[-- Attachment #1: lttng-instrumentation-net.patch --]
[-- Type: text/plain, Size: 4381 bytes --]

Network core events.

Added markers :

net_del_ifa_ipv4
net_dev_receive
net_dev_xmit
net_insert_ifa_ipv4
net_socket_call
net_socket_create
net_socket_recvmsg
net_socket_sendmsg

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: netdev@vger.kernel.org
---
 net/core/dev.c     |    6 ++++++
 net/ipv4/devinet.c |    6 ++++++
 net/socket.c       |   19 +++++++++++++++++++
 3 files changed, 31 insertions(+)

Index: linux-2.6-lttng/net/core/dev.c
===================================================================
--- linux-2.6-lttng.orig/net/core/dev.c	2008-03-27 07:26:26.000000000 -0400
+++ linux-2.6-lttng/net/core/dev.c	2008-03-27 07:31:44.000000000 -0400
@@ -119,6 +119,7 @@
 #include <linux/err.h>
 #include <linux/ctype.h>
 #include <linux/if_arp.h>
+#include <linux/marker.h>
 
 #include "net-sysfs.h"
 
@@ -1643,6 +1644,8 @@ int dev_queue_xmit(struct sk_buff *skb)
 	}
 
 gso:
+	trace_mark(net_dev_xmit, "skb %p protocol #2u%hu", skb, skb->protocol);
+
 	spin_lock_prefetch(&dev->queue_lock);
 
 	/* Disable soft irqs for various locks below. Also
@@ -2043,6 +2046,9 @@ int netif_receive_skb(struct sk_buff *sk
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
+	trace_mark(net_dev_receive, "skb %p protocol #2u%hu",
+		skb, skb->protocol);
+
 	skb_reset_network_header(skb);
 	skb_reset_transport_header(skb);
 	skb->mac_len = skb->network_header - skb->mac_header;
Index: linux-2.6-lttng/net/ipv4/devinet.c
===================================================================
--- linux-2.6-lttng.orig/net/ipv4/devinet.c	2008-03-27 07:26:26.000000000 -0400
+++ linux-2.6-lttng/net/ipv4/devinet.c	2008-03-27 07:31:49.000000000 -0400
@@ -56,6 +56,7 @@
 #include <linux/sysctl.h>
 #endif
 #include <linux/kmod.h>
+#include <linux/marker.h>
 
 #include <net/arp.h>
 #include <net/ip.h>
@@ -258,6 +259,8 @@ static void __inet_del_ifa(struct in_dev
 		struct in_ifaddr **ifap1 = &ifa1->ifa_next;
 
 		while ((ifa = *ifap1) != NULL) {
+			trace_mark(net_del_ifa_ipv4, "label %s",
+				ifa->ifa_label);
 			if (!(ifa->ifa_flags & IFA_F_SECONDARY) &&
 			    ifa1->ifa_scope <= ifa->ifa_scope)
 				last_prim = ifa;
@@ -364,6 +367,9 @@ static int __inet_insert_ifa(struct in_i
 			}
 			ifa->ifa_flags |= IFA_F_SECONDARY;
 		}
+		trace_mark(net_insert_ifa_ipv4, "label %s address #4u%lu",
+			ifa->ifa_label,
+			(unsigned long)ifa->ifa_address);
 	}
 
 	if (!(ifa->ifa_flags & IFA_F_SECONDARY)) {
Index: linux-2.6-lttng/net/socket.c
===================================================================
--- linux-2.6-lttng.orig/net/socket.c	2008-03-27 07:26:26.000000000 -0400
+++ linux-2.6-lttng/net/socket.c	2008-03-27 07:31:57.000000000 -0400
@@ -85,6 +85,7 @@
 #include <linux/audit.h>
 #include <linux/wireless.h>
 #include <linux/nsproxy.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -567,6 +568,11 @@ int sock_sendmsg(struct socket *sock, st
 	struct sock_iocb siocb;
 	int ret;
 
+	trace_mark(net_socket_sendmsg,
+		"sock %p family %d type %d protocol %d size %zu",
+		sock, sock->sk->sk_family, sock->sk->sk_type,
+		sock->sk->sk_protocol, size);
+
 	init_sync_kiocb(&iocb, NULL);
 	iocb.private = &siocb;
 	ret = __sock_sendmsg(&iocb, sock, msg, size);
@@ -650,7 +656,13 @@ int sock_recvmsg(struct socket *sock, st
 	struct sock_iocb siocb;
 	int ret;
 
+	trace_mark(net_socket_recvmsg,
+		"sock %p family %d type %d protocol %d size %zu",
+		sock, sock->sk->sk_family, sock->sk->sk_type,
+		sock->sk->sk_protocol, size);
+
 	init_sync_kiocb(&iocb, NULL);
+
 	iocb.private = &siocb;
 	ret = __sock_recvmsg(&iocb, sock, msg, size, flags);
 	if (-EIOCBQUEUED == ret)
@@ -1226,6 +1238,11 @@ asmlinkage long sys_socket(int family, i
 	if (retval < 0)
 		goto out_release;
 
+	trace_mark(net_socket_create,
+		"sock %p family %d type %d protocol %d fd %d",
+		sock, sock->sk->sk_family, sock->sk->sk_type,
+		sock->sk->sk_protocol, retval);
+
 out:
 	/* It may be already another descriptor 8) Not kernel problem. */
 	return retval;
@@ -2024,6 +2041,8 @@ asmlinkage long sys_socketcall(int call,
 	a0 = a[0];
 	a1 = a[1];
 
+	trace_mark(net_socket_call, "call %d a0 %lu", call, a0);
+
 	switch (call) {
 	case SYS_SOCKET:
 		err = sys_socket(a0, a1, a[2]);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch for 2.6.26 7/7] LTTng instrumentation - lib
  2008-03-27 13:20 [patch for 2.6.26 0/7] Architecture Independent Markers Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2008-03-27 13:21 ` [patch for 2.6.26 6/7] LTTng instrumentation net Mathieu Desnoyers
@ 2008-03-27 13:21 ` Mathieu Desnoyers
  2008-03-27 15:40 ` [patch for 2.6.26 0/7] Architecture Independent Markers Ingo Molnar
  2008-03-28  0:01 ` Denys Vlasenko
  8 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-27 13:21 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mike Mason, David Wilder, Mathieu Desnoyers

[-- Attachment #1: lttng-instrumentation-lib.patch --]
[-- Type: text/plain, Size: 3059 bytes --]

(Originally : Add Markers into Semaphore Primitives)

This patch adds several markers around semaphore primitives.
Along with a tracing application this patch can be useful for measuring
kernel semaphore usage and contention.

Added markers :

sem_down
sem_down_intr
sem_down_intr_fail
sem_down_intr_resume
sem_down_intr_sched
sem_down_resume
sem_down_sched
sem_up

Signed-off-by: Mike Mason <mmlnx@us.ibm.com>
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 lib/semaphore-sleepers.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux-2.6-lttng/lib/semaphore-sleepers.c
===================================================================
--- linux-2.6-lttng.orig/lib/semaphore-sleepers.c	2008-03-27 08:08:37.000000000 -0400
+++ linux-2.6-lttng/lib/semaphore-sleepers.c	2008-03-27 08:08:56.000000000 -0400
@@ -15,6 +15,7 @@
 #include <linux/sched.h>
 #include <linux/err.h>
 #include <linux/init.h>
+#include <linux/marker.h>
 #include <asm/semaphore.h>
 
 /*
@@ -50,6 +51,7 @@
 
 void __up(struct semaphore *sem)
 {
+	trace_mark(sem_up, "%p", sem);
 	wake_up(&sem->wait);
 }
 
@@ -59,6 +61,7 @@ void __sched __down(struct semaphore *se
 	DECLARE_WAITQUEUE(wait, tsk);
 	unsigned long flags;
 
+	trace_mark(sem_down, "%p", sem);
 	tsk->state = TASK_UNINTERRUPTIBLE;
 	spin_lock_irqsave(&sem->wait.lock, flags);
 	add_wait_queue_exclusive_locked(&sem->wait, &wait);
@@ -73,12 +76,14 @@ void __sched __down(struct semaphore *se
 		 * the wait_queue_head.
 		 */
 		if (!atomic_add_negative(sleepers - 1, &sem->count)) {
+			trace_mark(sem_down_resume, "%p", sem);
 			sem->sleepers = 0;
 			break;
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irqrestore(&sem->wait.lock, flags);
 
+		trace_mark(sem_down_sched, "%p", sem);
 		schedule();
 
 		spin_lock_irqsave(&sem->wait.lock, flags);
@@ -97,6 +102,7 @@ int __sched __down_interruptible(struct 
 	DECLARE_WAITQUEUE(wait, tsk);
 	unsigned long flags;
 
+	trace_mark(sem_down_intr, "%p", sem);
 	tsk->state = TASK_INTERRUPTIBLE;
 	spin_lock_irqsave(&sem->wait.lock, flags);
 	add_wait_queue_exclusive_locked(&sem->wait, &wait);
@@ -113,6 +119,7 @@ int __sched __down_interruptible(struct 
 		 * and exit.
 		 */
 		if (signal_pending(current)) {
+			trace_mark(sem_down_intr_fail, "%p", sem);
 			retval = -EINTR;
 			sem->sleepers = 0;
 			atomic_add(sleepers, &sem->count);
@@ -126,12 +133,14 @@ int __sched __down_interruptible(struct 
 		 * still hoping to get the semaphore.
 		 */
 		if (!atomic_add_negative(sleepers - 1, &sem->count)) {
+			trace_mark(sem_down_intr_resume, "%p", sem);
 			sem->sleepers = 0;
 			break;
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irqrestore(&sem->wait.lock, flags);
 
+		trace_mark(sem_down_intr_sched, "%p", sem);
 		schedule();
 
 		spin_lock_irqsave(&sem->wait.lock, flags);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-27 13:20 [patch for 2.6.26 0/7] Architecture Independent Markers Mathieu Desnoyers
                   ` (6 preceding siblings ...)
  2008-03-27 13:21 ` [patch for 2.6.26 7/7] LTTng instrumentation - lib Mathieu Desnoyers
@ 2008-03-27 15:40 ` Ingo Molnar
  2008-03-27 17:08   ` KOSAKI Motohiro
                     ` (2 more replies)
  2008-03-28  0:01 ` Denys Vlasenko
  8 siblings, 3 replies; 29+ messages in thread
From: Ingo Molnar @ 2008-03-27 15:40 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel, Linus Torvalds


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Hi Andrew,
> 
> After a few RFC rounds, I propose these markers for 2.6.26. They 
> include work done after comments from the memory management community. 
> Most of them have been used by the LTTng project for about 2 years.

very strong NACK. When markers went into 2.6.24 i initially believed 
your claims that my earlier objections about markers have been resolved 
and that it's a lightweight, useful facility.

so we optimistically used markers in ftrace (see sched-devel.git) for 
the scheduler, and i was shocked about marker impact:

just 3 ftrace markers in the scheduler plus their support code bloated 
the kernel by 5k (!), 288 bytes for only 3 markers in the scheduler 
itself, the rest in support code to manage the markers - that's 96 bytes 
added per every marker (44 (!) bytes of that in the fastpath!).

44 bytes per marker per fastpast is _NOT_ acceptable in any way, shape 
or form. Those 3 limited markers have the same cache cost as adding 
mcount callbacks for dyn-ftrace to the _whole_ scheduler ...

as i told you many, many moons ago, repeatedly: acceptable cost is a 5 
bytes callout that is patched to a NOP, and _maybe_ a little bit more to 
prepare parameters for the function calls. Not 44 bytes. Not 96 bytes. 
Not 5K total cost. Paravirt ops are super-lightweight in comparison.

and this stuff _can_ be done sanely and cheaply and in fact we have done 
it: see ftrace in sched-devel.git, and compare its cost.

see further details in the tongue-in-cheek commit below.

	Ingo

--------------------->
Subject: sched: reintroduce markers
From: Ingo Molnar <mingo@elte.hu>

Scheduler markers are seriously bloated - so lets add them:

      text    data     bss     dec     hex filename
   7209664  852020 1634872 9696556  93f52c vmlinux.before
   7214679  852188 1634872 9701739  94096b vmlinux.after

5K more bloat!

but that's not just bloat in the slowpath and tracer bloat, it's
also fastpath bloat:

     text    data     bss     dec     hex filename
    37642    7014     384   45040    aff0 sched.o.before
    37930    7134     384   45448    b188 sched.o.after

288 bytes for only 3 markers in the scheduler - that's 96 bytes added
per every marker (44 bytes of that in the fastpath).

and the tracer codepath is now slower and more complex as well.

A classic lose-lose situation, so let's apply this patch. Not!

NACK-ed-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h             |   36 --------
 kernel/sched.c                    |   11 +-
 kernel/trace/trace.h              |   12 --
 kernel/trace/trace_functions.c    |    2 
 kernel/trace/trace_sched_switch.c |  163 +++++++++++++++++++++++++++++++++++---
 kernel/trace/trace_sched_wakeup.c |  146 +++++++++++++++++++++++++++++++---
 6 files changed, 299 insertions(+), 71 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -2031,42 +2031,6 @@ extern int sched_mc_power_savings, sched
 
 extern void normalize_rt_tasks(void);
 
-#ifdef CONFIG_CONTEXT_SWITCH_TRACER
-extern void
-ftrace_ctx_switch(struct task_struct *prev, struct task_struct *next);
-#else
-static inline void
-ftrace_ctx_switch(struct task_struct *prev, struct task_struct *next)
-{
-}
-#endif
-
-#ifdef CONFIG_SCHED_TRACER
-extern void
-ftrace_wake_up_task(struct task_struct *wakee, struct task_struct *curr);
-extern void
-ftrace_wake_up_new_task(struct task_struct *wakee, struct task_struct *curr);
-#else
-static inline void
-ftrace_wake_up_task(struct task_struct *wakee, struct task_struct *curr)
-{
-}
-static inline void
-ftrace_wake_up_new_task(struct task_struct *wakee, struct task_struct *curr)
-{
-}
-#endif
-
-#ifdef CONFIG_CONTEXT_SWITCH_TRACER
-extern void
-ftrace_ctx_switch(struct task_struct *prev, struct task_struct *next);
-#else
-static inline void
-ftrace_ctx_switch(struct task_struct *prev, struct task_struct *next)
-{
-}
-#endif
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
 extern struct task_group init_task_group;
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1886,7 +1886,9 @@ static int try_to_wake_up(struct task_st
 
 out_activate:
 #endif /* CONFIG_SMP */
-	ftrace_wake_up_task(p, rq->curr);
+	trace_mark(kernel_sched_wakeup,
+		   "p %p rq->curr %p",
+		   p, rq->curr);
 
 	schedstat_inc(p, se.nr_wakeups);
 	if (sync)
@@ -2028,7 +2030,9 @@ void fastcall wake_up_new_task(struct ta
 		p->sched_class->task_new(rq, p);
 		inc_nr_running(rq);
 	}
-	ftrace_wake_up_new_task(p, rq->curr);
+	trace_mark(kernel_sched_wakeup_new,
+		   "p %p rq->curr %p",
+		   p, rq->curr);
 
 	check_preempt_curr(rq, p);
 #ifdef CONFIG_SMP
@@ -2202,7 +2206,8 @@ context_switch(struct rq *rq, struct tas
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	ftrace_ctx_switch(prev, next);
+	trace_mark(kernel_sched_schedule,
+		   "prev %p next %p", prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
Index: linux/kernel/trace/trace.h
===================================================================
--- linux.orig/kernel/trace/trace.h
+++ linux/kernel/trace/trace.h
@@ -134,6 +134,8 @@ void tracing_start_function_trace(void);
 void tracing_stop_function_trace(void);
 int register_tracer(struct tracer *type);
 void unregister_tracer(struct tracer *type);
+void tracing_start_sched_switch(void);
+void tracing_stop_sched_switch(void);
 
 extern int tracing_sched_switch_enabled;
 extern unsigned long tracing_max_latency;
@@ -148,16 +150,6 @@ static inline notrace cycle_t now(int cp
 	return cpu_clock(cpu);
 }
 
-#ifdef CONFIG_SCHED_TRACER
-extern void notrace
-wakeup_sched_switch(struct task_struct *prev, struct task_struct *next);
-#else
-static inline void
-wakeup_sched_switch(struct task_struct *prev, struct task_struct *next)
-{
-}
-#endif
-
 #ifdef CONFIG_CONTEXT_SWITCH_TRACER
 typedef void
 (*tracer_switch_func_t)(void *private,
Index: linux/kernel/trace/trace_functions.c
===================================================================
--- linux.orig/kernel/trace/trace_functions.c
+++ linux/kernel/trace/trace_functions.c
@@ -30,10 +30,12 @@ static notrace void start_function_trace
 {
 	function_reset(tr);
 	tracing_start_function_trace();
+	tracing_start_sched_switch();
 }
 
 static notrace void stop_function_trace(struct trace_array *tr)
 {
+	tracing_stop_sched_switch();
 	tracing_stop_function_trace();
 }
 
Index: linux/kernel/trace/trace_sched_switch.c
===================================================================
--- linux.orig/kernel/trace/trace_sched_switch.c
+++ linux/kernel/trace/trace_sched_switch.c
@@ -16,12 +16,17 @@
 
 static struct trace_array	*ctx_trace;
 static int __read_mostly	tracer_enabled;
+static atomic_t			sched_ref;
 int __read_mostly		tracing_sched_switch_enabled;
 
+static DEFINE_SPINLOCK(sched_switch_func_lock);
+
 static void notrace
-ctx_switch_func(struct task_struct *prev, struct task_struct *next)
+sched_switch_func(void *private, struct task_struct *prev,
+		  struct task_struct *next)
 {
-	struct trace_array *tr = ctx_trace;
+	struct trace_array **ptr = private;
+	struct trace_array *tr = *ptr;
 	struct trace_array_cpu *data;
 	unsigned long flags;
 	long disabled;
@@ -42,20 +47,115 @@ ctx_switch_func(struct task_struct *prev
 	raw_local_irq_restore(flags);
 }
 
-void ftrace_ctx_switch(struct task_struct *prev, struct task_struct *next)
+static struct tracer_switch_ops sched_switch_ops __read_mostly =
+{
+	.func		= sched_switch_func,
+	.private	= &ctx_trace,
+};
+
+static tracer_switch_func_t __read_mostly
+	tracer_switch_func = sched_switch_func;
+
+static struct tracer_switch_ops __read_mostly
+	*tracer_switch_func_ops = &sched_switch_ops;
+
+static void notrace
+sched_switch_func_loop(void *private, struct task_struct *prev,
+		       struct task_struct *next)
 {
-	tracing_record_cmdline(prev);
+	struct tracer_switch_ops *ops = tracer_switch_func_ops;
+
+	/* take care of alpha acting funny */
+	read_barrier_depends();
 
+	for (; ops != NULL; ops = ops->next) {
+		/* silly alpha */
+		read_barrier_depends();
+		ops->func(ops->private, prev, next);
+	}
+}
+
+notrace int register_tracer_switch(struct tracer_switch_ops *ops)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&sched_switch_func_lock, flags);
+	ops->next = tracer_switch_func_ops;
 	/*
-	 * If tracer_switch_func only points to the local
-	 * switch func, it still needs the ptr passed to it.
+	 * We are entering ops into the switch list but another
+	 * CPU might be walking that list. We need to make sure
+	 * the ops->next pointer is valid before another CPU sees
+	 * the ops pointer included into the switch list.
 	 */
-	ctx_switch_func(prev, next);
+	smp_wmb();
+	tracer_switch_func_ops = ops;
+
+	if (ops->next == &sched_switch_ops)
+		tracer_switch_func = sched_switch_func_loop;
+
+	spin_unlock_irqrestore(&sched_switch_func_lock, flags);
+
+	return 0;
+}
+
+notrace int unregister_tracer_switch(struct tracer_switch_ops *ops)
+{
+	unsigned long flags;
+	struct tracer_switch_ops **p = &tracer_switch_func_ops;
+	int ret;
+
+	spin_lock_irqsave(&sched_switch_func_lock, flags);
 
 	/*
-	 * Chain to the wakeup tracer (this is a NOP if disabled):
+	 * If the sched_switch is the only one left, then
+	 *  only call that function.
 	 */
-	wakeup_sched_switch(prev, next);
+	if (*p == ops && ops->next == &sched_switch_ops) {
+		tracer_switch_func = sched_switch_func;
+		tracer_switch_func_ops = &sched_switch_ops;
+		goto out;
+	}
+
+	for (; *p != &sched_switch_ops; p = &(*p)->next)
+		if (*p == ops)
+			break;
+
+	if (*p != ops) {
+		ret = -1;
+		goto out;
+	}
+
+	*p = (*p)->next;
+
+ out:
+	spin_unlock_irqrestore(&sched_switch_func_lock, flags);
+
+	return 0;
+}
+
+static notrace void
+sched_switch_callback(const struct marker *mdata, void *private_data,
+		      const char *format, ...)
+{
+	struct task_struct *prev;
+	struct task_struct *next;
+	va_list ap;
+
+	if (!atomic_read(&sched_ref))
+		return;
+
+	tracing_record_cmdline(current);
+
+	va_start(ap, format);
+	prev = va_arg(ap, typeof(prev));
+	next = va_arg(ap, typeof(next));
+	va_end(ap);
+
+	/*
+	 * If tracer_switch_func only points to the local
+	 * switch func, it still needs the ptr passed to it.
+	 */
+	tracer_switch_func(mdata->private, prev, next);
 }
 
 static notrace void sched_switch_reset(struct trace_array *tr)
@@ -72,10 +172,12 @@ static notrace void start_sched_trace(st
 {
 	sched_switch_reset(tr);
 	tracer_enabled = 1;
+	tracing_start_sched_switch();
 }
 
 static notrace void stop_sched_trace(struct trace_array *tr)
 {
+	tracing_stop_sched_switch();
 	tracer_enabled = 0;
 }
 
@@ -110,6 +212,35 @@ static struct tracer sched_switch_trace 
 	.ctrl_update	= sched_switch_trace_ctrl_update,
 };
 
+static int tracing_sched_arm(void)
+{
+	int ret;
+
+	ret = marker_arm("kernel_sched_schedule");
+	if (ret)
+		pr_info("sched trace: Couldn't arm probe switch_to\n");
+
+	return ret;
+}
+
+void tracing_start_sched_switch(void)
+{
+	long ref;
+
+	ref = atomic_inc_return(&sched_ref);
+	if (tracing_sched_switch_enabled && ref == 1)
+		tracing_sched_arm();
+}
+
+void tracing_stop_sched_switch(void)
+{
+	long ref;
+
+	ref = atomic_dec_and_test(&sched_ref);
+	if (tracing_sched_switch_enabled && ref)
+		marker_disarm("kernel_sched_schedule");
+}
+
 __init static int init_sched_switch_trace(void)
 {
 	int ret;
@@ -118,8 +249,22 @@ __init static int init_sched_switch_trac
 	if (ret)
 		return ret;
 
+	ret = marker_probe_register("kernel_sched_schedule",
+				    "prev %p next %p",
+				    sched_switch_callback,
+				    &ctx_trace);
+	if (ret) {
+		pr_info("sched trace: Couldn't add marker"
+			" probe to switch_to\n");
+		goto out;
+	}
+
+	if (atomic_read(&sched_ref))
+		ret = tracing_sched_arm();
+
 	tracing_sched_switch_enabled = 1;
 
+out:
 	return ret;
 }
 device_initcall(init_sched_switch_trace);
Index: linux/kernel/trace/trace_sched_wakeup.c
===================================================================
--- linux.orig/kernel/trace/trace_sched_wakeup.c
+++ linux/kernel/trace/trace_sched_wakeup.c
@@ -44,12 +44,66 @@ static int notrace report_latency(cycle_
 	return 1;
 }
 
-void notrace
-wakeup_sched_switch(struct task_struct *prev, struct task_struct *next)
+#ifdef CONFIG_FTRACE
+/* irqsoff uses its own function trace to keep the overhead down */
+static void notrace
+wakeup_tracer_call(unsigned long ip, unsigned long parent_ip)
 {
-	unsigned long latency = 0, t0 = 0, t1 = 0;
 	struct trace_array *tr = wakeup_trace;
 	struct trace_array_cpu *data;
+	unsigned long flags;
+	long disabled;
+	int resched;
+	int tcpu;
+	int cpu;
+
+	if (likely(!tracer_enabled) || !tr)
+		return;
+
+	resched = need_resched();
+	preempt_disable_notrace();
+
+	cpu = raw_smp_processor_id();
+
+	data = tr->data[cpu];
+	disabled = atomic_inc_return(&data->disabled);
+	if (likely(disabled != 1))
+		goto out;
+
+	spin_lock_irqsave(&wakeup_lock, flags);
+	if (wakeup_task)
+		tcpu = task_cpu(wakeup_task);
+	else
+		tcpu = -1;
+	spin_unlock_irqrestore(&wakeup_lock, flags);
+
+	if (cpu == tcpu)
+		ftrace(tr, data, ip, parent_ip, flags);
+
+out:
+	atomic_dec(&data->disabled);
+
+	/* prevent recursion with scheduler */
+	if (resched)
+		preempt_enable_no_resched_notrace();
+	else
+		preempt_enable_notrace();
+}
+
+static struct ftrace_ops trace_ops __read_mostly =
+{
+	.func		= wakeup_tracer_call,
+};
+#endif /* CONFIG_FTRACE */
+
+static void notrace
+wakeup_sched_switch(void *private, struct task_struct *prev,
+		    struct task_struct *next)
+{
+	unsigned long latency = 0, t0 = 0, t1 = 0;
+	struct trace_array **ptr = private;
+	struct trace_array *tr = *ptr;
+	struct trace_array_cpu *data;
 	cycle_t T0, T1, delta;
 	unsigned long flags;
 	long disabled;
@@ -130,8 +184,14 @@ out_unlock:
 	spin_unlock_irqrestore(&wakeup_lock, flags);
 out:
 	atomic_dec(&tr->data[cpu]->disabled);
+
 }
 
+static struct tracer_switch_ops switch_ops __read_mostly = {
+	.func		= wakeup_sched_switch,
+	.private	= &wakeup_trace,
+};
+
 static void notrace __wakeup_reset(struct trace_array *tr)
 {
 	struct trace_array_cpu *data;
@@ -206,26 +266,51 @@ out:
 	atomic_dec(&tr->data[cpu]->disabled);
 }
 
-notrace void
-ftrace_wake_up_task(struct task_struct *wakee, struct task_struct *curr)
+static notrace void
+wake_up_callback(const struct marker *mdata, void *private_data,
+		 const char *format, ...)
 {
+	struct trace_array **ptr = mdata->private;
+	struct trace_array *tr = *ptr;
+	struct task_struct *curr;
+	struct task_struct *p;
+	va_list ap;
+
 	if (likely(!tracer_enabled))
 		return;
 
-	wakeup_check_start(wakeup_trace, wakee, curr);
-}
+	va_start(ap, format);
 
-notrace void
-ftrace_wake_up_new_task(struct task_struct *wakee, struct task_struct *curr)
-{
-	if (likely(!tracer_enabled))
-		return;
+	/* now get the meat: "p %p rq->curr %p" */
+	p = va_arg(ap, typeof(p));
+	curr = va_arg(ap, typeof(curr));
+
+	va_end(ap);
 
-	wakeup_check_start(wakeup_trace, wakee, curr);
+	wakeup_check_start(tr, p, curr);
 }
 
 static notrace void start_wakeup_tracer(struct trace_array *tr)
 {
+	int ret;
+
+	ret = marker_arm("kernel_sched_wakeup");
+	if (ret) {
+		pr_info("wakeup trace: Couldn't arm probe"
+			" kernel_sched_wakeup\n");
+		return;
+	}
+
+	ret = marker_arm("kernel_sched_wakeup_new");
+	if (ret) {
+		pr_info("wakeup trace: Couldn't arm probe"
+			" kernel_sched_wakeup_new\n");
+		goto out;
+	}
+
+	register_tracer_switch(&switch_ops);
+	tracing_start_sched_switch();
+
 	wakeup_reset(tr);
 
 	/*
@@ -238,13 +323,22 @@ static notrace void start_wakeup_tracer(
 	smp_wmb();
 
 	tracer_enabled = 1;
+	register_ftrace_function(&trace_ops);
 
 	return;
+
+out:
+	marker_disarm("kernel_sched_wakeup");
 }
 
 static notrace void stop_wakeup_tracer(struct trace_array *tr)
 {
 	tracer_enabled = 0;
+	tracing_stop_sched_switch();
+	unregister_tracer_switch(&switch_ops);
+	marker_disarm("kernel_sched_wakeup");
+	marker_disarm("kernel_sched_wakeup_new");
+	unregister_ftrace_function(&trace_ops);
 }
 
 static notrace void wakeup_tracer_init(struct trace_array *tr)
@@ -305,6 +399,32 @@ __init static int init_wakeup_tracer(voi
 	if (ret)
 		return ret;
 
+	ret = marker_probe_register("kernel_sched_wakeup",
+				    "p %p rq->curr %p",
+				    wake_up_callback,
+				    &wakeup_trace);
+	if (ret) {
+		pr_info("wakeup trace: Couldn't add marker"
+			" probe to kernel_sched_wakeup\n");
+		goto fail;
+	}
+
+	ret = marker_probe_register("kernel_sched_wakeup_new",
+				    "p %p rq->curr %p",
+				    wake_up_callback,
+				    &wakeup_trace);
+	if (ret) {
+		pr_info("wakeup trace: Couldn't add marker"
+			" probe to kernel_sched_wakeup_new\n");
+		goto fail_deprobe;
+	}
+
+	return 0;
+
+fail_deprobe:
+	marker_probe_unregister("kernel_sched_wakeup");
+fail:
+	unregister_tracer(&wakeup_tracer);
 	return 0;
 }
 device_initcall(init_wakeup_tracer);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-27 15:40 ` [patch for 2.6.26 0/7] Architecture Independent Markers Ingo Molnar
@ 2008-03-27 17:08   ` KOSAKI Motohiro
  2008-03-28 10:15     ` Ingo Molnar
  2008-03-27 20:39   ` Mathieu Desnoyers
  2008-03-27 21:49   ` Frank Ch. Eigler
  2 siblings, 1 reply; 29+ messages in thread
From: KOSAKI Motohiro @ 2008-03-27 17:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mathieu Desnoyers, akpm, linux-kernel, Linus Torvalds

Hi

>      text    data     bss     dec     hex filename
>     37642    7014     384   45040    aff0 sched.o.before
>     37930    7134     384   45448    b188 sched.o.after
>
>  288 bytes for only 3 markers in the scheduler - that's 96 bytes added
>  per every marker (44 bytes of that in the fastpath).
>
>  and the tracer codepath is now slower and more complex as well.
>
>  A classic lose-lose situation, so let's apply this patch. Not!
>
>  NACK-ed-by: Ingo Molnar <mingo@elte.hu>

I wonder why nobody doesn't fix this serious problem until now.
and I am interesting to what difference to ftrace and marker.

if I have time at next week, I will investigate the marker problem.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-27 15:40 ` [patch for 2.6.26 0/7] Architecture Independent Markers Ingo Molnar
  2008-03-27 17:08   ` KOSAKI Motohiro
@ 2008-03-27 20:39   ` Mathieu Desnoyers
  2008-03-28  9:43     ` Ingo Molnar
  2008-03-28 13:33     ` Ingo Molnar
  2008-03-27 21:49   ` Frank Ch. Eigler
  2 siblings, 2 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-27 20:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, linux-kernel, Linus Torvalds

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Hi Andrew,
> > 
> > After a few RFC rounds, I propose these markers for 2.6.26. They 
> > include work done after comments from the memory management community. 
> > Most of them have been used by the LTTng project for about 2 years.
> 
> very strong NACK. When markers went into 2.6.24 i initially believed 
> your claims that my earlier objections about markers have been resolved 
> and that it's a lightweight, useful facility.
> 
> so we optimistically used markers in ftrace (see sched-devel.git) for 
> the scheduler, and i was shocked about marker impact:
> 
> just 3 ftrace markers in the scheduler plus their support code bloated 
> the kernel by 5k (!), 288 bytes for only 3 markers in the scheduler 
> itself, the rest in support code to manage the markers - that's 96 bytes 
> added per every marker (44 (!) bytes of that in the fastpath!).
> 
> 44 bytes per marker per fastpast is _NOT_ acceptable in any way, shape 
> or form. Those 3 limited markers have the same cache cost as adding 
> mcount callbacks for dyn-ftrace to the _whole_ scheduler ...
> 
> as i told you many, many moons ago, repeatedly: acceptable cost is a 5 
> bytes callout that is patched to a NOP, and _maybe_ a little bit more to 
> prepare parameters for the function calls. Not 44 bytes. Not 96 bytes. 
> Not 5K total cost. Paravirt ops are super-lightweight in comparison.
> 
> and this stuff _can_ be done sanely and cheaply and in fact we have done 
> it: see ftrace in sched-devel.git, and compare its cost.
> 
> see further details in the tongue-in-cheek commit below.
> 

Hi Ingo,

Let's compare one marker against one ftrace statement in sched.o on the
sched-dev tree on x86_32 and see where your "bloat" impression about markers
comes from. I think it's mostly due to the different metrics we use.


sched.o w/o CONFIG_CONTEXT_SWITCH_TRACER
   text    data     bss     dec     hex filename
  46564    2924     200   49688    c218 kernel/sched.o

Let's get an idea of CONFIG_CONTEXT_SWITCH_TRACER impact on sched.o :

sched.o with CONFIG_CONTEXT_SWITCH_TRACER

  text    data     bss     dec     hex filename
  46788    2924     200   49912    c2f8 kernel/sched.o

224 bytes added for 6 ftrace_*(). This is partly due to the helper function
ftrace_all_fair_tasks(). So let's be fair and not take it in account.

Only the cost for one ftrace_*(). All the others commented out, leaving this
one :

static inline void
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next)
{
        struct mm_struct *mm, *oldmm;

        prepare_task_switch(rq, prev, next);
        ftrace_ctx_switch(rq, prev, next);
...

   text    data     bss     dec     hex filename
  46644    2924     200   49768    c268 kernel/sched.o

Commenting this one out :

   text    data     bss     dec     hex filename
  46628    2924     200   49752    c258 kernel/sched.o

For an extra 16 bytes (13 + alignment).

Due to this addition to schedule fast path :

        movl    %ebx, %ecx
        movl    -48(%ebp), %edx
        movl    -40(%ebp), %eax
        call    ftrace_ctx_switch

corresponding to :

 38c:   89 d9                   mov    %ebx,%ecx
 38e:   8b 55 d0                mov    -0x30(%ebp),%edx
 391:   8b 45 d8                mov    -0x28(%ebp),%eax
 394:   e8 fc ff ff ff          call   395 <schedule+0x12c>

Which adds 13 bytes to the fast path. It reads the stack to populate the
registers even when the code is dynamically disabled. The size of this code
directly depends on the number of parameters passed to the tracer. It would also
have to dereference pointers from memory if there would happen to be some data
not present on the stack. All this when disabled. I suppose you patch a no-op
instead of the call to dynamically disable it.

Changing this for a trace_mark :

        trace_mark(ctx_switch, "rq %p prev %p next %p", rq, prev, next);

Adds this to schedule fast path :
(this is without immediate values)

        cmpb    $0, __mark_ctx_switch.33881+8
        jne     .L2164

corresponding to :

 38c:   80 3d 08 00 00 00 00    cmpb   $0x0,0x8
 393:   0f 85 0c 03 00 00       jne    6a5 <schedule+0x43c>

(13 bytes in the fast path, including a memory reference)

With immediate values optimization, we do better :

        mov $0,%al
        testb   %al, %al
        jne     .L2164

Corresponding to :

 389:   b0 00                   mov    $0x0,%al
 38b:   84 c0                   test   %al,%al
 38d:   0f 85 0c 03 00 00       jne    69f <schedule+0x436>

(10 bytes in the fast path instead of 13, and we remove any memory reference)


Near the end of schedule, we find the jump target :


.L2164:
        movl    %ebx, 20(%esp)
        movl    -48(%ebp), %edx
        movl    %edx, 16(%esp)
        movl    %ecx, 12(%esp)
        movl    $.LC108, 8(%esp)
        movl    $0, 4(%esp)
        movl    $__mark_ctx_switch.33881, (%esp)
        call    *__mark_ctx_switch.33881+12
        jmp     .L2126


 6a5:   89 5c 24 14             mov    %ebx,0x14(%esp)
 6a9:   8b 55 d0                mov    -0x30(%ebp),%edx
 6ac:   89 54 24 10             mov    %edx,0x10(%esp)
 6b0:   89 4c 24 0c             mov    %ecx,0xc(%esp)
 6b4:   c7 44 24 08 f7 04 00    movl   $0x4f7,0x8(%esp)
 6bb:   00 
 6bc:   c7 44 24 04 00 00 00    movl   $0x0,0x4(%esp)
 6c3:   00 
 6c4:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
 6cb:   ff 15 0c 00 00 00       call   *0xc
 6d1:   e9 c3 fc ff ff          jmp    399 <schedule+0x130>

Which adds an extra 50 bytes.

With immediate values optimization, we have a total size of 

   text    data     bss     dec     hex filename
  46767    2956     200   49923    c303 kernel/sched.o

(baseline)

   text    data     bss     dec     hex filename
  46638    2924     200   49762    c262 kernel/sched.o

We add 129 bytes of text here. Which does not balance. We should add 60 bytes. I
guess some code alignment is the cause. Let's look at the size of schedule()
instead, since this is the only code I touch :

With immediate values optimization, with the marker :
00000269 <schedule>
...
0000086c <cond_resched_softirq>
1539 bytes

And without the marker :
00000269 <schedule>
...
0000082d <cond_resched_softirq>
1476 bytes

For an added 63 bytes to schedule, which balances modulo some alignment.

If we look at the surrounding of the added 50 bytes (label .L2164) at the end of
schedule(), we see the assembly :

....
.L2103:
        movl    -32(%ebp), %eax
        testl   %eax, %eax
        je      .L2101
        movl    $0, 68(%esi)
        jmp     .L2089
.L2106:
        movl    $0, -32(%ebp)
        .p2align 4,,3
        jmp     .L2089
.L2164:
        movl    %ebx, 20(%esp)
        movl    -48(%ebp), %edx
        movl    %edx, 16(%esp)
        movl    %ecx, 12(%esp)
        movl    $.LC108, 8(%esp)
        movl    $0, 4(%esp)
        movl    $__mark_ctx_switch.33909, (%esp)
        call    *__mark_ctx_switch.33909+12
        jmp     .L2126
.L2124:
        movl    -40(%ebp), %eax
        call    _spin_unlock_irq
        .p2align 4,,6
        jmp     .L2141
.L2161:
        movl    $1, %ecx
        movl    $2, %eax
        call    profile_hits
        .p2align 4,,4
        jmp     .L2069
.L2160:
        movl    -48(%ebp), %edx
        movl    192(%edx), %eax
        testl   %eax, %eax
        jne     .L2066
        movl    %edx, %eax
        call    __schedule_bug
        jmp     .L2066
....

Which are all targets of "unlikely" branches. Therefore, it shares a cache line
with these targets on architectures with associative L1 i-cache. I don't see
how this could be considered as "fast path".

Therefore, on a 3 arguments marker (with immediate values), the marker seems
to outperform ftrace on the following items :

- Adds 10 bytes to the fast path instead of 13.
- No memory read is required on the fast path when the marker is dynamically
  disabled.
- The added fast path code size does not depend on the number of parameters
  passed.
- The runtime cost, when dynamically disabled, does not depend on the number of
  parameters passed.

However, you are right in that the _total_ code size of the ftrace statement is
smaller, but since it is all located and executed in the fast path, even when
dynamically disabled, I don't see this as an overall improvement.

About the cost of code size required to handle the data afterward : it will be
amortized by a common infrastructure such as LTTng, where the same code will
translate the data received as parameter into a trace.

Regards,

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-27 15:40 ` [patch for 2.6.26 0/7] Architecture Independent Markers Ingo Molnar
  2008-03-27 17:08   ` KOSAKI Motohiro
  2008-03-27 20:39   ` Mathieu Desnoyers
@ 2008-03-27 21:49   ` Frank Ch. Eigler
  2 siblings, 0 replies; 29+ messages in thread
From: Frank Ch. Eigler @ 2008-03-27 21:49 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mathieu Desnoyers, akpm, linux-kernel, Linus Torvalds

Ingo Molnar <mingo@elte.hu> writes:

>> After a few RFC rounds, I propose these markers for 2.6.26. They 
>> include work done after comments from the memory management community. 
>> Most of them have been used by the LTTng project for about 2 years.
>
> [...]
> so we optimistically used markers in ftrace (see sched-devel.git) for 
> the scheduler, and i was shocked about marker impact:

Do you have any *time* measurements?


- FChE

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-27 13:20 [patch for 2.6.26 0/7] Architecture Independent Markers Mathieu Desnoyers
                   ` (7 preceding siblings ...)
  2008-03-27 15:40 ` [patch for 2.6.26 0/7] Architecture Independent Markers Ingo Molnar
@ 2008-03-28  0:01 ` Denys Vlasenko
  2008-03-28  1:02   ` [PATCH] Markers - remove extra format argument Mathieu Desnoyers
  2008-03-28  1:04   ` [patch for 2.6.26 1/7] Markers - define non optimized marker (updated) Mathieu Desnoyers
  8 siblings, 2 replies; 29+ messages in thread
From: Denys Vlasenko @ 2008-03-28  0:01 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel

On Thursday 27 March 2008 14:20, Mathieu Desnoyers wrote:
> Hi Andrew,
> 
> After a few RFC rounds, I propose these markers for 2.6.26. They include
> work done after comments from the memory management community. Most of them have
> been used by the LTTng project for about 2 years.

Not in this patch, but I noticed:

#define __trace_mark(name, call_private, format, args...)               \
        do {                                                            \
                static const char __mstrtab_##name[]                    \
                __attribute__((section("__markers_strings")))           \
                = #name "\0" format;                                    \
                static struct marker __mark_##name                      \
                __attribute__((section("__markers"), aligned(8))) =     \
                { __mstrtab_##name, &__mstrtab_##name[sizeof(#name)],   \
                0, 0, marker_probe_cb,                                  \
                { __mark_empty_function, NULL}, NULL };                 \
                __mark_check_format(format, ## args);                   \
                if (unlikely(__mark_##name.state)) {                    \
                        (*__mark_##name.call)                           \
                                (&__mark_##name, call_private,          \
                                format, ## args);                       \
                }                                                       \
        } while (0)

In this call:

                        (*__mark_##name.call)                           \
                                (&__mark_##name, call_private,          \
                                format, ## args);                       \

you make gcc allocate duplicate format string. You can use
&__mstrtab_##name[sizeof(#name)] instead since it holds the same string,
or drop ", format," above and "const char *fmt" from here:

        void (*call)(const struct marker *mdata,        /* Probe wrapper */
                void *call_private, const char *fmt, ...);

since mdata->format is the same and all callees which need it can take it there.
--
vda

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] Markers - remove extra format argument
  2008-03-28  0:01 ` Denys Vlasenko
@ 2008-03-28  1:02   ` Mathieu Desnoyers
  2008-03-28  5:35     ` Masami Hiramatsu
  2008-03-28  1:04   ` [patch for 2.6.26 1/7] Markers - define non optimized marker (updated) Mathieu Desnoyers
  1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-28  1:02 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: akpm, linux-kernel

Denys Vlasenko <vda.linux@googlemail.com> :

> Not in this patch, but I noticed:
> 
> #define __trace_mark(name, call_private, format, args...)               \
>         do {                                                            \
>                 static const char __mstrtab_##name[]                    \
>                 __attribute__((section("__markers_strings")))           \
>                 = #name "\0" format;                                    \
>                 static struct marker __mark_##name                      \
>                 __attribute__((section("__markers"), aligned(8))) =     \
>                 { __mstrtab_##name, &__mstrtab_##name[sizeof(#name)],   \
>                 0, 0, marker_probe_cb,                                  \
>                 { __mark_empty_function, NULL}, NULL };                 \
>                 __mark_check_format(format, ## args);                   \
>                 if (unlikely(__mark_##name.state)) {                    \
>                         (*__mark_##name.call)                           \
>                                 (&__mark_##name, call_private,          \
>                                 format, ## args);                       \
>                 }                                                       \
>         } while (0)
> 
> In this call:
> 
>                         (*__mark_##name.call)                           \
>                                 (&__mark_##name, call_private,          \
>                                 format, ## args);                       \
> 
> you make gcc allocate duplicate format string. You can use
> &__mstrtab_##name[sizeof(#name)] instead since it holds the same string,
> or drop ", format," above and "const char *fmt" from here:
> 
>         void (*call)(const struct marker *mdata,        /* Probe wrapper */
>                 void *call_private, const char *fmt, ...);
> 
> since mdata->format is the same and all callees which need it can take it there.

Very good point. I actually thought about dropping it, since it would
remove an unnecessary argument from the stack. And actually, since I now
have the marker_probe_cb sitting between the marker site and the
callbacks, there is no API change required. Thanks :)

Mathieu

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Denys Vlasenko <vda.linux@googlemail.com>
---
 include/linux/marker.h |   11 +++++------
 kernel/marker.c        |   30 ++++++++++++++----------------
 2 files changed, 19 insertions(+), 22 deletions(-)

Index: linux-2.6-lttng/include/linux/marker.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/marker.h	2008-03-27 20:51:34.000000000 -0400
+++ linux-2.6-lttng/include/linux/marker.h	2008-03-27 20:54:55.000000000 -0400
@@ -44,8 +44,8 @@ struct marker {
 				 */
 	char state;		/* Marker state. */
 	char ptype;		/* probe type : 0 : single, 1 : multi */
-	void (*call)(const struct marker *mdata,	/* Probe wrapper */
-		void *call_private, const char *fmt, ...);
+				/* Probe wrapper */
+	void (*call)(const struct marker *mdata, void *call_private, ...);
 	struct marker_probe_closure single;
 	struct marker_probe_closure *multi;
 } __attribute__((aligned(8)));
@@ -72,8 +72,7 @@ struct marker {
 		__mark_check_format(format, ## args);			\
 		if (unlikely(__mark_##name.state)) {			\
 			(*__mark_##name.call)				\
-				(&__mark_##name, call_private,		\
-				format, ## args);			\
+				(&__mark_##name, call_private, ## args);\
 		}							\
 	} while (0)
 
@@ -117,9 +116,9 @@ static inline void __printf(1, 2) ___mar
 extern marker_probe_func __mark_empty_function;
 
 extern void marker_probe_cb(const struct marker *mdata,
-	void *call_private, const char *fmt, ...);
+	void *call_private, ...);
 extern void marker_probe_cb_noarg(const struct marker *mdata,
-	void *call_private, const char *fmt, ...);
+	void *call_private, ...);
 
 /*
  * Connect a probe to a marker.
Index: linux-2.6-lttng/kernel/marker.c
===================================================================
--- linux-2.6-lttng.orig/kernel/marker.c	2008-03-27 20:52:09.000000000 -0400
+++ linux-2.6-lttng/kernel/marker.c	2008-03-27 20:56:13.000000000 -0400
@@ -54,8 +54,8 @@ static DEFINE_MUTEX(markers_mutex);
 struct marker_entry {
 	struct hlist_node hlist;
 	char *format;
-	void (*call)(const struct marker *mdata,	/* Probe wrapper */
-		void *call_private, const char *fmt, ...);
+			/* Probe wrapper */
+	void (*call)(const struct marker *mdata, void *call_private, ...);
 	struct marker_probe_closure single;
 	struct marker_probe_closure *multi;
 	int refcount;	/* Number of times armed. 0 if disarmed. */
@@ -90,15 +90,13 @@ EXPORT_SYMBOL_GPL(__mark_empty_function)
  * marker_probe_cb Callback that prepares the variable argument list for probes.
  * @mdata: pointer of type struct marker
  * @call_private: caller site private data
- * @fmt: format string
  * @...:  Variable argument list.
  *
  * Since we do not use "typical" pointer based RCU in the 1 argument case, we
  * need to put a full smp_rmb() in this branch. This is why we do not use
  * rcu_dereference() for the pointer read.
  */
-void marker_probe_cb(const struct marker *mdata, void *call_private,
-	const char *fmt, ...)
+void marker_probe_cb(const struct marker *mdata, void *call_private, ...)
 {
 	va_list args;
 	char ptype;
@@ -119,8 +117,9 @@ void marker_probe_cb(const struct marker
 		/* Must read the ptr before private data. They are not data
 		 * dependant, so we put an explicit smp_rmb() here. */
 		smp_rmb();
-		va_start(args, fmt);
-		func(mdata->single.probe_private, call_private, fmt, &args);
+		va_start(args, call_private);
+		func(mdata->single.probe_private, call_private, mdata->format,
+			&args);
 		va_end(args);
 	} else {
 		struct marker_probe_closure *multi;
@@ -135,9 +134,9 @@ void marker_probe_cb(const struct marker
 		smp_read_barrier_depends();
 		multi = mdata->multi;
 		for (i = 0; multi[i].func; i++) {
-			va_start(args, fmt);
-			multi[i].func(multi[i].probe_private, call_private, fmt,
-				&args);
+			va_start(args, call_private);
+			multi[i].func(multi[i].probe_private, call_private,
+				mdata->format, &args);
 			va_end(args);
 		}
 	}
@@ -149,13 +148,11 @@ EXPORT_SYMBOL_GPL(marker_probe_cb);
  * marker_probe_cb Callback that does not prepare the variable argument list.
  * @mdata: pointer of type struct marker
  * @call_private: caller site private data
- * @fmt: format string
  * @...:  Variable argument list.
  *
  * Should be connected to markers "MARK_NOARGS".
  */
-void marker_probe_cb_noarg(const struct marker *mdata,
-	void *call_private, const char *fmt, ...)
+void marker_probe_cb_noarg(const struct marker *mdata, void *call_private, ...)
 {
 	va_list args;	/* not initialized */
 	char ptype;
@@ -171,7 +168,8 @@ void marker_probe_cb_noarg(const struct 
 		/* Must read the ptr before private data. They are not data
 		 * dependant, so we put an explicit smp_rmb() here. */
 		smp_rmb();
-		func(mdata->single.probe_private, call_private, fmt, &args);
+		func(mdata->single.probe_private, call_private, mdata->format,
+			&args);
 	} else {
 		struct marker_probe_closure *multi;
 		int i;
@@ -185,8 +183,8 @@ void marker_probe_cb_noarg(const struct 
 		smp_read_barrier_depends();
 		multi = mdata->multi;
 		for (i = 0; multi[i].func; i++)
-			multi[i].func(multi[i].probe_private, call_private, fmt,
-				&args);
+			multi[i].func(multi[i].probe_private, call_private,
+				mdata->format, &args);
 	}
 	preempt_enable();
 }

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 1/7] Markers - define non optimized marker (updated)
  2008-03-28  0:01 ` Denys Vlasenko
  2008-03-28  1:02   ` [PATCH] Markers - remove extra format argument Mathieu Desnoyers
@ 2008-03-28  1:04   ` Mathieu Desnoyers
  1 sibling, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-28  1:04 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: akpm, linux-kernel

Markers - define non optimized marker

To support the forthcoming "immediate values" marker optimization, we must have
a way to declare markers in few code paths that does not use instruction
modification based enable. This will be the case of printk(), some traps and
eventually lockdep instrumentation.

Changelog :
- Fix reversed boolean logic of "generic".

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 include/linux/marker.h |   29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

Index: linux-2.6-lttng/include/linux/marker.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/marker.h	2008-03-27 20:47:44.000000000 -0400
+++ linux-2.6-lttng/include/linux/marker.h	2008-03-27 20:49:04.000000000 -0400
@@ -58,8 +58,12 @@ struct marker {
  * Make sure the alignment of the structure in the __markers section will
  * not add unwanted padding between the beginning of the section and the
  * structure. Force alignment to the same alignment as the section start.
+ *
+ * The "generic" argument controls which marker enabling mechanism must be used.
+ * If generic is true, a variable read is used.
+ * If generic is false, immediate values are used.
  */
-#define __trace_mark(name, call_private, format, args...)		\
+#define __trace_mark(generic, name, call_private, format, args...)	\
 	do {								\
 		static const char __mstrtab_##name[]			\
 		__attribute__((section("__markers_strings")))		\
@@ -79,7 +83,7 @@ struct marker {
 extern void marker_update_probe_range(struct marker *begin,
 	struct marker *end);
 #else /* !CONFIG_MARKERS */
-#define __trace_mark(name, call_private, format, args...) \
+#define __trace_mark(generic, name, call_private, format, args...) \
 		__mark_check_format(format, ## args)
 static inline void marker_update_probe_range(struct marker *begin,
 	struct marker *end)
@@ -87,15 +91,30 @@ static inline void marker_update_probe_r
 #endif /* CONFIG_MARKERS */
 
 /**
- * trace_mark - Marker
+ * trace_mark - Marker using code patching
  * @name: marker name, not quoted.
  * @format: format string
  * @args...: variable argument list
  *
- * Places a marker.
+ * Places a marker using optimized code patching technique (imv_read())
+ * to be enabled when immediate values are present.
  */
 #define trace_mark(name, format, args...) \
-	__trace_mark(name, NULL, format, ## args)
+	__trace_mark(0, name, NULL, format, ## args)
+
+/**
+ * _trace_mark - Marker using variable read
+ * @name: marker name, not quoted.
+ * @format: format string
+ * @args...: variable argument list
+ *
+ * Places a marker using a standard memory read (_imv_read()) to be
+ * enabled. Should be used for markers in code paths where instruction
+ * modification based enabling is not welcome. (__init and __exit functions,
+ * lockdep, some traps, printk).
+ */
+#define _trace_mark(name, format, args...) \
+	__trace_mark(1, name, NULL, format, ## args)
 
 /**
  * MARK_NOARGS - Format string for a marker with no argument.
-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Markers - remove extra format argument
  2008-03-28  1:02   ` [PATCH] Markers - remove extra format argument Mathieu Desnoyers
@ 2008-03-28  5:35     ` Masami Hiramatsu
  0 siblings, 0 replies; 29+ messages in thread
From: Masami Hiramatsu @ 2008-03-28  5:35 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Denys Vlasenko, akpm, linux-kernel

Hi Mathieu,

Mathieu Desnoyers wrote:
> Denys Vlasenko <vda.linux@googlemail.com> :
>> In this call:
>>
>>                         (*__mark_##name.call)                           \
>>                                 (&__mark_##name, call_private,          \
>>                                 format, ## args);                       \
>>
>> you make gcc allocate duplicate format string. You can use
>> &__mstrtab_##name[sizeof(#name)] instead since it holds the same string,
>> or drop ", format," above and "const char *fmt" from here:
>>
>>         void (*call)(const struct marker *mdata,        /* Probe wrapper */
>>                 void *call_private, const char *fmt, ...);
>>
>> since mdata->format is the same and all callees which need it can take it there.
> 
> Very good point. I actually thought about dropping it, since it would
> remove an unnecessary argument from the stack. And actually, since I now
> have the marker_probe_cb sitting between the marker site and the
> callbacks, there is no API change required. Thanks :)

First of all, I'm very interested in the marker, and your patches look
useful for me.

By the way, could you tell me what the call_private is for?
In your patch, that is used only from special function below.

> @@ -1844,3 +1848,22 @@ int valid_swaphandles(swp_entry_t entry,
>  	*offset = ++toff;
>  	return nr_pages? ++nr_pages: 0;
>  }
> +
> +void ltt_dump_swap_files(void *call_data)
> +{
> +	int type;
> +	struct swap_info_struct *p = NULL;
> +
> +	mutex_lock(&swapon_mutex);
> +	for (type = swap_list.head; type >= 0; type = swap_info[type].next) {
> +		p = swap_info + type;
> +		if ((p->flags & SWP_ACTIVE) != SWP_ACTIVE)
> +			continue;
> +		__trace_mark(0, statedump_swap_files, call_data,
> +			"filp %p vfsmount %p dname %s",
> +			p->swap_file, p->swap_file->f_vfsmnt,
> +			p->swap_file->f_dentry->d_name.name);
> +	}
> +	mutex_unlock(&swapon_mutex);
> +}
> +EXPORT_SYMBOL_GPL(ltt_dump_swap_files);
> 

However, I think call_data can be passed as an argument.
If so, I think you can also drop it, like this.

 void (*call)(const struct marker *mdata, ...);


Best regards,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-27 20:39   ` Mathieu Desnoyers
@ 2008-03-28  9:43     ` Ingo Molnar
  2008-03-28 11:22       ` Ingo Molnar
  2008-03-28 11:38       ` Mathieu Desnoyers
  2008-03-28 13:33     ` Ingo Molnar
  1 sibling, 2 replies; 29+ messages in thread
From: Ingo Molnar @ 2008-03-28  9:43 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel, Linus Torvalds


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Let's compare one marker against one ftrace statement in sched.o on 
> the sched-dev tree on x86_32 and see where your "bloat" impression 
> about markers comes from. I think it's mostly due to the different 
> metrics we use.
> 
> sched.o w/o CONFIG_CONTEXT_SWITCH_TRACER
>    text    data     bss     dec     hex filename
>   46564    2924     200   49688    c218 kernel/sched.o
> 
> Let's get an idea of CONFIG_CONTEXT_SWITCH_TRACER impact on sched.o :
> 
> sched.o with CONFIG_CONTEXT_SWITCH_TRACER
> 
>   text    data     bss     dec     hex filename
>   46788    2924     200   49912    c2f8 kernel/sched.o
> 
> 224 bytes added for 6 ftrace_*(). This is partly due to the helper function
> ftrace_all_fair_tasks(). So let's be fair and not take it in account.

it's not 6 ftrace calls, you forgot about kernel/sched_fair.c, so it's 9 
tracepoints.

note that all but the 2 core trace hooks are temporary, i used them to 
debug a specific scheduler problem. Especially one trace point: 
ftrace_all_fair_tasks() is a totally ad-hoc trace-all-tasks-in-the-rq 
heavy function.

if you want to compare apples to apples, try the patch below, which 
removes the ad-hoc tracepoints.

	Ingo

------------------------>
Subject: no: ad hoc ftrace points
From: Ingo Molnar <mingo@elte.hu>
Date: Fri Mar 28 10:30:37 CET 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |   47 -----------------------------------------------
 kernel/sched_fair.c |    3 ---
 2 files changed, 50 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -2005,53 +2005,6 @@ static int sched_balance_self(int cpu, i
 
 #endif /* CONFIG_SMP */
 
-#ifdef CONFIG_CONTEXT_SWITCH_TRACER
-
-void ftrace_task(struct task_struct *p, void *__tr, void *__data)
-{
-#if 0
-	/*  
-	 * trace timeline tree
-	 */
-	__trace_special(__tr, __data,
-			p->pid, p->se.vruntime, p->se.sum_exec_runtime);
-#else
-	/*
-	 * trace balance metrics
-	 */
-	__trace_special(__tr, __data,
-			p->pid, p->se.avg_overlap, 0);
-#endif
-}
-
-void ftrace_all_fair_tasks(void *__rq, void *__tr, void *__data)
-{
-	struct task_struct *p;
-	struct sched_entity *se;
-	struct rb_node *curr;
-	struct rq *rq = __rq;
-
-	if (rq->cfs.curr) {
-		p = task_of(rq->cfs.curr);
-		ftrace_task(p, __tr, __data);
-	}
-	if (rq->cfs.next) {
-		p = task_of(rq->cfs.next);
-		ftrace_task(p, __tr, __data);
-	}
-
-	for (curr = first_fair(&rq->cfs); curr; curr = rb_next(curr)) {
-		se = rb_entry(curr, struct sched_entity, run_node);
-		if (!entity_is_task(se))
-			continue;
-
-		p = task_of(se);
-		ftrace_task(p, __tr, __data);
-	}
-}
-
-#endif
-
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -991,8 +991,6 @@ wake_affine(struct rq *rq, struct sched_
 	if (!(this_sd->flags & SD_WAKE_AFFINE))
 		return 0;
 
-	ftrace_special(__LINE__, curr->se.avg_overlap, sync);
-	ftrace_special(__LINE__, p->se.avg_overlap, -1);
 	/*
 	 * If the currently running task will sleep within
 	 * a reasonable amount of time then attract this newly
@@ -1118,7 +1116,6 @@ static void check_preempt_wakeup(struct 
 	if (unlikely(se == pse))
 		return;
 
-	ftrace_special(__LINE__, p->pid, se->last_wakeup);
 	cfs_rq_of(pse)->next = pse;
 
 	/*

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-27 17:08   ` KOSAKI Motohiro
@ 2008-03-28 10:15     ` Ingo Molnar
  2008-03-28 13:34       ` [OT] " Masami Hiramatsu
  2008-03-28 13:40       ` Frank Ch. Eigler
  0 siblings, 2 replies; 29+ messages in thread
From: Ingo Molnar @ 2008-03-28 10:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mathieu Desnoyers, akpm, linux-kernel, Linus Torvalds, Frank Ch. Eigler


* KOSAKI Motohiro <m-kosaki@ceres.dti.ne.jp> wrote:

> >      text    data     bss     dec     hex filename
> >     37642    7014     384   45040    aff0 sched.o.before
> >     37930    7134     384   45448    b188 sched.o.after
> >
> >  288 bytes for only 3 markers in the scheduler - that's 96 bytes added
> >  per every marker (44 bytes of that in the fastpath).
> >
> >  and the tracer codepath is now slower and more complex as well.
> >
> >  A classic lose-lose situation, so let's apply this patch. Not!
> >
> >  NACK-ed-by: Ingo Molnar <mingo@elte.hu>
> 
> I wonder why nobody doesn't fix this serious problem until now. and I 
> am interesting to what difference to ftrace and marker.

i warned and moaned about it ad nauseum.

furthermore, and because it's Friday again, let me remind folks that 
SystemTap has an even more significant bloat problem: the fact that it 
needs a huge download:

 Installing:
   kernel-debuginfo         x86_64     2.6.25-0.163.rc7.git1.fc9  
   development-debuginfo  198 M
 Installing for dependencies:
   kernel-debuginfo-common  x86_64     2.6.25-0.163.rc7.git1.fc9  
   development-debuginfo   30 M

 Total download size: 229 M

for _every_ updated kernel. That 229 MB size reduces the user base of 
SystemTap (which is otherwise a totally brilliant and cool tool) to 1% 
of its potential userbase, to those truly desperate persons who really 
_need_ to get their problem debugged somehow. But it's nowhere near 
usable as an easy, ad-hoc kernel instrumentation tool, just due to the 
sheer size it brings.

for heaven's sake, we can have 3 years of _full Linux kernel history_, 
with all 87875 commits, with full changelog and dependencies and full 
source code included, packed into a rather tight, 180 MB git repository 
...

[ and then i havent even begun about the on-disk footprint of this 
  monstrum, after the packages have been installed: an additional 850 MB.
  Puh-lease ... ]

there's a huge disconnect with reality here.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-28  9:43     ` Ingo Molnar
@ 2008-03-28 11:22       ` Ingo Molnar
  2008-03-28 11:38       ` Mathieu Desnoyers
  1 sibling, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2008-03-28 11:22 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel, Linus Torvalds


* Ingo Molnar <mingo@elte.hu> wrote:

> if you want to compare apples to apples, try the patch below, which 
> removes the ad-hoc tracepoints.

... new version of that is in sched-devel.git.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-28  9:43     ` Ingo Molnar
  2008-03-28 11:22       ` Ingo Molnar
@ 2008-03-28 11:38       ` Mathieu Desnoyers
  1 sibling, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-28 11:38 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, linux-kernel, Linus Torvalds

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Let's compare one marker against one ftrace statement in sched.o on 
> > the sched-dev tree on x86_32 and see where your "bloat" impression 
> > about markers comes from. I think it's mostly due to the different 
> > metrics we use.
> > 
> > sched.o w/o CONFIG_CONTEXT_SWITCH_TRACER
> >    text    data     bss     dec     hex filename
> >   46564    2924     200   49688    c218 kernel/sched.o
> > 
> > Let's get an idea of CONFIG_CONTEXT_SWITCH_TRACER impact on sched.o :
> > 
> > sched.o with CONFIG_CONTEXT_SWITCH_TRACER
> > 
> >   text    data     bss     dec     hex filename
> >   46788    2924     200   49912    c2f8 kernel/sched.o
> > 
> > 224 bytes added for 6 ftrace_*(). This is partly due to the helper function
> > ftrace_all_fair_tasks(). So let's be fair and not take it in account.
> 
> it's not 6 ftrace calls, you forgot about kernel/sched_fair.c, so it's 9 
> tracepoints.
> 
> note that all but the 2 core trace hooks are temporary, i used them to 
> debug a specific scheduler problem. Especially one trace point: 
> ftrace_all_fair_tasks() is a totally ad-hoc trace-all-tasks-in-the-rq 
> heavy function.
> 
> if you want to compare apples to apples, try the patch below, which 
> removes the ad-hoc tracepoints.
> 

Hrm, you are only quoting my introduction, where I introduce the reason
why I do in a more in-depth analysis on a _single_ ftrace statement.

Ingo, if you care to read the rest of my email, you will discover that I
concentrated my effort on a single ftrace statement in context_switch().
Whether or not I removed the trace points from kernel/sched_fair.c does
not change the validity of the results that follow. I commented out your
ad-hoc tracepoints from sched.c by hand in my test cases, and
sched_fair.c trace points were there in every scenario, so they were
invariant and _not_ considered, except in this introduction you quoted.

Mathieu

> 	Ingo
> 
> ------------------------>
> Subject: no: ad hoc ftrace points
> From: Ingo Molnar <mingo@elte.hu>
> Date: Fri Mar 28 10:30:37 CET 2008
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  kernel/sched.c      |   47 -----------------------------------------------
>  kernel/sched_fair.c |    3 ---
>  2 files changed, 50 deletions(-)
> 
> Index: linux/kernel/sched.c
> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -2005,53 +2005,6 @@ static int sched_balance_self(int cpu, i
>  
>  #endif /* CONFIG_SMP */
>  
> -#ifdef CONFIG_CONTEXT_SWITCH_TRACER
> -
> -void ftrace_task(struct task_struct *p, void *__tr, void *__data)
> -{
> -#if 0
> -	/*  
> -	 * trace timeline tree
> -	 */
> -	__trace_special(__tr, __data,
> -			p->pid, p->se.vruntime, p->se.sum_exec_runtime);
> -#else
> -	/*
> -	 * trace balance metrics
> -	 */
> -	__trace_special(__tr, __data,
> -			p->pid, p->se.avg_overlap, 0);
> -#endif
> -}
> -
> -void ftrace_all_fair_tasks(void *__rq, void *__tr, void *__data)
> -{
> -	struct task_struct *p;
> -	struct sched_entity *se;
> -	struct rb_node *curr;
> -	struct rq *rq = __rq;
> -
> -	if (rq->cfs.curr) {
> -		p = task_of(rq->cfs.curr);
> -		ftrace_task(p, __tr, __data);
> -	}
> -	if (rq->cfs.next) {
> -		p = task_of(rq->cfs.next);
> -		ftrace_task(p, __tr, __data);
> -	}
> -
> -	for (curr = first_fair(&rq->cfs); curr; curr = rb_next(curr)) {
> -		se = rb_entry(curr, struct sched_entity, run_node);
> -		if (!entity_is_task(se))
> -			continue;
> -
> -		p = task_of(se);
> -		ftrace_task(p, __tr, __data);
> -	}
> -}
> -
> -#endif
> -
>  /***
>   * try_to_wake_up - wake up a thread
>   * @p: the to-be-woken-up thread
> Index: linux/kernel/sched_fair.c
> ===================================================================
> --- linux.orig/kernel/sched_fair.c
> +++ linux/kernel/sched_fair.c
> @@ -991,8 +991,6 @@ wake_affine(struct rq *rq, struct sched_
>  	if (!(this_sd->flags & SD_WAKE_AFFINE))
>  		return 0;
>  
> -	ftrace_special(__LINE__, curr->se.avg_overlap, sync);
> -	ftrace_special(__LINE__, p->se.avg_overlap, -1);
>  	/*
>  	 * If the currently running task will sleep within
>  	 * a reasonable amount of time then attract this newly
> @@ -1118,7 +1116,6 @@ static void check_preempt_wakeup(struct 
>  	if (unlikely(se == pse))
>  		return;
>  
> -	ftrace_special(__LINE__, p->pid, se->last_wakeup);
>  	cfs_rq_of(pse)->next = pse;
>  
>  	/*

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-27 20:39   ` Mathieu Desnoyers
  2008-03-28  9:43     ` Ingo Molnar
@ 2008-03-28 13:33     ` Ingo Molnar
  2008-03-29 17:16       ` Mathieu Desnoyers
  1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2008-03-28 13:33 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel, Linus Torvalds


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

>  6a5:   89 5c 24 14             mov    %ebx,0x14(%esp)
>  6a9:   8b 55 d0                mov    -0x30(%ebp),%edx
>  6ac:   89 54 24 10             mov    %edx,0x10(%esp)
>  6b0:   89 4c 24 0c             mov    %ecx,0xc(%esp)
>  6b4:   c7 44 24 08 f7 04 00    movl   $0x4f7,0x8(%esp)
>  6bb:   00 
>  6bc:   c7 44 24 04 00 00 00    movl   $0x0,0x4(%esp)
>  6c3:   00 
>  6c4:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
>  6cb:   ff 15 0c 00 00 00       call   *0xc
>  6d1:   e9 c3 fc ff ff          jmp    399 <schedule+0x130>
> 
> Which adds an extra 50 bytes.

you talk about 32-bit while i talk about 64-bit. All these costs go up 
on 64-bit and you should know that. I measured 44 bytes in the fastpath 
and 52 bytes in the slowpath, which gives 96 bytes. (with a distro 
.config and likely with a different gcc)

96 bytes _per marker_ sprinkled throughout the kernel. This blows up the 
cache footprint of the kernel quite substantially, because it's all 
fragmented - even if this is in the 'slowpath'.

so yes, that is the bloat i'm talking about.

dont just compare it to ftrace-sched-switch, compare it to dyn-ftrace 
which gives us more than 78,000 trace points in the kernel _here and 
today_ at no measurable runtime cost, with a 5 byte NOP per trace point 
and _zero_ instruction stream (register scheduling, etc.) intrusion. No 
slowpath cost.

and the basic API approach of markers is flawed a well - the coupling to 
the kernel is too strong. The correct and long-term maintainable 
coupling is via ASCII symbol names, not via any binding built into the 
kernel.

With dyn-ftrace (see sched-devel.git/latest) tracing filters can be 
installed trivially by users, via function _symbols_, via:

  /debugfs/tracing/available_filter_functions
  /debugfs/tracing/set_ftrace_filter

wildcards are recognized as well, so if you do:

  echo '*lock' > /debugfs/tracing/set_ftrace_filter

all functions that have 'lock' in their name will have their tracepoints 
activated transparently from that point on.

even multiple names can be passed in at once:

  echo 'schedule wake_up* *acpi*' > /debugfs/tracing/set_ftrace_filter

so it's trivial to use it, very powerful and we've only begun exposing 
it towards users. I see no good reason why we'd patch any marker into 
the kernel - it's a maintenance cost from that point on.

so yes, my argument is: tens of thousands of lightweight tracepoints in 
the kernel here and today, which are configurable via function names, 
each of which can be turned on and off individually, and none of which 
needs any source code level changes - is an obviously superior approach.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [OT] Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-28 10:15     ` Ingo Molnar
@ 2008-03-28 13:34       ` Masami Hiramatsu
  2008-04-01  1:43         ` Denys Vlasenko
  2008-03-28 13:40       ` Frank Ch. Eigler
  1 sibling, 1 reply; 29+ messages in thread
From: Masami Hiramatsu @ 2008-03-28 13:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Mathieu Desnoyers, akpm, linux-kernel,
	Linus Torvalds, Frank Ch. Eigler, systemtap-ml, Jim Keniston

Ingo Molnar wrote:
> furthermore, and because it's Friday again, let me remind folks that 
> SystemTap has an even more significant bloat problem: the fact that it 
> needs a huge download:

Ingo, that seems very off-topic...

>  Installing:
>    kernel-debuginfo         x86_64     2.6.25-0.163.rc7.git1.fc9  
>    development-debuginfo  198 M
>  Installing for dependencies:
>    kernel-debuginfo-common  x86_64     2.6.25-0.163.rc7.git1.fc9  
>    development-debuginfo   30 M
> 
>  Total download size: 229 M
> 
> for _every_ updated kernel. That 229 MB size reduces the user base of 
> SystemTap (which is otherwise a totally brilliant and cool tool) to 1% 
> of its potential userbase, to those truly desperate persons who really 
> _need_ to get their problem debugged somehow. But it's nowhere near 
> usable as an easy, ad-hoc kernel instrumentation tool, just due to the 
> sheer size it brings.

Sure, it is a thorny issue especially for users of daily-kernel-updating
distribution.

However, when updating kernel from tarball, We can get full set of debuginfo
just by building kernel with CONFIG_DEBUG_INFO. I think it is enough
usable for kernel developers.

BTW, we already started working on the symbol-table based probing.
http://sources.redhat.com/bugzilla/show_bug.cgi?id=4311

> for heaven's sake, we can have 3 years of _full Linux kernel history_, 
> with all 87875 commits, with full changelog and dependencies and full 
> source code included, packed into a rather tight, 180 MB git repository 
> ...
> 
> [ and then i havent even begun about the on-disk footprint of this 
>   monstrum, after the packages have been installed: an additional 850 MB.
>   Puh-lease ... ]

Thank for Moore's law and all storage engineers, nowadays we can buy
an 1TB(=1000GB!) HDD with less than $300!(I also surprised...)
And many other compressed filesystem can be used for saving disk space.

Best regards,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-28 10:15     ` Ingo Molnar
  2008-03-28 13:34       ` [OT] " Masami Hiramatsu
@ 2008-03-28 13:40       ` Frank Ch. Eigler
  2008-03-28 14:18         ` Ingo Molnar
  2008-03-28 14:41         ` Ingo Molnar
  1 sibling, 2 replies; 29+ messages in thread
From: Frank Ch. Eigler @ 2008-03-28 13:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Mathieu Desnoyers, akpm, linux-kernel, Linus Torvalds


>> >  NACK-ed-by: Ingo Molnar <mingo@elte.hu>
>> [...]
>> I wonder why nobody doesn't fix this serious problem until now. [...]

Frankly, the "problem" has not yet been established.  Ingo has failed
to respond to Mathieu's analysis of the actual instruction streams,
which cast doubt on Ingo's technique of simply counting bytes.  So
far, neither Ingo nor Mathieu have published *time measurements* to
quantify the actual (as opposed to speculated) costs.


> [...]  furthermore, and because it's Friday again, let me remind
> folks that SystemTap has an even more significant bloat problem: the
> fact that it needs a huge download: [...the dwarf debugging
> information...]

It's a problem, and there are a few improvements under way or being
contemplated:

- compression of the DWARF data to eliminate more duplication
- support for inserting probes only on function entry/exit points
  without debugging data - relying on symbol tables and raw ABI
- a networked service to allow one machine to serve site-wide
  systemtap needs

At the end of the day, lots of information will still need to be
around for those who need to put probes approximately anywhere and to
pull approximately any data out.


> for heaven's sake, we can have 3 years of _full Linux kernel
> history_, with all 87875 commits, with full changelog and
> dependencies and full source code included, packed into a rather
> tight, 180 MB git repository

It may seem counterintuitive to you, but remember that object code is
recreated from scratch new every build, yet source code only gradually
(?)  changes.  Try checking in 3 years' builds of stripped object
files into a git repo.


- FChE

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-28 13:40       ` Frank Ch. Eigler
@ 2008-03-28 14:18         ` Ingo Molnar
  2008-03-28 14:41         ` Ingo Molnar
  1 sibling, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2008-03-28 14:18 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: KOSAKI Motohiro, Mathieu Desnoyers, akpm, linux-kernel, Linus Torvalds


* Frank Ch. Eigler <fche@redhat.com> wrote:

> 
> >> >  NACK-ed-by: Ingo Molnar <mingo@elte.hu>
> >> [...]
> >> I wonder why nobody doesn't fix this serious problem until now. [...]
> 
> Frankly, the "problem" has not yet been established.  Ingo has failed 
> to respond to Mathieu's analysis of the actual instruction streams, 
> which cast doubt on Ingo's technique of simply counting bytes. [...]

uhm no. The "difference" is due to Matthieu counting on the platform 
that is most favorable to his argument: 32-bit, while i counted on 
64-bit where most servers operate (and where instrumentation micro-cost 
actually matters).

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-28 13:40       ` Frank Ch. Eigler
  2008-03-28 14:18         ` Ingo Molnar
@ 2008-03-28 14:41         ` Ingo Molnar
  2008-03-28 15:31           ` Frank Ch. Eigler
  1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2008-03-28 14:41 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: KOSAKI Motohiro, Mathieu Desnoyers, akpm, linux-kernel, Linus Torvalds


* Frank Ch. Eigler <fche@redhat.com> wrote:

> > [...]  furthermore, and because it's Friday again, let me remind 
> > folks that SystemTap has an even more significant bloat problem: the 
> > fact that it needs a huge download: [...the dwarf debugging 
> > information...]
> 
> It's a problem, and there are a few improvements under way or being 
> contemplated: [...]

it's been the primary usability problem ever since SystemTap has been 
incepted 3 years ago. The 229+850 MB size numbers i cited were from a 
bleeding-edge (Fedora 9 beta) distro. So whatever is being contemplated, 
it's not here today.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-28 14:41         ` Ingo Molnar
@ 2008-03-28 15:31           ` Frank Ch. Eigler
  0 siblings, 0 replies; 29+ messages in thread
From: Frank Ch. Eigler @ 2008-03-28 15:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Mathieu Desnoyers, akpm, linux-kernel, Linus Torvalds

Hi -

On Fri, Mar 28, 2008 at 03:41:16PM +0100, Ingo Molnar wrote:
> > > [...]  furthermore, and because it's Friday again, let me remind 
> > > folks that SystemTap has an even more significant bloat problem: the 
> > > fact that it needs a huge download: [...the dwarf debugging 
> > > information...]
> > 
> > It's a problem, and there are a few improvements under way or being 
> > contemplated: [...]
> 
> it's been the primary usability problem ever since SystemTap has been 
> incepted 3 years ago. [...]

This complaint is particularly rich, seeing how you constantly try to
shoot down patches that would reduce systemtap's need for this, such
as markers.  And no, dyn-ftrace is not a complete substitute.  See how
many of the markers in mathieu's lttng suite occur at places *other*
than function entry points.  See how many require parameters other
than what may already sit in registers.

We in systemtap land have always had to make do with whatever was in
the kernel.  There was very little other than kprobes 3 years ago.  Of
newer facilities such as markers, ftrace, perfmon2 (I hope), and a
bunch of other little tracer thingies, all those that provide a usable
callback style interface can be exposed to systemtap.  The more the
merrier, and the less we'd have to resort to the lowest-level
kprobes/dwarf stuff.  Please help rather than obstruct.


- FChE

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-28 13:33     ` Ingo Molnar
@ 2008-03-29 17:16       ` Mathieu Desnoyers
  0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-03-29 17:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, linux-kernel, Linus Torvalds

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> >  6a5:   89 5c 24 14             mov    %ebx,0x14(%esp)
> >  6a9:   8b 55 d0                mov    -0x30(%ebp),%edx
> >  6ac:   89 54 24 10             mov    %edx,0x10(%esp)
> >  6b0:   89 4c 24 0c             mov    %ecx,0xc(%esp)
> >  6b4:   c7 44 24 08 f7 04 00    movl   $0x4f7,0x8(%esp)
> >  6bb:   00 
> >  6bc:   c7 44 24 04 00 00 00    movl   $0x0,0x4(%esp)
> >  6c3:   00 
> >  6c4:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
> >  6cb:   ff 15 0c 00 00 00       call   *0xc
> >  6d1:   e9 c3 fc ff ff          jmp    399 <schedule+0x130>
> > 
> > Which adds an extra 50 bytes.
> 
> you talk about 32-bit while i talk about 64-bit. All these costs go up 
> on 64-bit and you should know that. I measured 44 bytes in the fastpath 
> and 52 bytes in the slowpath, which gives 96 bytes. (with a distro 
> .config and likely with a different gcc)
> 

I did some testing with gcc -Os and -O2 on x86_64 and noticed that -Os
behaves badly in that it does not uses -freorder-blocks. This
optimization is required to have the unlikely branches moved out of the
critical path.

With -O2 :
        mov $0,%al
        movq    %rsi, 1912(%rbx)
        movq    -96(%rbp), %rdi
        incq    (%rdi)
        testb   %al, %al
        jne     .L1785


 4de:   b0 00                   mov    $0x0,%al
 4e0:   48 89 b3 78 07 00 00    mov    %rsi,0x778(%rbx)
 4e7:   48 8b 7d a0             mov    0xffffffffffffffa0(%rbp),%rdi
 4eb:   48 ff 07                incq   (%rdi)
 4ee:   84 c0                   test   %al,%al
 4f0:   0f 85 5f 03 00 00       jne    855 <thread_return+0x2b4>

So, as far as the assembly for the markers in the fast path is
concerned, it adds 10 bytes to the fast path, on x86_64. (I did not
count the %rdi stuff in this since I suppose it's unrelated to markers
and put there by the compiler which reorders instructions)

The bloc which contains the call is much lower at the end of
thread_return.

 855:   49 89 f0                mov    %rsi,%r8
 858:   48 89 d1                mov    %rdx,%rcx
 85b:   31 f6                   xor    %esi,%esi
 85d:   48 89 da                mov    %rbx,%rdx
 860:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
 867:   31 c0                   xor    %eax,%eax
 869:   ff 15 00 00 00 00       callq  *0(%rip)        # 86f <thread_return+0x2ce>
 86f:   e9 82 fc ff ff          jmpq   4f6 <schedule+0x166>

For an added 31 bytes.

Total size added : 41 bytes, 10 of them being in the fast path.


> 96 bytes _per marker_ sprinkled throughout the kernel. This blows up the 
> cache footprint of the kernel quite substantially, because it's all 
> fragmented - even if this is in the 'slowpath'.
> 
> so yes, that is the bloat i'm talking about.
> 

I think the very different compiler options we use change the picture
significantly.

> dont just compare it to ftrace-sched-switch, compare it to dyn-ftrace 
> which gives us more than 78,000 trace points in the kernel _here and 
> today_ at no measurable runtime cost, with a 5 byte NOP per trace point 
> and _zero_ instruction stream (register scheduling, etc.) intrusion. No 
> slowpath cost.
> 

Markers and dyn-ftrace does not fulfill the same purpose, so I don't see
why we should compare them. dyn-ftrace is good at tracing function
entry/exit, so let's keep it. However, it's not designed to extract
variables at specific locations in the kernel code.

Which slowpath cost are you talking about ? When markers are disabled,
their unused function call instructions are placed carefully out of the
kernel running code, along with BUG_ONs and WARN_ONs which already use
some cache lines. You are talking about no measurable runtime cost :
have you tried to measure the runtime cost of disabled markers ? I have
not been able to measure any significant difference with the complete
LTTng marker set compiled into the kernel.

> and the basic API approach of markers is flawed a well - the coupling to 
> the kernel is too strong. The correct and long-term maintainable 
> coupling is via ASCII symbol names, not via any binding built into the 
> kernel.
> 
> With dyn-ftrace (see sched-devel.git/latest) tracing filters can be 
> installed trivially by users, via function _symbols_, via:
> 
>   /debugfs/tracing/available_filter_functions
>   /debugfs/tracing/set_ftrace_filter
> 
> wildcards are recognized as well, so if you do:
> 
>   echo '*lock' > /debugfs/tracing/set_ftrace_filter
> 
> all functions that have 'lock' in their name will have their tracepoints 
> activated transparently from that point on.
> 
> even multiple names can be passed in at once:
> 
>   echo 'schedule wake_up* *acpi*' > /debugfs/tracing/set_ftrace_filter
> 
> so it's trivial to use it, very powerful and we've only begun exposing 
> it towards users. I see no good reason why we'd patch any marker into 
> the kernel - it's a maintenance cost from that point on.
> 

I did something similar with LTTng :

cat /proc/ltt  lists the available markers
echo "connect marker_name default dynamic channel_name" > /proc/ltt

Which indicates 
- The type of callback to use
- Where the data must be sent (LTTng supports multiple buffers, called
  "channels")

So yes, making this easy to use has been done. It's just that the marker
is one building block of the tracing infrastructure, not its entirety.

By the way, I like your tracing filters interface. It seems rather more
polished than my /proc interface. And personnally I don't care wether we
use /proc, debugfs.. as long as there is an interface to userspace.


> so yes, my argument is: tens of thousands of lightweight tracepoints in 
> the kernel here and today, which are configurable via function names, 
> each of which can be turned on and off individually, and none of which 
> needs any source code level changes - is an obviously superior approach.
> 

It's absolutely good to have that into the kernel, but it does not
_replace_ the markers, as I explained above.

Mathieu

> 	Ingo

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [OT] Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-03-28 13:34       ` [OT] " Masami Hiramatsu
@ 2008-04-01  1:43         ` Denys Vlasenko
  2008-04-01 14:30           ` Masami Hiramatsu
  0 siblings, 1 reply; 29+ messages in thread
From: Denys Vlasenko @ 2008-04-01  1:43 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, KOSAKI Motohiro, Mathieu Desnoyers, akpm,
	linux-kernel, Linus Torvalds, Frank Ch. Eigler, systemtap-ml,
	Jim Keniston

On Friday 28 March 2008 14:34, Masami Hiramatsu wrote:
> > for heaven's sake, we can have 3 years of _full Linux kernel history_, 
> > with all 87875 commits, with full changelog and dependencies and full 
> > source code included, packed into a rather tight, 180 MB git repository 
> > ...
> > 
> > [ and then i havent even begun about the on-disk footprint of this 
> >   monstrum, after the packages have been installed: an additional 850 MB.
> >   Puh-lease ... ]
> 
> Thank for Moore's law and all storage engineers, nowadays we can buy
> an 1TB(=1000GB!) HDD with less than $300!(I also surprised...)
> And many other compressed filesystem can be used for saving disk space.

This is not a valid technical reason for creating bloatware.

Bloatware's main problem is not a cost of storing it or
downloading it. The main problem that over time it becomes
an unmaintainable monstrosity nobody is willing to deal with.

I would rather try to find and fix a bug in 2000 lines of C code
than in 200 000 lines.
--
vda

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [OT] Re: [patch for 2.6.26 0/7] Architecture Independent Markers
  2008-04-01  1:43         ` Denys Vlasenko
@ 2008-04-01 14:30           ` Masami Hiramatsu
  0 siblings, 0 replies; 29+ messages in thread
From: Masami Hiramatsu @ 2008-04-01 14:30 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, KOSAKI Motohiro, Mathieu Desnoyers, akpm,
	linux-kernel, Linus Torvalds, Frank Ch. Eigler, systemtap-ml,
	Jim Keniston

Hi Denys,

Denys Vlasenko wrote:
> On Friday 28 March 2008 14:34, Masami Hiramatsu wrote:
>>> for heaven's sake, we can have 3 years of _full Linux kernel history_, 
>>> with all 87875 commits, with full changelog and dependencies and full 
>>> source code included, packed into a rather tight, 180 MB git repository 
>>> ...
>>>
>>> [ and then i havent even begun about the on-disk footprint of this 
>>>   monstrum, after the packages have been installed: an additional 850 MB.
>>>   Puh-lease ... ]
>> Thank for Moore's law and all storage engineers, nowadays we can buy
>> an 1TB(=1000GB!) HDD with less than $300!(I also surprised...)
>> And many other compressed filesystem can be used for saving disk space.
> 
> This is not a valid technical reason for creating bloatware.
> 
> Bloatware's main problem is not a cost of storing it or
> downloading it. The main problem that over time it becomes
> an unmaintainable monstrosity nobody is willing to deal with.
>
> I would rather try to find and fix a bug in 2000 lines of C code
> than in 200 000 lines.

If it is a program code, you're right.
However, the debuginfo is just a set of data files generated from
c-source code by the compiler, so you don't need to maintain it.

Thank you,

> --
> vda

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-04-01 14:43 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-27 13:20 [patch for 2.6.26 0/7] Architecture Independent Markers Mathieu Desnoyers
2008-03-27 13:20 ` [patch for 2.6.26 1/7] Markers - define non optimized marker Mathieu Desnoyers
2008-03-27 13:20 ` [patch for 2.6.26 2/7] LTTng instrumentation fs Mathieu Desnoyers
2008-03-27 13:21 ` [patch for 2.6.26 3/7] LTTng instrumentation ipc Mathieu Desnoyers
2008-03-27 13:21 ` [patch for 2.6.26 4/7] LTTng instrumentation kernel Mathieu Desnoyers
2008-03-27 13:21 ` [patch for 2.6.26 5/7] LTTng instrumentation mm Mathieu Desnoyers
2008-03-27 13:21 ` [patch for 2.6.26 6/7] LTTng instrumentation net Mathieu Desnoyers
2008-03-27 13:21 ` [patch for 2.6.26 7/7] LTTng instrumentation - lib Mathieu Desnoyers
2008-03-27 15:40 ` [patch for 2.6.26 0/7] Architecture Independent Markers Ingo Molnar
2008-03-27 17:08   ` KOSAKI Motohiro
2008-03-28 10:15     ` Ingo Molnar
2008-03-28 13:34       ` [OT] " Masami Hiramatsu
2008-04-01  1:43         ` Denys Vlasenko
2008-04-01 14:30           ` Masami Hiramatsu
2008-03-28 13:40       ` Frank Ch. Eigler
2008-03-28 14:18         ` Ingo Molnar
2008-03-28 14:41         ` Ingo Molnar
2008-03-28 15:31           ` Frank Ch. Eigler
2008-03-27 20:39   ` Mathieu Desnoyers
2008-03-28  9:43     ` Ingo Molnar
2008-03-28 11:22       ` Ingo Molnar
2008-03-28 11:38       ` Mathieu Desnoyers
2008-03-28 13:33     ` Ingo Molnar
2008-03-29 17:16       ` Mathieu Desnoyers
2008-03-27 21:49   ` Frank Ch. Eigler
2008-03-28  0:01 ` Denys Vlasenko
2008-03-28  1:02   ` [PATCH] Markers - remove extra format argument Mathieu Desnoyers
2008-03-28  5:35     ` Masami Hiramatsu
2008-03-28  1:04   ` [patch for 2.6.26 1/7] Markers - define non optimized marker (updated) Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).