linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
@ 2008-04-24 15:03 Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 01/37] Stringify support commas Mathieu Desnoyers
                   ` (37 more replies)
  0 siblings, 38 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel

Hi Ingo,

Here is a rather large patchset applying kernel instrumentation to
sched-devel.git. It includes, mainly :

- x86 NMI-safe traps.
- Optimized Markers, using Immediate Values, with nops/jump patching
  optimization.
- ftrace sched and wakeup tracer port to the markers.
- The archtecture independent instrumentation found in LTTng.

Please see the individual patches for more details. A diffstat is appended at
the bottom of this email.

They apply in the following order on to of current sched-devel.git :

stringify-support-commas.patch
#
x86_64-page-fault-nmi-safe.patch
change-alpha-active-count-bit.patch
change-avr32-active-count-bit.patch
x86-nmi-safe-int3-and-page-fault.patch
#
kprobes-use-mutex-for-insn-pages.patch
kprobes-dont-use-kprobes-mutex-in-arch-code.patch
kprobes-declare-kprobes-mutex-static.patch
fix-sched-devel-text-poke.patch
text-edit-lock-architecture-independent-code.patch
text-edit-lock-kprobes-architecture-independent-support.patch
#
# Immediate Values (basic)
add-all-cpus-option-to-stop-machine-run.patch
immediate-values-architecture-independent-code.patch
immediate-values-kconfig-menu-in-embedded.patch
immediate-values-x86-optimization.patch
add-text-poke-and-sync-core-to-powerpc.patch
immediate-values-powerpc-optimization.patch
immediate-values-documentation.patch
immediate-values-support-init.patch
# Immediate Values (NMI-safe)
immediate-values-move-kprobes-x86-restore-interrupt-to-kdebug-h.patch
add-discard-section-to-x86.patch
immediate-values-x86-optimization-nmi-mce-support.patch
immediate-values-powerpc-optimization-nmi-mce-support.patch
immediate-values-use-arch-nmi-mce-support.patch
# Immediate Values (jump patching)
immediate-values-jump.patch
#
scheduler-profiling-use-immediate-values.patch
#
make-marker_debug-static.patch # in -mm
markers-remove-extra-format-argument.patch
markers-define-non-optimized-marker.patch
linux-kernel-markers-immediate-values.patch
markers-use-imv-jump.patch
#
port-ftrace-to-markers.patch
#
# Instrumentation, architecture independent
lttng-instrumentation-fs.patch
lttng-instrumentation-ipc.patch
lttng-instrumentation-kernel.patch
lttng-instrumentation-mm.patch
lttng-instrumentation-net.patch


The overall diffstat :

 a/kernel/marker.c                                         |    2 
 arch/x86/kernel/traps_32.c                                |    3 
 include/asm-generic/vmlinux.lds.h                         |    8 
 include/asm-powerpc/immediate.h                           |    4 
 include/asm-x86/immediate.h                               |   54 -
 include/linux/immediate.h                                 |   15 
 include/linux/marker.h                                    |   13 
 include/linux/module.h                                    |    2 
 init/main.c                                               |    1 
 kernel/immediate.c                                        |   37 
 kernel/kprobes.c                                          |   44 
 kernel/module.c                                           |   10 
 kernel/sched.c                                            |   14 
 linux-2.6-lttng/Documentation/immediate.txt               |  221 ++++
 linux-2.6-lttng/arch/powerpc/kernel/Makefile              |    1 
 linux-2.6-lttng/arch/powerpc/kernel/immediate.c           |   70 +
 linux-2.6-lttng/include/asm-powerpc/cacheflush.h          |    4 
 linux-2.6-lttng/include/asm-powerpc/immediate.h           |   18 
 linux-2.6-lttng/include/asm-x86/kdebug.h                  |   12 
 linux-2.6-lttng/include/asm-x86/kprobes.h                 |    9 
 linux-2.6-lttng/include/linux/immediate.h                 |   11 
 linux-2.6-lttng/include/linux/marker.h                    |   29 
 linux-2.6-lttng/include/linux/stringify.h                 |    5 
 linux-2.6-lttng/include/linux/swapops.h                   |    8 
 linux-2.6-lttng/ipc/msg.c                                 |    6 
 linux-2.6-lttng/ipc/sem.c                                 |    6 
 linux-2.6-lttng/ipc/shm.c                                 |    6 
 linux-2.6-lttng/kernel/immediate.c                        |   73 -
 linux-2.6-lttng/kernel/kprobes.c                          |    2 
 linux-2.6-lttng/kernel/marker.c                           |   30 
 linux-2.6-lttng/mm/filemap.c                              |    7 
 linux-2.6-lttng/mm/hugetlb.c                              |    3 
 linux-2.6-lttng/mm/memory.c                               |   41 
 linux-2.6-lttng/mm/page_alloc.c                           |    9 
 linux-2.6-lttng/mm/page_io.c                              |    6 
 linux-2.6-lttng/mm/swapfile.c                             |   23 
 linux-2.6-lttng/net/core/dev.c                            |    6 
 linux-2.6-lttng/net/ipv4/devinet.c                        |    6 
 linux-2.6-lttng/net/socket.c                              |   19 
 linux-2.6-sched-devel/Documentation/immediate.txt         |    8 
 linux-2.6-sched-devel/Documentation/markers.txt           |   17 
 linux-2.6-sched-devel/arch/ia64/kernel/kprobes.c          |    2 
 linux-2.6-sched-devel/arch/powerpc/Kconfig                |    1 
 linux-2.6-sched-devel/arch/powerpc/kernel/kprobes.c       |    2 
 linux-2.6-sched-devel/arch/s390/kernel/kprobes.c          |    2 
 linux-2.6-sched-devel/arch/x86/Kconfig                    |    1 
 linux-2.6-sched-devel/arch/x86/kernel/Makefile            |    1 
 linux-2.6-sched-devel/arch/x86/kernel/alternative.c       |   38 
 linux-2.6-sched-devel/arch/x86/kernel/asm-offsets_32.c    |    1 
 linux-2.6-sched-devel/arch/x86/kernel/asm-offsets_64.c    |    1 
 linux-2.6-sched-devel/arch/x86/kernel/entry_32.S          |   30 
 linux-2.6-sched-devel/arch/x86/kernel/entry_64.S          |   34 
 linux-2.6-sched-devel/arch/x86/kernel/immediate.c         |  697 ++++++++++++--
 linux-2.6-sched-devel/arch/x86/kernel/kprobes.c           |    2 
 linux-2.6-sched-devel/arch/x86/kernel/paravirt.c          |    3 
 linux-2.6-sched-devel/arch/x86/kernel/paravirt_patch_32.c |    6 
 linux-2.6-sched-devel/arch/x86/kernel/paravirt_patch_64.c |    6 
 linux-2.6-sched-devel/arch/x86/kernel/traps_32.c          |    8 
 linux-2.6-sched-devel/arch/x86/kernel/traps_64.c          |    4 
 linux-2.6-sched-devel/arch/x86/kernel/vmi_32.c            |    2 
 linux-2.6-sched-devel/arch/x86/kernel/vmlinux_32.lds.S    |    1 
 linux-2.6-sched-devel/arch/x86/kernel/vmlinux_64.lds.S    |    1 
 linux-2.6-sched-devel/arch/x86/kvm/x86.c                  |    2 
 linux-2.6-sched-devel/arch/x86/lguest/boot.c              |    1 
 linux-2.6-sched-devel/arch/x86/mm/fault.c                 |    4 
 linux-2.6-sched-devel/arch/x86/xen/enlighten.c            |    1 
 linux-2.6-sched-devel/fs/buffer.c                         |    3 
 linux-2.6-sched-devel/fs/compat.c                         |    2 
 linux-2.6-sched-devel/fs/exec.c                           |    2 
 linux-2.6-sched-devel/fs/ioctl.c                          |    3 
 linux-2.6-sched-devel/fs/open.c                           |    3 
 linux-2.6-sched-devel/fs/read_write.c                     |   23 
 linux-2.6-sched-devel/fs/select.c                         |    5 
 linux-2.6-sched-devel/include/asm-alpha/thread_info.h     |    2 
 linux-2.6-sched-devel/include/asm-avr32/thread_info.h     |    2 
 linux-2.6-sched-devel/include/asm-generic/vmlinux.lds.h   |    3 
 linux-2.6-sched-devel/include/asm-powerpc/immediate.h     |   57 +
 linux-2.6-sched-devel/include/asm-x86/immediate.h         |  111 ++
 linux-2.6-sched-devel/include/asm-x86/irqflags.h          |   56 +
 linux-2.6-sched-devel/include/asm-x86/paravirt.h          |    7 
 linux-2.6-sched-devel/include/linux/hardirq.h             |   27 
 linux-2.6-sched-devel/include/linux/immediate.h           |   94 +
 linux-2.6-sched-devel/include/linux/kprobes.h             |    2 
 linux-2.6-sched-devel/include/linux/marker.h              |   16 
 linux-2.6-sched-devel/include/linux/memory.h              |    7 
 linux-2.6-sched-devel/include/linux/module.h              |   16 
 linux-2.6-sched-devel/include/linux/profile.h             |    5 
 linux-2.6-sched-devel/include/linux/sched.h               |   32 
 linux-2.6-sched-devel/include/linux/stop_machine.h        |    8 
 linux-2.6-sched-devel/init/Kconfig                        |   18 
 linux-2.6-sched-devel/init/main.c                         |    8 
 linux-2.6-sched-devel/kernel/Makefile                     |    1 
 linux-2.6-sched-devel/kernel/exit.c                       |    8 
 linux-2.6-sched-devel/kernel/fork.c                       |    5 
 linux-2.6-sched-devel/kernel/immediate.c                  |  149 ++
 linux-2.6-sched-devel/kernel/irq/handle.c                 |    7 
 linux-2.6-sched-devel/kernel/itimer.c                     |   13 
 linux-2.6-sched-devel/kernel/kthread.c                    |    5 
 linux-2.6-sched-devel/kernel/lockdep.c                    |   20 
 linux-2.6-sched-devel/kernel/marker.c                     |    8 
 linux-2.6-sched-devel/kernel/module.c                     |   50 -
 linux-2.6-sched-devel/kernel/printk.c                     |   27 
 linux-2.6-sched-devel/kernel/profile.c                    |   22 
 linux-2.6-sched-devel/kernel/sched.c                      |    5 
 linux-2.6-sched-devel/kernel/sched_fair.c                 |    5 
 linux-2.6-sched-devel/kernel/signal.c                     |    3 
 linux-2.6-sched-devel/kernel/softirq.c                    |   23 
 linux-2.6-sched-devel/kernel/stop_machine.c               |   32 
 linux-2.6-sched-devel/kernel/timer.c                      |   13 
 linux-2.6-sched-devel/kernel/trace/trace.h                |   20 
 linux-2.6-sched-devel/kernel/trace/trace_sched_switch.c   |  173 ++-
 linux-2.6-sched-devel/kernel/trace/trace_sched_wakeup.c   |  108 ++
 linux-2.6-sched-devel/mm/memory.c                         |   34 
 113 files changed, 2542 insertions(+), 415 deletions(-)


Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 01/37] Stringify support commas
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 02/37] x86_64 page fault NMI-safe Mathieu Desnoyers
                   ` (36 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, akpm, Sam Ravnborg

[-- Attachment #1: stringify-support-commas.patch --]
[-- Type: text/plain, Size: 1499 bytes --]

#define MYDEF a, b, c

__stringify(MYDEF) should be replaced by "a, b, c", but compilation fails
because the __stringify macro expects only one argument. Fix it by using
variable macro arguments in __stringify and __stringify_1.

Needed in my current NMI safe iret paravirt support work so I can expand
a macro containing assembly code into a string.

Since some architectures still use -traditional, which does not support macros
with variable arguments, keep the old stringify around. Test with __STDC__ to
see if -traditional is used.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: akpm@osdl.org
CC: Sam Ravnborg <sam@ravnborg.org>
---
 include/linux/stringify.h |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6-lttng/include/linux/stringify.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/stringify.h	2008-04-22 14:16:21.000000000 -0400
+++ linux-2.6-lttng/include/linux/stringify.h	2008-04-22 14:19:26.000000000 -0400
@@ -6,7 +6,12 @@
  * converts to "bar".
  */
 
+#ifdef __STDC__
+#define __stringify_1(x...)	#x
+#define __stringify(x...)	__stringify_1(x)
+#else	/* Support gcc -traditional, without commas. */
 #define __stringify_1(x)	#x
 #define __stringify(x)		__stringify_1(x)
+#endif
 
 #endif	/* !__LINUX_STRINGIFY_H */

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 02/37] x86_64 page fault NMI-safe
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 01/37] Stringify support commas Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 03/37] Change Alpha active count bit Mathieu Desnoyers
                   ` (35 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, akpm, H. Peter Anvin, Jeremy Fitzhardinge,
	Steven Rostedt, Frank Ch. Eigler

[-- Attachment #1: x86_64-page-fault-nmi-safe.patch --]
[-- Type: text/plain, Size: 3462 bytes --]

> I think you're vastly overestimating what is sane to do from an NMI
> context.  It is utterly and totally insane to assume vmalloc is available
> in NMI.
>
>       -hpa
>

Ok, please tell me where I am wrong then.. by looking into
arch/x86/mm/fault.c, I see that vmalloc_sync_all() touches pgd_list
entries while the pgd_lock spinlock is taken, with interrupts disabled.
So it's protected against concurrent pgd_list modification from

a - vmalloc_sync_all() on other CPUs
b - local interrupts

However, a completely normal interrupt can come on a remote CPU, run
vmalloc_fault() and issue a set_pgd concurrently. Therefore I conclude
this interrupt disable is not there to insure any kind of protection
against concurrent updates.

Also, we see that vmalloc_fault has comments such as :

(for x86_32)
         * Do _not_ use "current" here. We might be inside
         * an interrupt in the middle of a task switch..

So it takes the pgd_addr from cr3, not from current. Using only the
stack/registers makes this NMI-safe even if "current" is invalid when
the NMI comes. This is caused by the fact that __switch_to will update
the registers before updating current_task without disabling interrupts.

You are right in that x86_64 does not seems to play as safely as x86_32
on this matter; it uses current->mm. Probably it shouldn't assume
"current" is valid. Actually, I don't see where x86_64 disables
interrupts around __switch_to, so this would seem to be a race
condition. Or have I missed something ?

(Ingo)
> > the scheduler disables interrupts around __switch_to(). (x86 does 
> > not set __ARCH_WANT_INTERRUPTS_ON_CTXSW)
>
(Mathieu)
> Ok, so I guess it's only useful to NMIs then. However, it makes me
> wonder why this comment was there in the first place on x86_32
> vmalloc_fault() and why it uses read_cr3() :
>
>         * Do _not_ use "current" here. We might be inside
>         * an interrupt in the middle of a task switch..
(Ingo)
hm, i guess it's still useful to keep the
__ARCH_WANT_INTERRUPTS_ON_CTXSW case working too. On -rt we used to
enable it to squeeze a tiny bit more latency out of the system.


Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: akpm@osdl.org
CC: mingo@elte.hu
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
---
 arch/x86/mm/fault.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-sched-devel/arch/x86/mm/fault.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/mm/fault.c	2008-04-22 20:04:02.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/mm/fault.c	2008-04-22 20:09:50.000000000 -0400
@@ -535,6 +535,7 @@ static int vmalloc_fault(struct pt_regs 
 	}
 	return 0;
 #else
+	unsigned long pgd_paddr;
 	pgd_t *pgd, *pgd_ref;
 	pud_t *pud, *pud_ref;
 	pmd_t *pmd, *pmd_ref;
@@ -548,7 +549,8 @@ static int vmalloc_fault(struct pt_regs 
 	   happen within a race in page table update. In the later
 	   case just flush. */
 
-	pgd = pgd_offset(current->mm ?: &init_mm, address);
+	pgd_paddr = read_cr3();
+	pgd = __va(pgd_paddr) + pgd_index(address);
 	pgd_ref = pgd_offset_k(address);
 	if (pgd_none(*pgd_ref))
 		return -1;

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 03/37] Change Alpha active count bit
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 01/37] Stringify support commas Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 02/37] x86_64 page fault NMI-safe Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 04/37] Change avr32 " Mathieu Desnoyers
                   ` (34 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, rth, ink

[-- Attachment #1: change-alpha-active-count-bit.patch --]
[-- Type: text/plain, Size: 988 bytes --]

alpha uses the active count bit 31. This patch moves it to 28 so 31 is freed for
nmi count.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: rth@twiddle.net
CC: ink@jurassic.park.msu.ru
---
 include/asm-alpha/thread_info.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-sched-devel/include/asm-alpha/thread_info.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-alpha/thread_info.h	2008-04-24 10:28:41.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-alpha/thread_info.h	2008-04-24 10:30:22.000000000 -0400
@@ -57,7 +57,7 @@ register struct thread_info *__current_t
 
 #endif /* __ASSEMBLY__ */
 
-#define PREEMPT_ACTIVE		0x40000000
+#define PREEMPT_ACTIVE		0x10000000
 
 /*
  * Thread information flags:

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 04/37] Change avr32 active count bit
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2008-04-24 15:03 ` [patch 03/37] Change Alpha active count bit Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 05/37] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
                   ` (33 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, hskinnemoen

[-- Attachment #1: change-avr32-active-count-bit.patch --]
[-- Type: text/plain, Size: 965 bytes --]

avr32 uses the active count bit 31. This patch moves it to 28 so 31 is freed for
nmi count.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: hskinnemoen@atmel.com
---
 include/asm-avr32/thread_info.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-sched-devel/include/asm-avr32/thread_info.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-avr32/thread_info.h	2008-04-24 10:31:47.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-avr32/thread_info.h	2008-04-24 10:32:16.000000000 -0400
@@ -70,7 +70,7 @@ static inline struct thread_info *curren
 
 #endif /* !__ASSEMBLY__ */
 
-#define PREEMPT_ACTIVE		0x40000000
+#define PREEMPT_ACTIVE		0x10000000
 
 /*
  * Thread information flags

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 05/37] x86 NMI-safe INT3 and Page Fault
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2008-04-24 15:03 ` [patch 04/37] Change avr32 " Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 06/37] Kprobes - use a mutex to protect the instruction pages list Mathieu Desnoyers
                   ` (32 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, akpm, H. Peter Anvin, Jeremy Fitzhardinge,
	Steven Rostedt, Frank Ch. Eigler

[-- Attachment #1: x86-nmi-safe-int3-and-page-fault.patch --]
[-- Type: text/plain, Size: 23479 bytes --]

Implements an alternative iret with popf and return so trap and exception
handlers can return to the NMI handler without issuing iret. iret would cause
NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
copy the return instruction pointer to the top of the previous stack, issue a
popf, loads the previous esp and issue a near return (ret).

It allows placing immediate values (and therefore optimized trace_marks) in NMI
code since returning from a breakpoint would be valid. Accessing vmalloc'd
memory, which allows executing module code or accessing vmapped or vmalloc'd
areas from NMI context, would also be valid. This is very useful to tracers like
LTTng.

This patch makes all faults, traps and exception safe to be called from NMI
context *except* single-stepping, which requires iret to restore the TF (trap
flag) and jump to the return address in a single instruction. Sorry, no kprobes
support in NMI handlers because of this limitation.  We cannot single-step an
NMI handler, because iret must set the TF flag and return back to the
instruction to single-step in a single instruction. This cannot be emulated with
popf/lret, because lret would be single-stepped. It does not apply to immediate
values because they do not use single-stepping. This code detects if the TF
flag is set and uses the iret path for single-stepping, even if it reactivates
NMIs prematurely.

Test to detect if nested under a NMI handler is only done upon the return from
trap/exception to kernel, which is not frequent. Other return paths (return from
trap/exception to userspace, return from interrupt) keep the exact same behavior
(no slowdown).

Depends on :
change-alpha-active-count-bit.patch
change-avr32-active-count-bit.patch

TODO : test with lguest, xen, kvm.

** This patch depends on the "Stringify support commas" patchset **
** Also depends on fix-x86_64-page-fault-scheduler-race patch **

tested on x86_32 (tests implemented in a separate patch) :
- instrumented the return path to export the EIP, CS and EFLAGS values when
  taken so we know the return path code has been executed.
- trace_mark, using immediate values, with 10ms delay with the breakpoint
  activated. Runs well through the return path.
- tested vmalloc faults in NMI handler by placing a non-optimized marker in the
  NMI handler (so no breakpoint is executed) and connecting a probe which
  touches every pages of a 20MB vmalloc'd buffer. It executes trough the return
  path without problem.
- Tested with and without preemption

tested on x86_64
- instrumented the return path to export the EIP, CS and EFLAGS values when
  taken so we know the return path code has been executed.
- trace_mark, using immediate values, with 10ms delay with the breakpoint
  activated. Runs well through the return path.

To test on x86_64 :
- Test without preemption
- Test vmalloc faults
- Test on Intel 64 bits CPUs. (AMD64 was fine)

Changelog since v1 :
- x86_64 fixes.
Changelog since v2 :
- fix paravirt build
Changelog since v3 :
- Include modifications suggested by Jeremy
Changelog since v4 :
- including hardirq.h in entry_32/64.S is a bad idea (non ifndef'd C code),
  define HARDNMI_MASK in the .S files directly.
Changelog since v5 :
- Add HARDNMI_MASK to irq_count() and make die() more verbose for NMIs.
Changelog since v7 :
- Implement paravirtualized nmi_return.
Changelog since v8 :
- refreshed the patch for asm-offsets. Those were left out of v8.
- now depends on "Stringify support commas" patch.
Changelog since v9 :
- Only test the nmi nested preempt count flag upon return from exceptions, not
  on return from interrupts. Only the kernel return path has this test.
- Add Xen, VMI, lguest support. Use their iret pavavirt ops in lieu of
  nmi_return.

-- Ported to sched-devel.git

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: akpm@osdl.org
CC: mingo@elte.hu
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
---
 arch/x86/kernel/asm-offsets_32.c    |    1 
 arch/x86/kernel/asm-offsets_64.c    |    1 
 arch/x86/kernel/entry_32.S          |   30 +++++++++++++++++++
 arch/x86/kernel/entry_64.S          |   34 ++++++++++++++++++++-
 arch/x86/kernel/paravirt.c          |    3 +
 arch/x86/kernel/paravirt_patch_32.c |    6 +++
 arch/x86/kernel/paravirt_patch_64.c |    6 +++
 arch/x86/kernel/traps_32.c          |    3 +
 arch/x86/kernel/traps_64.c          |    4 ++
 arch/x86/kernel/vmi_32.c            |    2 +
 arch/x86/lguest/boot.c              |    1 
 arch/x86/xen/enlighten.c            |    1 
 include/asm-x86/irqflags.h          |   56 ++++++++++++++++++++++++++++++++++++
 include/asm-x86/paravirt.h          |    7 +++-
 include/linux/hardirq.h             |   27 +++++++++++++++--
 15 files changed, 174 insertions(+), 8 deletions(-)

Index: linux-2.6-sched-devel/include/linux/hardirq.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/hardirq.h	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/hardirq.h	2008-04-24 10:36:43.000000000 -0400
@@ -22,10 +22,13 @@
  * PREEMPT_MASK: 0x000000ff
  * SOFTIRQ_MASK: 0x0000ff00
  * HARDIRQ_MASK: 0x0fff0000
+ * HARDNMI_MASK: 0x40000000
  */
 #define PREEMPT_BITS	8
 #define SOFTIRQ_BITS	8
 
+#define HARDNMI_BITS	1
+
 #ifndef HARDIRQ_BITS
 #define HARDIRQ_BITS	12
 
@@ -45,16 +48,19 @@
 #define PREEMPT_SHIFT	0
 #define SOFTIRQ_SHIFT	(PREEMPT_SHIFT + PREEMPT_BITS)
 #define HARDIRQ_SHIFT	(SOFTIRQ_SHIFT + SOFTIRQ_BITS)
+#define HARDNMI_SHIFT	(30)
 
 #define __IRQ_MASK(x)	((1UL << (x))-1)
 
 #define PREEMPT_MASK	(__IRQ_MASK(PREEMPT_BITS) << PREEMPT_SHIFT)
 #define SOFTIRQ_MASK	(__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
 #define HARDIRQ_MASK	(__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT)
+#define HARDNMI_MASK	(__IRQ_MASK(HARDNMI_BITS) << HARDNMI_SHIFT)
 
 #define PREEMPT_OFFSET	(1UL << PREEMPT_SHIFT)
 #define SOFTIRQ_OFFSET	(1UL << SOFTIRQ_SHIFT)
 #define HARDIRQ_OFFSET	(1UL << HARDIRQ_SHIFT)
+#define HARDNMI_OFFSET	(1UL << HARDNMI_SHIFT)
 
 #if PREEMPT_ACTIVE < (1 << (HARDIRQ_SHIFT + HARDIRQ_BITS))
 #error PREEMPT_ACTIVE is too low!
@@ -62,7 +68,9 @@
 
 #define hardirq_count()	(preempt_count() & HARDIRQ_MASK)
 #define softirq_count()	(preempt_count() & SOFTIRQ_MASK)
-#define irq_count()	(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK))
+#define irq_count() \
+	(preempt_count() & (HARDNMI_MASK | HARDIRQ_MASK | SOFTIRQ_MASK))
+#define hardnmi_count()	(preempt_count() & HARDNMI_MASK)
 
 /*
  * Are we doing bottom half or hardware interrupt processing?
@@ -71,6 +79,7 @@
 #define in_irq()		(hardirq_count())
 #define in_softirq()		(softirq_count())
 #define in_interrupt()		(irq_count())
+#define in_nmi()		(hardnmi_count())
 
 /*
  * Are we running in atomic context?  WARNING: this macro cannot
@@ -159,7 +168,19 @@ extern void irq_enter(void);
  */
 extern void irq_exit(void);
 
-#define nmi_enter()		do { lockdep_off(); __irq_enter(); } while (0)
-#define nmi_exit()		do { __irq_exit(); lockdep_on(); } while (0)
+#define nmi_enter()					\
+	do {						\
+		lockdep_off();				\
+		BUG_ON(hardnmi_count());		\
+		add_preempt_count(HARDNMI_OFFSET);	\
+		__irq_enter();				\
+	} while (0)
+
+#define nmi_exit()					\
+	do {						\
+		__irq_exit();				\
+		sub_preempt_count(HARDNMI_OFFSET);	\
+		lockdep_on();				\
+	} while (0)
 
 #endif /* LINUX_HARDIRQ_H */
Index: linux-2.6-sched-devel/arch/x86/kernel/entry_32.S
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/entry_32.S	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/entry_32.S	2008-04-24 10:36:43.000000000 -0400
@@ -68,6 +68,8 @@
 
 #define nr_syscalls ((syscall_table_size)/4)
 
+#define HARDNMI_MASK 0x40000000
+
 #ifdef CONFIG_PREEMPT
 #define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
@@ -232,8 +234,32 @@ END(ret_from_fork)
 	# userspace resumption stub bypassing syscall exit tracing
 	ALIGN
 	RING0_PTREGS_FRAME
+
 ret_from_exception:
 	preempt_stop(CLBR_ANY)
+	GET_THREAD_INFO(%ebp)
+	movl PT_EFLAGS(%esp), %eax	# mix EFLAGS and CS
+	movb PT_CS(%esp), %al
+	andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
+	cmpl $USER_RPL, %eax
+	jae resume_userspace	# returning to v8086 or userspace
+	testl $HARDNMI_MASK,TI_preempt_count(%ebp)
+	jz resume_kernel		/* Not nested over NMI ? */
+	testw $X86_EFLAGS_TF, PT_EFLAGS(%esp)
+	jnz resume_kernel		/*
+					 * If single-stepping an NMI handler,
+					 * use the normal iret path instead of
+					 * the popf/lret because lret would be
+					 * single-stepped. It should not
+					 * happen : it will reactivate NMIs
+					 * prematurely.
+					 */
+	TRACE_IRQS_IRET
+	RESTORE_REGS
+	addl $4, %esp			# skip orig_eax/error_code
+	CFI_ADJUST_CFA_OFFSET -4
+	INTERRUPT_RETURN_NMI_SAFE
+
 ret_from_intr:
 	GET_THREAD_INFO(%ebp)
 check_userspace:
@@ -873,6 +899,10 @@ ENTRY(native_iret)
 .previous
 END(native_iret)
 
+ENTRY(native_nmi_return)
+	NATIVE_INTERRUPT_RETURN_NMI_SAFE # Should we deal with popf exception ?
+END(native_nmi_return)
+
 ENTRY(native_irq_enable_syscall_ret)
 	sti
 	sysexit
Index: linux-2.6-sched-devel/arch/x86/kernel/entry_64.S
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/entry_64.S	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/entry_64.S	2008-04-24 10:36:43.000000000 -0400
@@ -156,6 +156,8 @@ END(mcount)
 #endif /* CONFIG_DYNAMIC_FTRACE */
 #endif /* CONFIG_FTRACE */
 
+#define HARDNMI_MASK 0x40000000
+
 #ifndef CONFIG_PREEMPT
 #define retint_kernel retint_restore_args
 #endif	
@@ -698,6 +700,9 @@ ENTRY(native_iret)
 	.section __ex_table,"a"
 	.quad native_iret, bad_iret
 	.previous
+
+ENTRY(native_nmi_return)
+	NATIVE_INTERRUPT_RETURN_NMI_SAFE
 #endif
 
 	.section .fixup,"ax"
@@ -753,6 +758,23 @@ retint_signal:
 	GET_THREAD_INFO(%rcx)
 	jmp retint_check
 
+	/* Returning to kernel space from exception. */
+	/* rcx:	 threadinfo. interrupts off. */
+ENTRY(retexc_kernel)
+	testl $HARDNMI_MASK,threadinfo_preempt_count(%rcx)
+	jz retint_kernel		/* Not nested over NMI ? */
+	testw $X86_EFLAGS_TF,EFLAGS-ARGOFFSET(%rsp)	/* trap flag? */
+	jnz retint_kernel		/*
+					 * If single-stepping an NMI handler,
+					 * use the normal iret path instead of
+					 * the popf/lret because lret would be
+					 * single-stepped. It should not
+					 * happen : it will reactivate NMIs
+					 * prematurely.
+					 */
+	RESTORE_ARGS 0,8,0
+	INTERRUPT_RETURN_NMI_SAFE
+
 #ifdef CONFIG_PREEMPT
 	/* Returning to kernel space. Check if we need preemption */
 	/* rcx:	 threadinfo. interrupts off. */
@@ -911,9 +933,17 @@ paranoid_swapgs\trace:
 	TRACE_IRQS_IRETQ 0
 	.endif
 	SWAPGS_UNSAFE_STACK
-paranoid_restore\trace:
+paranoid_restore_no_nmi\trace:
 	RESTORE_ALL 8
 	jmp irq_return
+paranoid_restore\trace:
+	GET_THREAD_INFO(%rcx)
+	testl $HARDNMI_MASK,threadinfo_preempt_count(%rcx)
+	jz paranoid_restore_no_nmi\trace	/* Nested over NMI ? */
+	testw $X86_EFLAGS_TF,EFLAGS-0(%rsp)	/* trap flag? */
+	jnz paranoid_restore_no_nmi\trace
+	RESTORE_ALL 8
+	INTERRUPT_RETURN_NMI_SAFE
 paranoid_userspace\trace:
 	GET_THREAD_INFO(%rcx)
 	movl threadinfo_flags(%rcx),%ebx
@@ -1012,7 +1042,7 @@ error_exit:
 	TRACE_IRQS_OFF
 	GET_THREAD_INFO(%rcx)	
 	testl %eax,%eax
-	jne  retint_kernel
+	jne  retexc_kernel
 	LOCKDEP_SYS_EXIT_IRQ
 	movl  threadinfo_flags(%rcx),%edx
 	movl  $_TIF_WORK_MASK,%edi
Index: linux-2.6-sched-devel/include/asm-x86/irqflags.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-x86/irqflags.h	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-x86/irqflags.h	2008-04-24 10:36:43.000000000 -0400
@@ -51,6 +51,61 @@ static inline void native_halt(void)
 
 #endif
 
+#ifdef CONFIG_X86_64
+/*
+ * Only returns from a trap or exception to a NMI context (intra-privilege
+ * level near return) to the same SS and CS segments. Should be used
+ * upon trap or exception return when nested over a NMI context so no iret is
+ * issued. It takes care of modifying the eflags, rsp and returning to the
+ * previous function.
+ *
+ * The stack, at that point, looks like :
+ *
+ * 0(rsp)  RIP
+ * 8(rsp)  CS
+ * 16(rsp) EFLAGS
+ * 24(rsp) RSP
+ * 32(rsp) SS
+ *
+ * Upon execution :
+ * Copy EIP to the top of the return stack
+ * Update top of return stack address
+ * Pop eflags into the eflags register
+ * Make the return stack current
+ * Near return (popping the return address from the return stack)
+ */
+#define NATIVE_INTERRUPT_RETURN_NMI_SAFE	pushq %rax;		\
+						movq %rsp, %rax;	\
+						movq 24+8(%rax), %rsp;	\
+						pushq 0+8(%rax);	\
+						pushq 16+8(%rax);	\
+						movq (%rax), %rax;	\
+						popfq;			\
+						ret
+#else
+/*
+ * Protected mode only, no V8086. Implies that protected mode must
+ * be entered before NMIs or MCEs are enabled. Only returns from a trap or
+ * exception to a NMI context (intra-privilege level far return). Should be used
+ * upon trap or exception return when nested over a NMI context so no iret is
+ * issued.
+ *
+ * The stack, at that point, looks like :
+ *
+ * 0(esp) EIP
+ * 4(esp) CS
+ * 8(esp) EFLAGS
+ *
+ * Upon execution :
+ * Copy the stack eflags to top of stack
+ * Pop eflags into the eflags register
+ * Far return: pop EIP and CS into their register, and additionally pop EFLAGS.
+ */
+#define NATIVE_INTERRUPT_RETURN_NMI_SAFE	pushl 8(%esp);	\
+						popfl;		\
+						lret $4
+#endif
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
@@ -109,6 +164,7 @@ static inline unsigned long __raw_local_
 
 #define ENABLE_INTERRUPTS(x)	sti
 #define DISABLE_INTERRUPTS(x)	cli
+#define INTERRUPT_RETURN_NMI_SAFE	NATIVE_INTERRUPT_RETURN_NMI_SAFE
 
 #ifdef CONFIG_X86_64
 #define INTERRUPT_RETURN	iretq
Index: linux-2.6-sched-devel/include/asm-x86/paravirt.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-x86/paravirt.h	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-x86/paravirt.h	2008-04-24 10:36:43.000000000 -0400
@@ -141,9 +141,10 @@ struct pv_cpu_ops {
 	u64 (*read_pmc)(int counter);
 	unsigned long long (*read_tscp)(unsigned int *aux);
 
-	/* These two are jmp to, not actually called. */
+	/* These three are jmp to, not actually called. */
 	void (*irq_enable_syscall_ret)(void);
 	void (*iret)(void);
+	void (*nmi_return)(void);
 
 	void (*swapgs)(void);
 
@@ -1385,6 +1386,10 @@ static inline unsigned long __raw_local_
 	PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_iret), CLBR_NONE,	\
 		  jmp *%cs:pv_cpu_ops+PV_CPU_iret)
 
+#define INTERRUPT_RETURN_NMI_SAFE					\
+	PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_nmi_return), CLBR_NONE,	\
+		  jmp *%cs:pv_cpu_ops+PV_CPU_nmi_return)
+
 #define DISABLE_INTERRUPTS(clobbers)					\
 	PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \
 		  PV_SAVE_REGS;			\
Index: linux-2.6-sched-devel/arch/x86/kernel/traps_32.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/traps_32.c	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/traps_32.c	2008-04-24 10:36:43.000000000 -0400
@@ -476,6 +476,9 @@ void die(const char *str, struct pt_regs
 	if (kexec_should_crash(current))
 		crash_kexec(regs);
 
+	if (in_nmi())
+		panic("Fatal exception in non-maskable interrupt");
+
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
 
Index: linux-2.6-sched-devel/arch/x86/kernel/traps_64.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/traps_64.c	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/traps_64.c	2008-04-24 10:36:43.000000000 -0400
@@ -556,6 +556,10 @@ void __kprobes oops_end(unsigned long fl
 		oops_exit();
 		return;
 	}
+	if (in_nmi())
+		panic("Fatal exception in non-maskable interrupt");
+	if (in_interrupt())
+		panic("Fatal exception in interrupt");
 	if (panic_on_oops)
 		panic("Fatal exception");
 	oops_exit();
Index: linux-2.6-sched-devel/arch/x86/kernel/paravirt.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/paravirt.c	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/paravirt.c	2008-04-24 10:36:43.000000000 -0400
@@ -139,6 +139,7 @@ unsigned paravirt_patch_default(u8 type,
 		/* If the operation is a nop, then nop the callsite */
 		ret = paravirt_patch_nop();
 	else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
+		 type == PARAVIRT_PATCH(pv_cpu_ops.nmi_return) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_syscall_ret))
 		/* If operation requires a jmp, then jmp */
 		ret = paravirt_patch_jmp(insnbuf, opfunc, addr, len);
@@ -190,6 +191,7 @@ static void native_flush_tlb_single(unsi
 
 /* These are in entry.S */
 extern void native_iret(void);
+extern void native_nmi_return(void);
 extern void native_irq_enable_syscall_ret(void);
 
 static int __init print_banner(void)
@@ -328,6 +330,7 @@ struct pv_cpu_ops pv_cpu_ops = {
 
 	.irq_enable_syscall_ret = native_irq_enable_syscall_ret,
 	.iret = native_iret,
+	.nmi_return = native_nmi_return,
 	.swapgs = native_swapgs,
 
 	.set_iopl_mask = native_set_iopl_mask,
Index: linux-2.6-sched-devel/arch/x86/kernel/paravirt_patch_32.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/paravirt_patch_32.c	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/paravirt_patch_32.c	2008-04-24 10:36:43.000000000 -0400
@@ -1,10 +1,13 @@
-#include <asm/paravirt.h>
+#include <linux/stringify.h>
+#include <linux/irqflags.h>
 
 DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
 DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
 DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
 DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
 DEF_NATIVE(pv_cpu_ops, iret, "iret");
+DEF_NATIVE(pv_cpu_ops, nmi_return,
+	__stringify(NATIVE_INTERRUPT_RETURN_NMI_SAFE));
 DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, "sti; sysexit");
 DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
 DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3");
@@ -29,6 +32,7 @@ unsigned native_patch(u8 type, u16 clobb
 		PATCH_SITE(pv_irq_ops, restore_fl);
 		PATCH_SITE(pv_irq_ops, save_fl);
 		PATCH_SITE(pv_cpu_ops, iret);
+		PATCH_SITE(pv_cpu_ops, nmi_return);
 		PATCH_SITE(pv_cpu_ops, irq_enable_syscall_ret);
 		PATCH_SITE(pv_mmu_ops, read_cr2);
 		PATCH_SITE(pv_mmu_ops, read_cr3);
Index: linux-2.6-sched-devel/arch/x86/kernel/paravirt_patch_64.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/paravirt_patch_64.c	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/paravirt_patch_64.c	2008-04-24 10:36:43.000000000 -0400
@@ -1,12 +1,15 @@
+#include <linux/irqflags.h>
+#include <linux/stringify.h>
 #include <asm/paravirt.h>
 #include <asm/asm-offsets.h>
-#include <linux/stringify.h>
 
 DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
 DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
 DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq");
 DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax");
 DEF_NATIVE(pv_cpu_ops, iret, "iretq");
+DEF_NATIVE(pv_cpu_ops, nmi_return,
+	__stringify(NATIVE_INTERRUPT_RETURN_NMI_SAFE));
 DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
 DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");
 DEF_NATIVE(pv_mmu_ops, write_cr3, "movq %rdi, %cr3");
@@ -35,6 +38,7 @@ unsigned native_patch(u8 type, u16 clobb
 		PATCH_SITE(pv_irq_ops, irq_enable);
 		PATCH_SITE(pv_irq_ops, irq_disable);
 		PATCH_SITE(pv_cpu_ops, iret);
+		PATCH_SITE(pv_cpu_ops, nmi_return);
 		PATCH_SITE(pv_cpu_ops, irq_enable_syscall_ret);
 		PATCH_SITE(pv_cpu_ops, swapgs);
 		PATCH_SITE(pv_mmu_ops, read_cr2);
Index: linux-2.6-sched-devel/arch/x86/kernel/asm-offsets_32.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/asm-offsets_32.c	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/asm-offsets_32.c	2008-04-24 10:36:43.000000000 -0400
@@ -118,6 +118,7 @@ void foo(void)
 	OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable);
 	OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
 	OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
+	OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
 	OFFSET(PV_CPU_irq_enable_syscall_ret, pv_cpu_ops, irq_enable_syscall_ret);
 	OFFSET(PV_CPU_read_cr0, pv_cpu_ops, read_cr0);
 #endif
Index: linux-2.6-sched-devel/arch/x86/kernel/asm-offsets_64.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/asm-offsets_64.c	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/asm-offsets_64.c	2008-04-24 10:36:43.000000000 -0400
@@ -69,6 +69,7 @@ int main(void)
 	OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable);
 	OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
 	OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
+	OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
 	OFFSET(PV_CPU_irq_enable_syscall_ret, pv_cpu_ops, irq_enable_syscall_ret);
 	OFFSET(PV_CPU_swapgs, pv_cpu_ops, swapgs);
 	OFFSET(PV_MMU_read_cr2, pv_mmu_ops, read_cr2);
Index: linux-2.6-sched-devel/arch/x86/xen/enlighten.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/xen/enlighten.c	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/xen/enlighten.c	2008-04-24 10:36:43.000000000 -0400
@@ -1008,6 +1008,7 @@ static const struct pv_cpu_ops xen_cpu_o
 	.read_pmc = native_read_pmc,
 
 	.iret = xen_iret,
+	.nmi_return = xen_iret,
 	.irq_enable_syscall_ret = xen_sysexit,
 
 	.load_tr_desc = paravirt_nop,
Index: linux-2.6-sched-devel/arch/x86/kernel/vmi_32.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/vmi_32.c	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/vmi_32.c	2008-04-24 10:36:43.000000000 -0400
@@ -151,6 +151,8 @@ static unsigned vmi_patch(u8 type, u16 c
 					      insns, ip);
 		case PARAVIRT_PATCH(pv_cpu_ops.iret):
 			return patch_internal(VMI_CALL_IRET, len, insns, ip);
+		case PARAVIRT_PATCH(pv_cpu_ops.nmi_return):
+			return patch_internal(VMI_CALL_IRET, len, insns, ip);
 		case PARAVIRT_PATCH(pv_cpu_ops.irq_enable_syscall_ret):
 			return patch_internal(VMI_CALL_SYSEXIT, len, insns, ip);
 		default:
Index: linux-2.6-sched-devel/arch/x86/lguest/boot.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/lguest/boot.c	2008-04-24 10:31:46.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/lguest/boot.c	2008-04-24 10:36:43.000000000 -0400
@@ -958,6 +958,7 @@ __init void lguest_init(void)
 	pv_cpu_ops.cpuid = lguest_cpuid;
 	pv_cpu_ops.load_idt = lguest_load_idt;
 	pv_cpu_ops.iret = lguest_iret;
+	pv_cpu_ops.nmi_return = lguest_iret;
 	pv_cpu_ops.load_sp0 = lguest_load_sp0;
 	pv_cpu_ops.load_tr_desc = lguest_load_tr_desc;
 	pv_cpu_ops.set_ldt = lguest_set_ldt;

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 06/37] Kprobes - use a mutex to protect the instruction pages list.
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2008-04-24 15:03 ` [patch 05/37] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 07/37] Kprobes - do not use kprobes mutex in arch code Mathieu Desnoyers
                   ` (31 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Ananth N Mavinakayanahalli, Masami Hiramatsu,
	hch, anil.s.keshavamurthy, davem

[-- Attachment #1: kprobes-use-mutex-for-insn-pages.patch --]
[-- Type: text/plain, Size: 3650 bytes --]

Protect the instruction pages list by a specific insn pages mutex, called in 
get_insn_slot() and free_insn_slot(). It makes sure that architectures that does
not need to call arch_remove_kprobe() does not take an unneeded kprobes mutex.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
CC: hch@infradead.org
CC: anil.s.keshavamurthy@intel.com
CC: davem@davemloft.net
---
 kernel/kprobes.c |   27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kprobes.c	2007-08-27 11:48:56.000000000 -0400
+++ linux-2.6-lttng/kernel/kprobes.c	2007-08-27 11:48:58.000000000 -0400
@@ -95,6 +95,10 @@ enum kprobe_slot_state {
 	SLOT_USED = 2,
 };
 
+/*
+ * Protects the kprobe_insn_pages list. Can nest into kprobe_mutex.
+ */
+static DEFINE_MUTEX(kprobe_insn_mutex);
 static struct hlist_head kprobe_insn_pages;
 static int kprobe_garbage_slots;
 static int collect_garbage_slots(void);
@@ -131,7 +135,9 @@ kprobe_opcode_t __kprobes *get_insn_slot
 {
 	struct kprobe_insn_page *kip;
 	struct hlist_node *pos;
+	kprobe_opcode_t *ret;
 
+	mutex_lock(&kprobe_insn_mutex);
  retry:
 	hlist_for_each_entry(kip, pos, &kprobe_insn_pages, hlist) {
 		if (kip->nused < INSNS_PER_PAGE) {
@@ -140,7 +146,8 @@ kprobe_opcode_t __kprobes *get_insn_slot
 				if (kip->slot_used[i] == SLOT_CLEAN) {
 					kip->slot_used[i] = SLOT_USED;
 					kip->nused++;
-					return kip->insns + (i * MAX_INSN_SIZE);
+					ret = kip->insns + (i * MAX_INSN_SIZE);
+					goto end;
 				}
 			}
 			/* Surprise!  No unused slots.  Fix kip->nused. */
@@ -154,8 +161,10 @@ kprobe_opcode_t __kprobes *get_insn_slot
 	}
 	/* All out of space.  Need to allocate a new page. Use slot 0. */
 	kip = kmalloc(sizeof(struct kprobe_insn_page), GFP_KERNEL);
-	if (!kip)
-		return NULL;
+	if (!kip) {
+		ret = NULL;
+		goto end;
+	}
 
 	/*
 	 * Use module_alloc so this page is within +/- 2GB of where the
@@ -165,7 +174,8 @@ kprobe_opcode_t __kprobes *get_insn_slot
 	kip->insns = module_alloc(PAGE_SIZE);
 	if (!kip->insns) {
 		kfree(kip);
-		return NULL;
+		ret = NULL;
+		goto end;
 	}
 	INIT_HLIST_NODE(&kip->hlist);
 	hlist_add_head(&kip->hlist, &kprobe_insn_pages);
@@ -173,7 +183,10 @@ kprobe_opcode_t __kprobes *get_insn_slot
 	kip->slot_used[0] = SLOT_USED;
 	kip->nused = 1;
 	kip->ngarbage = 0;
-	return kip->insns;
+	ret = kip->insns;
+end:
+	mutex_unlock(&kprobe_insn_mutex);
+	return ret;
 }
 
 /* Return 1 if all garbages are collected, otherwise 0. */
@@ -207,7 +220,7 @@ static int __kprobes collect_garbage_slo
 	struct kprobe_insn_page *kip;
 	struct hlist_node *pos, *next;
 
-	/* Ensure no-one is preepmted on the garbages */
+	/* Ensure no-one is preempted on the garbages */
 	if (check_safety() != 0)
 		return -EAGAIN;
 
@@ -231,6 +244,7 @@ void __kprobes free_insn_slot(kprobe_opc
 	struct kprobe_insn_page *kip;
 	struct hlist_node *pos;
 
+	mutex_lock(&kprobe_insn_mutex);
 	hlist_for_each_entry(kip, pos, &kprobe_insn_pages, hlist) {
 		if (kip->insns <= slot &&
 		    slot < kip->insns + (INSNS_PER_PAGE * MAX_INSN_SIZE)) {
@@ -247,6 +261,7 @@ void __kprobes free_insn_slot(kprobe_opc
 
 	if (dirty && ++kprobe_garbage_slots > INSNS_PER_PAGE)
 		collect_garbage_slots();
+	mutex_unlock(&kprobe_insn_mutex);
 }
 #endif
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 07/37] Kprobes - do not use kprobes mutex in arch code
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2008-04-24 15:03 ` [patch 06/37] Kprobes - use a mutex to protect the instruction pages list Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 08/37] Kprobes - declare kprobe_mutex static Mathieu Desnoyers
                   ` (30 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Ananth N Mavinakayanahalli, Masami Hiramatsu,
	anil.s.keshavamurthy, davem

[-- Attachment #1: kprobes-dont-use-kprobes-mutex-in-arch-code.patch --]
[-- Type: text/plain, Size: 4222 bytes --]

Remove the kprobes mutex from kprobes.h, since it does not belong there. Also
remove all use of this mutex in the architecture specific code, replacing it by
a proper mutex lock/unlock in the architecture agnostic code.

Changelog :
- remove unnecessary kprobe_mutex around arch_remove_kprobe()

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
CC: anil.s.keshavamurthy@intel.com
CC: davem@davemloft.net
---
 arch/ia64/kernel/kprobes.c    |    2 --
 arch/powerpc/kernel/kprobes.c |    2 --
 arch/s390/kernel/kprobes.c    |    2 --
 arch/x86/kernel/kprobes.c     |    2 --
 include/linux/kprobes.h       |    2 --
 5 files changed, 10 deletions(-)

Index: linux-2.6-sched-devel/include/linux/kprobes.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/kprobes.h	2008-04-19 17:41:25.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/kprobes.h	2008-04-22 20:12:38.000000000 -0400
@@ -35,7 +35,6 @@
 #include <linux/percpu.h>
 #include <linux/spinlock.h>
 #include <linux/rcupdate.h>
-#include <linux/mutex.h>
 
 #ifdef CONFIG_KPROBES
 #include <asm/kprobes.h>
@@ -195,7 +194,6 @@ static inline int init_test_probes(void)
 #endif /* CONFIG_KPROBES_SANITY_TEST */
 
 extern spinlock_t kretprobe_lock;
-extern struct mutex kprobe_mutex;
 extern int arch_prepare_kprobe(struct kprobe *p);
 extern void arch_arm_kprobe(struct kprobe *p);
 extern void arch_disarm_kprobe(struct kprobe *p);
Index: linux-2.6-sched-devel/arch/x86/kernel/kprobes.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/kprobes.c	2008-04-22 20:04:02.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/kprobes.c	2008-04-22 20:12:38.000000000 -0400
@@ -376,9 +376,7 @@ void __kprobes arch_disarm_kprobe(struct
 
 void __kprobes arch_remove_kprobe(struct kprobe *p)
 {
-	mutex_lock(&kprobe_mutex);
 	free_insn_slot(p->ainsn.insn, (p->ainsn.boostable == 1));
-	mutex_unlock(&kprobe_mutex);
 }
 
 static void __kprobes save_previous_kprobe(struct kprobe_ctlblk *kcb)
Index: linux-2.6-sched-devel/arch/ia64/kernel/kprobes.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/ia64/kernel/kprobes.c	2008-04-22 20:04:00.000000000 -0400
+++ linux-2.6-sched-devel/arch/ia64/kernel/kprobes.c	2008-04-22 20:12:54.000000000 -0400
@@ -672,9 +672,7 @@ void __kprobes arch_disarm_kprobe(struct
 
 void __kprobes arch_remove_kprobe(struct kprobe *p)
 {
-	mutex_lock(&kprobe_mutex);
 	free_insn_slot(p->ainsn.insn, p->ainsn.inst_flag & INST_FLAG_BOOSTABLE);
-	mutex_unlock(&kprobe_mutex);
 }
 /*
  * We are resuming execution after a single step fault, so the pt_regs
Index: linux-2.6-sched-devel/arch/powerpc/kernel/kprobes.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/powerpc/kernel/kprobes.c	2008-04-19 17:41:25.000000000 -0400
+++ linux-2.6-sched-devel/arch/powerpc/kernel/kprobes.c	2008-04-22 20:12:38.000000000 -0400
@@ -88,9 +88,7 @@ void __kprobes arch_disarm_kprobe(struct
 
 void __kprobes arch_remove_kprobe(struct kprobe *p)
 {
-	mutex_lock(&kprobe_mutex);
 	free_insn_slot(p->ainsn.insn, 0);
-	mutex_unlock(&kprobe_mutex);
 }
 
 static void __kprobes prepare_singlestep(struct kprobe *p, struct pt_regs *regs)
Index: linux-2.6-sched-devel/arch/s390/kernel/kprobes.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/s390/kernel/kprobes.c	2008-04-22 20:04:02.000000000 -0400
+++ linux-2.6-sched-devel/arch/s390/kernel/kprobes.c	2008-04-22 20:12:38.000000000 -0400
@@ -220,9 +220,7 @@ void __kprobes arch_disarm_kprobe(struct
 
 void __kprobes arch_remove_kprobe(struct kprobe *p)
 {
-	mutex_lock(&kprobe_mutex);
 	free_insn_slot(p->ainsn.insn, 0);
-	mutex_unlock(&kprobe_mutex);
 }
 
 static void __kprobes prepare_singlestep(struct kprobe *p, struct pt_regs *regs)

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 08/37] Kprobes - declare kprobe_mutex static
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (6 preceding siblings ...)
  2008-04-24 15:03 ` [patch 07/37] Kprobes - do not use kprobes mutex in arch code Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 09/37] Fix sched-devel text_poke Mathieu Desnoyers
                   ` (29 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Ananth N Mavinakayanahalli, Masami Hiramatsu,
	hch, anil.s.keshavamurthy, davem

[-- Attachment #1: kprobes-declare-kprobes-mutex-static.patch --]
[-- Type: text/plain, Size: 1254 bytes --]

Since it will not be used by other kernel objects, it makes sense to declare it
static.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
CC: hch@infradead.org
CC: anil.s.keshavamurthy@intel.com
CC: davem@davemloft.net
---
 kernel/kprobes.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kprobes.c	2007-08-19 09:09:15.000000000 -0400
+++ linux-2.6-lttng/kernel/kprobes.c	2007-08-19 17:18:07.000000000 -0400
@@ -68,7 +68,7 @@ static struct hlist_head kretprobe_inst_
 /* NOTE: change this value only with kprobe_mutex held */
 static bool kprobe_enabled;
 
-DEFINE_MUTEX(kprobe_mutex);		/* Protects kprobe_table */
+static DEFINE_MUTEX(kprobe_mutex);	/* Protects kprobe_table */
 DEFINE_SPINLOCK(kretprobe_lock);	/* Protects kretprobe_inst_table */
 static DEFINE_PER_CPU(struct kprobe *, kprobe_instance) = NULL;
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 09/37] Fix sched-devel text_poke
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (7 preceding siblings ...)
  2008-04-24 15:03 ` [patch 08/37] Kprobes - declare kprobe_mutex static Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 10/37] Text Edit Lock - Architecture Independent Code Mathieu Desnoyers
                   ` (28 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: fix-sched-devel-text-poke.patch --]
[-- Type: text/plain, Size: 2338 bytes --]

Use core_text_address() instead of kernel_text_address(). Deal with modules in
the same way used for the core kernel.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 arch/x86/kernel/alternative.c |   38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

Index: linux-2.6-sched-devel/arch/x86/kernel/alternative.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/alternative.c	2008-04-22 20:15:41.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/alternative.c	2008-04-22 20:16:22.000000000 -0400
@@ -511,31 +511,29 @@ void *__kprobes text_poke(void *addr, co
 	unsigned long flags;
 	char *vaddr;
 	int nr_pages = 2;
+	struct page *pages[2];
+	int i;
 
-	BUG_ON(len > sizeof(long));
-	BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1))
-		- ((long)addr & ~(sizeof(long) - 1)));
-	if (kernel_text_address((unsigned long)addr)) {
-		struct page *pages[2] = { virt_to_page(addr),
-			virt_to_page(addr + PAGE_SIZE) };
-		if (!pages[1])
-			nr_pages = 1;
-		vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
-		BUG_ON(!vaddr);
-		local_irq_save(flags);
-		memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
-		local_irq_restore(flags);
-		vunmap(vaddr);
+	if (!core_kernel_text((unsigned long)addr)) {
+		pages[0] = vmalloc_to_page(addr);
+		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
 	} else {
-		/*
-		 * modules are in vmalloc'ed memory, always writable.
-		 */
-		local_irq_save(flags);
-		memcpy(addr, opcode, len);
-		local_irq_restore(flags);
+		pages[0] = virt_to_page(addr);
+		pages[1] = virt_to_page(addr + PAGE_SIZE);
 	}
+	BUG_ON(!pages[0]);
+	if (!pages[1])
+		nr_pages = 1;
+	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
+	BUG_ON(!vaddr);
+	local_irq_save(flags);
+	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
+	local_irq_restore(flags);
+	vunmap(vaddr);
 	sync_core();
 	/* Could also do a CLFLUSH here to speed up CPU recovery; but
 	   that causes hangs on some VIA CPUs. */
+	for (i = 0; i < len; i++)
+		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
 	return addr;
 }

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 10/37] Text Edit Lock - Architecture Independent Code
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (8 preceding siblings ...)
  2008-04-24 15:03 ` [patch 09/37] Fix sched-devel text_poke Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 11/37] Text Edit Lock - kprobes architecture independent support Mathieu Desnoyers
                   ` (27 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, Andi Kleen

[-- Attachment #1: text-edit-lock-architecture-independent-code.patch --]
[-- Type: text/plain, Size: 3206 bytes --]

This is an architecture independant synchronization around kernel text
modifications through use of a global mutex.

A mutex has been chosen so that kprobes, the main user of this, can sleep during
memory allocation between the memory read of the instructions it must replace
and the memory write of the breakpoint.

Other user of this interface: immediate values.

Paravirt and alternatives are always done when SMP is inactive, so there is no
need to use locks.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <andi@firstfloor.org>
CC: Ingo Molnar <mingo@elte.hu>
---
 include/linux/memory.h |    7 +++++++
 mm/memory.c            |   34 ++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

Index: linux-2.6-sched-devel/include/linux/memory.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/memory.h	2008-04-22 20:04:13.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/memory.h	2008-04-22 20:19:19.000000000 -0400
@@ -92,4 +92,11 @@ extern int memory_notify(unsigned long v
 #define hotplug_memory_notifier(fn, pri) do { } while (0)
 #endif
 
+/*
+ * Take and release the kernel text modification lock, used for code patching.
+ * Users of this lock can sleep.
+ */
+extern void kernel_text_lock(void);
+extern void kernel_text_unlock(void);
+
 #endif /* _LINUX_MEMORY_H_ */
Index: linux-2.6-sched-devel/mm/memory.c
===================================================================
--- linux-2.6-sched-devel.orig/mm/memory.c	2008-04-19 17:41:25.000000000 -0400
+++ linux-2.6-sched-devel/mm/memory.c	2008-04-22 20:19:19.000000000 -0400
@@ -51,6 +51,8 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/kprobes.h>
+#include <linux/mutex.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -96,6 +98,12 @@ int randomize_va_space __read_mostly =
 					2;
 #endif
 
+/*
+ * mutex protecting text section modification (dynamic code patching).
+ * some users need to sleep (allocating memory...) while they hold this lock.
+ */
+static DEFINE_MUTEX(text_mutex);
+
 static int __init disable_randmaps(char *s)
 {
 	randomize_va_space = 0;
@@ -2737,3 +2745,29 @@ void print_vma_addr(char *prefix, unsign
 	}
 	up_read(&current->mm->mmap_sem);
 }
+
+/**
+ * kernel_text_lock     -   Take the kernel text modification lock
+ *
+ * Insures mutual write exclusion of kernel and modules text live text
+ * modification. Should be used for code patching.
+ * Users of this lock can sleep.
+ */
+void __kprobes kernel_text_lock(void)
+{
+	mutex_lock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(kernel_text_lock);
+
+/**
+ * kernel_text_unlock   -   Release the kernel text modification lock
+ *
+ * Insures mutual write exclusion of kernel and modules text live text
+ * modification. Should be used for code patching.
+ * Users of this lock can sleep.
+ */
+void __kprobes kernel_text_unlock(void)
+{
+	mutex_unlock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(kernel_text_unlock);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 11/37] Text Edit Lock - kprobes architecture independent support
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (9 preceding siblings ...)
  2008-04-24 15:03 ` [patch 10/37] Text Edit Lock - Architecture Independent Code Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 12/37] Add all cpus option to stop machine run Mathieu Desnoyers
                   ` (26 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Ananth N Mavinakayanahalli,
	anil.s.keshavamurthy, davem, Roel Kluin

[-- Attachment #1: text-edit-lock-kprobes-architecture-independent-support.patch --]
[-- Type: text/plain, Size: 3077 bytes --]

Use the mutual exclusion provided by the text edit lock in the kprobes code. It
allows coherent manipulation of the kernel code by other subsystems.

Changelog:

Move the kernel_text_lock/unlock out of the for loops.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
CC: ananth@in.ibm.com
CC: anil.s.keshavamurthy@intel.com
CC: davem@davemloft.net
CC: Roel Kluin <12o3l@tiscali.nl>
---
 kernel/kprobes.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kprobes.c	2008-04-09 10:52:51.000000000 -0400
+++ linux-2.6-lttng/kernel/kprobes.c	2008-04-09 10:52:57.000000000 -0400
@@ -43,6 +43,7 @@
 #include <linux/seq_file.h>
 #include <linux/debugfs.h>
 #include <linux/kdebug.h>
+#include <linux/memory.h>
 
 #include <asm-generic/sections.h>
 #include <asm/cacheflush.h>
@@ -577,9 +578,10 @@ static int __kprobes __register_kprobe(s
 		goto out;
 	}
 
+	kernel_text_lock();
 	ret = arch_prepare_kprobe(p);
 	if (ret)
-		goto out;
+		goto out_unlock_text;
 
 	INIT_HLIST_NODE(&p->hlist);
 	hlist_add_head_rcu(&p->hlist,
@@ -587,7 +589,8 @@ static int __kprobes __register_kprobe(s
 
 	if (kprobe_enabled)
 		arch_arm_kprobe(p);
-
+out_unlock_text:
+	kernel_text_unlock();
 out:
 	mutex_unlock(&kprobe_mutex);
 
@@ -630,8 +633,11 @@ valid_p:
 		 * enabled - otherwise, the breakpoint would already have
 		 * been removed. We save on flushing icache.
 		 */
-		if (kprobe_enabled)
+		if (kprobe_enabled) {
+			kernel_text_lock();
 			arch_disarm_kprobe(p);
+			kernel_text_unlock();
+		}
 		hlist_del_rcu(&old_p->hlist);
 		cleanup_p = 1;
 	} else {
@@ -729,7 +735,6 @@ static int __kprobes pre_handler_kretpro
 		}
 
 		arch_prepare_kretprobe(ri, regs);
-
 		/* XXX(hch): why is there no hlist_move_head? */
 		hlist_del(&ri->uflist);
 		hlist_add_head(&ri->uflist, &ri->rp->used_instances);
@@ -951,11 +956,13 @@ static void __kprobes enable_all_kprobes
 	if (kprobe_enabled)
 		goto already_enabled;
 
+	kernel_text_lock();
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist)
 			arch_arm_kprobe(p);
 	}
+	kernel_text_unlock();
 
 	kprobe_enabled = true;
 	printk(KERN_INFO "Kprobes globally enabled\n");
@@ -980,6 +987,7 @@ static void __kprobes disable_all_kprobe
 
 	kprobe_enabled = false;
 	printk(KERN_INFO "Kprobes globally disabled\n");
+	kernel_text_lock();
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist) {
@@ -987,6 +995,7 @@ static void __kprobes disable_all_kprobe
 				arch_disarm_kprobe(p);
 		}
 	}
+	kernel_text_unlock();
 
 	mutex_unlock(&kprobe_mutex);
 	/* Allow all currently running kprobes to complete */

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 12/37] Add all cpus option to stop machine run
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (10 preceding siblings ...)
  2008-04-24 15:03 ` [patch 11/37] Text Edit Lock - kprobes architecture independent support Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 13/37] Immediate Values - Architecture Independent Code Mathieu Desnoyers
                   ` (25 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Jason Baron, Mathieu Desnoyers, Rusty Russell, Adrian Bunk,
	Andi Kleen, Christoph Hellwig, akpm

[-- Attachment #1: add-all-cpus-option-to-stop-machine-run.patch --]
[-- Type: text/plain, Size: 4371 bytes --]

-allow stop_mahcine_run() to call a function on all cpus. Calling 
 stop_machine_run() with a 'ALL_CPUS' invokes this new behavior.
 stop_machine_run() proceeds as normal until the calling cpu has
 invoked 'fn'. Then, we tell all the other cpus to call 'fn'.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
---

 include/linux/stop_machine.h |    8 +++++++-
 kernel/stop_machine.c        |   32 +++++++++++++++++++++++++-------
 2 files changed, 32 insertions(+), 8 deletions(-)


Index: linux-2.6-sched-devel/include/linux/stop_machine.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/stop_machine.h	2008-04-19 17:41:24.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/stop_machine.h	2008-04-22 20:19:41.000000000 -0400
@@ -8,11 +8,17 @@
 #include <asm/system.h>
 
 #if defined(CONFIG_STOP_MACHINE) && defined(CONFIG_SMP)
+
+#define ALL_CPUS ~0U
+
 /**
  * stop_machine_run: freeze the machine on all CPUs and run this function
  * @fn: the function to run
  * @data: the data ptr for the @fn()
- * @cpu: the cpu to run @fn() on (or any, if @cpu == NR_CPUS.
+ * @cpu: if @cpu == n, run @fn() on cpu n
+ *       if @cpu == NR_CPUS, run @fn() on any cpu
+ *       if @cpu == ALL_CPUS, run @fn() first on the calling cpu, and then
+ *       concurrently on all the other cpus
  *
  * Description: This causes a thread to be scheduled on every other cpu,
  * each of which disables interrupts, and finally interrupts are disabled
Index: linux-2.6-sched-devel/kernel/stop_machine.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/stop_machine.c	2008-04-22 20:04:13.000000000 -0400
+++ linux-2.6-sched-devel/kernel/stop_machine.c	2008-04-22 20:21:38.000000000 -0400
@@ -22,9 +22,17 @@ enum stopmachine_state {
 	STOPMACHINE_WAIT,
 	STOPMACHINE_PREPARE,
 	STOPMACHINE_DISABLE_IRQ,
+	STOPMACHINE_RUN,
 	STOPMACHINE_EXIT,
 };
 
+struct stop_machine_data {
+	int (*fn)(void *);
+	void *data;
+	struct completion done;
+	int run_all;
+} smdata;
+
 static enum stopmachine_state stopmachine_state;
 static unsigned int stopmachine_num_threads;
 static atomic_t stopmachine_thread_ack;
@@ -33,6 +41,7 @@ static int stopmachine(void *cpu)
 {
 	int irqs_disabled = 0;
 	int prepared = 0;
+	int ran = 0;
 
 	set_cpus_allowed_ptr(current, &cpumask_of_cpu((int)(long)cpu));
 
@@ -57,6 +66,11 @@ static int stopmachine(void *cpu)
 			prepared = 1;
 			smp_mb(); /* Must read state first. */
 			atomic_inc(&stopmachine_thread_ack);
+		} else if (stopmachine_state == STOPMACHINE_RUN && !ran) {
+			smdata.fn(smdata.data);
+			ran = 1;
+			smp_mb(); /* Must read state first. */
+			atomic_inc(&stopmachine_thread_ack);
 		}
 		/* Yield in first stage: migration threads need to
 		 * help our sisters onto their CPUs. */
@@ -134,11 +148,10 @@ static void restart_machine(void)
 	preempt_enable_no_resched();
 }
 
-struct stop_machine_data {
-	int (*fn)(void *);
-	void *data;
-	struct completion done;
-};
+static void run_other_cpus(void)
+{
+	stopmachine_set_state(STOPMACHINE_RUN);
+}
 
 static int do_stop(void *_smdata)
 {
@@ -148,6 +161,8 @@ static int do_stop(void *_smdata)
 	ret = stop_machine();
 	if (ret == 0) {
 		ret = smdata->fn(smdata->data);
+		if (smdata->run_all)
+			run_other_cpus();
 		restart_machine();
 	}
 
@@ -171,14 +186,17 @@ struct task_struct *__stop_machine_run(i
 	struct stop_machine_data smdata;
 	struct task_struct *p;
 
+	mutex_lock(&stopmachine_mutex);
+
 	smdata.fn = fn;
 	smdata.data = data;
+	smdata.run_all = (cpu == ALL_CPUS) ? 1 : 0;
 	init_completion(&smdata.done);
 
-	mutex_lock(&stopmachine_mutex);
+	smp_wmb(); /* make sure other cpus see smdata updates */
 
 	/* If they don't care which CPU fn runs on, bind to any online one. */
-	if (cpu == NR_CPUS)
+	if (cpu == NR_CPUS || cpu == ALL_CPUS)
 		cpu = raw_smp_processor_id();
 
 	p = kthread_create(do_stop, &smdata, "kstopmachine");

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 13/37] Immediate Values - Architecture Independent Code
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (11 preceding siblings ...)
  2008-04-24 15:03 ` [patch 12/37] Add all cpus option to stop machine run Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-25 14:55   ` Ingo Molnar
  2008-04-24 15:03 ` [patch 14/37] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
                   ` (24 subsequent siblings)
  37 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Jason Baron, Rusty Russell, Adrian Bunk,
	Andi Kleen, Christoph Hellwig, akpm

[-- Attachment #1: immediate-values-architecture-independent-code.patch --]
[-- Type: text/plain, Size: 18625 bytes --]

Immediate values are used as read mostly variables that are rarely updated. They
use code patching to modify the values inscribed in the instruction stream. It
provides a way to save precious cache lines that would otherwise have to be used
by these variables.

There is a generic _imv_read() version, which uses standard global
variables, and optimized per architecture imv_read() implementations,
which use a load immediate to remove a data cache hit. When the immediate values
functionnality is disabled in the kernel, it falls back to global variables.

It adds a new rodata section "__imv" to place the pointers to the enable
value. Immediate values activation functions sits in kernel/immediate.c.

Immediate values refer to the memory address of a previously declared integer.
This integer holds the information about the state of the immediate values
associated, and must be accessed through the API found in linux/immediate.h.

At module load time, each immediate value is checked to see if it must be
enabled. It would be the case if the variable they refer to is exported from
another module and already enabled.

In the early stages of start_kernel(), the immediate values are updated to
reflect the state of the variable they refer to.

* Why should this be merged *

It improves performances on heavy memory I/O workloads.

An interesting result shows the potential this infrastructure has by
showing the slowdown a simple system call such as getppid() suffers when it is
used under heavy user-space cache trashing:

Random walk L1 and L2 trashing surrounding a getppid() call:
(note: in this test, do_syscal_trace was taken at each system call, see
Documentation/immediate.txt in these patches for details)
- No memory pressure :   getppid() takes  1573 cycles
- With memory pressure : getppid() takes 15589 cycles

We therefore have a slowdown of 10 times just to get the kernel variables from
memory. Another test on the same architecture (Intel P4) measured the memory
latency to be 559 cycles. Therefore, each cache line removed from the hot path
would improve the syscall time of 3.5% in these conditions.

Changelog:

- section __imv is already SHF_ALLOC
- Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
  the if (immediateindex) is unnecessary here.
- Remove module_mutex usage: depend on functions implemented in module.c for
  that.
- Does not update tainted module's immediate values.
- remove imv_*_t types, add DECLARE_IMV() and DEFINE_IMV().
  - imv_read(&var) becomes imv_read(var) because of this.
- Adding a new EXPORT_IMV_SYMBOL(_GPL).
- remove imv_if(). Should use if (unlikely(imv_read(var))) instead.
  - Wait until we have gcc support before we add the imv_if macro, since
    its form may have to change.
- Dont't declare the __imv section in vmlinux.lds.h, just put the content
  in the rodata section.
- Simplify interface : remove imv_set_early, keep track of kernel boot
  status internally.
- Remove the ALIGN(8) before the __imv section. It is packed now.
- Uses an IPI busy-loop on each CPU with interrupts disabled as a simple,
  architecture agnostic, update mechanism.
- Use imv_* instead of immediate_*.
- Updating immediate values, cannot rely on smp_call_function() b/c
  synchronizing cpus using IPIs leads to deadlocks. Process A held a read lock
  on tasklist_lock, then process B called apply_imv_update(). Process A received
  the IPI and begins executing ipi_busy_loop(). Then process C takes a write
  lock irq on the task list lock, before receiving the IPI. Thus, process A
  holds up process C, and C can't get an IPI b/c interrupts are disabled. Solve
  this problem by using a new 'ALL_CPUS' parameter to stop_machine_run(). Which
  runs a function on all cpus after they are busy looping and have disabled
  irqs. Since this is done in a new process context, we don't have to worry
  about interrupted spin_locks. Also, less lines of code. Has survived 24 hours+
  of testing...

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Jason Baron <jbaron@redhat.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
---
 include/asm-generic/vmlinux.lds.h |    3 
 include/linux/immediate.h         |   94 +++++++++++++++++++++++
 include/linux/module.h            |   16 ++++
 init/main.c                       |    8 ++
 kernel/Makefile                   |    1 
 kernel/immediate.c                |  149 ++++++++++++++++++++++++++++++++++++++
 kernel/module.c                   |   50 ++++++++++++
 7 files changed, 320 insertions(+), 1 deletion(-)

Index: linux-2.6-sched-devel/include/linux/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-sched-devel/include/linux/immediate.h	2008-04-24 09:16:54.000000000 -0400
@@ -0,0 +1,94 @@
+#ifndef _LINUX_IMMEDIATE_H
+#define _LINUX_IMMEDIATE_H
+
+/*
+ * Immediate values, can be updated at runtime and save cache lines.
+ *
+ * (C) Copyright 2007 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#ifdef CONFIG_IMMEDIATE
+
+struct __imv {
+	unsigned long var;	/* Pointer to the identifier variable of the
+				 * immediate value
+				 */
+	unsigned long imv;	/*
+				 * Pointer to the memory location of the
+				 * immediate value within the instruction.
+				 */
+	unsigned char size;	/* Type size. */
+} __attribute__ ((packed));
+
+#include <asm/immediate.h>
+
+/**
+ * imv_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define imv_set(name, i)						\
+	do {								\
+		name##__imv = (i);					\
+		core_imv_update();					\
+		module_imv_update();					\
+	} while (0)
+
+/*
+ * Internal update functions.
+ */
+extern void core_imv_update(void);
+extern void imv_update_range(const struct __imv *begin,
+	const struct __imv *end);
+
+#else
+
+/*
+ * Generic immediate values: a simple, standard, memory load.
+ */
+
+/**
+ * imv_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ */
+#define imv_read(name)			_imv_read(name)
+
+/**
+ * imv_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define imv_set(name, i)		(name##__imv = (i))
+
+static inline void core_imv_update(void) { }
+static inline void module_imv_update(void) { }
+
+#endif
+
+#define DECLARE_IMV(type, name) extern __typeof__(type) name##__imv
+#define DEFINE_IMV(type, name)  __typeof__(type) name##__imv
+
+#define EXPORT_IMV_SYMBOL(name) EXPORT_SYMBOL(name##__imv)
+#define EXPORT_IMV_SYMBOL_GPL(name) EXPORT_SYMBOL_GPL(name##__imv)
+
+/**
+ * _imv_read - Read immediate value with standard memory load.
+ * @name: immediate value name
+ *
+ * Force a data read of the immediate value instead of the immediate value
+ * based mechanism. Useful for __init and __exit section data read.
+ */
+#define _imv_read(name)		(name##__imv)
+
+#endif
Index: linux-2.6-sched-devel/include/linux/module.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/module.h	2008-04-24 08:59:30.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/module.h	2008-04-24 09:16:54.000000000 -0400
@@ -15,6 +15,7 @@
 #include <linux/stringify.h>
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
+#include <linux/immediate.h>
 #include <linux/marker.h>
 #include <asm/local.h>
 
@@ -355,6 +356,10 @@ struct module
 	/* The command line arguments (may be mangled).  People like
 	   keeping pointers to this stuff */
 	char *args;
+#ifdef CONFIG_IMMEDIATE
+	const struct __imv *immediate;
+	unsigned int num_immediate;
+#endif
 #ifdef CONFIG_MARKERS
 	struct marker *markers;
 	unsigned int num_markers;
@@ -467,6 +472,9 @@ extern void print_modules(void);
 
 extern void module_update_markers(void);
 
+extern void _module_imv_update(void);
+extern void module_imv_update(void);
+
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -571,6 +579,14 @@ static inline void module_update_markers
 {
 }
 
+static inline void _module_imv_update(void)
+{
+}
+
+static inline void module_imv_update(void)
+{
+}
+
 #endif /* CONFIG_MODULES */
 
 struct device_driver;
Index: linux-2.6-sched-devel/kernel/module.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/module.c	2008-04-24 08:59:32.000000000 -0400
+++ linux-2.6-sched-devel/kernel/module.c	2008-04-24 09:17:25.000000000 -0400
@@ -33,6 +33,7 @@
 #include <linux/cpu.h>
 #include <linux/moduleparam.h>
 #include <linux/errno.h>
+#include <linux/immediate.h>
 #include <linux/err.h>
 #include <linux/vermagic.h>
 #include <linux/notifier.h>
@@ -1715,6 +1716,7 @@ static struct module *load_module(void _
 	unsigned int unusedcrcindex;
 	unsigned int unusedgplindex;
 	unsigned int unusedgplcrcindex;
+	unsigned int immediateindex;
 	unsigned int markersindex;
 	unsigned int markersstringsindex;
 	struct module *mod;
@@ -1813,6 +1815,7 @@ static struct module *load_module(void _
 #ifdef ARCH_UNWIND_SECTION_NAME
 	unwindex = find_sec(hdr, sechdrs, secstrings, ARCH_UNWIND_SECTION_NAME);
 #endif
+	immediateindex = find_sec(hdr, sechdrs, secstrings, "__imv");
 
 	/* Don't keep modinfo section */
 	sechdrs[infoindex].sh_flags &= ~(unsigned long)SHF_ALLOC;
@@ -1971,6 +1974,11 @@ static struct module *load_module(void _
 	mod->gpl_future_syms = (void *)sechdrs[gplfutureindex].sh_addr;
 	if (gplfuturecrcindex)
 		mod->gpl_future_crcs = (void *)sechdrs[gplfuturecrcindex].sh_addr;
+#ifdef CONFIG_IMMEDIATE
+	mod->immediate = (void *)sechdrs[immediateindex].sh_addr;
+	mod->num_immediate =
+		sechdrs[immediateindex].sh_size / sizeof(*mod->immediate);
+#endif
 
 	mod->unused_syms = (void *)sechdrs[unusedindex].sh_addr;
 	if (unusedcrcindex)
@@ -2038,11 +2046,16 @@ static struct module *load_module(void _
 
 	add_kallsyms(mod, sechdrs, symindex, strindex, secstrings);
 
+	if (!(mod->taints & TAINT_FORCED_MODULE)) {
 #ifdef CONFIG_MARKERS
-	if (!mod->taints)
 		marker_update_probe_range(mod->markers,
 			mod->markers + mod->num_markers);
 #endif
+#ifdef CONFIG_IMMEDIATE
+		imv_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+#endif
+}
 	err = module_finalize(hdr, sechdrs, mod);
 	if (err < 0)
 		goto cleanup;
@@ -2588,3 +2601,38 @@ void module_update_markers(void)
 	mutex_unlock(&module_mutex);
 }
 #endif
+
+#ifdef CONFIG_IMMEDIATE
+/**
+ * _module_imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Module_mutex must be held be the caller.
+ */
+void _module_imv_update(void)
+{
+	struct module *mod;
+
+	list_for_each_entry(mod, &modules, list) {
+		if (mod->taints)
+			continue;
+		imv_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+	}
+}
+EXPORT_SYMBOL_GPL(_module_imv_update);
+
+/**
+ * module_imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Takes module_mutex.
+ */
+void module_imv_update(void)
+{
+	mutex_lock(&module_mutex);
+	_module_imv_update();
+	mutex_unlock(&module_mutex);
+}
+EXPORT_SYMBOL_GPL(module_imv_update);
+#endif
Index: linux-2.6-sched-devel/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-sched-devel/kernel/immediate.c	2008-04-24 09:16:54.000000000 -0400
@@ -0,0 +1,149 @@
+/*
+ * Copyright (C) 2007 Mathieu Desnoyers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/immediate.h>
+#include <linux/memory.h>
+#include <linux/cpu.h>
+#include <linux/stop_machine.h>
+
+#include <asm/cacheflush.h>
+
+/*
+ * Kernel ready to execute the SMP update that may depend on trap and ipi.
+ */
+static int imv_early_boot_complete;
+static int wrote_text;
+
+extern const struct __imv __start___imv[];
+extern const struct __imv __stop___imv[];
+
+static int stop_machine_imv_update(void *imv_ptr)
+{
+	struct __imv *imv = imv_ptr;
+
+	if (!wrote_text) {
+		text_poke((void *)imv->imv, (void *)imv->var, imv->size);
+		wrote_text = 1;
+		smp_wmb(); /* make sure other cpus see that this has run */
+	} else
+		sync_core();
+
+	flush_icache_range(imv->imv, imv->imv + imv->size);
+
+	return 0;
+}
+
+/*
+ * imv_mutex nests inside module_mutex. imv_mutex protects builtin
+ * immediates and module immediates.
+ */
+static DEFINE_MUTEX(imv_mutex);
+
+
+/**
+ * apply_imv_update - update one immediate value
+ * @imv: pointer of type const struct __imv to update
+ *
+ * Update one immediate value. Must be called with imv_mutex held.
+ * It makes sure all CPUs are not executing the modified code by having them
+ * busy looping with interrupts disabled.
+ * It does _not_ protect against NMI and MCE (could be a problem with Intel's
+ * errata if we use immediate values in their code path).
+ */
+static int apply_imv_update(const struct __imv *imv)
+{
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (imv->size) {
+	case 1:	if (*(uint8_t *)imv->imv
+				== *(uint8_t *)imv->var)
+			return 0;
+		break;
+	case 2:	if (*(uint16_t *)imv->imv
+				== *(uint16_t *)imv->var)
+			return 0;
+		break;
+	case 4:	if (*(uint32_t *)imv->imv
+				== *(uint32_t *)imv->var)
+			return 0;
+		break;
+	case 8:	if (*(uint64_t *)imv->imv
+				== *(uint64_t *)imv->var)
+			return 0;
+		break;
+	default:return -EINVAL;
+	}
+
+	if (imv_early_boot_complete) {
+		kernel_text_lock();
+		wrote_text = 0;
+		stop_machine_run(stop_machine_imv_update, (void *)imv,
+					ALL_CPUS);
+		kernel_text_unlock();
+	} else
+		text_poke_early((void *)imv->imv, (void *)imv->var,
+				imv->size);
+	return 0;
+}
+
+/**
+ * imv_update_range - Update immediate values in a range
+ * @begin: pointer to the beginning of the range
+ * @end: pointer to the end of the range
+ *
+ * Updates a range of immediates.
+ */
+void imv_update_range(const struct __imv *begin,
+		const struct __imv *end)
+{
+	const struct __imv *iter;
+	int ret;
+	for (iter = begin; iter < end; iter++) {
+		mutex_lock(&imv_mutex);
+		ret = apply_imv_update(iter);
+		if (imv_early_boot_complete && ret)
+			printk(KERN_WARNING
+				"Invalid immediate value. "
+				"Variable at %p, "
+				"instruction at %p, size %hu\n",
+				(void *)iter->imv,
+				(void *)iter->var, iter->size);
+		mutex_unlock(&imv_mutex);
+	}
+}
+EXPORT_SYMBOL_GPL(imv_update_range);
+
+/**
+ * imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ */
+void core_imv_update(void)
+{
+	/* Core kernel imvs */
+	imv_update_range(__start___imv, __stop___imv);
+}
+EXPORT_SYMBOL_GPL(core_imv_update);
+
+void __init imv_init_complete(void)
+{
+	imv_early_boot_complete = 1;
+}
Index: linux-2.6-sched-devel/init/main.c
===================================================================
--- linux-2.6-sched-devel.orig/init/main.c	2008-04-24 08:59:30.000000000 -0400
+++ linux-2.6-sched-devel/init/main.c	2008-04-24 09:16:54.000000000 -0400
@@ -60,6 +60,7 @@
 #include <linux/sched.h>
 #include <linux/signal.h>
 #include <linux/kmemcheck.h>
+#include <linux/immediate.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -103,6 +104,11 @@ static inline void mark_rodata_ro(void) 
 #ifdef CONFIG_TC
 extern void tc_init(void);
 #endif
+#ifdef CONFIG_IMMEDIATE
+extern void imv_init_complete(void);
+#else
+static inline void imv_init_complete(void) { }
+#endif
 
 enum system_states system_state;
 EXPORT_SYMBOL(system_state);
@@ -547,6 +553,7 @@ asmlinkage void __init start_kernel(void
 	boot_init_stack_canary();
 
 	cgroup_init_early();
+	core_imv_update();
 
 	local_irq_disable();
 	early_boot_irqs_off();
@@ -671,6 +678,7 @@ asmlinkage void __init start_kernel(void
 	cpuset_init();
 	taskstats_init_early();
 	delayacct_init();
+	imv_init_complete();
 
 	check_bugs();
 
Index: linux-2.6-sched-devel/kernel/Makefile
===================================================================
--- linux-2.6-sched-devel.orig/kernel/Makefile	2008-04-24 08:59:30.000000000 -0400
+++ linux-2.6-sched-devel/kernel/Makefile	2008-04-24 09:14:39.000000000 -0400
@@ -75,6 +75,7 @@ obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_IMMEDIATE) += immediate.o
 obj-$(CONFIG_MARKERS) += marker.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 obj-$(CONFIG_FTRACE) += trace/
Index: linux-2.6-sched-devel/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-generic/vmlinux.lds.h	2008-04-24 08:59:30.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-generic/vmlinux.lds.h	2008-04-24 09:16:54.000000000 -0400
@@ -61,6 +61,9 @@
 		*(.rodata) *(.rodata.*)					\
 		*(__vermagic)		/* Kernel version magic */	\
 		*(__markers_strings)	/* Markers: strings */		\
+		VMLINUX_SYMBOL(__start___imv) = .;			\
+		*(__imv)		/* Immediate values: pointers */ \
+		VMLINUX_SYMBOL(__stop___imv) = .;			\
 	}								\
 									\
 	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 14/37] Immediate Values - Kconfig menu in EMBEDDED
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (12 preceding siblings ...)
  2008-04-24 15:03 ` [patch 13/37] Immediate Values - Architecture Independent Code Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 15/37] Immediate Values - x86 Optimization Mathieu Desnoyers
                   ` (23 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Adrian Bunk, Andi Kleen,
	Christoph Hellwig, akpm

[-- Attachment #1: immediate-values-kconfig-menu-in-embedded.patch --]
[-- Type: text/plain, Size: 2494 bytes --]

Immediate values provide a way to use dynamic code patching to update variables
sitting within the instruction stream. It saves caches lines normally used by
static read mostly variables. Enable it by default, but let users disable it
through the EMBEDDED menu with the "Disable immediate values" submenu entry.

Note: Since I think that I really should let embedded systems developers using
RO memory the option to disable the immediate values, I choose to leave this
menu option there, in the EMBEDDED menu. Also, the "CONFIG_IMMEDIATE" makes
sense because we want to compile out all the immediate code when we decide not
to use optimized immediate values at all (it removes otherwise unused code).

Changelog:
- Change ARCH_SUPPORTS_IMMEDIATE for HAS_IMMEDIATE
- Turn DISABLE_IMMEDIATE into positive logic

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
---
 init/Kconfig |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Index: linux-2.6-sched-devel/init/Kconfig
===================================================================
--- linux-2.6-sched-devel.orig/init/Kconfig	2008-04-22 20:04:13.000000000 -0400
+++ linux-2.6-sched-devel/init/Kconfig	2008-04-22 20:22:23.000000000 -0400
@@ -765,6 +765,24 @@ config PROC_PAGE_MONITOR
 	  /proc/kpagecount, and /proc/kpageflags. Disabling these
           interfaces will reduce the size of the kernel by approximately 4kb.
 
+config HAVE_IMMEDIATE
+	def_bool n
+
+config IMMEDIATE
+	default y
+	depends on HAVE_IMMEDIATE
+	bool "Immediate value optimization" if EMBEDDED
+	help
+	  Immediate values are used as read-mostly variables that are rarely
+	  updated. They use code patching to modify the values inscribed in the
+	  instruction stream. It provides a way to save precious cache lines
+	  that would otherwise have to be used by these variables. They can be
+	  disabled through the EMBEDDED menu.
+
+	  It consumes slightly more memory and modifies the instruction stream
+	  each time any specially-marked variable is updated. Should really be
+	  disabled for embedded systems with read-only text.
+
 endmenu		# General setup
 
 config SLABINFO

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 15/37] Immediate Values - x86 Optimization
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (13 preceding siblings ...)
  2008-04-24 15:03 ` [patch 14/37] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 16/37] Add text_poke and sync_core to powerpc Mathieu Desnoyers
                   ` (22 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Thomas Gleixner,
	Ingo Molnar, Rusty Russell, Adrian Bunk, akpm

[-- Attachment #1: immediate-values-x86-optimization.patch --]
[-- Type: text/plain, Size: 4893 bytes --]

x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.

Note : a movb needs to get its value froma =q constraint.

Quoting "H. Peter Anvin" <hpa@zytor.com>

Using =r for single-byte values is incorrect for 32-bit code -- that would 
permit %spl, %bpl, %sil, %dil which are illegal in 32-bit mode.

Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
  non atomic writes to a code region only touched by us (nobody can execute it
  since we are protected by the imv_mutex).
- Put imv_set and _imv_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.
- Bugfix : 8 bytes 64 bits immediate value was declared as "4 bytes" in the
  immediate structure.
- Change the immediate.c update code to support variable length opcodes.
- Vastly simplified, using a busy looping IPI with interrupts disabled.
  Does not protect against NMI nor MCE.
- Pack the __imv section. Use smallest types required for size (char).
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: akpm@osdl.org
---
 arch/x86/Kconfig            |    1 
 include/asm-x86/immediate.h |   77 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 78 insertions(+)

Index: linux-2.6-sched-devel/include/asm-x86/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-sched-devel/include/asm-x86/immediate.h	2008-04-22 20:22:29.000000000 -0400
@@ -0,0 +1,77 @@
+#ifndef _ASM_X86_IMMEDIATE_H
+#define _ASM_X86_IMMEDIATE_H
+
+/*
+ * Immediate values. x86 architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <asm/asm.h>
+
+/**
+ * imv_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _imv_read() instead.
+ * If size is bigger than the architecture long size, fall back on a memory
+ * read.
+ *
+ * Make sure to populate the initial static 64 bits opcode with a value
+ * what will generate an instruction with 8 bytes immediate value (not the REX.W
+ * prefixed one that loads a sign extended 32 bits immediate value in a r64
+ * register).
+ */
+#define imv_read(name)							\
+	({								\
+		__typeof__(name##__imv) value;				\
+		BUILD_BUG_ON(sizeof(value) > 8);			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm(".section __imv,\"a\",@progbits\n\t"	\
+				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
+				".byte %c2\n\t"				\
+				".previous\n\t"				\
+				"mov $0,%0\n\t"				\
+				"3:\n\t"				\
+				: "=q" (value)				\
+				: "i" (&name##__imv),			\
+				  "i" (sizeof(value)));			\
+			break;						\
+		case 2:							\
+		case 4:							\
+			asm(".section __imv,\"a\",@progbits\n\t"	\
+				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
+				".byte %c2\n\t"				\
+				".previous\n\t"				\
+				"mov $0,%0\n\t"				\
+				"3:\n\t"				\
+				: "=r" (value)				\
+				: "i" (&name##__imv),			\
+				  "i" (sizeof(value)));			\
+			break;						\
+		case 8:							\
+			if (sizeof(long) < 8) {				\
+				value = name##__imv;			\
+				break;					\
+			}						\
+			asm(".section __imv,\"a\",@progbits\n\t"	\
+				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
+				".byte %c2\n\t"				\
+				".previous\n\t"				\
+				"mov $0xFEFEFEFE01010101,%0\n\t" 	\
+				"3:\n\t"				\
+				: "=r" (value)				\
+				: "i" (&name##__imv),			\
+				  "i" (sizeof(value)));			\
+			break;						\
+		};							\
+		value;							\
+	})
+
+#endif /* _ASM_X86_IMMEDIATE_H */
Index: linux-2.6-sched-devel/arch/x86/Kconfig
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/Kconfig	2008-04-22 20:04:02.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/Kconfig	2008-04-22 20:22:50.000000000 -0400
@@ -25,6 +25,7 @@ config X86
 	select HAVE_KRETPROBES
 	select HAVE_KVM if ((X86_32 && !X86_VOYAGER && !X86_VISWS && !X86_NUMAQ) || X86_64)
 	select HAVE_ARCH_KGDB if !X86_VOYAGER
+	select HAVE_IMMEDIATE
 
 
 config GENERIC_LOCKBREAK

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 16/37] Add text_poke and sync_core to powerpc
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (14 preceding siblings ...)
  2008-04-24 15:03 ` [patch 15/37] Immediate Values - x86 Optimization Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 17/37] Immediate Values - Powerpc Optimization Mathieu Desnoyers
                   ` (21 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Christoph Hellwig,
	Paul Mackerras, Adrian Bunk, Andi Kleen, akpm

[-- Attachment #1: add-text-poke-and-sync-core-to-powerpc.patch --]
[-- Type: text/plain, Size: 1461 bytes --]

- Needed on architectures where we must surround live instruction modification
  with "WP flag disable".
- Turns into a memcpy on powerpc since there is no WP flag activated for
  instruction pages (yet..).
- Add empty sync_core to powerpc so it can be used in architecture independent
  code.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Christoph Hellwig <hch@infradead.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
---
 include/asm-powerpc/cacheflush.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/include/asm-powerpc/cacheflush.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-powerpc/cacheflush.h	2007-11-19 12:05:50.000000000 -0500
+++ linux-2.6-lttng/include/asm-powerpc/cacheflush.h	2007-11-19 13:27:36.000000000 -0500
@@ -63,7 +63,9 @@ extern void flush_dcache_phys_range(unsi
 #define copy_from_user_page(vma, page, vaddr, dst, src, len) \
 	memcpy(dst, src, len)
 
-
+#define text_poke	memcpy
+#define text_poke_early	text_poke
+#define sync_core()
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 /* internal debugging function */

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 17/37] Immediate Values - Powerpc Optimization
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (15 preceding siblings ...)
  2008-04-24 15:03 ` [patch 16/37] Add text_poke and sync_core to powerpc Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 18/37] Immediate Values - Documentation Mathieu Desnoyers
                   ` (20 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Christoph Hellwig,
	Paul Mackerras, Adrian Bunk, Andi Kleen, akpm

[-- Attachment #1: immediate-values-powerpc-optimization.patch --]
[-- Type: text/plain, Size: 3149 bytes --]

PowerPC optimization of the immediate values which uses a li instruction,
patched with an immediate value.

Changelog:
- Put imv_set and _imv_set in the architecture independent header.
- Pack the __imv section. Use smallest types required for size (char).
- Remove architecture specific update code : now handled by architecture
  agnostic code.
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Christoph Hellwig <hch@infradead.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
---
 arch/powerpc/Kconfig            |    1 
 include/asm-powerpc/immediate.h |   55 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

Index: linux-2.6-sched-devel/include/asm-powerpc/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-sched-devel/include/asm-powerpc/immediate.h	2008-04-22 20:22:58.000000000 -0400
@@ -0,0 +1,55 @@
+#ifndef _ASM_POWERPC_IMMEDIATE_H
+#define _ASM_POWERPC_IMMEDIATE_H
+
+/*
+ * Immediate values. PowerPC architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <asm/asm-compat.h>
+
+/**
+ * imv_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _imv_read() instead.
+ */
+#define imv_read(name)							\
+	({								\
+		__typeof__(name##__imv) value;				\
+		BUILD_BUG_ON(sizeof(value) > 8);			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm(".section __imv,\"a\",@progbits\n\t"	\
+					PPC_LONG "%c1, ((1f)-1)\n\t"	\
+					".byte 1\n\t"			\
+					".previous\n\t"			\
+					"li %0,0\n\t"			\
+					"1:\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__imv));			\
+			break;						\
+		case 2:							\
+			asm(".section __imv,\"a\",@progbits\n\t"	\
+					PPC_LONG "%c1, ((1f)-2)\n\t"	\
+					".byte 2\n\t"			\
+					".previous\n\t"			\
+					"li %0,0\n\t"			\
+					"1:\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__imv));			\
+			break;						\
+		case 4:							\
+		case 8:	value = name##__imv;				\
+			break;						\
+		};							\
+		value;							\
+	})
+
+#endif /* _ASM_POWERPC_IMMEDIATE_H */
Index: linux-2.6-sched-devel/arch/powerpc/Kconfig
===================================================================
--- linux-2.6-sched-devel.orig/arch/powerpc/Kconfig	2008-04-22 20:04:01.000000000 -0400
+++ linux-2.6-sched-devel/arch/powerpc/Kconfig	2008-04-22 20:23:18.000000000 -0400
@@ -110,6 +110,7 @@ config PPC
 	select HAVE_KPROBES
 	select HAVE_KRETPROBES
 	select HAVE_LMB
+	select HAVE_IMMEDIATE
 
 config EARLY_PRINTK
 	bool

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 18/37] Immediate Values - Documentation
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (16 preceding siblings ...)
  2008-04-24 15:03 ` [patch 17/37] Immediate Values - Powerpc Optimization Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 19/37] Immediate Values Support init Mathieu Desnoyers
                   ` (19 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Adrian Bunk, Andi Kleen,
	Christoph Hellwig, akpm, KOSAKI Motohiro

[-- Attachment #1: immediate-values-documentation.patch --]
[-- Type: text/plain, Size: 9144 bytes --]

Changelog:
- Remove imv_set_early (removed from API).
- Use imv_* instead of immediate_*.
- Remove non-ascii characters.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/immediate.txt |  221 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 221 insertions(+)

Index: linux-2.6-lttng/Documentation/immediate.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/Documentation/immediate.txt	2008-04-17 08:48:52.000000000 -0400
@@ -0,0 +1,221 @@
+		        Using the Immediate Values
+
+			    Mathieu Desnoyers
+
+
+This document introduces Immediate Values and their use.
+
+
+* Purpose of immediate values
+
+An immediate value is used to compile into the kernel variables that sit within
+the instruction stream. They are meant to be rarely updated but read often.
+Using immediate values for these variables will save cache lines.
+
+This infrastructure is specialized in supporting dynamic patching of the values
+in the instruction stream when multiple CPUs are running without disturbing the
+normal system behavior.
+
+Compiling code meant to be rarely enabled at runtime can be done using
+if (unlikely(imv_read(var))) as condition surrounding the code. The
+smallest data type required for the test (an 8 bits char) is preferred, since
+some architectures, such as powerpc, only allow up to 16 bits immediate values.
+
+
+* Usage
+
+In order to use the "immediate" macros, you should include linux/immediate.h.
+
+#include <linux/immediate.h>
+
+DEFINE_IMV(char, this_immediate);
+EXPORT_IMV_SYMBOL(this_immediate);
+
+
+And use, in the body of a function:
+
+Use imv_set(this_immediate) to set the immediate value.
+
+Use imv_read(this_immediate) to read the immediate value.
+
+The immediate mechanism supports inserting multiple instances of the same
+immediate. Immediate values can be put in inline functions, inlined static
+functions, and unrolled loops.
+
+If you have to read the immediate values from a function declared as __init or
+__exit, you should explicitly use _imv_read(), which will fall back on a
+global variable read. Failing to do so will leave a reference to the __init
+section after it is freed (it would generate a modpost warning).
+
+You can choose to set an initial static value to the immediate by using, for
+instance:
+
+DEFINE_IMV(long, myptr) = 10;
+
+
+* Optimization for a given architecture
+
+One can implement optimized immediate values for a given architecture by
+replacing asm-$ARCH/immediate.h.
+
+
+* Performance improvement
+
+
+  * Memory hit for a data-based branch
+
+Here are the results on a 3GHz Pentium 4:
+
+number of tests: 100
+number of branches per test: 100000
+memory hit cycles per iteration (mean): 636.611
+L1 cache hit cycles per iteration (mean): 89.6413
+instruction stream based test, cycles per iteration (mean): 85.3438
+Just getting the pointer from a modulo on a pseudo-random value, doing
+  nothing with it, cycles per iteration (mean): 77.5044
+
+So:
+Base case:                      77.50 cycles
+instruction stream based test:  +7.8394 cycles
+L1 cache hit based test:        +12.1369 cycles
+Memory load based test:         +559.1066 cycles
+
+So let's say we have a ping flood coming at
+(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms)
+7674 packets per second. If we put 2 markers for irq entry/exit, it
+brings us to 15348 markers sites executed per second.
+
+(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029
+We therefore have a 0.29% slowdown just on this case.
+
+Compared to this, the instruction stream based test will cause a
+slowdown of:
+
+(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004
+For a 0.004% slowdown.
+
+If we plan to use this for memory allocation, spinlock, and all sorts of
+very high event rate tracing, we can assume it will execute 10 to 100
+times more sites per second, which brings us to 0.4% slowdown with the
+instruction stream based test compared to 29% slowdown with the memory
+load based test on a system with high memory pressure.
+
+
+
+  * Markers impact under heavy memory load
+
+Running a kernel with my LTTng instrumentation set, in a test that
+generates memory pressure (from userspace) by trashing L1 and L2 caches
+between calls to getppid() (note: syscall_trace is active and calls
+a marker upon syscall entry and syscall exit; markers are disarmed).
+This test is done in user-space, so there are some delays due to IRQs
+coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20
+nice level)
+
+My first set of results: Linear cache trashing, turned out not to be
+very interesting, because it seems like the linearity of the memset on a
+full array is somehow detected and it does not "really" trash the
+caches.
+
+Now the most interesting result: Random walk L1 and L2 trashing
+surrounding a getppid() call.
+
+- Markers compiled out (but syscall_trace execution forced)
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.033 cycles
+getppid: 1681.4 cycles
+With memory pressure
+Reading timestamps takes 102.938 cycles
+getppid: 15691.6 cycles
+
+
+- With the immediate values based markers:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.006 cycles
+getppid: 1681.84 cycles
+With memory pressure
+Reading timestamps takes 100.291 cycles
+getppid: 11793 cycles
+
+
+- With global variables based markers:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 107.999 cycles
+getppid: 1669.06 cycles
+With memory pressure
+Reading timestamps takes 102.839 cycles
+getppid: 12535 cycles
+
+The result is quite interesting in that the kernel is slower without
+markers than with markers. I explain it by the fact that the data
+accessed is not laid out in the same manner in the cache lines when the
+markers are compiled in or out. It seems that it aligns the function's
+data better to compile-in the markers in this case.
+
+But since the interesting comparison is between the immediate values and
+global variables based markers, and because they share the same memory
+layout, except for the movl being replaced by a movz, we see that the
+global variable based markers (2 markers) adds 742 cycles to each system
+call (syscall entry and exit are traced and memory locations for both
+global variables lie on the same cache line).
+
+
+- Test redone with less iterations, but with error estimates
+
+10 runs of 100 iterations each: Tests done on a 3GHz P4. Here I run getppid with
+syscall trace inactive, comparing the case with memory pressure and without
+memory pressure. (sorry, my system is not setup to execute syscall_trace this
+time, but it will make the point anyway).
+
+No memory pressure
+Reading timestamps:     150.92 cycles,     std dev.    1.01 cycles
+getppid:               1462.09 cycles,     std dev.   18.87 cycles
+
+With memory pressure
+Reading timestamps:     578.22 cycles,     std dev.  269.51 cycles
+getppid:              17113.33 cycles,     std dev. 1655.92 cycles
+
+
+Now for memory read timing: (10 runs, branches per test: 100000)
+Memory read based branch:
+                       644.09 cycles,      std dev.   11.39 cycles
+L1 cache hit based branch:
+                        88.16 cycles,      std dev.    1.35 cycles
+
+
+So, now that we have the raw results, let's calculate:
+
+Memory read:
+644.09 +/- 11.39 - 88.16 +/- 1.35 = 555.93 +/- 11.46 cycles
+
+Getppid without memory pressure:
+1462.09 +/- 18.87 - 150.92 +/- 1.01 = 1311.17 +/- 18.90 cycles
+
+Getppid with memory pressure:
+17113.33 +/- 1655.92 - 578.22 +/- 269.51 = 16535.11 +/- 1677.71 cycles
+
+Therefore, if we add 2 markers not based on immediate values to the getppid
+code, which would add 2 memory reads, we would add
+2 * 555.93 +/- 12.74 = 1111.86 +/- 25.48 cycles
+
+Therefore,
+
+1111.86 +/- 25.48 / 16535.11 +/- 1677.71 = 0.0672
+ relative error: sqrt(((25.48/1111.86)^2)+((1677.71/16535.11)^2))
+                     = 0.1040
+ absolute error: 0.1040 * 0.0672 = 0.0070
+
+Therefore: 0.0672 +/- 0.0070 * 100% = 6.72 +/- 0.70 %
+
+We can therefore affirm that adding 2 markers to getppid, on a system with high
+memory pressure, would have a performance hit of at least 6.0% on the system
+call time, all within the uncertainty limits of these tests. The same applies to
+other kernel code paths. The smaller those code paths are, the highest the
+impact ratio will be.
+
+Therefore, not only is it interesting to use the immediate values to dynamically
+activate dormant code such as the markers, but I think it should also be
+considered as a replacement for many of the "read-mostly" static variables.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 19/37] Immediate Values Support init
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (17 preceding siblings ...)
  2008-04-24 15:03 ` [patch 18/37] Immediate Values - Documentation Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 20/37] Immediate Values - Move Kprobes x86 restore_interrupt to kdebug.h Mathieu Desnoyers
                   ` (18 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Frank Ch. Eigler, KOSAKI Motohiro

[-- Attachment #1: immediate-values-support-init.patch --]
[-- Type: text/plain, Size: 9956 bytes --]

Supports placing immediate values in init code

We need to put the immediate values in RW data section so we can edit them
before init section unload.

This code puts NULL pointers in lieu of original pointer referencing init code
before the init sections are freed, both in the core kernel and in modules.

TODO : support __exit section.

Changelog:
- Fix !CONFIG_IMMEDIATE

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/immediate.txt       |    8 ++++----
 include/asm-generic/vmlinux.lds.h |    8 ++++----
 include/asm-powerpc/immediate.h   |    4 ++--
 include/asm-x86/immediate.h       |    6 +++---
 include/linux/immediate.h         |    4 ++++
 include/linux/module.h            |    2 +-
 init/main.c                       |    1 +
 kernel/immediate.c                |   31 +++++++++++++++++++++++++++++--
 kernel/module.c                   |    4 ++++
 9 files changed, 52 insertions(+), 16 deletions(-)

Index: linux-2.6-sched-devel/kernel/immediate.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/immediate.c	2008-04-22 20:21:52.000000000 -0400
+++ linux-2.6-sched-devel/kernel/immediate.c	2008-04-22 20:23:22.000000000 -0400
@@ -22,6 +22,7 @@
 #include <linux/cpu.h>
 #include <linux/stop_machine.h>
 
+#include <asm/sections.h>
 #include <asm/cacheflush.h>
 
 /*
@@ -30,8 +31,8 @@
 static int imv_early_boot_complete;
 static int wrote_text;
 
-extern const struct __imv __start___imv[];
-extern const struct __imv __stop___imv[];
+extern struct __imv __start___imv[];
+extern struct __imv __stop___imv[];
 
 static int stop_machine_imv_update(void *imv_ptr)
 {
@@ -118,6 +119,8 @@ void imv_update_range(const struct __imv
 	int ret;
 	for (iter = begin; iter < end; iter++) {
 		mutex_lock(&imv_mutex);
+		if (!iter->imv)	/* Skip removed __init immediate values */
+			goto skip;
 		ret = apply_imv_update(iter);
 		if (imv_early_boot_complete && ret)
 			printk(KERN_WARNING
@@ -126,6 +129,7 @@ void imv_update_range(const struct __imv
 				"instruction at %p, size %hu\n",
 				(void *)iter->imv,
 				(void *)iter->var, iter->size);
+skip:
 		mutex_unlock(&imv_mutex);
 	}
 }
@@ -143,6 +147,29 @@ void core_imv_update(void)
 }
 EXPORT_SYMBOL_GPL(core_imv_update);
 
+/**
+ * imv_unref
+ *
+ * Deactivate any immediate value reference pointing into the code region in the
+ * range start to start + size.
+ */
+void imv_unref(struct __imv *begin, struct __imv *end, void *start,
+		unsigned long size)
+{
+	struct __imv *iter;
+
+	for (iter = begin; iter < end; iter++)
+		if (iter->imv >= (unsigned long)start
+			&& iter->imv < (unsigned long)start + size)
+			iter->imv = 0UL;
+}
+
+void imv_unref_core_init(void)
+{
+	imv_unref(__start___imv, __stop___imv, __init_begin,
+		(unsigned long)__init_end - (unsigned long)__init_begin);
+}
+
 void __init imv_init_complete(void)
 {
 	imv_early_boot_complete = 1;
Index: linux-2.6-sched-devel/kernel/module.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/module.c	2008-04-22 20:21:52.000000000 -0400
+++ linux-2.6-sched-devel/kernel/module.c	2008-04-22 20:23:22.000000000 -0400
@@ -2207,6 +2207,10 @@ sys_init_module(void __user *umod,
 	/* Drop initial reference. */
 	module_put(mod);
 	unwind_remove_table(mod->unwind_info, 1);
+#ifdef CONFIG_IMMEDIATE
+	imv_unref(mod->immediate, mod->immediate + mod->num_immediate,
+		mod->module_init, mod->init_size);
+#endif
 	module_free(mod, mod->module_init);
 	mod->module_init = NULL;
 	mod->init_size = 0;
Index: linux-2.6-sched-devel/include/linux/module.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/module.h	2008-04-22 20:21:52.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/module.h	2008-04-22 20:23:22.000000000 -0400
@@ -357,7 +357,7 @@ struct module
 	   keeping pointers to this stuff */
 	char *args;
 #ifdef CONFIG_IMMEDIATE
-	const struct __imv *immediate;
+	struct __imv *immediate;
 	unsigned int num_immediate;
 #endif
 #ifdef CONFIG_MARKERS
Index: linux-2.6-sched-devel/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-generic/vmlinux.lds.h	2008-04-22 20:21:52.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-generic/vmlinux.lds.h	2008-04-22 20:23:22.000000000 -0400
@@ -52,7 +52,10 @@
 	. = ALIGN(8);							\
 	VMLINUX_SYMBOL(__start___markers) = .;				\
 	*(__markers)							\
-	VMLINUX_SYMBOL(__stop___markers) = .;
+	VMLINUX_SYMBOL(__stop___markers) = .;				\
+	VMLINUX_SYMBOL(__start___imv) = .;				\
+	*(__imv)		/* Immediate values: pointers */ 	\
+	VMLINUX_SYMBOL(__stop___imv) = .;
 
 #define RO_DATA(align)							\
 	. = ALIGN((align));						\
@@ -61,9 +64,6 @@
 		*(.rodata) *(.rodata.*)					\
 		*(__vermagic)		/* Kernel version magic */	\
 		*(__markers_strings)	/* Markers: strings */		\
-		VMLINUX_SYMBOL(__start___imv) = .;			\
-		*(__imv)		/* Immediate values: pointers */ \
-		VMLINUX_SYMBOL(__stop___imv) = .;			\
 	}								\
 									\
 	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\
Index: linux-2.6-sched-devel/include/linux/immediate.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/immediate.h	2008-04-22 20:21:52.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/immediate.h	2008-04-22 20:23:22.000000000 -0400
@@ -46,6 +46,9 @@ struct __imv {
 extern void core_imv_update(void);
 extern void imv_update_range(const struct __imv *begin,
 	const struct __imv *end);
+extern void imv_unref_core_init(void);
+extern void imv_unref(struct __imv *begin, struct __imv *end, void *start,
+		unsigned long size);
 
 #else
 
@@ -73,6 +76,7 @@ extern void imv_update_range(const struc
 
 static inline void core_imv_update(void) { }
 static inline void module_imv_update(void) { }
+static inline void imv_unref_core_init(void) { }
 
 #endif
 
Index: linux-2.6-sched-devel/init/main.c
===================================================================
--- linux-2.6-sched-devel.orig/init/main.c	2008-04-22 20:22:16.000000000 -0400
+++ linux-2.6-sched-devel/init/main.c	2008-04-22 20:23:22.000000000 -0400
@@ -808,6 +808,7 @@ static void run_init_process(char *init_
  */
 static int noinline init_post(void)
 {
+	imv_unref_core_init();
 	free_initmem();
 	unlock_kernel();
 	mark_rodata_ro();
Index: linux-2.6-sched-devel/include/asm-x86/immediate.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-x86/immediate.h	2008-04-22 20:22:29.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-x86/immediate.h	2008-04-22 20:23:22.000000000 -0400
@@ -33,7 +33,7 @@
 		BUILD_BUG_ON(sizeof(value) > 8);			\
 		switch (sizeof(value)) {				\
 		case 1:							\
-			asm(".section __imv,\"a\",@progbits\n\t"	\
+			asm(".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
 				".byte %c2\n\t"				\
 				".previous\n\t"				\
@@ -45,7 +45,7 @@
 			break;						\
 		case 2:							\
 		case 4:							\
-			asm(".section __imv,\"a\",@progbits\n\t"	\
+			asm(".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
 				".byte %c2\n\t"				\
 				".previous\n\t"				\
@@ -60,7 +60,7 @@
 				value = name##__imv;			\
 				break;					\
 			}						\
-			asm(".section __imv,\"a\",@progbits\n\t"	\
+			asm(".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
 				".byte %c2\n\t"				\
 				".previous\n\t"				\
Index: linux-2.6-sched-devel/include/asm-powerpc/immediate.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-powerpc/immediate.h	2008-04-22 20:22:58.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-powerpc/immediate.h	2008-04-22 20:23:22.000000000 -0400
@@ -26,7 +26,7 @@
 		BUILD_BUG_ON(sizeof(value) > 8);			\
 		switch (sizeof(value)) {				\
 		case 1:							\
-			asm(".section __imv,\"a\",@progbits\n\t"	\
+			asm(".section __imv,\"aw\",@progbits\n\t"	\
 					PPC_LONG "%c1, ((1f)-1)\n\t"	\
 					".byte 1\n\t"			\
 					".previous\n\t"			\
@@ -36,7 +36,7 @@
 				: "i" (&name##__imv));			\
 			break;						\
 		case 2:							\
-			asm(".section __imv,\"a\",@progbits\n\t"	\
+			asm(".section __imv,\"aw\",@progbits\n\t"	\
 					PPC_LONG "%c1, ((1f)-2)\n\t"	\
 					".byte 2\n\t"			\
 					".previous\n\t"			\
Index: linux-2.6-sched-devel/Documentation/immediate.txt
===================================================================
--- linux-2.6-sched-devel.orig/Documentation/immediate.txt	2008-04-22 20:23:21.000000000 -0400
+++ linux-2.6-sched-devel/Documentation/immediate.txt	2008-04-22 20:23:22.000000000 -0400
@@ -42,10 +42,10 @@ The immediate mechanism supports inserti
 immediate. Immediate values can be put in inline functions, inlined static
 functions, and unrolled loops.
 
-If you have to read the immediate values from a function declared as __init or
-__exit, you should explicitly use _imv_read(), which will fall back on a
-global variable read. Failing to do so will leave a reference to the __init
-section after it is freed (it would generate a modpost warning).
+If you have to read the immediate values from a function declared as __exit, you
+should explicitly use _imv_read(), which will fall back on a global variable
+read. Failing to do so will leave a reference to the __exit section in kernel
+without module unload support. imv_read() in the __init section is supported.
 
 You can choose to set an initial static value to the immediate by using, for
 instance:

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 20/37] Immediate Values - Move Kprobes x86 restore_interrupt to kdebug.h
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (18 preceding siblings ...)
  2008-04-24 15:03 ` [patch 19/37] Immediate Values Support init Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 21/37] Add __discard section to x86 Mathieu Desnoyers
                   ` (17 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Ananth N Mavinakayanahalli, Christoph Hellwig,
	anil.s.keshavamurthy, davem, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

[-- Attachment #1: immediate-values-move-kprobes-x86-restore-interrupt-to-kdebug-h.patch --]
[-- Type: text/plain, Size: 2516 bytes --]

Since the breakpoint handler is useful both to kprobes and immediate values, it
makes sense to make the required restore_interrupt() available through
asm-i386/kdebug.h.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: anil.s.keshavamurthy@intel.com
CC: davem@davemloft.net
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: H. Peter Anvin <hpa@zytor.com>
---
 include/asm-x86/kdebug.h  |   12 ++++++++++++
 include/asm-x86/kprobes.h |    9 ---------
 2 files changed, 12 insertions(+), 9 deletions(-)

Index: linux-2.6-lttng/include/asm-x86/kdebug.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/kdebug.h	2008-03-25 08:56:54.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/kdebug.h	2008-03-25 09:00:17.000000000 -0400
@@ -3,6 +3,9 @@
 
 #include <linux/notifier.h>
 
+#include <linux/ptrace.h>
+#include <asm/system.h>
+
 struct pt_regs;
 
 /* Grossly misnamed. */
@@ -34,4 +37,13 @@ extern void show_regs(struct pt_regs *re
 extern unsigned long oops_begin(void);
 extern void oops_end(unsigned long, struct pt_regs *, int signr);
 
+/* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
+ * if necessary, before executing the original int3/1 (trap) handler.
+ */
+static inline void restore_interrupts(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_enable();
+}
+
 #endif
Index: linux-2.6-lttng/include/asm-x86/kprobes.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/kprobes.h	2008-03-25 08:56:54.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/kprobes.h	2008-03-25 09:00:17.000000000 -0400
@@ -82,15 +82,6 @@ struct kprobe_ctlblk {
 	struct prev_kprobe prev_kprobe;
 };
 
-/* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
- * if necessary, before executing the original int3/1 (trap) handler.
- */
-static inline void restore_interrupts(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_enable();
-}
-
 extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
 extern int kprobe_exceptions_notify(struct notifier_block *self,
 				    unsigned long val, void *data);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 21/37] Add __discard section to x86
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (19 preceding siblings ...)
  2008-04-24 15:03 ` [patch 20/37] Immediate Values - Move Kprobes x86 restore_interrupt to kdebug.h Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 22/37] Immediate Values - x86 Optimization NMI and MCE support Mathieu Desnoyers
                   ` (16 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, H. Peter Anvin, Andi Kleen, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Thomas Gleixner,
	Ingo Molnar

[-- Attachment #1: add-discard-section-to-x86.patch --]
[-- Type: text/plain, Size: 1812 bytes --]

Add a __discard sectionto the linker script. Code produced in this section will
not be put in the vmlinux file. This is useful when we have to calculate the
size of an instruction before actually declaring it (for alignment purposes for
instance). This is used by the immediate values.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: H. Peter Anvin <hpa@zytor.com>
CC: Andi Kleen <ak@muc.de>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
---
 arch/x86/kernel/vmlinux_32.lds.S |    1 +
 arch/x86/kernel/vmlinux_64.lds.S |    1 +
 2 files changed, 2 insertions(+)

Index: linux-2.6-sched-devel/arch/x86/kernel/vmlinux_32.lds.S
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/vmlinux_32.lds.S	2008-04-22 20:04:02.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/vmlinux_32.lds.S	2008-04-22 20:33:15.000000000 -0400
@@ -213,6 +213,7 @@ SECTIONS
   /* Sections to be discarded */
   /DISCARD/ : {
 	*(.exitcall.exit)
+	*(__discard)
 	}
 
   STABS_DEBUG
Index: linux-2.6-sched-devel/arch/x86/kernel/vmlinux_64.lds.S
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/vmlinux_64.lds.S	2008-04-22 20:04:02.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/vmlinux_64.lds.S	2008-04-22 20:33:15.000000000 -0400
@@ -246,6 +246,7 @@ SECTIONS
   /DISCARD/ : {
 	*(.exitcall.exit)
 	*(.eh_frame)
+	*(__discard)
 	}
 
   STABS_DEBUG

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 22/37] Immediate Values - x86 Optimization NMI and MCE support
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (20 preceding siblings ...)
  2008-04-24 15:03 ` [patch 21/37] Add __discard section to x86 Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 23/37] Immediate Values - Powerpc Optimization NMI " Mathieu Desnoyers
                   ` (15 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Thomas Gleixner,
	Ingo Molnar

[-- Attachment #1: immediate-values-x86-optimization-nmi-mce-support.patch --]
[-- Type: text/plain, Size: 17257 bytes --]

x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.
It uses a breakpoint to bypass the instruction being changed, which lessens the
interrupt latency of the operation and protects against NMIs and MCE.

- More reentrant immediate value : uses a breakpoint. Needs to know the
  instruction's first byte. This is why we keep the "instruction size"
  variable, so we can support the REX prefixed instructions too.

Changelog:
- Change the immediate.c update code to support variable length opcodes.
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
  non atomic writes to a code region only touched by us (nobody can execute it
  since we are protected by the imv_mutex).
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.
- Change the immediate.c update code to support variable length opcodes.
- Use imv_* instead of immediate_*.
- Use kernel_wp_disable/enable instead of save/restore.
- Fix 1 byte immediate value so it declares its instruction size.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
---
 arch/x86/kernel/Makefile    |    1 
 arch/x86/kernel/immediate.c |  291 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps_32.c  |    8 -
 include/asm-x86/immediate.h |   48 ++++++-
 4 files changed, 338 insertions(+), 10 deletions(-)

Index: linux-2.6-sched-devel/include/asm-x86/immediate.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-x86/immediate.h	2008-04-22 20:23:22.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-x86/immediate.h	2008-04-22 20:33:22.000000000 -0400
@@ -12,6 +12,18 @@
 
 #include <asm/asm.h>
 
+struct __imv {
+	unsigned long var;	/* Pointer to the identifier variable of the
+				 * immediate value
+				 */
+	unsigned long imv;	/*
+				 * Pointer to the memory location of the
+				 * immediate value within the instruction.
+				 */
+	unsigned char size;	/* Type size. */
+	unsigned char insn_size;/* Instruction size. */
+} __attribute__ ((packed));
+
 /**
  * imv_read - read immediate variable
  * @name: immediate value name
@@ -26,6 +38,11 @@
  * what will generate an instruction with 8 bytes immediate value (not the REX.W
  * prefixed one that loads a sign extended 32 bits immediate value in a r64
  * register).
+ *
+ * Create the instruction in a discarded section to calculate its size. This is
+ * how we can align the beginning of the instruction on an address that will
+ * permit atomic modification of the immediate value without knowing the size of
+ * the opcode used by the compiler. The operand size is known in advance.
  */
 #define imv_read(name)							\
 	({								\
@@ -33,9 +50,14 @@
 		BUILD_BUG_ON(sizeof(value) > 8);			\
 		switch (sizeof(value)) {				\
 		case 1:							\
-			asm(".section __imv,\"aw\",@progbits\n\t"	\
+			asm(".section __discard,\"\",@progbits\n\t"	\
+				"1:\n\t"				\
+				"mov $0,%0\n\t"				\
+				"2:\n\t"				\
+				".previous\n\t"				\
+				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
-				".byte %c2\n\t"				\
+				".byte %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
 				"mov $0,%0\n\t"				\
 				"3:\n\t"				\
@@ -45,10 +67,16 @@
 			break;						\
 		case 2:							\
 		case 4:							\
-			asm(".section __imv,\"aw\",@progbits\n\t"	\
+			asm(".section __discard,\"\",@progbits\n\t"	\
+				"1:\n\t"				\
+				"mov $0,%0\n\t"				\
+				"2:\n\t"				\
+				".previous\n\t"				\
+				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
-				".byte %c2\n\t"				\
+				".byte %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
+				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
 				"mov $0,%0\n\t"				\
 				"3:\n\t"				\
 				: "=r" (value)				\
@@ -60,10 +88,16 @@
 				value = name##__imv;			\
 				break;					\
 			}						\
-			asm(".section __imv,\"aw\",@progbits\n\t"	\
+			asm(".section __discard,\"\",@progbits\n\t"	\
+				"1:\n\t"				\
+				"mov $0xFEFEFEFE01010101,%0\n\t"	\
+				"2:\n\t"				\
+				".previous\n\t"				\
+				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
-				".byte %c2\n\t"				\
+				".byte %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
+				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
 				"mov $0xFEFEFEFE01010101,%0\n\t" 	\
 				"3:\n\t"				\
 				: "=r" (value)				\
@@ -74,4 +108,6 @@
 		value;							\
 	})
 
+extern int arch_imv_update(const struct __imv *imv, int early);
+
 #endif /* _ASM_X86_IMMEDIATE_H */
Index: linux-2.6-sched-devel/arch/x86/kernel/traps_32.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/traps_32.c	2008-04-22 20:10:45.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/traps_32.c	2008-04-22 20:34:47.000000000 -0400
@@ -595,7 +595,7 @@ void do_##name(struct pt_regs *regs, lon
 }
 
 DO_VM86_ERROR_INFO(0, SIGFPE,  "divide error", divide_error, FPE_INTDIV, regs->ip)
-#ifndef CONFIG_KPROBES
+#if !defined(CONFIG_KPROBES) && !defined(CONFIG_IMMEDIATE)
 DO_VM86_ERROR(3, SIGTRAP, "int3", int3)
 #endif
 DO_VM86_ERROR(4, SIGSEGV, "overflow", overflow)
@@ -860,7 +860,7 @@ void restart_nmi(void)
 	acpi_nmi_enable();
 }
 
-#ifdef CONFIG_KPROBES
+#if defined(CONFIG_KPROBES) || defined(CONFIG_IMMEDIATE)
 void __kprobes do_int3(struct pt_regs *regs, long error_code)
 {
 	trace_hardirqs_fixup();
@@ -869,8 +869,8 @@ void __kprobes do_int3(struct pt_regs *r
 			== NOTIFY_STOP)
 		return;
 	/*
-	 * This is an interrupt gate, because kprobes wants interrupts
-	 * disabled. Normal trap handlers don't.
+	 * This is an interrupt gate, because kprobes and immediate values want
+	 * interrupts disabled. Normal trap handlers don't.
 	 */
 	restore_interrupts(regs);
 
Index: linux-2.6-sched-devel/arch/x86/kernel/Makefile
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/Makefile	2008-04-22 20:04:02.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/Makefile	2008-04-22 20:33:22.000000000 -0400
@@ -68,6 +68,7 @@ obj-y				+= vsmp_64.o
 obj-$(CONFIG_KPROBES)		+= kprobes.o
 obj-$(CONFIG_MODULES)		+= module_$(BITS).o
 obj-$(CONFIG_ACPI_SRAT) 	+= srat_32.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
 obj-$(CONFIG_EFI) 		+= efi.o efi_$(BITS).o efi_stub_$(BITS).o
 obj-$(CONFIG_DOUBLEFAULT) 	+= doublefault_32.o
 obj-$(CONFIG_KGDB)		+= kgdb.o
Index: linux-2.6-sched-devel/arch/x86/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-sched-devel/arch/x86/kernel/immediate.c	2008-04-22 20:33:22.000000000 -0400
@@ -0,0 +1,291 @@
+/*
+ * Immediate Value - x86 architecture specific code.
+ *
+ * Rationale
+ *
+ * Required because of :
+ * - Erratum 49 fix for Intel PIII.
+ * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
+ *   Centrino Duo Processor Technology Specification Update, AH33.
+ *   Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
+ *   Instruction Execution Results.
+ *
+ * Permits immediate value modification by XMC with correct serialization.
+ *
+ * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
+ * location that has preemption enabled because it involves no temporary or
+ * reused data structure.
+ *
+ * Quoting Richard J Moore, source of the information motivating this
+ * implementation which differs from the one proposed by Intel which is not
+ * suitable for kernel context (does not support NMI and would require disabling
+ * interrupts on every CPU for a long period) :
+ *
+ * "There is another issue to consider when looking into using probes other
+ * then int3:
+ *
+ * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
+ * practice of modifying code on one processor where another has prefetched
+ * the unmodified version of the code. Intel states that unpredictable general
+ * protection faults may result if a synchronizing instruction (iret, int,
+ * int3, cpuid, etc ) is not executed on the second processor before it
+ * executes the pre-fetched out-of-date copy of the instruction.
+ *
+ * When we became aware of this I had a long discussion with Intel's
+ * microarchitecture guys. It turns out that the reason for this erratum
+ * (which incidentally Intel does not intend to fix) is because the trace
+ * cache - the stream of micro-ops resulting from instruction interpretation -
+ * cannot be guaranteed to be valid. Reading between the lines I assume this
+ * issue arises because of optimization done in the trace cache, where it is
+ * no longer possible to identify the original instruction boundaries. If the
+ * CPU discoverers that the trace cache has been invalidated because of
+ * unsynchronized cross-modification then instruction execution will be
+ * aborted with a GPF. Further discussion with Intel revealed that replacing
+ * the first opcode byte with an int3 would not be subject to this erratum.
+ *
+ * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
+ *
+ * Overall design
+ *
+ * The algorithm proposed by Intel applies not so well in kernel context: it
+ * would imply disabling interrupts and looping on every CPUs while modifying
+ * the code and would not support instrumentation of code called from interrupt
+ * sources that cannot be disabled.
+ *
+ * Therefore, we use a different algorithm to respect Intel's erratum (see the
+ * quoted discussion above). We make sure that no CPU sees an out-of-date copy
+ * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
+ * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
+ * execute a sync_core(), to make sure that even when the breakpoint is removed,
+ * no cpu could possibly still have the out-of-date copy of the instruction,
+ * modify the now unused 2nd byte of the instruction, and then put back the
+ * original 1st byte of the instruction.
+ *
+ * It has exactly the same intent as the algorithm proposed by Intel, but
+ * it has less side-effects, scales better and supports NMI, SMI and MCE.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/preempt.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/kdebug.h>
+#include <linux/rcupdate.h>
+#include <linux/kprobes.h>
+#include <linux/io.h>
+
+#include <asm/cacheflush.h>
+
+#define BREAKPOINT_INSTRUCTION  0xcc
+#define BREAKPOINT_INS_LEN	1
+#define NR_NOPS			10
+
+static unsigned long target_after_int3;	/* EIP of the target after the int3 */
+static unsigned long bypass_eip;	/* EIP of the bypass. */
+static unsigned long bypass_after_int3;	/* EIP after the end-of-bypass int3 */
+static unsigned long after_imv;	/*
+					 * EIP where to resume after the
+					 * single-stepping.
+					 */
+
+/*
+ * Internal bypass used during value update. The bypass is skipped by the
+ * function in which it is inserted.
+ * No need to be aligned because we exclude readers from the site during
+ * update.
+ * Layout is:
+ * (10x nop) int3
+ * (maximum size is 2 bytes opcode + 8 bytes immediate value for long on x86_64)
+ * The nops are the target replaced by the instruction to single-step.
+ * Align on 16 bytes to make sure the nops fit within a single page so remapping
+ * it can be done easily.
+ */
+static inline void _imv_bypass(unsigned long *bypassaddr,
+	unsigned long *breaknextaddr)
+{
+		asm volatile("jmp 2f;\n\t"
+				".align 16;\n\t"
+				"0:\n\t"
+				".space 10, 0x90;\n\t"
+				"1:\n\t"
+				"int3;\n\t"
+				"2:\n\t"
+				"mov $(0b),%0;\n\t"
+				"mov $((1b)+1),%1;\n\t"
+				: "=r" (*bypassaddr),
+				  "=r" (*breaknextaddr));
+}
+
+static void imv_synchronize_core(void *info)
+{
+	sync_core();	/* use cpuid to stop speculative execution */
+}
+
+/*
+ * The eip value points right after the breakpoint instruction, in the second
+ * byte of the movl.
+ * Disable preemption in the bypass to make sure no thread will be preempted in
+ * it. We can then use synchronize_sched() to make sure every bypass users have
+ * ended.
+ */
+static int imv_notifier(struct notifier_block *nb,
+	unsigned long val, void *data)
+{
+	enum die_val die_val = (enum die_val) val;
+	struct die_args *args = data;
+
+	if (!args->regs || user_mode_vm(args->regs))
+		return NOTIFY_DONE;
+
+	if (die_val == DIE_INT3) {
+		if (args->regs->ip == target_after_int3) {
+			preempt_disable();
+			args->regs->ip = bypass_eip;
+			return NOTIFY_STOP;
+		} else if (args->regs->ip == bypass_after_int3) {
+			args->regs->ip = after_imv;
+			preempt_enable();
+			return NOTIFY_STOP;
+		}
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block imv_notify = {
+	.notifier_call = imv_notifier,
+	.priority = 0x7fffffff,	/* we need to be notified first */
+};
+
+/**
+ * arch_imv_update - update one immediate value
+ * @imv: pointer of type const struct __imv to update
+ * @early: early boot (1) or normal (0)
+ *
+ * Update one immediate value. Must be called with imv_mutex held.
+ */
+__kprobes int arch_imv_update(const struct __imv *imv, int early)
+{
+	int ret;
+	unsigned char opcode_size = imv->insn_size - imv->size;
+	unsigned long insn = imv->imv - opcode_size;
+	unsigned long len;
+	char *vaddr;
+	struct page *pages[1];
+
+#ifdef CONFIG_KPROBES
+	/*
+	 * Fail if a kprobe has been set on this instruction.
+	 * (TODO: we could eventually do better and modify all the (possibly
+	 * nested) kprobes for this site if kprobes had an API for this.
+	 */
+	if (unlikely(!early
+			&& *(unsigned char *)insn == BREAKPOINT_INSTRUCTION)) {
+		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+				    "Variable at %p, "
+				    "instruction at %p, size %hu\n",
+				    (void *)imv->imv,
+				    (void *)imv->var, imv->size);
+		return -EBUSY;
+	}
+#endif
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (imv->size) {
+	case 1:	if (*(uint8_t *)imv->imv
+				== *(uint8_t *)imv->var)
+			return 0;
+		break;
+	case 2:	if (*(uint16_t *)imv->imv
+				== *(uint16_t *)imv->var)
+			return 0;
+		break;
+	case 4:	if (*(uint32_t *)imv->imv
+				== *(uint32_t *)imv->var)
+			return 0;
+		break;
+#ifdef CONFIG_X86_64
+	case 8:	if (*(uint64_t *)imv->imv
+				== *(uint64_t *)imv->var)
+			return 0;
+		break;
+#endif
+	default:return -EINVAL;
+	}
+
+	if (!early) {
+		/* bypass is 10 bytes long for x86_64 long */
+		WARN_ON(imv->insn_size > 10);
+		_imv_bypass(&bypass_eip, &bypass_after_int3);
+
+		after_imv = imv->imv + imv->size;
+
+		/*
+		 * Using the _early variants because nobody is executing the
+		 * bypass code while we patch it. It is protected by the
+		 * imv_mutex. Since we modify the instructions non atomically
+		 * (for nops), we have to use the _early variant.
+		 * We must however deal with RO pages.
+		 * Use a single page : 10 bytes are aligned on 16 bytes
+		 * boundaries.
+		 */
+		pages[0] = virt_to_page((void *)bypass_eip);
+		vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
+		BUG_ON(!vaddr);
+		text_poke_early(&vaddr[bypass_eip & ~PAGE_MASK],
+			(void *)insn, imv->insn_size);
+		/*
+		 * Fill the rest with nops.
+		 */
+		len = NR_NOPS - imv->insn_size;
+		add_nops((void *)
+			&vaddr[(bypass_eip & ~PAGE_MASK) + imv->insn_size],
+			len);
+		vunmap(vaddr);
+
+		target_after_int3 = insn + BREAKPOINT_INS_LEN;
+		/* register_die_notifier has memory barriers */
+		register_die_notifier(&imv_notify);
+		/* The breakpoint will single-step the bypass */
+		text_poke((void *)insn,
+			((unsigned char[]){BREAKPOINT_INSTRUCTION}), 1);
+		/*
+		 * Make sure the breakpoint is set before we continue (visible
+		 * to other CPUs and interrupts).
+		 */
+		wmb();
+		/*
+		 * Execute serializing instruction on each CPU.
+		 */
+		ret = on_each_cpu(imv_synchronize_core, NULL, 1, 1);
+		BUG_ON(ret != 0);
+
+		text_poke((void *)(insn + opcode_size), (void *)imv->var,
+				imv->size);
+		/*
+		 * Make sure the value can be seen from other CPUs and
+		 * interrupts.
+		 */
+		wmb();
+		text_poke((void *)insn, (unsigned char *)bypass_eip, 1);
+		/*
+		 * Wait for all int3 handlers to end (interrupts are disabled in
+		 * int3). This CPU is clearly not in a int3 handler, because
+		 * int3 handler is not preemptible and there cannot be any more
+		 * int3 handler called for this site, because we placed the
+		 * original instruction back.  synchronize_sched has memory
+		 * barriers.
+		 */
+		synchronize_sched();
+		unregister_die_notifier(&imv_notify);
+		/* unregister_die_notifier has memory barriers */
+	} else
+		text_poke_early((void *)imv->imv, (void *)imv->var,
+			imv->size);
+	return 0;
+}

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 23/37] Immediate Values - Powerpc Optimization NMI MCE support
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (21 preceding siblings ...)
  2008-04-24 15:03 ` [patch 22/37] Immediate Values - x86 Optimization NMI and MCE support Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 24/37] Immediate Values Use Arch NMI and MCE Support Mathieu Desnoyers
                   ` (14 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Christoph Hellwig, Paul Mackerras

[-- Attachment #1: immediate-values-powerpc-optimization-nmi-mce-support.patch --]
[-- Type: text/plain, Size: 4991 bytes --]

Use an atomic update for immediate values.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Christoph Hellwig <hch@infradead.org>
CC: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/kernel/Makefile    |    1 
 arch/powerpc/kernel/immediate.c |   70 ++++++++++++++++++++++++++++++++++++++++
 include/asm-powerpc/immediate.h |   18 ++++++++++
 3 files changed, 89 insertions(+)

Index: linux-2.6-lttng/arch/powerpc/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/powerpc/kernel/immediate.c	2008-04-16 21:22:29.000000000 -0400
@@ -0,0 +1,70 @@
+/*
+ * Powerpc optimized immediate values enabling/disabling.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/string.h>
+#include <linux/kprobes.h>
+#include <asm/cacheflush.h>
+#include <asm/page.h>
+
+#define LI_OPCODE_LEN	2
+
+/**
+ * arch_imv_update - update one immediate value
+ * @imv: pointer of type const struct __imv to update
+ * @early: early boot (1), normal (0)
+ *
+ * Update one immediate value. Must be called with imv_mutex held.
+ */
+int arch_imv_update(const struct __imv *imv, int early)
+{
+#ifdef CONFIG_KPROBES
+	kprobe_opcode_t *insn;
+	/*
+	 * Fail if a kprobe has been set on this instruction.
+	 * (TODO: we could eventually do better and modify all the (possibly
+	 * nested) kprobes for this site if kprobes had an API for this.
+	 */
+	switch (imv->size) {
+	case 1:	/* The uint8_t points to the 3rd byte of the
+		 * instruction */
+		insn = (void *)(imv->imv - 1 - LI_OPCODE_LEN);
+		break;
+	case 2:	insn = (void *)(imv->imv - LI_OPCODE_LEN);
+		break;
+	default:
+	return -EINVAL;
+	}
+
+	if (unlikely(!early && *insn == BREAKPOINT_INSTRUCTION)) {
+		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+				    "Variable at %p, "
+				    "instruction at %p, size %lu\n",
+				    (void *)imv->imv,
+				    (void *)imv->var, imv->size);
+		return -EBUSY;
+	}
+#endif
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (imv->size) {
+	case 1:	if (*(uint8_t *)imv->imv == *(uint8_t *)imv->var)
+			return 0;
+		*(uint8_t *)imv->imv = *(uint8_t *)imv->var;
+		break;
+	case 2:	if (*(uint16_t *)imv->imv == *(uint16_t *)imv->var)
+			return 0;
+		*(uint16_t *)imv->imv = *(uint16_t *)imv->var;
+		break;
+	default:return -EINVAL;
+	}
+	flush_icache_range(imv->imv, imv->imv + imv->size);
+	return 0;
+}
Index: linux-2.6-lttng/include/asm-powerpc/immediate.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-powerpc/immediate.h	2008-04-16 12:25:42.000000000 -0400
+++ linux-2.6-lttng/include/asm-powerpc/immediate.h	2008-04-16 20:49:48.000000000 -0400
@@ -12,6 +12,16 @@
 
 #include <asm/asm-compat.h>
 
+struct __imv {
+	unsigned long var;	/* Identifier variable of the immediate value */
+	unsigned long imv;	/*
+				 * Pointer to the memory location that holds
+				 * the immediate value within the load immediate
+				 * instruction.
+				 */
+	unsigned char size;	/* Type size. */
+} __attribute__ ((packed));
+
 /**
  * imv_read - read immediate variable
  * @name: immediate value name
@@ -19,6 +29,11 @@
  * Reads the value of @name.
  * Optimized version of the immediate.
  * Do not use in __init and __exit functions. Use _imv_read() instead.
+ * Makes sure the 2 bytes update will be atomic by aligning the immediate
+ * value. Use a normal memory read for the 4 bytes immediate because there is no
+ * way to atomically update it without using a seqlock read side, which would
+ * cost more in term of total i-cache and d-cache space than a simple memory
+ * read.
  */
 #define imv_read(name)							\
 	({								\
@@ -40,6 +55,7 @@
 					PPC_LONG "%c1, ((1f)-2)\n\t"	\
 					".byte 2\n\t"			\
 					".previous\n\t"			\
+					".align 2\n\t"			\
 					"li %0,0\n\t"			\
 					"1:\n\t"			\
 				: "=r" (value)				\
@@ -52,4 +68,6 @@
 		value;							\
 	})
 
+extern int arch_imv_update(const struct __imv *imv, int early);
+
 #endif /* _ASM_POWERPC_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/powerpc/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/kernel/Makefile	2008-04-16 12:23:07.000000000 -0400
+++ linux-2.6-lttng/arch/powerpc/kernel/Makefile	2008-04-16 12:25:44.000000000 -0400
@@ -45,6 +45,7 @@ obj-$(CONFIG_HIBERNATION)	+= swsusp.o su
 obj64-$(CONFIG_HIBERNATION)	+= swsusp_asm64.o
 obj-$(CONFIG_MODULES)		+= module_$(CONFIG_WORD_SIZE).o
 obj-$(CONFIG_44x)		+= cpu_setup_44x.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
 
 ifeq ($(CONFIG_PPC_MERGE),y)
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 24/37] Immediate Values Use Arch NMI and MCE Support
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (22 preceding siblings ...)
  2008-04-24 15:03 ` [patch 23/37] Immediate Values - Powerpc Optimization NMI " Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 25/37] Immediate Values - Jump Mathieu Desnoyers
                   ` (13 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: immediate-values-use-arch-nmi-mce-support.patch --]
[-- Type: text/plain, Size: 4216 bytes --]

Remove the architecture agnostic code now replaced by architecture specific,
atomic instruction updates.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 include/linux/immediate.h |   11 ------
 kernel/immediate.c        |   73 +---------------------------------------------
 2 files changed, 3 insertions(+), 81 deletions(-)

Index: linux-2.6-lttng/kernel/immediate.c
===================================================================
--- linux-2.6-lttng.orig/kernel/immediate.c	2008-04-11 09:41:33.000000000 -0400
+++ linux-2.6-lttng/kernel/immediate.c	2008-04-14 18:48:05.000000000 -0400
@@ -19,92 +19,23 @@
 #include <linux/mutex.h>
 #include <linux/immediate.h>
 #include <linux/memory.h>
-#include <linux/cpu.h>
-#include <linux/stop_machine.h>
 
 #include <asm/sections.h>
-#include <asm/cacheflush.h>
 
 /*
  * Kernel ready to execute the SMP update that may depend on trap and ipi.
  */
 static int imv_early_boot_complete;
-static int wrote_text;
 
 extern struct __imv __start___imv[];
 extern struct __imv __stop___imv[];
 
-static int stop_machine_imv_update(void *imv_ptr)
-{
-	struct __imv *imv = imv_ptr;
-
-	if (!wrote_text) {
-		text_poke((void *)imv->imv, (void *)imv->var, imv->size);
-		wrote_text = 1;
-		smp_wmb(); /* make sure other cpus see that this has run */
-	} else
-		sync_core();
-
-	flush_icache_range(imv->imv, imv->imv + imv->size);
-
-	return 0;
-}
-
 /*
  * imv_mutex nests inside module_mutex. imv_mutex protects builtin
  * immediates and module immediates.
  */
 static DEFINE_MUTEX(imv_mutex);
 
-
-/**
- * apply_imv_update - update one immediate value
- * @imv: pointer of type const struct __imv to update
- *
- * Update one immediate value. Must be called with imv_mutex held.
- * It makes sure all CPUs are not executing the modified code by having them
- * busy looping with interrupts disabled.
- * It does _not_ protect against NMI and MCE (could be a problem with Intel's
- * errata if we use immediate values in their code path).
- */
-static int apply_imv_update(const struct __imv *imv)
-{
-	/*
-	 * If the variable and the instruction have the same value, there is
-	 * nothing to do.
-	 */
-	switch (imv->size) {
-	case 1:	if (*(uint8_t *)imv->imv
-				== *(uint8_t *)imv->var)
-			return 0;
-		break;
-	case 2:	if (*(uint16_t *)imv->imv
-				== *(uint16_t *)imv->var)
-			return 0;
-		break;
-	case 4:	if (*(uint32_t *)imv->imv
-				== *(uint32_t *)imv->var)
-			return 0;
-		break;
-	case 8:	if (*(uint64_t *)imv->imv
-				== *(uint64_t *)imv->var)
-			return 0;
-		break;
-	default:return -EINVAL;
-	}
-
-	if (imv_early_boot_complete) {
-		kernel_text_lock();
-		wrote_text = 0;
-		stop_machine_run(stop_machine_imv_update, (void *)imv,
-					ALL_CPUS);
-		kernel_text_unlock();
-	} else
-		text_poke_early((void *)imv->imv, (void *)imv->var,
-				imv->size);
-	return 0;
-}
-
 /**
  * imv_update_range - Update immediate values in a range
  * @begin: pointer to the beginning of the range
@@ -121,7 +52,9 @@ void imv_update_range(const struct __imv
 		mutex_lock(&imv_mutex);
 		if (!iter->imv)	/* Skip removed __init immediate values */
 			goto skip;
-		ret = apply_imv_update(iter);
+		kernel_text_lock();
+		ret = arch_imv_update(iter, !imv_early_boot_complete);
+		kernel_text_unlock();
 		if (imv_early_boot_complete && ret)
 			printk(KERN_WARNING
 				"Invalid immediate value. "
Index: linux-2.6-lttng/include/linux/immediate.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/immediate.h	2008-04-11 09:36:58.000000000 -0400
+++ linux-2.6-lttng/include/linux/immediate.h	2008-04-14 18:46:47.000000000 -0400
@@ -12,17 +12,6 @@
 
 #ifdef CONFIG_IMMEDIATE
 
-struct __imv {
-	unsigned long var;	/* Pointer to the identifier variable of the
-				 * immediate value
-				 */
-	unsigned long imv;	/*
-				 * Pointer to the memory location of the
-				 * immediate value within the instruction.
-				 */
-	unsigned char size;	/* Type size. */
-} __attribute__ ((packed));
-
 #include <asm/immediate.h>
 
 /**

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 25/37] Immediate Values - Jump
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (23 preceding siblings ...)
  2008-04-24 15:03 ` [patch 24/37] Immediate Values Use Arch NMI and MCE Support Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 26/37] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
                   ` (12 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: immediate-values-jump.patch --]
[-- Type: text/plain, Size: 19370 bytes --]

Adds a new imv_cond() macro to declare a byte read that is meant to be embedded
in unlikely(imv_cond(var)), so the kernel can dynamically detect patterns such
as mov, test, jne or mov, test, je and patch it with nops and a jump.

Changelog:
- fix !CONFIG_IMMEDIATE

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 arch/x86/kernel/immediate.c     |  406 +++++++++++++++++++++++++++++++++-------
 include/asm-powerpc/immediate.h |    2 
 include/asm-x86/immediate.h     |   34 +++
 include/linux/immediate.h       |   11 -
 kernel/immediate.c              |    6 
 5 files changed, 384 insertions(+), 75 deletions(-)

Index: linux-2.6-sched-devel/include/asm-x86/immediate.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-x86/immediate.h	2008-04-24 10:10:15.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-x86/immediate.h	2008-04-24 10:10:19.000000000 -0400
@@ -20,6 +20,7 @@ struct __imv {
 				 * Pointer to the memory location of the
 				 * immediate value within the instruction.
 				 */
+	int  jmp_off;		/* offset for jump target */
 	unsigned char size;	/* Type size. */
 	unsigned char insn_size;/* Instruction size. */
 } __attribute__ ((packed));
@@ -57,6 +58,7 @@ struct __imv {
 				".previous\n\t"				\
 				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
+				".int 0\n\t"				\
 				".byte %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
 				"mov $0,%0\n\t"				\
@@ -74,6 +76,7 @@ struct __imv {
 				".previous\n\t"				\
 				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
+				".int 0\n\t"				\
 				".byte %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
 				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
@@ -95,6 +98,7 @@ struct __imv {
 				".previous\n\t"				\
 				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
+				".int 0\n\t"				\
 				".byte %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
 				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
@@ -108,6 +112,34 @@ struct __imv {
 		value;							\
 	})
 
-extern int arch_imv_update(const struct __imv *imv, int early);
+/*
+ * Uses %al.
+ * size is 0.
+ * Use in if (unlikely(imv_cond(var)))
+ * Given a char as argument.
+ */
+#define imv_cond(name)							\
+	({								\
+		__typeof__(name##__imv) value;				\
+		BUILD_BUG_ON(sizeof(value) > 1);			\
+		asm (".section __discard,\"\",@progbits\n\t"		\
+			"1:\n\t"					\
+			"mov $0,%0\n\t"					\
+			"2:\n\t"					\
+			".previous\n\t"					\
+			".section __imv,\"aw\",@progbits\n\t"		\
+			_ASM_PTR "%c1, (3f)-1\n\t"			\
+			".int 0\n\t"					\
+			".byte %c2, (2b-1b)\n\t"			\
+			".previous\n\t"					\
+			"mov $0,%0\n\t"					\
+			"3:\n\t"					\
+			: "=a" (value)					\
+			: "i" (&name##__imv),				\
+			  "i" (0));					\
+		value;							\
+	})
+
+extern int arch_imv_update(struct __imv *imv, int early);
 
 #endif /* _ASM_X86_IMMEDIATE_H */
Index: linux-2.6-sched-devel/arch/x86/kernel/immediate.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/immediate.c	2008-04-24 10:10:15.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/immediate.c	2008-04-24 10:13:27.000000000 -0400
@@ -80,13 +80,25 @@
 #include <asm/cacheflush.h>
 
 #define BREAKPOINT_INSTRUCTION  0xcc
+#define JMP_REL8		0xeb
+#define JMP_REL32		0xe9
+#define INSN_NOP1		0x90
+#define INSN_NOP2		0x89, 0xf6
 #define BREAKPOINT_INS_LEN	1
 #define NR_NOPS			10
 
+/*#define DEBUG_IMMEDIATE 1*/
+
+#ifdef DEBUG_IMMEDIATE
+#define printk_dbg printk
+#else
+#define printk_dbg(fmt , a...)
+#endif
+
 static unsigned long target_after_int3;	/* EIP of the target after the int3 */
 static unsigned long bypass_eip;	/* EIP of the bypass. */
 static unsigned long bypass_after_int3;	/* EIP after the end-of-bypass int3 */
-static unsigned long after_imv;	/*
+static unsigned long after_imv;		/*
 					 * EIP where to resume after the
 					 * single-stepping.
 					 */
@@ -142,6 +154,25 @@ static int imv_notifier(struct notifier_
 
 	if (die_val == DIE_INT3) {
 		if (args->regs->ip == target_after_int3) {
+			/* deal with non-relocatable jmp instructions */
+			switch (*(uint8_t *)bypass_eip) {
+			case JMP_REL8: /* eb cb       jmp rel8 */
+				args->regs->ip +=
+					*(signed char *)(bypass_eip + 1) + 1;
+				return NOTIFY_STOP;
+			case JMP_REL32: /* e9 cw    jmp rel16 (valid on ia32) */
+					/* e9 cd    jmp rel32 */
+				args->regs->ip +=
+					*(int *)(bypass_eip + 1) + 4;
+				return NOTIFY_STOP;
+			case INSN_NOP1:
+				/* deal with insertion of nop + jmp_rel32 */
+				if (*((uint8_t *)bypass_eip + 1) == JMP_REL32) {
+					args->regs->ip +=
+						*(int *)(bypass_eip + 2) + 5;
+					return NOTIFY_STOP;
+				}
+			}
 			preempt_disable();
 			args->regs->ip = bypass_eip;
 			return NOTIFY_STOP;
@@ -159,71 +190,107 @@ static struct notifier_block imv_notify 
 	.priority = 0x7fffffff,	/* we need to be notified first */
 };
 
-/**
- * arch_imv_update - update one immediate value
- * @imv: pointer of type const struct __imv to update
- * @early: early boot (1) or normal (0)
- *
- * Update one immediate value. Must be called with imv_mutex held.
+/*
+ * returns -1 if not found
+ * return 0 if found.
  */
-__kprobes int arch_imv_update(const struct __imv *imv, int early)
+static inline int detect_mov_test_jne(uint8_t *addr, uint8_t **opcode,
+		uint8_t **jmp_offset, int *offset_len)
 {
-	int ret;
-	unsigned char opcode_size = imv->insn_size - imv->size;
-	unsigned long insn = imv->imv - opcode_size;
-	unsigned long len;
-	char *vaddr;
-	struct page *pages[1];
-
-#ifdef CONFIG_KPROBES
-	/*
-	 * Fail if a kprobe has been set on this instruction.
-	 * (TODO: we could eventually do better and modify all the (possibly
-	 * nested) kprobes for this site if kprobes had an API for this.
-	 */
-	if (unlikely(!early
-			&& *(unsigned char *)insn == BREAKPOINT_INSTRUCTION)) {
-		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
-				    "Variable at %p, "
-				    "instruction at %p, size %hu\n",
-				    (void *)imv->imv,
-				    (void *)imv->var, imv->size);
-		return -EBUSY;
-	}
-#endif
-
-	/*
-	 * If the variable and the instruction have the same value, there is
-	 * nothing to do.
-	 */
-	switch (imv->size) {
-	case 1:	if (*(uint8_t *)imv->imv
-				== *(uint8_t *)imv->var)
-			return 0;
-		break;
-	case 2:	if (*(uint16_t *)imv->imv
-				== *(uint16_t *)imv->var)
-			return 0;
-		break;
-	case 4:	if (*(uint32_t *)imv->imv
-				== *(uint32_t *)imv->var)
+	printk_dbg(KERN_DEBUG "Trying at %p %hx %hx %hx %hx %hx %hx\n",
+		addr, addr[0], addr[1], addr[2], addr[3], addr[4], addr[5]);
+	/* b0 cb    movb cb,%al */
+	if (addr[0] != 0xb0)
+		return -1;
+	/* 84 c0    test %al,%al */
+	if (addr[2] != 0x84 || addr[3] != 0xc0)
+		return -1;
+	printk_dbg(KERN_DEBUG "Found test %%al,%%al at %p\n", addr + 2);
+	switch (addr[4]) {
+	case 0x75: /* 75 cb       jne rel8 */
+		printk_dbg(KERN_DEBUG "Found jne rel8 at %p\n", addr + 4);
+		*opcode = addr + 4;
+		*jmp_offset = addr + 5;
+		*offset_len = 1;
+		return 0;
+	case 0x0f:
+		switch (addr[5]) {
+		case 0x85:	 /* 0F 85 cw    jne rel16 (valid on ia32) */
+				 /* 0F 85 cd    jne rel32 */
+			printk_dbg(KERN_DEBUG "Found jne rel16/32 at %p\n",
+				addr + 5);
+			*opcode = addr + 4;
+			*jmp_offset = addr + 6;
+			*offset_len = 4;
 			return 0;
+		default:
+			return -1;
+		}
 		break;
-#ifdef CONFIG_X86_64
-	case 8:	if (*(uint64_t *)imv->imv
-				== *(uint64_t *)imv->var)
+	default: return -1;
+	}
+}
+
+/*
+ * returns -1 if not found
+ * return 0 if found.
+ */
+static inline int detect_mov_test_je(uint8_t *addr, uint8_t **opcode,
+		uint8_t **jmp_offset, int *offset_len)
+{
+	/* b0 cb    movb cb,%al */
+	if (addr[0] != 0xb0)
+		return -1;
+	/* 84 c0    test %al,%al */
+	if (addr[2] != 0x84 || addr[3] != 0xc0)
+		return -1;
+	printk_dbg(KERN_DEBUG "Found test %%al,%%al at %p\n", addr + 2);
+	switch (addr[4]) {
+	case 0x74: /* 74 cb       je rel8 */
+		printk_dbg(KERN_DEBUG "Found je rel8 at %p\n", addr + 4);
+		*opcode = addr + 4;
+		*jmp_offset = addr + 5;
+		*offset_len = 1;
+		return 0;
+	case 0x0f:
+		switch (addr[5]) {
+		case 0x84:	 /* 0F 84 cw    je rel16 (valid on ia32) */
+				 /* 0F 84 cd    je rel32 */
+			printk_dbg(KERN_DEBUG "Found je rel16/32 at %p\n",
+				addr + 5);
+			*opcode = addr + 4;
+			*jmp_offset = addr + 6;
+			*offset_len = 4;
 			return 0;
+		default:
+			return -1;
+		}
 		break;
-#endif
-	default:return -EINVAL;
+	default: return -1;
 	}
+}
+
+static int static_early;
 
-	if (!early) {
-		/* bypass is 10 bytes long for x86_64 long */
-		WARN_ON(imv->insn_size > 10);
-		_imv_bypass(&bypass_eip, &bypass_after_int3);
+/*
+ * Marked noinline because we prefer to have only one _imv_bypass. Not that it
+ * is required, but there is no need to edit two bypasses.
+ */
+static noinline int replace_instruction_safe(uint8_t *addr, uint8_t *newcode,
+		int size)
+{
+	char *vaddr;
+	struct page *pages[1];
+	int len;
+	int ret;
+
+	/* bypass is 10 bytes long for x86_64 long */
+	WARN_ON(size > 10);
 
-		after_imv = imv->imv + imv->size;
+	_imv_bypass(&bypass_eip, &bypass_after_int3);
+
+	if (!static_early) {
+		after_imv = (unsigned long)addr + size;
 
 		/*
 		 * Using the _early variants because nobody is executing the
@@ -238,22 +305,23 @@ __kprobes int arch_imv_update(const stru
 		vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
 		BUG_ON(!vaddr);
 		text_poke_early(&vaddr[bypass_eip & ~PAGE_MASK],
-			(void *)insn, imv->insn_size);
+			(void *)addr, size);
 		/*
 		 * Fill the rest with nops.
 		 */
-		len = NR_NOPS - imv->insn_size;
+		len = NR_NOPS - size;
 		add_nops((void *)
-			&vaddr[(bypass_eip & ~PAGE_MASK) + imv->insn_size],
+			&vaddr[(bypass_eip & ~PAGE_MASK) + size],
 			len);
 		vunmap(vaddr);
 
-		target_after_int3 = insn + BREAKPOINT_INS_LEN;
+		target_after_int3 = (unsigned long)addr + BREAKPOINT_INS_LEN;
 		/* register_die_notifier has memory barriers */
 		register_die_notifier(&imv_notify);
-		/* The breakpoint will single-step the bypass */
-		text_poke((void *)insn,
-			((unsigned char[]){BREAKPOINT_INSTRUCTION}), 1);
+		/* The breakpoint will execute the bypass */
+		text_poke((void *)addr,
+			((unsigned char[]){BREAKPOINT_INSTRUCTION}),
+			BREAKPOINT_INS_LEN);
 		/*
 		 * Make sure the breakpoint is set before we continue (visible
 		 * to other CPUs and interrupts).
@@ -265,14 +333,18 @@ __kprobes int arch_imv_update(const stru
 		ret = on_each_cpu(imv_synchronize_core, NULL, 1, 1);
 		BUG_ON(ret != 0);
 
-		text_poke((void *)(insn + opcode_size), (void *)imv->var,
-				imv->size);
+		text_poke((void *)(addr + BREAKPOINT_INS_LEN),
+			&newcode[BREAKPOINT_INS_LEN],
+			size - BREAKPOINT_INS_LEN);
 		/*
 		 * Make sure the value can be seen from other CPUs and
 		 * interrupts.
 		 */
 		wmb();
-		text_poke((void *)insn, (unsigned char *)bypass_eip, 1);
+#ifdef DEBUG_IMMEDIATE
+		mdelay(10);	/* lets the breakpoint for a while */
+#endif
+		text_poke(addr, newcode, BREAKPOINT_INS_LEN);
 		/*
 		 * Wait for all int3 handlers to end (interrupts are disabled in
 		 * int3). This CPU is clearly not in a int3 handler, because
@@ -285,7 +357,203 @@ __kprobes int arch_imv_update(const stru
 		unregister_die_notifier(&imv_notify);
 		/* unregister_die_notifier has memory barriers */
 	} else
-		text_poke_early((void *)imv->imv, (void *)imv->var,
-			imv->size);
+		text_poke_early(addr, newcode, size);
+	return 0;
+}
+
+static int patch_jump_target(struct __imv *imv)
+{
+	uint8_t *opcode, *jmp_offset;
+	int offset_len;
+	int mov_test_j_found = 0;
+
+	if (!detect_mov_test_jne((uint8_t *)imv->imv - 1,
+			&opcode, &jmp_offset, &offset_len)) {
+		imv->insn_size = 1;	/* positive logic */
+		mov_test_j_found = 1;
+	} else if (!detect_mov_test_je((uint8_t *)imv->imv - 1,
+			&opcode, &jmp_offset, &offset_len)) {
+		imv->insn_size = 0;	/* negative logic */
+		mov_test_j_found = 1;
+	}
+
+	if (mov_test_j_found) {
+		int logicvar = imv->insn_size ? imv->var : !imv->var;
+		int newoff;
+
+		if (offset_len == 1) {
+			imv->jmp_off = *(signed char *)jmp_offset;
+			/* replace with JMP_REL8 opcode. */
+			replace_instruction_safe(opcode,
+				((unsigned char[]){ JMP_REL8,
+				(logicvar ? (signed char)imv->jmp_off : 0) }),
+				2);
+		} else {
+			/* replace with nop and JMP_REL16/32 opcode.
+			 * It's ok to shrink an instruction, never ok to
+			 * grow it afterward. */
+			imv->jmp_off = *(int *)jmp_offset;
+			newoff = logicvar ? (int)imv->jmp_off : 0;
+			replace_instruction_safe(opcode,
+				((unsigned char[]){ INSN_NOP1, JMP_REL32,
+				((unsigned char *)&newoff)[0],
+				((unsigned char *)&newoff)[1],
+				((unsigned char *)&newoff)[2],
+				((unsigned char *)&newoff)[3] }),
+				6);
+		}
+		/* now we can get rid of the movb */
+		replace_instruction_safe((uint8_t *)imv->imv - 1,
+			((unsigned char[]){ INSN_NOP2 }),
+			2);
+		/* now we can get rid of the testb */
+		replace_instruction_safe((uint8_t *)imv->imv + 1,
+			((unsigned char[]){ INSN_NOP2 }),
+			2);
+		/* remember opcode + 1 to enable the JMP_REL patching */
+		if (offset_len == 1)
+			imv->imv = (unsigned long)opcode + 1;
+		else
+			imv->imv = (unsigned long)opcode + 2;	/* skip nop */
+		return 0;
+
+	}
+
+	if (*((uint8_t *)imv->imv - 1) == JMP_REL8) {
+		int logicvar = imv->insn_size ? imv->var : !imv->var;
+
+		printk_dbg(KERN_DEBUG "Found JMP_REL8 at %p\n",
+			((uint8_t *)imv->imv - 1));
+		/* Speed up by skipping if not changed */
+		if (logicvar) {
+			if (*(int8_t *)imv->imv == (int8_t)imv->jmp_off)
+				return 0;
+		} else {
+			if (*(int8_t *)imv->imv == 0)
+				return 0;
+		}
+
+		replace_instruction_safe((uint8_t *)imv->imv - 1,
+			((unsigned char[]){ JMP_REL8,
+			(logicvar ? (signed char)imv->jmp_off : 0) }),
+			2);
+		return 0;
+	}
+
+	if (*((uint8_t *)imv->imv - 1) == JMP_REL32) {
+		int logicvar = imv->insn_size ? imv->var : !imv->var;
+		int newoff = logicvar ? (int)imv->jmp_off : 0;
+
+		printk_dbg(KERN_DEBUG "Found JMP_REL32 at %p, update with %x\n",
+			((uint8_t *)imv->imv - 1), newoff);
+		/* Speed up by skipping if not changed */
+		if (logicvar) {
+			if (*(int *)imv->imv == (int)imv->jmp_off)
+				return 0;
+		} else {
+			if (*(int *)imv->imv == 0)
+				return 0;
+		}
+
+		replace_instruction_safe((uint8_t *)imv->imv - 1,
+			((unsigned char[]){ JMP_REL32,
+			((unsigned char *)&newoff)[0],
+			((unsigned char *)&newoff)[1],
+			((unsigned char *)&newoff)[2],
+			((unsigned char *)&newoff)[3] }),
+			5);
+		return 0;
+	}
+
+	/* Nothing known found. */
+	return -1;
+}
+
+/**
+ * arch_imv_update - update one immediate value
+ * @imv: pointer of type const struct __imv to update
+ * @early: early boot (1) or normal (0)
+ *
+ * Update one immediate value. Must be called with imv_mutex held.
+ */
+__kprobes int arch_imv_update(struct __imv *imv, int early)
+{
+	int ret;
+	uint8_t buf[10];
+	unsigned long insn, opcode_size;
+
+	static_early = early;
+
+	/*
+	 * If imv_cond is encountered, try to patch it with
+	 * patch_jump_target. Continue with normal immediate values if the area
+	 * surrounding the instruction is not as expected.
+	 */
+	if (imv->size == 0) {
+		ret = patch_jump_target(imv);
+		if (ret) {
+#ifdef DEBUG_IMMEDIATE
+			static int nr_fail;
+			printk(KERN_DEBUG
+				"Jump target fallback at %lX, nr fail %d\n",
+				imv->imv, ++nr_fail);
+#endif
+			imv->size = 1;
+		} else {
+#ifdef DEBUG_IMMEDIATE
+			static int nr_success;
+			printk(KERN_DEBUG "Jump target at %lX, nr success %d\n",
+				imv->imv, ++nr_success);
+#endif
+			return 0;
+		}
+	}
+
+	opcode_size = imv->insn_size - imv->size;
+	insn = imv->imv - opcode_size;
+
+#ifdef CONFIG_KPROBES
+	/*
+	 * Fail if a kprobe has been set on this instruction.
+	 * (TODO: we could eventually do better and modify all the (possibly
+	 * nested) kprobes for this site if kprobes had an API for this.
+	 */
+	if (unlikely(!early
+			&& *(unsigned char *)insn == BREAKPOINT_INSTRUCTION)) {
+		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+				    "Variable at %p, "
+				    "instruction at %p, size %hu\n",
+				    (void *)imv->var,
+				    (void *)imv->imv, imv->size);
+		return -EBUSY;
+	}
+#endif
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (imv->size) {
+	case 1:	if (*(uint8_t *)imv->imv == *(uint8_t *)imv->var)
+			return 0;
+		break;
+	case 2:	if (*(uint16_t *)imv->imv == *(uint16_t *)imv->var)
+			return 0;
+		break;
+	case 4:	if (*(uint32_t *)imv->imv == *(uint32_t *)imv->var)
+			return 0;
+		break;
+#ifdef CONFIG_X86_64
+	case 8:	if (*(uint64_t *)imv->imv == *(uint64_t *)imv->var)
+			return 0;
+		break;
+#endif
+	default:return -EINVAL;
+	}
+
+	memcpy(buf, (uint8_t *)insn, opcode_size);
+	memcpy(&buf[opcode_size], (void *)imv->var, imv->size);
+	replace_instruction_safe((uint8_t *)insn, buf, imv->insn_size);
+
 	return 0;
 }
Index: linux-2.6-sched-devel/include/linux/immediate.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/immediate.h	2008-04-24 10:10:18.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/immediate.h	2008-04-24 10:10:19.000000000 -0400
@@ -33,8 +33,7 @@
  * Internal update functions.
  */
 extern void core_imv_update(void);
-extern void imv_update_range(const struct __imv *begin,
-	const struct __imv *end);
+extern void imv_update_range(struct __imv *begin, struct __imv *end);
 extern void imv_unref_core_init(void);
 extern void imv_unref(struct __imv *begin, struct __imv *end, void *start,
 		unsigned long size);
@@ -54,6 +53,14 @@ extern void imv_unref(struct __imv *begi
 #define imv_read(name)			_imv_read(name)
 
 /**
+ * imv_cond - read immediate variable use as condition for if()
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ */
+#define imv_cond(name)			_imv_read(name)
+
+/**
  * imv_set - set immediate variable (with locking)
  * @name: immediate value name
  * @i: required value
Index: linux-2.6-sched-devel/kernel/immediate.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/immediate.c	2008-04-24 10:10:18.000000000 -0400
+++ linux-2.6-sched-devel/kernel/immediate.c	2008-04-24 10:10:19.000000000 -0400
@@ -43,10 +43,10 @@ static DEFINE_MUTEX(imv_mutex);
  *
  * Updates a range of immediates.
  */
-void imv_update_range(const struct __imv *begin,
-		const struct __imv *end)
+void imv_update_range(struct __imv *begin,
+		struct __imv *end)
 {
-	const struct __imv *iter;
+	struct __imv *iter;
 	int ret;
 	for (iter = begin; iter < end; iter++) {
 		mutex_lock(&imv_mutex);
Index: linux-2.6-sched-devel/include/asm-powerpc/immediate.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-powerpc/immediate.h	2008-04-24 10:10:17.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-powerpc/immediate.h	2008-04-24 10:10:19.000000000 -0400
@@ -68,6 +68,8 @@ struct __imv {
 		value;							\
 	})
 
+#define imv_cond(name)	imv_read(name)
+
 extern int arch_imv_update(const struct __imv *imv, int early);
 
 #endif /* _ASM_POWERPC_IMMEDIATE_H */

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 26/37] Scheduler Profiling - Use Immediate Values
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (24 preceding siblings ...)
  2008-04-24 15:03 ` [patch 25/37] Immediate Values - Jump Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 27/37] From: Adrian Bunk <bunk@kernel.org> Mathieu Desnoyers
                   ` (11 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Adrian Bunk, Andi Kleen,
	Christoph Hellwig, akpm

[-- Attachment #1: scheduler-profiling-use-immediate-values.patch --]
[-- Type: text/plain, Size: 6060 bytes --]

Use immediate values with lower d-cache hit in optimized version as a
condition for scheduler profiling call.

Changelog :
- Use imv_* instead of immediate_*.
- Follow the white rabbit : kvm_main.c which becomes x86.c.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
---
 arch/x86/kvm/x86.c      |    2 +-
 include/linux/profile.h |    5 +++--
 kernel/profile.c        |   22 +++++++++++-----------
 kernel/sched_fair.c     |    5 +----
 4 files changed, 16 insertions(+), 18 deletions(-)

Index: linux-2.6-sched-devel/kernel/profile.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/profile.c	2008-04-22 20:04:13.000000000 -0400
+++ linux-2.6-sched-devel/kernel/profile.c	2008-04-22 20:23:26.000000000 -0400
@@ -40,8 +40,8 @@ static int (*timer_hook)(struct pt_regs 
 static atomic_t *prof_buffer;
 static unsigned long prof_len, prof_shift;
 
-int prof_on __read_mostly;
-EXPORT_SYMBOL_GPL(prof_on);
+DEFINE_IMV(char, prof_on) __read_mostly;
+EXPORT_IMV_SYMBOL_GPL(prof_on);
 
 static cpumask_t prof_cpu_mask = CPU_MASK_ALL;
 #ifdef CONFIG_SMP
@@ -59,7 +59,7 @@ static int __init profile_setup(char *st
 
 	if (!strncmp(str, sleepstr, strlen(sleepstr))) {
 #ifdef CONFIG_SCHEDSTATS
-		prof_on = SLEEP_PROFILING;
+		imv_set(prof_on, SLEEP_PROFILING);
 		if (str[strlen(sleepstr)] == ',')
 			str += strlen(sleepstr) + 1;
 		if (get_option(&str, &par))
@@ -72,7 +72,7 @@ static int __init profile_setup(char *st
 			"kernel sleep profiling requires CONFIG_SCHEDSTATS\n");
 #endif /* CONFIG_SCHEDSTATS */
 	} else if (!strncmp(str, schedstr, strlen(schedstr))) {
-		prof_on = SCHED_PROFILING;
+		imv_set(prof_on, SCHED_PROFILING);
 		if (str[strlen(schedstr)] == ',')
 			str += strlen(schedstr) + 1;
 		if (get_option(&str, &par))
@@ -81,7 +81,7 @@ static int __init profile_setup(char *st
 			"kernel schedule profiling enabled (shift: %ld)\n",
 			prof_shift);
 	} else if (!strncmp(str, kvmstr, strlen(kvmstr))) {
-		prof_on = KVM_PROFILING;
+		imv_set(prof_on, KVM_PROFILING);
 		if (str[strlen(kvmstr)] == ',')
 			str += strlen(kvmstr) + 1;
 		if (get_option(&str, &par))
@@ -91,7 +91,7 @@ static int __init profile_setup(char *st
 			prof_shift);
 	} else if (get_option(&str, &par)) {
 		prof_shift = par;
-		prof_on = CPU_PROFILING;
+		imv_set(prof_on, CPU_PROFILING);
 		printk(KERN_INFO "kernel profiling enabled (shift: %ld)\n",
 			prof_shift);
 	}
@@ -102,7 +102,7 @@ __setup("profile=", profile_setup);
 
 void __init profile_init(void)
 {
-	if (!prof_on)
+	if (!_imv_read(prof_on))
 		return;
 
 	/* only text is profiled */
@@ -289,7 +289,7 @@ void profile_hits(int type, void *__pc, 
 	int i, j, cpu;
 	struct profile_hit *hits;
 
-	if (prof_on != type || !prof_buffer)
+	if (!prof_buffer)
 		return;
 	pc = min((pc - (unsigned long)_stext) >> prof_shift, prof_len - 1);
 	i = primary = (pc & (NR_PROFILE_GRP - 1)) << PROFILE_GRPSHIFT;
@@ -399,7 +399,7 @@ void profile_hits(int type, void *__pc, 
 {
 	unsigned long pc;
 
-	if (prof_on != type || !prof_buffer)
+	if (!prof_buffer)
 		return;
 	pc = ((unsigned long)__pc - (unsigned long)_stext) >> prof_shift;
 	atomic_add(nr_hits, &prof_buffer[min(pc, prof_len - 1)]);
@@ -556,7 +556,7 @@ static int __init create_hash_tables(voi
 	}
 	return 0;
 out_cleanup:
-	prof_on = 0;
+	imv_set(prof_on, 0);
 	smp_mb();
 	on_each_cpu(profile_nop, NULL, 0, 1);
 	for_each_online_cpu(cpu) {
@@ -583,7 +583,7 @@ static int __init create_proc_profile(vo
 {
 	struct proc_dir_entry *entry;
 
-	if (!prof_on)
+	if (!_imv_read(prof_on))
 		return 0;
 	if (create_hash_tables())
 		return -1;
Index: linux-2.6-sched-devel/include/linux/profile.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/profile.h	2008-04-19 17:41:23.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/profile.h	2008-04-22 20:23:26.000000000 -0400
@@ -7,10 +7,11 @@
 #include <linux/init.h>
 #include <linux/cpumask.h>
 #include <linux/cache.h>
+#include <linux/immediate.h>
 
 #include <asm/errno.h>
 
-extern int prof_on __read_mostly;
+DECLARE_IMV(char, prof_on) __read_mostly;
 
 #define CPU_PROFILING	1
 #define SCHED_PROFILING	2
@@ -38,7 +39,7 @@ static inline void profile_hit(int type,
 	/*
 	 * Speedup for the common (no profiling enabled) case:
 	 */
-	if (unlikely(prof_on == type))
+	if (unlikely(imv_read(prof_on) == type))
 		profile_hits(type, ip, 1);
 }
 
Index: linux-2.6-sched-devel/kernel/sched_fair.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/sched_fair.c	2008-04-22 20:04:13.000000000 -0400
+++ linux-2.6-sched-devel/kernel/sched_fair.c	2008-04-22 20:23:26.000000000 -0400
@@ -616,11 +616,8 @@ static void enqueue_sleeper(struct cfs_r
 		 * get a milliseconds-range estimation of the amount of
 		 * time that the task spent sleeping:
 		 */
-		if (unlikely(prof_on == SLEEP_PROFILING)) {
-
-			profile_hits(SLEEP_PROFILING, (void *)get_wchan(tsk),
+		profile_hits(SLEEP_PROFILING, (void *)get_wchan(task_of(se)),
 				     delta >> 20);
-		}
 		account_scheduler_latency(tsk, delta >> 10, 0);
 	}
 #endif
Index: linux-2.6-sched-devel/arch/x86/kvm/x86.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kvm/x86.c	2008-04-19 17:41:23.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kvm/x86.c	2008-04-22 20:23:26.000000000 -0400
@@ -2604,7 +2604,7 @@ again:
 	/*
 	 * Profile KVM exit RIPs:
 	 */
-	if (unlikely(prof_on == KVM_PROFILING)) {
+	if (unlikely(imv_read(prof_on) == KVM_PROFILING)) {
 		kvm_x86_ops->cache_regs(vcpu);
 		profile_hit(KVM_PROFILING, (void *)vcpu->arch.rip);
 	}

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 27/37] From: Adrian Bunk <bunk@kernel.org>
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (25 preceding siblings ...)
  2008-04-24 15:03 ` [patch 26/37] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-28  9:54   ` Adrian Bunk
  2008-04-24 15:03 ` [patch 28/37] Markers - remove extra format argument Mathieu Desnoyers
                   ` (10 subsequent siblings)
  37 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Adrian Bunk, Mathieu Desnoyers

[-- Attachment #1: make-marker_debug-static.patch --]
[-- Type: text/plain, Size: 931 bytes --]

With the needlessly global marker_debug being static gcc can optimize the
unused code away.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/marker.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN kernel/marker.c~make-marker_debug-static kernel/marker.c
--- a/kernel/marker.c~make-marker_debug-static
+++ a/kernel/marker.c
@@ -28,7 +28,7 @@ extern struct marker __start___markers[]
 extern struct marker __stop___markers[];
 
 /* Set to 1 to enable marker debug output */
-const int marker_debug;
+static const int marker_debug;
 
 /*
  * markers_mutex nests inside module_mutex. Markers mutex protects the builtin
_

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 28/37] Markers - remove extra format argument
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (26 preceding siblings ...)
  2008-04-24 15:03 ` [patch 27/37] From: Adrian Bunk <bunk@kernel.org> Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 29/37] Markers - define non optimized marker Mathieu Desnoyers
                   ` (9 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, Denys Vlasenko

[-- Attachment #1: markers-remove-extra-format-argument.patch --]
[-- Type: text/plain, Size: 7731 bytes --]


Denys Vlasenko <vda.linux@googlemail.com> :

> Not in this patch, but I noticed:
> 
> #define __trace_mark(name, call_private, format, args...)               \
>         do {                                                            \
>                 static const char __mstrtab_##name[]                    \
>                 __attribute__((section("__markers_strings")))           \
>                 = #name "\0" format;                                    \
>                 static struct marker __mark_##name                      \
>                 __attribute__((section("__markers"), aligned(8))) =     \
>                 { __mstrtab_##name, &__mstrtab_##name[sizeof(#name)],   \
>                 0, 0, marker_probe_cb,                                  \
>                 { __mark_empty_function, NULL}, NULL };                 \
>                 __mark_check_format(format, ## args);                   \
>                 if (unlikely(__mark_##name.state)) {                    \
>                         (*__mark_##name.call)                           \
>                                 (&__mark_##name, call_private,          \
>                                 format, ## args);                       \
>                 }                                                       \
>         } while (0)
> 
> In this call:
> 
>                         (*__mark_##name.call)                           \
>                                 (&__mark_##name, call_private,          \
>                                 format, ## args);                       \
> 
> you make gcc allocate duplicate format string. You can use
> &__mstrtab_##name[sizeof(#name)] instead since it holds the same string,
> or drop ", format," above and "const char *fmt" from here:
> 
>         void (*call)(const struct marker *mdata,        /* Probe wrapper */
>                 void *call_private, const char *fmt, ...);
> 
> since mdata->format is the same and all callees which need it can take it there.

Very good point. I actually thought about dropping it, since it would
remove an unnecessary argument from the stack. And actually, since I now
have the marker_probe_cb sitting between the marker site and the
callbacks, there is no API change required. Thanks :)

Mathieu

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Denys Vlasenko <vda.linux@googlemail.com>
---
 include/linux/marker.h |   11 +++++------
 kernel/marker.c        |   30 ++++++++++++++----------------
 2 files changed, 19 insertions(+), 22 deletions(-)

Index: linux-2.6-lttng/include/linux/marker.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/marker.h	2008-03-27 20:51:34.000000000 -0400
+++ linux-2.6-lttng/include/linux/marker.h	2008-03-27 20:54:55.000000000 -0400
@@ -44,8 +44,8 @@ struct marker {
 				 */
 	char state;		/* Marker state. */
 	char ptype;		/* probe type : 0 : single, 1 : multi */
-	void (*call)(const struct marker *mdata,	/* Probe wrapper */
-		void *call_private, const char *fmt, ...);
+				/* Probe wrapper */
+	void (*call)(const struct marker *mdata, void *call_private, ...);
 	struct marker_probe_closure single;
 	struct marker_probe_closure *multi;
 } __attribute__((aligned(8)));
@@ -72,8 +72,7 @@ struct marker {
 		__mark_check_format(format, ## args);			\
 		if (unlikely(__mark_##name.state)) {			\
 			(*__mark_##name.call)				\
-				(&__mark_##name, call_private,		\
-				format, ## args);			\
+				(&__mark_##name, call_private, ## args);\
 		}							\
 	} while (0)
 
@@ -117,9 +116,9 @@ static inline void __printf(1, 2) ___mar
 extern marker_probe_func __mark_empty_function;
 
 extern void marker_probe_cb(const struct marker *mdata,
-	void *call_private, const char *fmt, ...);
+	void *call_private, ...);
 extern void marker_probe_cb_noarg(const struct marker *mdata,
-	void *call_private, const char *fmt, ...);
+	void *call_private, ...);
 
 /*
  * Connect a probe to a marker.
Index: linux-2.6-lttng/kernel/marker.c
===================================================================
--- linux-2.6-lttng.orig/kernel/marker.c	2008-03-27 20:52:09.000000000 -0400
+++ linux-2.6-lttng/kernel/marker.c	2008-03-27 20:56:13.000000000 -0400
@@ -54,8 +54,8 @@ static DEFINE_MUTEX(markers_mutex);
 struct marker_entry {
 	struct hlist_node hlist;
 	char *format;
-	void (*call)(const struct marker *mdata,	/* Probe wrapper */
-		void *call_private, const char *fmt, ...);
+			/* Probe wrapper */
+	void (*call)(const struct marker *mdata, void *call_private, ...);
 	struct marker_probe_closure single;
 	struct marker_probe_closure *multi;
 	int refcount;	/* Number of times armed. 0 if disarmed. */
@@ -90,15 +90,13 @@ EXPORT_SYMBOL_GPL(__mark_empty_function)
  * marker_probe_cb Callback that prepares the variable argument list for probes.
  * @mdata: pointer of type struct marker
  * @call_private: caller site private data
- * @fmt: format string
  * @...:  Variable argument list.
  *
  * Since we do not use "typical" pointer based RCU in the 1 argument case, we
  * need to put a full smp_rmb() in this branch. This is why we do not use
  * rcu_dereference() for the pointer read.
  */
-void marker_probe_cb(const struct marker *mdata, void *call_private,
-	const char *fmt, ...)
+void marker_probe_cb(const struct marker *mdata, void *call_private, ...)
 {
 	va_list args;
 	char ptype;
@@ -119,8 +117,9 @@ void marker_probe_cb(const struct marker
 		/* Must read the ptr before private data. They are not data
 		 * dependant, so we put an explicit smp_rmb() here. */
 		smp_rmb();
-		va_start(args, fmt);
-		func(mdata->single.probe_private, call_private, fmt, &args);
+		va_start(args, call_private);
+		func(mdata->single.probe_private, call_private, mdata->format,
+			&args);
 		va_end(args);
 	} else {
 		struct marker_probe_closure *multi;
@@ -135,9 +134,9 @@ void marker_probe_cb(const struct marker
 		smp_read_barrier_depends();
 		multi = mdata->multi;
 		for (i = 0; multi[i].func; i++) {
-			va_start(args, fmt);
-			multi[i].func(multi[i].probe_private, call_private, fmt,
-				&args);
+			va_start(args, call_private);
+			multi[i].func(multi[i].probe_private, call_private,
+				mdata->format, &args);
 			va_end(args);
 		}
 	}
@@ -149,13 +148,11 @@ EXPORT_SYMBOL_GPL(marker_probe_cb);
  * marker_probe_cb Callback that does not prepare the variable argument list.
  * @mdata: pointer of type struct marker
  * @call_private: caller site private data
- * @fmt: format string
  * @...:  Variable argument list.
  *
  * Should be connected to markers "MARK_NOARGS".
  */
-void marker_probe_cb_noarg(const struct marker *mdata,
-	void *call_private, const char *fmt, ...)
+void marker_probe_cb_noarg(const struct marker *mdata, void *call_private, ...)
 {
 	va_list args;	/* not initialized */
 	char ptype;
@@ -171,7 +168,8 @@ void marker_probe_cb_noarg(const struct 
 		/* Must read the ptr before private data. They are not data
 		 * dependant, so we put an explicit smp_rmb() here. */
 		smp_rmb();
-		func(mdata->single.probe_private, call_private, fmt, &args);
+		func(mdata->single.probe_private, call_private, mdata->format,
+			&args);
 	} else {
 		struct marker_probe_closure *multi;
 		int i;
@@ -185,8 +183,8 @@ void marker_probe_cb_noarg(const struct 
 		smp_read_barrier_depends();
 		multi = mdata->multi;
 		for (i = 0; multi[i].func; i++)
-			multi[i].func(multi[i].probe_private, call_private, fmt,
-				&args);
+			multi[i].func(multi[i].probe_private, call_private,
+				mdata->format, &args);
 	}
 	preempt_enable();
 }

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 29/37] Markers - define non optimized marker
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (27 preceding siblings ...)
  2008-04-24 15:03 ` [patch 28/37] Markers - remove extra format argument Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 30/37] Linux Kernel Markers - Use Immediate Values Mathieu Desnoyers
                   ` (8 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: markers-define-non-optimized-marker.patch --]
[-- Type: text/plain, Size: 3184 bytes --]

To support the forthcoming "immediate values" marker optimization, we must have
a way to declare markers in few code paths that does not use instruction
modification based enable. This will be the case of printk(), some traps and
eventually lockdep instrumentation.

Changelog :
- Fix reversed boolean logic of "generic".

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 include/linux/marker.h |   29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

Index: linux-2.6-lttng/include/linux/marker.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/marker.h	2008-03-27 20:47:44.000000000 -0400
+++ linux-2.6-lttng/include/linux/marker.h	2008-03-27 20:49:04.000000000 -0400
@@ -58,8 +58,12 @@ struct marker {
  * Make sure the alignment of the structure in the __markers section will
  * not add unwanted padding between the beginning of the section and the
  * structure. Force alignment to the same alignment as the section start.
+ *
+ * The "generic" argument controls which marker enabling mechanism must be used.
+ * If generic is true, a variable read is used.
+ * If generic is false, immediate values are used.
  */
-#define __trace_mark(name, call_private, format, args...)		\
+#define __trace_mark(generic, name, call_private, format, args...)	\
 	do {								\
 		static const char __mstrtab_##name[]			\
 		__attribute__((section("__markers_strings")))		\
@@ -79,7 +83,7 @@ struct marker {
 extern void marker_update_probe_range(struct marker *begin,
 	struct marker *end);
 #else /* !CONFIG_MARKERS */
-#define __trace_mark(name, call_private, format, args...) \
+#define __trace_mark(generic, name, call_private, format, args...) \
 		__mark_check_format(format, ## args)
 static inline void marker_update_probe_range(struct marker *begin,
 	struct marker *end)
@@ -87,15 +91,30 @@ static inline void marker_update_probe_r
 #endif /* CONFIG_MARKERS */
 
 /**
- * trace_mark - Marker
+ * trace_mark - Marker using code patching
  * @name: marker name, not quoted.
  * @format: format string
  * @args...: variable argument list
  *
- * Places a marker.
+ * Places a marker using optimized code patching technique (imv_read())
+ * to be enabled when immediate values are present.
  */
 #define trace_mark(name, format, args...) \
-	__trace_mark(name, NULL, format, ## args)
+	__trace_mark(0, name, NULL, format, ## args)
+
+/**
+ * _trace_mark - Marker using variable read
+ * @name: marker name, not quoted.
+ * @format: format string
+ * @args...: variable argument list
+ *
+ * Places a marker using a standard memory read (_imv_read()) to be
+ * enabled. Should be used for markers in code paths where instruction
+ * modification based enabling is not welcome. (__init and __exit functions,
+ * lockdep, some traps, printk).
+ */
+#define _trace_mark(name, format, args...) \
+	__trace_mark(1, name, NULL, format, ## args)
 
 /**
  * MARK_NOARGS - Format string for a marker with no argument.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 30/37] Linux Kernel Markers - Use Immediate Values
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (28 preceding siblings ...)
  2008-04-24 15:03 ` [patch 29/37] Markers - define non optimized marker Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 31/37] Markers use imv jump Mathieu Desnoyers
                   ` (7 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: linux-kernel-markers-immediate-values.patch --]
[-- Type: text/plain, Size: 5749 bytes --]

Make markers use immediate values.

Changelog :
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 Documentation/markers.txt |   17 +++++++++++++----
 include/linux/marker.h    |   16 ++++++++++++----
 kernel/marker.c           |    8 ++++++--
 kernel/module.c           |    1 +
 4 files changed, 32 insertions(+), 10 deletions(-)

Index: linux-2.6-sched-devel/include/linux/marker.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/marker.h	2008-04-24 09:17:56.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/marker.h	2008-04-24 09:17:58.000000000 -0400
@@ -12,6 +12,7 @@
  * See the file COPYING for more details.
  */
 
+#include <linux/immediate.h>
 #include <linux/types.h>
 
 struct module;
@@ -42,7 +43,7 @@ struct marker {
 	const char *format;	/* Marker format string, describing the
 				 * variable argument list.
 				 */
-	char state;		/* Marker state. */
+	DEFINE_IMV(char, state);/* Immediate value state. */
 	char ptype;		/* probe type : 0 : single, 1 : multi */
 				/* Probe wrapper */
 	void (*call)(const struct marker *mdata, void *call_private, ...);
@@ -74,9 +75,16 @@ struct marker {
 		0, 0, marker_probe_cb,					\
 		{ __mark_empty_function, NULL}, NULL };			\
 		__mark_check_format(format, ## args);			\
-		if (unlikely(__mark_##name.state)) {			\
-			(*__mark_##name.call)				\
-				(&__mark_##name, call_private, ## args);\
+		if (!generic) {						\
+			if (unlikely(imv_read(__mark_##name.state)))	\
+				(*__mark_##name.call)			\
+					(&__mark_##name, call_private,	\
+					## args);			\
+		} else {						\
+			if (unlikely(_imv_read(__mark_##name.state)))	\
+				(*__mark_##name.call)			\
+					(&__mark_##name, call_private,	\
+					## args);			\
 		}							\
 	} while (0)
 
Index: linux-2.6-sched-devel/kernel/marker.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/marker.c	2008-04-24 09:17:56.000000000 -0400
+++ linux-2.6-sched-devel/kernel/marker.c	2008-04-24 09:17:58.000000000 -0400
@@ -23,6 +23,7 @@
 #include <linux/rcupdate.h>
 #include <linux/marker.h>
 #include <linux/err.h>
+#include <linux/immediate.h>
 
 extern struct marker __start___markers[];
 extern struct marker __stop___markers[];
@@ -542,7 +543,7 @@ static int set_marker(struct marker_entr
 	 */
 	smp_wmb();
 	elem->ptype = (*entry)->ptype;
-	elem->state = active;
+	elem->state__imv = active;
 
 	return 0;
 }
@@ -556,7 +557,7 @@ static int set_marker(struct marker_entr
 static void disable_marker(struct marker *elem)
 {
 	/* leave "call" as is. It is known statically. */
-	elem->state = 0;
+	elem->state__imv = 0;
 	elem->single.func = __mark_empty_function;
 	/* Update the function before setting the ptype */
 	smp_wmb();
@@ -620,6 +621,9 @@ static void marker_update_probes(void)
 	marker_update_probe_range(__start___markers, __stop___markers);
 	/* Markers in modules. */
 	module_update_markers();
+	/* Update immediate values */
+	core_imv_update();
+	module_imv_update();
 }
 
 /**
Index: linux-2.6-sched-devel/Documentation/markers.txt
===================================================================
--- linux-2.6-sched-devel.orig/Documentation/markers.txt	2008-04-24 09:17:57.000000000 -0400
+++ linux-2.6-sched-devel/Documentation/markers.txt	2008-04-24 09:17:58.000000000 -0400
@@ -15,10 +15,12 @@ provide at runtime. A marker can be "on"
 (no probe is attached). When a marker is "off" it has no effect, except for
 adding a tiny time penalty (checking a condition for a branch) and space
 penalty (adding a few bytes for the function call at the end of the
-instrumented function and adds a data structure in a separate section).  When a
-marker is "on", the function you provide is called each time the marker is
-executed, in the execution context of the caller. When the function provided
-ends its execution, it returns to the caller (continuing from the marker site).
+instrumented function and adds a data structure in a separate section). The
+immediate values are used to minimize the impact on data cache, encoding the
+condition in the instruction stream. When a marker is "on", the function you
+provide is called each time the marker is executed, in the execution context of
+the caller. When the function provided ends its execution, it returns to the
+caller (continuing from the marker site).
 
 You can put markers at important locations in the code. Markers are
 lightweight hooks that can pass an arbitrary number of parameters,
@@ -69,6 +71,13 @@ a printk warning which identifies the in
 "Format mismatch for probe probe_name (format), marker (format)"
 
 
+* Optimization for a given architecture
+
+To force use of a non-optimized version of the markers, _trace_mark() should be
+used. It takes the same parameters as the normal markers, but it does not use
+the immediate values based on code patching.
+
+
 * Probe / marker example
 
 See the example provided in samples/markers/src
Index: linux-2.6-sched-devel/kernel/module.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/module.c	2008-04-24 09:17:56.000000000 -0400
+++ linux-2.6-sched-devel/kernel/module.c	2008-04-24 09:17:58.000000000 -0400
@@ -2052,6 +2052,7 @@ static struct module *load_module(void _
 			mod->markers + mod->num_markers);
 #endif
 #ifdef CONFIG_IMMEDIATE
+		/* Immediate values must be updated after markers */
 		imv_update_range(mod->immediate,
 			mod->immediate + mod->num_immediate);
 #endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 31/37] Markers use imv jump
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (29 preceding siblings ...)
  2008-04-24 15:03 ` [patch 30/37] Linux Kernel Markers - Use Immediate Values Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 32/37] Port ftrace to markers Mathieu Desnoyers
                   ` (6 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: markers-use-imv-jump.patch --]
[-- Type: text/plain, Size: 1017 bytes --]

Let markers use the heavily optimized imv_cond() version of immediate values.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 include/linux/marker.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-lttng/include/linux/marker.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/marker.h	2008-04-16 00:16:52.000000000 -0400
+++ linux-2.6-lttng/include/linux/marker.h	2008-04-16 00:17:12.000000000 -0400
@@ -76,7 +76,7 @@ struct marker {
 		{ __mark_empty_function, NULL}, NULL };			\
 		__mark_check_format(format, ## args);			\
 		if (!generic) {						\
-			if (unlikely(imv_read(__mark_##name.state)))	\
+			if (unlikely(imv_cond(__mark_##name.state)))	\
 				(*__mark_##name.call)			\
 					(&__mark_##name, call_private,	\
 					## args);			\

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 32/37] Port ftrace to markers
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (30 preceding siblings ...)
  2008-04-24 15:03 ` [patch 31/37] Markers use imv jump Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 33/37] LTTng instrumentation fs Mathieu Desnoyers
                   ` (5 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, Steven Rostedt

[-- Attachment #1: port-ftrace-to-markers.patch --]
[-- Type: text/plain, Size: 15365 bytes --]

Porting ftrace to the marker infrastructure.

Don't need to chain to the wakeup tracer from the sched tracer, because markers
support multiple probes connected.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ingo Molnar <mingo@elte.hu>
CC: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/sched.h             |   32 -------
 kernel/sched.c                    |   14 ++-
 kernel/trace/trace.h              |   20 ----
 kernel/trace/trace_sched_switch.c |  173 +++++++++++++++++++++++++++++++-------
 kernel/trace/trace_sched_wakeup.c |  108 ++++++++++++++++++++++-
 5 files changed, 257 insertions(+), 90 deletions(-)

Index: linux-2.6-sched-devel/include/linux/sched.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/sched.h	2008-04-24 11:00:30.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/sched.h	2008-04-24 11:00:41.000000000 -0400
@@ -2080,38 +2080,6 @@ __trace_special(void *__tr, void *__data
 }
 #endif
 
-#ifdef CONFIG_CONTEXT_SWITCH_TRACER
-extern void
-ftrace_ctx_switch(void *rq, struct task_struct *prev, struct task_struct *next);
-extern void
-ftrace_wake_up_task(void *rq, struct task_struct *wakee,
-		    struct task_struct *curr);
-extern void ftrace_all_fair_tasks(void *__rq, void *__tr, void *__data);
-extern void
-ftrace_special(unsigned long arg1, unsigned long arg2, unsigned long arg3);
-#else
-static inline void
-ftrace_ctx_switch(void *rq, struct task_struct *prev, struct task_struct *next)
-{
-}
-static inline void
-sched_trace_special(unsigned long p1, unsigned long p2, unsigned long p3)
-{
-}
-static inline void
-ftrace_wake_up_task(void *rq, struct task_struct *wakee,
-		    struct task_struct *curr)
-{
-}
-static inline void ftrace_all_fair_tasks(void *__rq, void *__tr, void *__data)
-{
-}
-static inline void
-ftrace_special(unsigned long arg1, unsigned long arg2, unsigned long arg3)
-{
-}
-#endif
-
 extern long sched_setaffinity(pid_t pid, const cpumask_t *new_mask);
 extern long sched_getaffinity(pid_t pid, cpumask_t *mask);
 
Index: linux-2.6-sched-devel/kernel/sched.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/sched.c	2008-04-24 11:00:30.000000000 -0400
+++ linux-2.6-sched-devel/kernel/sched.c	2008-04-24 11:01:35.000000000 -0400
@@ -2618,7 +2618,9 @@ out_activate:
 	success = 1;
 
 out_running:
-	ftrace_wake_up_task(rq, p, rq->curr);
+	trace_mark(kernel_sched_wakeup,
+		"pid %d state %ld ## rq %p task %p rq->curr %p",
+		p->pid, p->state, rq, p, rq->curr);
 	check_preempt_curr(rq, p);
 
 	p->state = TASK_RUNNING;
@@ -2749,7 +2751,9 @@ void wake_up_new_task(struct task_struct
 		p->sched_class->task_new(rq, p);
 		inc_nr_running(rq);
 	}
-	ftrace_wake_up_task(rq, p, rq->curr);
+	trace_mark(kernel_sched_wakeup_new,
+		"pid %d state %ld ## rq %p task %p rq->curr %p",
+		p->pid, p->state, rq, p, rq->curr);
 	check_preempt_curr(rq, p);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_wake_up)
@@ -2922,7 +2926,11 @@ context_switch(struct rq *rq, struct tas
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	ftrace_ctx_switch(rq, prev, next);
+	trace_mark(kernel_sched_schedule,
+		"prev_pid %d next_pid %d prev_state %ld "
+		"## rq %p prev %p next %p",
+		prev->pid, next->pid, prev->state,
+		rq, prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
Index: linux-2.6-sched-devel/kernel/trace/trace.h
===================================================================
--- linux-2.6-sched-devel.orig/kernel/trace/trace.h	2008-04-24 11:00:30.000000000 -0400
+++ linux-2.6-sched-devel/kernel/trace/trace.h	2008-04-24 11:00:41.000000000 -0400
@@ -240,25 +240,10 @@ void update_max_tr_single(struct trace_a
 
 extern cycle_t ftrace_now(int cpu);
 
-#ifdef CONFIG_SCHED_TRACER
-extern void
-wakeup_sched_switch(struct task_struct *prev, struct task_struct *next);
-extern void
-wakeup_sched_wakeup(struct task_struct *wakee, struct task_struct *curr);
-#else
-static inline void
-wakeup_sched_switch(struct task_struct *prev, struct task_struct *next)
-{
-}
-static inline void
-wakeup_sched_wakeup(struct task_struct *wakee, struct task_struct *curr)
-{
-}
-#endif
-
 #ifdef CONFIG_CONTEXT_SWITCH_TRACER
 typedef void
 (*tracer_switch_func_t)(void *private,
+			void *__rq,
 			struct task_struct *prev,
 			struct task_struct *next);
 
@@ -268,9 +253,6 @@ struct tracer_switch_ops {
 	struct tracer_switch_ops	*next;
 };
 
-extern int register_tracer_switch(struct tracer_switch_ops *ops);
-extern int unregister_tracer_switch(struct tracer_switch_ops *ops);
-
 #endif /* CONFIG_CONTEXT_SWITCH_TRACER */
 
 #ifdef CONFIG_DYNAMIC_FTRACE
Index: linux-2.6-sched-devel/kernel/trace/trace_sched_switch.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/trace/trace_sched_switch.c	2008-04-24 11:00:30.000000000 -0400
+++ linux-2.6-sched-devel/kernel/trace/trace_sched_switch.c	2008-04-24 11:00:41.000000000 -0400
@@ -16,11 +16,14 @@
 
 static struct trace_array	*ctx_trace;
 static int __read_mostly	tracer_enabled;
+static atomic_t			sched_ref;
 
 static void
-ctx_switch_func(void *__rq, struct task_struct *prev, struct task_struct *next)
+sched_switch_func(void *private, void *__rq, struct task_struct *prev,
+			struct task_struct *next)
 {
-	struct trace_array *tr = ctx_trace;
+	struct trace_array **ptr = private;
+	struct trace_array *tr = *ptr;
 	struct trace_array_cpu *data;
 	unsigned long flags;
 	long disabled;
@@ -41,10 +44,40 @@ ctx_switch_func(void *__rq, struct task_
 	local_irq_restore(flags);
 }
 
+static notrace void
+sched_switch_callback(void *probe_data, void *call_data,
+		      const char *format, va_list *args)
+{
+	struct task_struct *prev;
+	struct task_struct *next;
+	struct rq *__rq;
+
+	if (!atomic_read(&sched_ref))
+		return;
+
+	/* skip prev_pid %d next_pid %d prev_state %ld */
+	(void)va_arg(*args, int);
+	(void)va_arg(*args, int);
+	(void)va_arg(*args, long);
+	__rq = va_arg(*args, typeof(__rq));
+	prev = va_arg(*args, typeof(prev));
+	next = va_arg(*args, typeof(next));
+
+	tracing_record_cmdline(prev);
+
+	/*
+	 * If tracer_switch_func only points to the local
+	 * switch func, it still needs the ptr passed to it.
+	 */
+	sched_switch_func(probe_data, __rq, prev, next);
+}
+
 static void
-wakeup_func(void *__rq, struct task_struct *wakee, struct task_struct *curr)
+wakeup_func(void *private, void *__rq, struct task_struct *wakee, struct
+			task_struct *curr)
 {
-	struct trace_array *tr = ctx_trace;
+	struct trace_array **ptr = private;
+	struct trace_array *tr = *ptr;
 	struct trace_array_cpu *data;
 	unsigned long flags;
 	long disabled;
@@ -67,35 +100,29 @@ wakeup_func(void *__rq, struct task_stru
 	local_irq_restore(flags);
 }
 
-void
-ftrace_ctx_switch(void *__rq, struct task_struct *prev,
-		  struct task_struct *next)
-{
-	if (unlikely(atomic_read(&trace_record_cmdline_enabled)))
-		tracing_record_cmdline(prev);
+static notrace void
+wake_up_callback(void *probe_data, void *call_data,
+		 const char *format, va_list *args)
+{
+	struct task_struct *curr;
+	struct task_struct *task;
+	struct rq *__rq;
 
-	/*
-	 * If tracer_switch_func only points to the local
-	 * switch func, it still needs the ptr passed to it.
-	 */
-	ctx_switch_func(__rq, prev, next);
+	if (likely(!tracer_enabled))
+		return;
 
-	/*
-	 * Chain to the wakeup tracer (this is a NOP if disabled):
-	 */
-	wakeup_sched_switch(prev, next);
-}
+	/* Skip pid %d state %ld */
+	(void)va_arg(*args, int);
+	(void)va_arg(*args, long);
+	/* now get the meat: "rq %p task %p rq->curr %p" */
+	__rq = va_arg(*args, typeof(__rq));
+	task = va_arg(*args, typeof(task));
+	curr = va_arg(*args, typeof(curr));
 
-void
-ftrace_wake_up_task(void *__rq, struct task_struct *wakee,
-		    struct task_struct *curr)
-{
-	wakeup_func(__rq, wakee, curr);
+	tracing_record_cmdline(task);
+	tracing_record_cmdline(curr);
 
-	/*
-	 * Chain to the wakeup tracer (this is a NOP if disabled):
-	 */
-	wakeup_sched_wakeup(wakee, curr);
+	wakeup_func(probe_data, __rq, task, curr);
 }
 
 void
@@ -132,15 +159,95 @@ static void sched_switch_reset(struct tr
 		tracing_reset(tr->data[cpu]);
 }
 
+static int tracing_sched_register(void)
+{
+	int ret;
+
+	ret = marker_probe_register("kernel_sched_wakeup",
+			"pid %d state %ld ## rq %p task %p rq->curr %p",
+			wake_up_callback,
+			&ctx_trace);
+	if (ret) {
+		pr_info("wakeup trace: Couldn't add marker"
+			" probe to kernel_sched_wakeup\n");
+		return ret;
+	}
+
+	ret = marker_probe_register("kernel_sched_wakeup_new",
+			"pid %d state %ld ## rq %p task %p rq->curr %p",
+			wake_up_callback,
+			&ctx_trace);
+	if (ret) {
+		pr_info("wakeup trace: Couldn't add marker"
+			" probe to kernel_sched_wakeup_new\n");
+		goto fail_deprobe;
+	}
+
+	ret = marker_probe_register("kernel_sched_schedule",
+		"prev_pid %d next_pid %d prev_state %ld "
+		"## rq %p prev %p next %p",
+		sched_switch_callback,
+		&ctx_trace);
+	if (ret) {
+		pr_info("sched trace: Couldn't add marker"
+			" probe to kernel_sched_schedule\n");
+		goto fail_deprobe_wake_new;
+	}
+
+	return ret;
+fail_deprobe_wake_new:
+	marker_probe_unregister("kernel_sched_wakeup_new",
+				wake_up_callback,
+				&ctx_trace);
+fail_deprobe:
+	marker_probe_unregister("kernel_sched_wakeup",
+				wake_up_callback,
+				&ctx_trace);
+	return ret;
+}
+
+static void tracing_sched_unregister(void)
+{
+	marker_probe_unregister("kernel_sched_schedule",
+				sched_switch_callback,
+				&ctx_trace);
+	marker_probe_unregister("kernel_sched_wakeup_new",
+				wake_up_callback,
+				&ctx_trace);
+	marker_probe_unregister("kernel_sched_wakeup",
+				wake_up_callback,
+				&ctx_trace);
+}
+
+void tracing_start_sched_switch(void)
+{
+	long ref;
+
+	ref = atomic_inc_return(&sched_ref);
+	if (ref == 1)
+		tracing_sched_register();
+}
+
+void tracing_stop_sched_switch(void)
+{
+	long ref;
+
+	ref = atomic_dec_and_test(&sched_ref);
+	if (ref)
+		tracing_sched_unregister();
+}
+
 static void start_sched_trace(struct trace_array *tr)
 {
 	sched_switch_reset(tr);
 	atomic_inc(&trace_record_cmdline_enabled);
 	tracer_enabled = 1;
+	tracing_start_sched_switch();
 }
 
 static void stop_sched_trace(struct trace_array *tr)
 {
+	tracing_stop_sched_switch();
 	atomic_dec(&trace_record_cmdline_enabled);
 	tracer_enabled = 0;
 }
@@ -181,6 +288,14 @@ static struct tracer sched_switch_trace 
 
 __init static int init_sched_switch_trace(void)
 {
+	int ret = 0;
+
+	if (atomic_read(&sched_ref))
+		ret = tracing_sched_register();
+	if (ret) {
+		pr_info("error registering scheduler trace\n");
+		return ret;
+	}
 	return register_tracer(&sched_switch_trace);
 }
 device_initcall(init_sched_switch_trace);
Index: linux-2.6-sched-devel/kernel/trace/trace_sched_wakeup.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/trace/trace_sched_wakeup.c	2008-04-24 11:00:30.000000000 -0400
+++ linux-2.6-sched-devel/kernel/trace/trace_sched_wakeup.c	2008-04-24 11:00:41.000000000 -0400
@@ -15,6 +15,7 @@
 #include <linux/kallsyms.h>
 #include <linux/uaccess.h>
 #include <linux/ftrace.h>
+#include <linux/marker.h>
 
 #include "trace.h"
 
@@ -44,11 +45,13 @@ static int report_latency(cycle_t delta)
 	return 1;
 }
 
-void
-wakeup_sched_switch(struct task_struct *prev, struct task_struct *next)
+static void notrace
+wakeup_sched_switch(void *private, void *rq, struct task_struct *prev,
+	struct task_struct *next)
 {
 	unsigned long latency = 0, t0 = 0, t1 = 0;
-	struct trace_array *tr = wakeup_trace;
+	struct trace_array **ptr = private;
+	struct trace_array *tr = *ptr;
 	struct trace_array_cpu *data;
 	cycle_t T0, T1, delta;
 	unsigned long flags;
@@ -113,6 +116,31 @@ out:
 	atomic_dec(&tr->data[cpu]->disabled);
 }
 
+static notrace void
+sched_switch_callback(void *probe_data, void *call_data,
+		      const char *format, va_list *args)
+{
+	struct task_struct *prev;
+	struct task_struct *next;
+	struct rq *__rq;
+
+	/* skip prev_pid %d next_pid %d prev_state %ld */
+	(void)va_arg(*args, int);
+	(void)va_arg(*args, int);
+	(void)va_arg(*args, long);
+	__rq = va_arg(*args, typeof(__rq));
+	prev = va_arg(*args, typeof(prev));
+	next = va_arg(*args, typeof(next));
+
+	tracing_record_cmdline(prev);
+
+	/*
+	 * If tracer_switch_func only points to the local
+	 * switch func, it still needs the ptr passed to it.
+	 */
+	wakeup_sched_switch(probe_data, __rq, prev, next);
+}
+
 static void __wakeup_reset(struct trace_array *tr)
 {
 	struct trace_array_cpu *data;
@@ -188,19 +216,68 @@ out:
 	atomic_dec(&tr->data[cpu]->disabled);
 }
 
-void wakeup_sched_wakeup(struct task_struct *wakee, struct task_struct *curr)
-{
+static notrace void
+wake_up_callback(void *probe_data, void *call_data,
+		 const char *format, va_list *args)
+{
+	struct trace_array **ptr = probe_data;
+	struct trace_array *tr = *ptr;
+	struct task_struct *curr;
+	struct task_struct *task;
+	struct rq *__rq;
+
 	if (likely(!tracer_enabled))
 		return;
 
+	/* Skip pid %d state %ld */
+	(void)va_arg(*args, int);
+	(void)va_arg(*args, long);
+	/* now get the meat: "rq %p task %p rq->curr %p" */
+	__rq = va_arg(*args, typeof(__rq));
+	task = va_arg(*args, typeof(task));
+	curr = va_arg(*args, typeof(curr));
+
+	tracing_record_cmdline(task);
 	tracing_record_cmdline(curr);
-	tracing_record_cmdline(wakee);
 
-	wakeup_check_start(wakeup_trace, wakee, curr);
+	wakeup_check_start(tr, task, curr);
 }
 
 static void start_wakeup_tracer(struct trace_array *tr)
 {
+	int ret;
+
+	ret = marker_probe_register("kernel_sched_wakeup",
+			"pid %d state %ld ## rq %p task %p rq->curr %p",
+			wake_up_callback,
+			&wakeup_trace);
+	if (ret) {
+		pr_info("wakeup trace: Couldn't add marker"
+			" probe to kernel_sched_wakeup\n");
+		return;
+	}
+
+	ret = marker_probe_register("kernel_sched_wakeup_new",
+			"pid %d state %ld ## rq %p task %p rq->curr %p",
+			wake_up_callback,
+			&wakeup_trace);
+	if (ret) {
+		pr_info("wakeup trace: Couldn't add marker"
+			" probe to kernel_sched_wakeup_new\n");
+		goto fail_deprobe;
+	}
+
+	ret = marker_probe_register("kernel_sched_schedule",
+		"prev_pid %d next_pid %d prev_state %ld "
+		"## rq %p prev %p next %p",
+		sched_switch_callback,
+		&wakeup_trace);
+	if (ret) {
+		pr_info("sched trace: Couldn't add marker"
+			" probe to kernel_sched_schedule\n");
+		goto fail_deprobe_wake_new;
+	}
+
 	wakeup_reset(tr);
 
 	/*
@@ -215,11 +292,28 @@ static void start_wakeup_tracer(struct t
 	tracer_enabled = 1;
 
 	return;
+fail_deprobe_wake_new:
+	marker_probe_unregister("kernel_sched_wakeup_new",
+				wake_up_callback,
+				&wakeup_trace);
+fail_deprobe:
+	marker_probe_unregister("kernel_sched_wakeup",
+				wake_up_callback,
+				&wakeup_trace);
 }
 
 static void stop_wakeup_tracer(struct trace_array *tr)
 {
 	tracer_enabled = 0;
+	marker_probe_unregister("kernel_sched_schedule",
+				sched_switch_callback,
+				&wakeup_trace);
+	marker_probe_unregister("kernel_sched_wakeup_new",
+				wake_up_callback,
+				&wakeup_trace);
+	marker_probe_unregister("kernel_sched_wakeup",
+				wake_up_callback,
+				&wakeup_trace);
 }
 
 static void wakeup_tracer_init(struct trace_array *tr)

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 33/37] LTTng instrumentation fs
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (31 preceding siblings ...)
  2008-04-24 15:03 ` [patch 32/37] Port ftrace to markers Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:03 ` [patch 34/37] LTTng instrumentation ipc Mathieu Desnoyers
                   ` (4 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, Alexander Viro

[-- Attachment #1: lttng-instrumentation-fs.patch --]
[-- Type: text/plain, Size: 8702 bytes --]

Core filesystem events markers.

Markers added :

fs_buffer_wait_end
fs_buffer_wait_start
fs_close
fs_exec
fs_ioctl
fs_llseek
fs_lseek
fs_open
fs_pollfd
fs_pread64
fs_pwrite64
fs_read
fs_readv
fs_select
fs_write
fs_writev

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
---
 fs/buffer.c     |    3 +++
 fs/compat.c     |    2 ++
 fs/exec.c       |    2 ++
 fs/ioctl.c      |    3 +++
 fs/open.c       |    3 +++
 fs/read_write.c |   23 +++++++++++++++++++++--
 fs/select.c     |    5 +++++
 7 files changed, 39 insertions(+), 2 deletions(-)

Index: linux-2.6-sched-devel/fs/buffer.c
===================================================================
--- linux-2.6-sched-devel.orig/fs/buffer.c	2008-04-22 20:04:10.000000000 -0400
+++ linux-2.6-sched-devel/fs/buffer.c	2008-04-22 20:23:31.000000000 -0400
@@ -41,6 +41,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/marker.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -89,7 +90,9 @@ void unlock_buffer(struct buffer_head *b
  */
 void __wait_on_buffer(struct buffer_head * bh)
 {
+	trace_mark(fs_buffer_wait_start, "bh %p", bh);
 	wait_on_bit(&bh->b_state, BH_Lock, sync_buffer, TASK_UNINTERRUPTIBLE);
+	trace_mark(fs_buffer_wait_end, "bh %p", bh);
 }
 
 static void
Index: linux-2.6-sched-devel/fs/compat.c
===================================================================
--- linux-2.6-sched-devel.orig/fs/compat.c	2008-03-27 14:02:05.000000000 -0400
+++ linux-2.6-sched-devel/fs/compat.c	2008-04-22 20:23:31.000000000 -0400
@@ -50,6 +50,7 @@
 #include <linux/poll.h>
 #include <linux/mm.h>
 #include <linux/eventpoll.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1401,6 +1402,7 @@ int compat_do_execve(char * filename,
 
 	retval = search_binary_handler(bprm, regs);
 	if (retval >= 0) {
+		trace_mark(fs_exec, "filename %s", filename);
 		/* execve success */
 		security_bprm_free(bprm);
 		acct_update_integrals(current);
Index: linux-2.6-sched-devel/fs/ioctl.c
===================================================================
--- linux-2.6-sched-devel.orig/fs/ioctl.c	2008-03-27 14:02:05.000000000 -0400
+++ linux-2.6-sched-devel/fs/ioctl.c	2008-04-22 20:23:31.000000000 -0400
@@ -13,6 +13,7 @@
 #include <linux/security.h>
 #include <linux/module.h>
 #include <linux/uaccess.h>
+#include <linux/marker.h>
 
 #include <asm/ioctls.h>
 
@@ -201,6 +202,8 @@ asmlinkage long sys_ioctl(unsigned int f
 	if (!filp)
 		goto out;
 
+	trace_mark(fs_ioctl, "fd %u cmd %u arg %lu", fd, cmd, arg);
+
 	error = security_file_ioctl(filp, cmd, arg);
 	if (error)
 		goto out_fput;
Index: linux-2.6-sched-devel/fs/open.c
===================================================================
--- linux-2.6-sched-devel.orig/fs/open.c	2008-04-22 20:04:11.000000000 -0400
+++ linux-2.6-sched-devel/fs/open.c	2008-04-22 20:23:31.000000000 -0400
@@ -27,6 +27,7 @@
 #include <linux/rcupdate.h>
 #include <linux/audit.h>
 #include <linux/falloc.h>
+#include <linux/marker.h>
 
 int vfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
@@ -1089,6 +1090,7 @@ long do_sys_open(int dfd, const char __u
 				fsnotify_open(f->f_path.dentry);
 				fd_install(fd, f);
 			}
+			trace_mark(fs_open, "fd %d filename %s", fd, tmp);
 		}
 		putname(tmp);
 	}
@@ -1178,6 +1180,7 @@ asmlinkage long sys_close(unsigned int f
 	filp = fdt->fd[fd];
 	if (!filp)
 		goto out_unlock;
+	trace_mark(fs_close, "fd %u", fd);
 	rcu_assign_pointer(fdt->fd[fd], NULL);
 	FD_CLR(fd, fdt->close_on_exec);
 	__put_unused_fd(files, fd);
Index: linux-2.6-sched-devel/fs/read_write.c
===================================================================
--- linux-2.6-sched-devel.orig/fs/read_write.c	2008-03-27 14:02:06.000000000 -0400
+++ linux-2.6-sched-devel/fs/read_write.c	2008-04-22 20:23:31.000000000 -0400
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
 #include <linux/splice.h>
+#include <linux/marker.h>
 #include "read_write.h"
 
 #include <asm/uaccess.h>
@@ -146,6 +147,9 @@ asmlinkage off_t sys_lseek(unsigned int 
 		if (res != (loff_t)retval)
 			retval = -EOVERFLOW;	/* LFS: should only happen on 32 bit platforms */
 	}
+
+	trace_mark(fs_lseek, "fd %u offset %ld origin %u", fd, offset, origin);
+
 	fput_light(file, fput_needed);
 bad:
 	return retval;
@@ -173,6 +177,10 @@ asmlinkage long sys_llseek(unsigned int 
 	offset = vfs_llseek(file, ((loff_t) offset_high << 32) | offset_low,
 			origin);
 
+	trace_mark(fs_llseek, "fd %u offset %llu origin %u", fd,
+			(unsigned long long)offset,
+			origin);
+
 	retval = (int)offset;
 	if (offset >= 0) {
 		retval = -EFAULT;
@@ -359,6 +367,7 @@ asmlinkage ssize_t sys_read(unsigned int
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		loff_t pos = file_pos_read(file);
+		trace_mark(fs_read, "fd %u count %zu", fd, count);
 		ret = vfs_read(file, buf, count, &pos);
 		file_pos_write(file, pos);
 		fput_light(file, fput_needed);
@@ -376,6 +385,7 @@ asmlinkage ssize_t sys_write(unsigned in
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		loff_t pos = file_pos_read(file);
+		trace_mark(fs_write, "fd %u count %zu", fd, count);
 		ret = vfs_write(file, buf, count, &pos);
 		file_pos_write(file, pos);
 		fput_light(file, fput_needed);
@@ -397,8 +407,12 @@ asmlinkage ssize_t sys_pread64(unsigned 
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		ret = -ESPIPE;
-		if (file->f_mode & FMODE_PREAD)
+		if (file->f_mode & FMODE_PREAD) {
+			trace_mark(fs_pread64, "fd %u count %zu pos %llu",
+				fd, count, (unsigned long long)pos);
 			ret = vfs_read(file, buf, count, &pos);
+		}
+
 		fput_light(file, fput_needed);
 	}
 
@@ -418,8 +432,11 @@ asmlinkage ssize_t sys_pwrite64(unsigned
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		ret = -ESPIPE;
-		if (file->f_mode & FMODE_PWRITE)  
+		if (file->f_mode & FMODE_PWRITE) {
+			trace_mark(fs_pwrite64, "fd %u count %zu pos %llu",
+				fd, count, (unsigned long long)pos);
 			ret = vfs_write(file, buf, count, &pos);
+		}
 		fput_light(file, fput_needed);
 	}
 
@@ -663,6 +680,7 @@ sys_readv(unsigned long fd, const struct
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		loff_t pos = file_pos_read(file);
+		trace_mark(fs_readv, "fd %lu vlen %lu", fd, vlen);
 		ret = vfs_readv(file, vec, vlen, &pos);
 		file_pos_write(file, pos);
 		fput_light(file, fput_needed);
@@ -684,6 +702,7 @@ sys_writev(unsigned long fd, const struc
 	file = fget_light(fd, &fput_needed);
 	if (file) {
 		loff_t pos = file_pos_read(file);
+		trace_mark(fs_writev, "fd %lu vlen %lu", fd, vlen);
 		ret = vfs_writev(file, vec, vlen, &pos);
 		file_pos_write(file, pos);
 		fput_light(file, fput_needed);
Index: linux-2.6-sched-devel/fs/select.c
===================================================================
--- linux-2.6-sched-devel.orig/fs/select.c	2008-04-22 20:04:11.000000000 -0400
+++ linux-2.6-sched-devel/fs/select.c	2008-04-22 20:23:31.000000000 -0400
@@ -23,6 +23,7 @@
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/rcupdate.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 
@@ -231,6 +232,9 @@ int do_select(int n, fd_set_bits *fds, s
 				file = fget_light(i, &fput_needed);
 				if (file) {
 					f_op = file->f_op;
+					trace_mark(fs_select,
+							"fd %d timeout #8d%lld",
+							i, (long long)*timeout);
 					mask = DEFAULT_POLLMASK;
 					if (f_op && f_op->poll)
 						mask = (*f_op->poll)(file, retval ? NULL : wait);
@@ -559,6 +563,7 @@ static inline unsigned int do_pollfd(str
 		file = fget_light(fd, &fput_needed);
 		mask = POLLNVAL;
 		if (file != NULL) {
+			trace_mark(fs_pollfd, "fd %d", fd);
 			mask = DEFAULT_POLLMASK;
 			if (file->f_op && file->f_op->poll)
 				mask = file->f_op->poll(file, pwait);
Index: linux-2.6-sched-devel/fs/exec.c
===================================================================
--- linux-2.6-sched-devel.orig/fs/exec.c	2008-03-27 14:02:05.000000000 -0400
+++ linux-2.6-sched-devel/fs/exec.c	2008-04-22 20:23:31.000000000 -0400
@@ -51,6 +51,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1338,6 +1339,7 @@ int do_execve(char * filename,
 
 	retval = search_binary_handler(bprm,regs);
 	if (retval >= 0) {
+		trace_mark(fs_exec, "filename %s", filename);
 		/* execve success */
 		free_arg_pages(bprm);
 		security_bprm_free(bprm);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 34/37] LTTng instrumentation ipc
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (32 preceding siblings ...)
  2008-04-24 15:03 ` [patch 33/37] LTTng instrumentation fs Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 23:02   ` Alexey Dobriyan
  2008-04-24 15:03 ` [patch 35/37] LTTng instrumentation kernel Mathieu Desnoyers
                   ` (3 subsequent siblings)
  37 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: lttng-instrumentation-ipc.patch --]
[-- Type: text/plain, Size: 3432 bytes --]

Interprocess communication, core events.

Added markers :

ipc_msg_create
ipc_sem_create
ipc_shm_create

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 ipc/msg.c |    6 +++++-
 ipc/sem.c |    6 +++++-
 ipc/shm.c |    6 +++++-
 3 files changed, 15 insertions(+), 3 deletions(-)

Index: linux-2.6-lttng/ipc/msg.c
===================================================================
--- linux-2.6-lttng.orig/ipc/msg.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/ipc/msg.c	2008-03-27 07:40:12.000000000 -0400
@@ -37,6 +37,7 @@
 #include <linux/rwsem.h>
 #include <linux/nsproxy.h>
 #include <linux/ipc_namespace.h>
+#include <linux/marker.h>
 
 #include <asm/current.h>
 #include <asm/uaccess.h>
@@ -293,6 +294,7 @@ asmlinkage long sys_msgget(key_t key, in
 	struct ipc_namespace *ns;
 	struct ipc_ops msg_ops;
 	struct ipc_params msg_params;
+	long ret;
 
 	ns = current->nsproxy->ipc_ns;
 
@@ -303,7 +305,9 @@ asmlinkage long sys_msgget(key_t key, in
 	msg_params.key = key;
 	msg_params.flg = msgflg;
 
-	return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params);
+	ret = ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params);
+	trace_mark(ipc_msg_create, "id %ld flags %d", ret, msgflg);
+	return ret;
 }
 
 static inline unsigned long
Index: linux-2.6-lttng/ipc/sem.c
===================================================================
--- linux-2.6-lttng.orig/ipc/sem.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/ipc/sem.c	2008-03-27 07:40:12.000000000 -0400
@@ -83,6 +83,7 @@
 #include <linux/rwsem.h>
 #include <linux/nsproxy.h>
 #include <linux/ipc_namespace.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include "util.h"
@@ -312,6 +313,7 @@ asmlinkage long sys_semget(key_t key, in
 	struct ipc_namespace *ns;
 	struct ipc_ops sem_ops;
 	struct ipc_params sem_params;
+	long err;
 
 	ns = current->nsproxy->ipc_ns;
 
@@ -326,7 +328,9 @@ asmlinkage long sys_semget(key_t key, in
 	sem_params.flg = semflg;
 	sem_params.u.nsems = nsems;
 
-	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
+	err = ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
+	trace_mark(ipc_sem_create, "id %ld flags %d", err, semflg);
+	return err;
 }
 
 /* Manage the doubly linked list sma->sem_pending as a FIFO:
Index: linux-2.6-lttng/ipc/shm.c
===================================================================
--- linux-2.6-lttng.orig/ipc/shm.c	2008-03-27 07:40:09.000000000 -0400
+++ linux-2.6-lttng/ipc/shm.c	2008-03-27 07:40:12.000000000 -0400
@@ -39,6 +39,7 @@
 #include <linux/nsproxy.h>
 #include <linux/mount.h>
 #include <linux/ipc_namespace.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 
@@ -482,6 +483,7 @@ asmlinkage long sys_shmget (key_t key, s
 	struct ipc_namespace *ns;
 	struct ipc_ops shm_ops;
 	struct ipc_params shm_params;
+	long err;
 
 	ns = current->nsproxy->ipc_ns;
 
@@ -493,7 +495,9 @@ asmlinkage long sys_shmget (key_t key, s
 	shm_params.flg = shmflg;
 	shm_params.u.size = size;
 
-	return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
+	err = ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
+	trace_mark(ipc_shm_create, "id %ld flags %d", err, shmflg);
+	return err;
 }
 
 static inline unsigned long copy_shmid_to_user(void __user *buf, struct shmid64_ds *in, int version)

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 35/37] LTTng instrumentation kernel
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (33 preceding siblings ...)
  2008-04-24 15:03 ` [patch 34/37] LTTng instrumentation ipc Mathieu Desnoyers
@ 2008-04-24 15:03 ` Mathieu Desnoyers
  2008-04-24 15:04 ` [patch 36/37] LTTng instrumentation mm Mathieu Desnoyers
                   ` (2 subsequent siblings)
  37 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:03 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: lttng-instrumentation-kernel.patch --]
[-- Type: text/plain, Size: 16738 bytes --]

Core kernel events.

*not* present in this patch because they are architecture specific :
- syscall entry/exit
- traps
- kernel thread creation

Added markers :

kernel_irq_entry
kernel_irq_exit
kernel_kthread_stop
kernel_kthread_stop_ret
kernel_module_free
kernel_module_load
kernel_printk
kernel_process_exit
kernel_process_fork
kernel_process_free
kernel_process_wait
kernel_sched_migrate_task
kernel_sched_schedule
kernel_sched_try_wakeup
kernel_sched_wait_task
kernel_sched_wakeup_new_task
kernel_send_signal
kernel_softirq_entry
kernel_softirq_exit
kernel_softirq_raise
kernel_tasklet_high_entry
kernel_tasklet_high_exit
kernel_tasklet_low_entry
kernel_tasklet_low_exit
kernel_timer_itimer_expired
kernel_timer_itimer_set
kernel_timer_set
kernel_timer_timeout
kernel_timer_update_time
kernel_vprintk
locking_hardirqs_off
locking_hardirqs_on
locking_lock_acquire
locking_lock_release
locking_softirqs_off
locking_softirqs_on

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 kernel/exit.c       |    8 ++++++++
 kernel/fork.c       |    5 +++++
 kernel/irq/handle.c |    7 +++++++
 kernel/itimer.c     |   13 +++++++++++++
 kernel/kthread.c    |    5 +++++
 kernel/lockdep.c    |   20 ++++++++++++++++++++
 kernel/module.c     |    5 +++++
 kernel/printk.c     |   27 +++++++++++++++++++++++++++
 kernel/sched.c      |    5 +++++
 kernel/signal.c     |    3 +++
 kernel/softirq.c    |   23 +++++++++++++++++++++++
 kernel/timer.c      |   13 ++++++++++++-
 12 files changed, 133 insertions(+), 1 deletion(-)

Index: linux-2.6-sched-devel/kernel/irq/handle.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/irq/handle.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/irq/handle.c	2008-04-24 11:00:49.000000000 -0400
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
+#include <linux/marker.h>
 
 #include "internals.h"
 
@@ -130,6 +131,10 @@ irqreturn_t handle_IRQ_event(unsigned in
 {
 	irqreturn_t ret, retval = IRQ_NONE;
 	unsigned int status = 0;
+	struct pt_regs *regs = get_irq_regs();
+
+	trace_mark(kernel_irq_entry, "irq_id %u kernel_mode %u", irq,
+		(regs)?(!user_mode(regs)):(1));
 
 	handle_dynamic_tick(action);
 
@@ -148,6 +153,8 @@ irqreturn_t handle_IRQ_event(unsigned in
 		add_interrupt_randomness(irq);
 	local_irq_disable();
 
+	trace_mark(kernel_irq_exit, MARK_NOARGS);
+
 	return retval;
 }
 
Index: linux-2.6-sched-devel/kernel/itimer.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/itimer.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/itimer.c	2008-04-24 11:00:49.000000000 -0400
@@ -12,6 +12,7 @@
 #include <linux/time.h>
 #include <linux/posix-timers.h>
 #include <linux/hrtimer.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 
@@ -132,6 +133,9 @@ enum hrtimer_restart it_real_fn(struct h
 	struct signal_struct *sig =
 		container_of(timer, struct signal_struct, real_timer);
 
+	trace_mark(kernel_timer_itimer_expired, "pid %d",
+		pid_nr(sig->leader_pid));
+
 	kill_pid_info(SIGALRM, SEND_SIG_PRIV, sig->leader_pid);
 
 	return HRTIMER_NORESTART;
@@ -157,6 +161,15 @@ int do_setitimer(int which, struct itime
 	    !timeval_valid(&value->it_interval))
 		return -EINVAL;
 
+	trace_mark(kernel_timer_itimer_set,
+			"which %d interval_sec %ld interval_usec %ld "
+			"value_sec %ld value_usec %ld",
+			which,
+			value->it_interval.tv_sec,
+			value->it_interval.tv_usec,
+			value->it_value.tv_sec,
+			value->it_value.tv_usec);
+
 	switch (which) {
 	case ITIMER_REAL:
 again:
Index: linux-2.6-sched-devel/kernel/kthread.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/kthread.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/kthread.c	2008-04-24 11:00:49.000000000 -0400
@@ -15,6 +15,7 @@
 #include <linux/mutex.h>
 #include <linux/cpumask.h>
 #include <linux/cpuset.h>
+#include <linux/marker.h>
 
 #define KTHREAD_NICE_LEVEL (-5)
 
@@ -207,6 +208,8 @@ int kthread_stop(struct task_struct *k)
 	/* It could exit after stop_info.k set, but before wake_up_process. */
 	get_task_struct(k);
 
+	trace_mark(kernel_kthread_stop, "pid %d", k->pid);
+
 	/* Must init completion *before* thread sees kthread_stop_info.k */
 	init_completion(&kthread_stop_info.done);
 	smp_wmb();
@@ -222,6 +225,8 @@ int kthread_stop(struct task_struct *k)
 	ret = kthread_stop_info.err;
 	mutex_unlock(&kthread_stop_lock);
 
+	trace_mark(kernel_kthread_stop_ret, "ret %d", ret);
+
 	return ret;
 }
 EXPORT_SYMBOL(kthread_stop);
Index: linux-2.6-sched-devel/kernel/lockdep.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/lockdep.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/lockdep.c	2008-04-24 11:00:49.000000000 -0400
@@ -40,6 +40,7 @@
 #include <linux/utsname.h>
 #include <linux/hash.h>
 #include <linux/ftrace.h>
+#include <linux/marker.h>
 
 #include <asm/sections.h>
 
@@ -2024,6 +2025,9 @@ void trace_hardirqs_on_caller(unsigned l
 
 	time_hardirqs_on(CALLER_ADDR0, a0);
 
+	_trace_mark(locking_hardirqs_on, "ip #p%lu",
+		(unsigned long) __builtin_return_address(0));
+
 	if (unlikely(!debug_locks || current->lockdep_recursion))
 		return;
 
@@ -2078,6 +2082,9 @@ void trace_hardirqs_off_caller(unsigned 
 
 	time_hardirqs_off(CALLER_ADDR0, a0);
 
+	_trace_mark(locking_hardirqs_off, "ip #p%lu",
+		(unsigned long) __builtin_return_address(0));
+
 	if (unlikely(!debug_locks || current->lockdep_recursion))
 		return;
 
@@ -2110,6 +2117,9 @@ void trace_softirqs_on(unsigned long ip)
 {
 	struct task_struct *curr = current;
 
+	_trace_mark(locking_softirqs_on, "ip #p%lu",
+		(unsigned long) __builtin_return_address(0));
+
 	if (unlikely(!debug_locks))
 		return;
 
@@ -2144,6 +2154,9 @@ void trace_softirqs_off(unsigned long ip
 {
 	struct task_struct *curr = current;
 
+	_trace_mark(locking_softirqs_off, "ip #p%lu",
+		(unsigned long) __builtin_return_address(0));
+
 	if (unlikely(!debug_locks))
 		return;
 
@@ -2376,6 +2389,10 @@ static int __lock_acquire(struct lockdep
 	int chain_head = 0;
 	u64 chain_key;
 
+	_trace_mark(locking_lock_acquire,
+		"ip #p%lu subclass %u lock %p trylock %d",
+		ip, subclass, lock, trylock);
+
 	if (!prove_locking)
 		check = 1;
 
@@ -2649,6 +2666,9 @@ __lock_release(struct lockdep_map *lock,
 {
 	struct task_struct *curr = current;
 
+	_trace_mark(locking_lock_release, "ip #p%lu lock %p nested %d",
+		ip, lock, nested);
+
 	if (!check_unlock(curr, lock, ip))
 		return;
 
Index: linux-2.6-sched-devel/kernel/printk.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/printk.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/printk.c	2008-04-24 11:00:49.000000000 -0400
@@ -32,6 +32,7 @@
 #include <linux/security.h>
 #include <linux/bootmem.h>
 #include <linux/syscalls.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 
@@ -607,6 +608,7 @@ asmlinkage int printk(const char *fmt, .
 	int r;
 
 	va_start(args, fmt);
+	trace_mark(kernel_printk, "ip %p", __builtin_return_address(0));
 	r = vprintk(fmt, args);
 	va_end(args);
 
@@ -683,6 +685,31 @@ asmlinkage int vprintk(const char *fmt, 
 	raw_local_irq_save(flags);
 	this_cpu = smp_processor_id();
 
+	if (printed_len > 0) {
+		unsigned int loglevel;
+		int mark_len;
+		char *mark_buf;
+		char saved_char;
+
+		if (printk_buf[0] == '<' && printk_buf[1] >= '0' &&
+		   printk_buf[1] <= '7' && printk_buf[2] == '>') {
+			loglevel = printk_buf[1] - '0';
+			mark_buf = &printk_buf[3];
+			mark_len = printed_len - 3;
+		} else {
+			loglevel = default_message_loglevel;
+			mark_buf = printk_buf;
+			mark_len = printed_len;
+		}
+		if (mark_buf[mark_len - 1] == '\n')
+			mark_len--;
+		saved_char = mark_buf[mark_len];
+		mark_buf[mark_len] = '\0';
+		_trace_mark(kernel_vprintk, "loglevel %c string %s ip %p",
+			loglevel, mark_buf, __builtin_return_address(0));
+		mark_buf[mark_len] = saved_char;
+	}
+
 	/*
 	 * Ouch, printk recursed into itself!
 	 */
Index: linux-2.6-sched-devel/kernel/sched.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/sched.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/sched.c	2008-04-24 11:00:49.000000000 -0400
@@ -71,6 +71,7 @@
 #include <linux/debugfs.h>
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
+#include <linux/marker.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -2275,6 +2276,8 @@ void wait_task_inactive(struct task_stru
 		 * just go back and repeat.
 		 */
 		rq = task_rq_lock(p, &flags);
+		trace_mark(kernel_sched_wait_task, "pid %d state %ld",
+			p->pid, p->state);
 		running = task_running(rq, p);
 		on_rq = p->se.on_rq;
 		task_rq_unlock(rq, &flags);
@@ -3163,6 +3166,8 @@ static void sched_migrate_task(struct ta
 	    || unlikely(cpu_is_offline(dest_cpu)))
 		goto out;
 
+	trace_mark(kernel_sched_migrate_task, "pid %d state %ld dest_cpu %d",
+		p->pid, p->state, dest_cpu);
 	/* force the process onto the specified CPU */
 	if (migrate_task(p, dest_cpu, &req)) {
 		/* Need to wait for migration thread (might exit: take ref). */
Index: linux-2.6-sched-devel/kernel/signal.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/signal.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/signal.c	2008-04-24 11:00:49.000000000 -0400
@@ -26,6 +26,7 @@
 #include <linux/freezer.h>
 #include <linux/pid_namespace.h>
 #include <linux/nsproxy.h>
+#include <linux/marker.h>
 
 #include <asm/param.h>
 #include <asm/uaccess.h>
@@ -663,6 +664,8 @@ static int send_signal(int sig, struct s
 	struct sigqueue * q = NULL;
 	int ret = 0;
 
+	trace_mark(kernel_send_signal, "pid %d signal %d", t->pid, sig);
+
 	/*
 	 * Deliver the signal to listening signalfds. This must be called
 	 * with the sighand lock held.
Index: linux-2.6-sched-devel/kernel/softirq.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/softirq.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/softirq.c	2008-04-24 11:00:49.000000000 -0400
@@ -21,6 +21,7 @@
 #include <linux/rcupdate.h>
 #include <linux/smp.h>
 #include <linux/tick.h>
+#include <linux/marker.h>
 
 #include <asm/irq.h>
 /*
@@ -231,7 +232,15 @@ restart:
 
 	do {
 		if (pending & 1) {
+			trace_mark(kernel_softirq_entry, "softirq_id %lu",
+				((unsigned long)h
+					- (unsigned long)softirq_vec)
+					/ sizeof(*h));
 			h->action(h);
+			trace_mark(kernel_softirq_exit, "softirq_id %lu",
+				((unsigned long)h
+					- (unsigned long)softirq_vec)
+					/ sizeof(*h));
 			rcu_bh_qsctr_inc(cpu);
 		}
 		h++;
@@ -323,6 +332,8 @@ void irq_exit(void)
  */
 inline void raise_softirq_irqoff(unsigned int nr)
 {
+	trace_mark(kernel_softirq_raise, "softirq_id %u", nr);
+
 	__raise_softirq_irqoff(nr);
 
 	/*
@@ -412,7 +423,13 @@ static void tasklet_action(struct softir
 			if (!atomic_read(&t->count)) {
 				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
 					BUG();
+				trace_mark(kernel_tasklet_low_entry,
+						"func %p data %lu",
+						t->func, t->data);
 				t->func(t->data);
+				trace_mark(kernel_tasklet_low_exit,
+						"func %p data %lu",
+						t->func, t->data);
 				tasklet_unlock(t);
 				continue;
 			}
@@ -447,7 +464,13 @@ static void tasklet_hi_action(struct sof
 			if (!atomic_read(&t->count)) {
 				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
 					BUG();
+				trace_mark(kernel_tasklet_high_entry,
+						"func %p data %lu",
+						t->func, t->data);
 				t->func(t->data);
+				trace_mark(kernel_tasklet_high_exit,
+						"func %p data %lu",
+						t->func, t->data);
 				tasklet_unlock(t);
 				continue;
 			}
Index: linux-2.6-sched-devel/kernel/timer.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/timer.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/timer.c	2008-04-24 11:00:49.000000000 -0400
@@ -37,12 +37,14 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 #include <asm/div64.h>
 #include <asm/timex.h>
 #include <asm/io.h>
+#include <asm/irq_regs.h>
 
 u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
 
@@ -288,6 +290,8 @@ static void internal_add_timer(struct tv
 		i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
 		vec = base->tv5.vec + i;
 	}
+	trace_mark(kernel_timer_set, "expires %lu function %p data %lu",
+		expires, timer->function, timer->data);
 	/*
 	 * Timers are FIFO:
 	 */
@@ -940,6 +944,11 @@ void do_timer(unsigned long ticks)
 {
 	jiffies_64 += ticks;
 	update_times(ticks);
+	trace_mark(kernel_timer_update_time,
+		"jiffies #8u%llu xtime_sec %ld xtime_nsec %ld "
+		"walltomonotonic_sec %ld walltomonotonic_nsec %ld",
+		(unsigned long long)jiffies_64, xtime.tv_sec, xtime.tv_nsec,
+		wall_to_monotonic.tv_sec, wall_to_monotonic.tv_nsec);
 }
 
 #ifdef __ARCH_WANT_SYS_ALARM
@@ -1021,7 +1030,9 @@ asmlinkage long sys_getegid(void)
 
 static void process_timeout(unsigned long __data)
 {
-	wake_up_process((struct task_struct *)__data);
+	struct task_struct *task = (struct task_struct *)__data;
+	trace_mark(kernel_timer_timeout, "pid %d", task->pid);
+	wake_up_process(task);
 }
 
 /**
Index: linux-2.6-sched-devel/kernel/exit.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/exit.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/exit.c	2008-04-24 11:01:06.000000000 -0400
@@ -44,6 +44,7 @@
 #include <linux/resource.h>
 #include <linux/blkdev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -137,6 +138,8 @@ static void __exit_signal(struct task_st
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
 {
+	trace_mark(kernel_process_free, "pid %d",
+		container_of(rhp, struct task_struct, rcu)->pid);
 	put_task_struct(container_of(rhp, struct task_struct, rcu));
 }
 
@@ -951,6 +954,9 @@ NORET_TYPE void do_exit(long code)
 
 	if (group_dead)
 		acct_process();
+
+	trace_mark(kernel_process_exit, "pid %d", tsk->pid);
+
 	exit_sem(tsk);
 	exit_files(tsk);
 	exit_fs(tsk);
@@ -1435,6 +1441,8 @@ static long do_wait(enum pid_type type, 
 	struct task_struct *tsk;
 	int flag, retval;
 
+	trace_mark(kernel_process_wait, "pid %d", pid_nr(pid));
+
 	add_wait_queue(&current->signal->wait_chldexit,&wait);
 repeat:
 	/* If there is nothing that can match our critier just get out */
Index: linux-2.6-sched-devel/kernel/fork.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/fork.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/fork.c	2008-04-24 11:00:49.000000000 -0400
@@ -54,6 +54,7 @@
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
 #include <linux/magic.h>
+#include <linux/marker.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1516,6 +1517,10 @@ long do_fork(unsigned long clone_flags,
 	if (!IS_ERR(p)) {
 		struct completion vfork;
 
+		trace_mark(kernel_process_fork,
+			"parent_pid %d child_pid %d child_tgid %d",
+			current->pid, p->pid, p->tgid);
+
 		nr = task_pid_vnr(p);
 
 		if (clone_flags & CLONE_PARENT_SETTID)
Index: linux-2.6-sched-devel/kernel/module.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/module.c	2008-04-24 11:00:47.000000000 -0400
+++ linux-2.6-sched-devel/kernel/module.c	2008-04-24 11:00:49.000000000 -0400
@@ -47,6 +47,7 @@
 #include <asm/cacheflush.h>
 #include <linux/license.h>
 #include <asm/sections.h>
+#include <linux/marker.h>
 
 #if 0
 #define DEBUGP printk
@@ -1332,6 +1333,8 @@ static int __unlink_module(void *_mod)
 /* Free a module, remove from lists, etc (must hold module_mutex). */
 static void free_module(struct module *mod)
 {
+	trace_mark(kernel_module_free, "name %s", mod->name);
+
 	/* Delete from various lists */
 	stop_machine_run(__unlink_module, mod, NR_CPUS);
 	remove_notes_attrs(mod);
@@ -2117,6 +2120,8 @@ static struct module *load_module(void _
 	/* Get rid of temporary copy */
 	vfree(hdr);
 
+	trace_mark(kernel_module_load, "name %s", mod->name);
+
 	/* Done! */
 	return mod;
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 36/37] LTTng instrumentation mm
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (34 preceding siblings ...)
  2008-04-24 15:03 ` [patch 35/37] LTTng instrumentation kernel Mathieu Desnoyers
@ 2008-04-24 15:04 ` Mathieu Desnoyers
  2008-04-28  2:12   ` Masami Hiramatsu
  2008-04-24 15:04 ` [patch 37/37] LTTng instrumentation net Mathieu Desnoyers
  2008-04-26 19:38 ` [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Peter Zijlstra
  37 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:04 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, linux-mm, Dave Hansen

[-- Attachment #1: lttng-instrumentation-mm.patch --]
[-- Type: text/plain, Size: 9803 bytes --]

Memory management core events.

Added markers :

mm_filemap_wait_end
mm_filemap_wait_start
mm_handle_fault_entry
mm_handle_fault_exit
mm_huge_page_alloc
mm_huge_page_free
mm_page_alloc
mm_page_free
mm_swap_file_close
mm_swap_file_open
mm_swap_in
mm_swap_out
statedump_swap_files

Changelog:
- Use page_to_pfn for swap out instrumentation, wait_on_page_bit, do_swap_page,
  page alloc/free.
- add missing free_hot_cold_page instrumentation.
- add hugetlb page_alloc page_free instrumentation.
- Add write_access to mm fault.
- Add page bit_nr waited for by wait_on_page_bit.
- Move page alloc instrumentation to __aloc_pages so we cover the alloc zeroed
  page path.
- Add swap file used for swap in and swap out events.
- Dump the swap files, instrument swapon and swapoff.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: linux-mm@kvack.org
CC: Dave Hansen <haveblue@us.ibm.com>
---
 include/linux/swapops.h |    8 ++++++++
 mm/filemap.c            |    7 +++++++
 mm/hugetlb.c            |    3 +++
 mm/memory.c             |   41 ++++++++++++++++++++++++++++++++---------
 mm/page_alloc.c         |    9 +++++++++
 mm/page_io.c            |    6 ++++++
 mm/swapfile.c           |   23 +++++++++++++++++++++++
 7 files changed, 88 insertions(+), 9 deletions(-)

Index: linux-2.6-lttng/mm/filemap.c
===================================================================
--- linux-2.6-lttng.orig/mm/filemap.c	2008-04-21 09:53:24.000000000 -0400
+++ linux-2.6-lttng/mm/filemap.c	2008-04-21 10:08:01.000000000 -0400
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/marker.h>
 #include "internal.h"
 
 /*
@@ -540,9 +541,15 @@ void wait_on_page_bit(struct page *page,
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
+	trace_mark(mm_filemap_wait_start, "pfn %lu bit_nr %d",
+		page_to_pfn(page), bit_nr);
+
 	if (test_bit(bit_nr, &page->flags))
 		__wait_on_bit(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
+
+	trace_mark(mm_filemap_wait_end, "pfn %lu bit_nr %d",
+		page_to_pfn(page), bit_nr);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c	2008-04-21 10:08:00.000000000 -0400
+++ linux-2.6-lttng/mm/memory.c	2008-04-21 11:16:52.000000000 -0400
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/marker.h>
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/delayacct.h>
@@ -2058,6 +2059,12 @@ static int do_swap_page(struct mm_struct
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+#ifdef CONFIG_SWAP
+		trace_mark(mm_swap_in, "pfn %lu filp %p offset %lu",
+			page_to_pfn(page),
+			get_swap_info_struct(swp_type(entry))->swap_file,
+			swp_offset(entry));
+#endif
 	}
 
 	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
@@ -2517,30 +2524,46 @@ unlock:
 int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, int write_access)
 {
+	int res;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
+	trace_mark(mm_handle_fault_entry,
+		"address %lu ip #p%ld write_access %d",
+		address, KSTK_EIP(current), write_access);
+
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
 
-	if (unlikely(is_vm_hugetlb_page(vma)))
-		return hugetlb_fault(mm, vma, address, write_access);
+	if (unlikely(is_vm_hugetlb_page(vma))) {
+		res = hugetlb_fault(mm, vma, address, write_access);
+		goto end;
+	}
 
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);
-	if (!pud)
-		return VM_FAULT_OOM;
+	if (!pud) {
+		res = VM_FAULT_OOM;
+		goto end;
+	}
 	pmd = pmd_alloc(mm, pud, address);
-	if (!pmd)
-		return VM_FAULT_OOM;
+	if (!pmd) {
+		res = VM_FAULT_OOM;
+		goto end;
+	}
 	pte = pte_alloc_map(mm, pmd, address);
-	if (!pte)
-		return VM_FAULT_OOM;
+	if (!pte) {
+		res = VM_FAULT_OOM;
+		goto end;
+	}
 
-	return handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+	res = handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+end:
+	trace_mark(mm_handle_fault_exit, MARK_NOARGS);
+	return res;
 }
 
 #ifndef __PAGETABLE_PUD_FOLDED
Index: linux-2.6-lttng/mm/page_alloc.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_alloc.c	2008-04-21 09:53:24.000000000 -0400
+++ linux-2.6-lttng/mm/page_alloc.c	2008-04-21 10:08:01.000000000 -0400
@@ -45,6 +45,7 @@
 #include <linux/fault-inject.h>
 #include <linux/page-isolation.h>
 #include <linux/memcontrol.h>
+#include <linux/marker.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -527,6 +528,9 @@ static void __free_pages_ok(struct page 
 	int i;
 	int reserved = 0;
 
+	trace_mark(mm_page_free, "order %u pfn %lu",
+		order, page_to_pfn(page));
+
 	for (i = 0 ; i < (1 << order) ; ++i)
 		reserved += free_pages_check(page + i);
 	if (reserved)
@@ -990,6 +994,8 @@ static void free_hot_cold_page(struct pa
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 
+	trace_mark(mm_page_free, "order %u pfn %lu", 0, page_to_pfn(page));
+
 	if (PageAnon(page))
 		page->mapping = NULL;
 	if (free_pages_check(page))
@@ -1643,6 +1649,9 @@ nopage:
 		show_mem();
 	}
 got_pg:
+	if (page)
+		trace_mark(mm_page_alloc, "order %u pfn %lu", order,
+			   page_to_pfn(page));
 	return page;
 }
 
Index: linux-2.6-lttng/mm/page_io.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_io.c	2008-04-21 09:53:24.000000000 -0400
+++ linux-2.6-lttng/mm/page_io.c	2008-04-21 10:08:01.000000000 -0400
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/marker.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -114,6 +115,11 @@ int swap_writepage(struct page *page, st
 		rw |= (1 << BIO_RW_SYNC);
 	count_vm_event(PSWPOUT);
 	set_page_writeback(page);
+	trace_mark(mm_swap_out, "pfn %lu filp %p offset %lu",
+			page_to_pfn(page),
+			get_swap_info_struct(swp_type(
+				page_swp_entry(page)))->swap_file,
+			swp_offset(page_swp_entry(page)));
 	unlock_page(page);
 	submit_bio(rw, bio);
 out:
Index: linux-2.6-lttng/mm/hugetlb.c
===================================================================
--- linux-2.6-lttng.orig/mm/hugetlb.c	2008-04-21 09:53:24.000000000 -0400
+++ linux-2.6-lttng/mm/hugetlb.c	2008-04-21 10:08:01.000000000 -0400
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/marker.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -137,6 +138,7 @@ static void free_huge_page(struct page *
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
+	trace_mark(mm_huge_page_free, "pfn %lu", page_to_pfn(page));
 	mapping = (struct address_space *) page_private(page);
 	set_page_private(page, 0);
 	BUG_ON(page_count(page));
@@ -485,6 +487,7 @@ static struct page *alloc_huge_page(stru
 	if (!IS_ERR(page)) {
 		set_page_refcounted(page);
 		set_page_private(page, (unsigned long) mapping);
+		trace_mark(mm_huge_page_alloc, "pfn %lu", page_to_pfn(page));
 	}
 	return page;
 }
Index: linux-2.6-lttng/include/linux/swapops.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/swapops.h	2008-04-21 09:53:24.000000000 -0400
+++ linux-2.6-lttng/include/linux/swapops.h	2008-04-21 10:08:01.000000000 -0400
@@ -76,6 +76,14 @@ static inline pte_t swp_entry_to_pte(swp
 	return __swp_entry_to_pte(arch_entry);
 }
 
+static inline swp_entry_t page_swp_entry(struct page *page)
+{
+	swp_entry_t entry;
+	VM_BUG_ON(!PageSwapCache(page));
+	entry.val = page_private(page);
+	return entry;
+}
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
Index: linux-2.6-lttng/mm/swapfile.c
===================================================================
--- linux-2.6-lttng.orig/mm/swapfile.c	2008-04-21 09:53:24.000000000 -0400
+++ linux-2.6-lttng/mm/swapfile.c	2008-04-21 10:08:01.000000000 -0400
@@ -28,6 +28,7 @@
 #include <linux/capability.h>
 #include <linux/syscalls.h>
 #include <linux/memcontrol.h>
+#include <linux/marker.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -1310,6 +1311,7 @@ asmlinkage long sys_swapoff(const char _
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
 	p->flags = 0;
+	trace_mark(mm_swap_file_close, "filp %p", swap_file);
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
@@ -1691,6 +1693,8 @@ asmlinkage long sys_swapon(const char __
 	} else {
 		swap_info[prev].next = p - swap_info;
 	}
+	trace_mark(mm_swap_file_open, "filp %p filename %s",
+		swap_file, name);
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	error = 0;
@@ -1844,3 +1848,22 @@ int valid_swaphandles(swp_entry_t entry,
 	*offset = ++toff;
 	return nr_pages? ++nr_pages: 0;
 }
+
+void ltt_dump_swap_files(void *call_data)
+{
+	int type;
+	struct swap_info_struct *p = NULL;
+
+	mutex_lock(&swapon_mutex);
+	for (type = swap_list.head; type >= 0; type = swap_info[type].next) {
+		p = swap_info + type;
+		if ((p->flags & SWP_ACTIVE) != SWP_ACTIVE)
+			continue;
+		__trace_mark(0, statedump_swap_files, call_data,
+			"filp %p vfsmount %p dname %s",
+			p->swap_file, p->swap_file->f_vfsmnt,
+			p->swap_file->f_dentry->d_name.name);
+	}
+	mutex_unlock(&swapon_mutex);
+}
+EXPORT_SYMBOL_GPL(ltt_dump_swap_files);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 37/37] LTTng instrumentation net
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (35 preceding siblings ...)
  2008-04-24 15:04 ` [patch 36/37] LTTng instrumentation mm Mathieu Desnoyers
@ 2008-04-24 15:04 ` Mathieu Desnoyers
  2008-04-24 15:52   ` Pavel Emelyanov
  2008-04-26 19:38 ` [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Peter Zijlstra
  37 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 15:04 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, netdev

[-- Attachment #1: lttng-instrumentation-net.patch --]
[-- Type: text/plain, Size: 4381 bytes --]

Network core events.

Added markers :

net_del_ifa_ipv4
net_dev_receive
net_dev_xmit
net_insert_ifa_ipv4
net_socket_call
net_socket_create
net_socket_recvmsg
net_socket_sendmsg

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: netdev@vger.kernel.org
---
 net/core/dev.c     |    6 ++++++
 net/ipv4/devinet.c |    6 ++++++
 net/socket.c       |   19 +++++++++++++++++++
 3 files changed, 31 insertions(+)

Index: linux-2.6-lttng/net/core/dev.c
===================================================================
--- linux-2.6-lttng.orig/net/core/dev.c	2008-03-27 07:26:26.000000000 -0400
+++ linux-2.6-lttng/net/core/dev.c	2008-03-27 07:31:44.000000000 -0400
@@ -119,6 +119,7 @@
 #include <linux/err.h>
 #include <linux/ctype.h>
 #include <linux/if_arp.h>
+#include <linux/marker.h>
 
 #include "net-sysfs.h"
 
@@ -1643,6 +1644,8 @@ int dev_queue_xmit(struct sk_buff *skb)
 	}
 
 gso:
+	trace_mark(net_dev_xmit, "skb %p protocol #2u%hu", skb, skb->protocol);
+
 	spin_lock_prefetch(&dev->queue_lock);
 
 	/* Disable soft irqs for various locks below. Also
@@ -2043,6 +2046,9 @@ int netif_receive_skb(struct sk_buff *sk
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
+	trace_mark(net_dev_receive, "skb %p protocol #2u%hu",
+		skb, skb->protocol);
+
 	skb_reset_network_header(skb);
 	skb_reset_transport_header(skb);
 	skb->mac_len = skb->network_header - skb->mac_header;
Index: linux-2.6-lttng/net/ipv4/devinet.c
===================================================================
--- linux-2.6-lttng.orig/net/ipv4/devinet.c	2008-03-27 07:26:26.000000000 -0400
+++ linux-2.6-lttng/net/ipv4/devinet.c	2008-03-27 07:31:49.000000000 -0400
@@ -56,6 +56,7 @@
 #include <linux/sysctl.h>
 #endif
 #include <linux/kmod.h>
+#include <linux/marker.h>
 
 #include <net/arp.h>
 #include <net/ip.h>
@@ -258,6 +259,8 @@ static void __inet_del_ifa(struct in_dev
 		struct in_ifaddr **ifap1 = &ifa1->ifa_next;
 
 		while ((ifa = *ifap1) != NULL) {
+			trace_mark(net_del_ifa_ipv4, "label %s",
+				ifa->ifa_label);
 			if (!(ifa->ifa_flags & IFA_F_SECONDARY) &&
 			    ifa1->ifa_scope <= ifa->ifa_scope)
 				last_prim = ifa;
@@ -364,6 +367,9 @@ static int __inet_insert_ifa(struct in_i
 			}
 			ifa->ifa_flags |= IFA_F_SECONDARY;
 		}
+		trace_mark(net_insert_ifa_ipv4, "label %s address #4u%lu",
+			ifa->ifa_label,
+			(unsigned long)ifa->ifa_address);
 	}
 
 	if (!(ifa->ifa_flags & IFA_F_SECONDARY)) {
Index: linux-2.6-lttng/net/socket.c
===================================================================
--- linux-2.6-lttng.orig/net/socket.c	2008-03-27 07:26:26.000000000 -0400
+++ linux-2.6-lttng/net/socket.c	2008-03-27 07:31:57.000000000 -0400
@@ -85,6 +85,7 @@
 #include <linux/audit.h>
 #include <linux/wireless.h>
 #include <linux/nsproxy.h>
+#include <linux/marker.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -567,6 +568,11 @@ int sock_sendmsg(struct socket *sock, st
 	struct sock_iocb siocb;
 	int ret;
 
+	trace_mark(net_socket_sendmsg,
+		"sock %p family %d type %d protocol %d size %zu",
+		sock, sock->sk->sk_family, sock->sk->sk_type,
+		sock->sk->sk_protocol, size);
+
 	init_sync_kiocb(&iocb, NULL);
 	iocb.private = &siocb;
 	ret = __sock_sendmsg(&iocb, sock, msg, size);
@@ -650,7 +656,13 @@ int sock_recvmsg(struct socket *sock, st
 	struct sock_iocb siocb;
 	int ret;
 
+	trace_mark(net_socket_recvmsg,
+		"sock %p family %d type %d protocol %d size %zu",
+		sock, sock->sk->sk_family, sock->sk->sk_type,
+		sock->sk->sk_protocol, size);
+
 	init_sync_kiocb(&iocb, NULL);
+
 	iocb.private = &siocb;
 	ret = __sock_recvmsg(&iocb, sock, msg, size, flags);
 	if (-EIOCBQUEUED == ret)
@@ -1226,6 +1238,11 @@ asmlinkage long sys_socket(int family, i
 	if (retval < 0)
 		goto out_release;
 
+	trace_mark(net_socket_create,
+		"sock %p family %d type %d protocol %d fd %d",
+		sock, sock->sk->sk_family, sock->sk->sk_type,
+		sock->sk->sk_protocol, retval);
+
 out:
 	/* It may be already another descriptor 8) Not kernel problem. */
 	return retval;
@@ -2024,6 +2041,8 @@ asmlinkage long sys_socketcall(int call,
 	a0 = a[0];
 	a1 = a[1];
 
+	trace_mark(net_socket_call, "call %d a0 %lu", call, a0);
+
 	switch (call) {
 	case SYS_SOCKET:
 		err = sys_socket(a0, a1, a[2]);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 37/37] LTTng instrumentation net
  2008-04-24 15:04 ` [patch 37/37] LTTng instrumentation net Mathieu Desnoyers
@ 2008-04-24 15:52   ` Pavel Emelyanov
  2008-04-24 16:13     ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: Pavel Emelyanov @ 2008-04-24 15:52 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Ingo Molnar, linux-kernel, netdev

Mathieu Desnoyers wrote:
> Network core events.
> 
> Added markers :
> 
> net_del_ifa_ipv4
> net_dev_receive
> net_dev_xmit
> net_insert_ifa_ipv4
> net_socket_call
> net_socket_create
> net_socket_recvmsg
> net_socket_sendmsg

Network "core" events are not limited with the above calls.

Besides, real "core" events already sent notifications about themselves.
Why do we need additional hooks?

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: netdev@vger.kernel.org
> ---
>  net/core/dev.c     |    6 ++++++
>  net/ipv4/devinet.c |    6 ++++++
>  net/socket.c       |   19 +++++++++++++++++++
>  3 files changed, 31 insertions(+)


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 37/37] LTTng instrumentation net
  2008-04-24 15:52   ` Pavel Emelyanov
@ 2008-04-24 16:13     ` Mathieu Desnoyers
  2008-04-24 16:30       ` Pavel Emelyanov
  0 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-24 16:13 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: akpm, Ingo Molnar, linux-kernel, netdev

* Pavel Emelyanov (xemul@openvz.org) wrote:
> Mathieu Desnoyers wrote:
> > Network core events.
> > 
> > Added markers :
> > 
> > net_del_ifa_ipv4
> > net_dev_receive
> > net_dev_xmit
> > net_insert_ifa_ipv4
> > net_socket_call
> > net_socket_create
> > net_socket_recvmsg
> > net_socket_sendmsg
> 
> Network "core" events are not limited with the above calls.
> 

True. This is by no mean an exhaustive list of network events. It just
happens to be the ones which has been useful to LTT/LTTng users for the
past ~10 years.

> Besides, real "core" events already sent notifications about themselves.
> Why do we need additional hooks?
> 

I doubt the current notification hooks have a performance impact as
small as the proposed markers. Which notification mechanism do you refer
to ? It could be interesting to put markers in there instead.

The goal behind this is to feed information to a general purpose tracer
like lttng, a scripting mechanism like systemtap or a special-purpose
tracer like ftrace.

I think that the most important instrumentation in this patchset is the
xmit/recv of a packet at the device level. The net_socket_*
instrumentation could eventually be replaced by an architecture specfic
system call parameters instrumentation.

Mathieu

> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > CC: netdev@vger.kernel.org
> > ---
> >  net/core/dev.c     |    6 ++++++
> >  net/ipv4/devinet.c |    6 ++++++
> >  net/socket.c       |   19 +++++++++++++++++++
> >  3 files changed, 31 insertions(+)
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 37/37] LTTng instrumentation net
  2008-04-24 16:13     ` Mathieu Desnoyers
@ 2008-04-24 16:30       ` Pavel Emelyanov
  0 siblings, 0 replies; 66+ messages in thread
From: Pavel Emelyanov @ 2008-04-24 16:30 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Ingo Molnar, linux-kernel, netdev

Mathieu Desnoyers wrote:
> * Pavel Emelyanov (xemul@openvz.org) wrote:
>> Mathieu Desnoyers wrote:
>>> Network core events.
>>>
>>> Added markers :
>>>
>>> net_del_ifa_ipv4
>>> net_dev_receive
>>> net_dev_xmit
>>> net_insert_ifa_ipv4
>>> net_socket_call
>>> net_socket_create
>>> net_socket_recvmsg
>>> net_socket_sendmsg
>> Network "core" events are not limited with the above calls.
>>
> 
> True. This is by no mean an exhaustive list of network events. It just
> happens to be the ones which has been useful to LTT/LTTng users for the
> past ~10 years.

Do you mean, that we'll have these debris all over the networking code some day?

>> Besides, real "core" events already sent notifications about themselves.
>> Why do we need additional hooks?
>>
> 
> I doubt the current notification hooks have a performance impact as
> small as the proposed markers. Which notification mechanism do you refer
> to ? It could be interesting to put markers in there instead.

E.g. call_netdevice_notifiers and co. 
And they have nothing to do with performance, since configuration code is 
not supposed to have a rocket speed.

> The goal behind this is to feed information to a general purpose tracer
> like lttng, a scripting mechanism like systemtap or a special-purpose
> tracer like ftrace.
> 
> I think that the most important instrumentation in this patchset is the
> xmit/recv of a packet at the device level. The net_socket_*
> instrumentation could eventually be replaced by an architecture specfic
> system call parameters instrumentation.

I will not argue about the value of such hooks in xmit/recv paths, but
as far as the net_socket_xxx is concerned - there is already the
* ptrace
* security
* kprobes
way to screw the normal code flow up in these places.

> Mathieu
> 
>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
>>> CC: netdev@vger.kernel.org
>>> ---
>>>  net/core/dev.c     |    6 ++++++
>>>  net/ipv4/devinet.c |    6 ++++++
>>>  net/socket.c       |   19 +++++++++++++++++++
>>>  3 files changed, 31 insertions(+)
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 34/37] LTTng instrumentation ipc
  2008-04-24 15:03 ` [patch 34/37] LTTng instrumentation ipc Mathieu Desnoyers
@ 2008-04-24 23:02   ` Alexey Dobriyan
  2008-04-25  2:15     ` Frank Ch. Eigler
  0 siblings, 1 reply; 66+ messages in thread
From: Alexey Dobriyan @ 2008-04-24 23:02 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Ingo Molnar, linux-kernel

On Thu, Apr 24, 2008 at 11:03:58AM -0400, Mathieu Desnoyers wrote:
> Interprocess communication, core events.
> 
> Added markers :
> 
> ipc_msg_create
> ipc_sem_create
> ipc_shm_create

> --- linux-2.6-lttng.orig/ipc/shm.c
> +++ linux-2.6-lttng/ipc/shm.c
> @@ -39,6 +39,7 @@
>  #include <linux/nsproxy.h>
>  #include <linux/mount.h>
>  #include <linux/ipc_namespace.h>
> +#include <linux/marker.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -482,6 +483,7 @@ asmlinkage long sys_shmget (key_t key, s
>  	struct ipc_namespace *ns;
>  	struct ipc_ops shm_ops;
>  	struct ipc_params shm_params;
> +	long err;
>  
>  	ns = current->nsproxy->ipc_ns;
>  
> @@ -493,7 +495,9 @@ asmlinkage long sys_shmget (key_t key, s
>  	shm_params.flg = shmflg;
>  	shm_params.u.size = size;
>  
> -	return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
> +	err = ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
> +	trace_mark(ipc_shm_create, "id %ld flags %d", err, shmflg);
> +	return err;
>  }

OK, finally the meat of markers facility was posted and we can actually
see the end result. And the end result is unwieldy and limited.

Today I was debugging some SysV shmem stuff and only 0.5 of marker
above would be useful. I ended up with the following:

	rv = ipcget(...);
	if (rv < 0)
		printk("%s: rv = %d\n", __func__, rv);
	return rv;

because I knew app was doing a lot of shmget/IPC_RMID and only -E events
were interesting. The rest was inserted deeply in mm/shmem.c internals
which these patches avoid for some reason :^)

Can I write

	if (rv < 0)
		trace_mark(foo, "rv %d", rv);

		?

Looks like i could. But people want also want to see success, so what?
Two markers per exit?

	rv = ipc_get(...);
	if (rv < 0)
		trace_marker(foo_err, ...);
	trace_marker(foo_all, ...);


Also everything inserted so far is static. Sometimes only one bit in
mask is interesting and to save time to parse nibbles people do:

	printk("foo = %d\n", !!(mask & foo));

And interesting bits vary.

Again, all events aren't interesting:

	if (file && file->f_op == &shm_file_operations)
		printk("%s: file = %p: START\n", __func__, file);

Can I write this with markers?

And finally, if we are talking by debugging by printks (which to me, markers
essentially are), they come in generations: you insert some initial
stuff, get information, narrow search area, insert some more in places
that are dependent from what you've seen in step 1, and whoila, bug is
understood.

So what is proposed? Insert markers at places that look strategic? Feed
me with data I DO NOT care and DO NOT want to see?

mm/ patch is full if "file %p". Do you realize that pointers are
only sometimes interesting and sometimes you want dentry (not pointer,
but name!):

	printk("file = '%s'\n", file->f_dentry->d_name.name);

You seem to place _one_ marker at the very end of error path, but,
c'mon, information _which_ "goto" exactly came to the error path is
also interesting!

Should I preventively insert marker when writing new code?



So, let me say even more explicitly. Looking at proposed places elite
enough to be blessed with marker...

Those which are close enough to system call boundary are essentially
strace(1).

Markers are very visible and distract attention from real code.
Markers steal critical vertical space (80 x only _25_ ).

Markers fmt strings are unconditional and can't take prior information
about the problem into account.

Markers points won't be removed, only accumulated -- somebody _might_ be
interested in this information.

Developers want to see wildly different information. Marker is one for
everyone.

Said that, markers are lame and should be fully removed until too late.


P.S.: I probably miss some obvious things and look only from specific
debugging perspective. Let someone speak up from another angle. It's very
good moment now because ipc/ kernel/ mm/ net/ patches show CRYSTALLY CLEAR
the end result. (and futex patch earlier).


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 34/37] LTTng instrumentation ipc
  2008-04-24 23:02   ` Alexey Dobriyan
@ 2008-04-25  2:15     ` Frank Ch. Eigler
  2008-04-25 12:56       ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: Frank Ch. Eigler @ 2008-04-25  2:15 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel


Alexey Dobriyan <adobriyan@gmail.com> writes:

> [...]
> Can I write
> 	if (rv < 0)
> 		trace_mark(foo, "rv %d", rv);

Sure.

> Looks like i could. But people want also want to see success, so what?
> Two markers per exit?
>
> 	rv = ipc_get(...);
> 	if (rv < 0)
> 		trace_marker(foo_err, ...);
> 	trace_marker(foo_all, ...);

That seems excessive.  Just pass "rv" value and let the consumer
decide whether they care about < 0.

You seem to be operating under the mistaken assumption that marker
consumers will simply have to pass on the full firehose flow without
filtering.  That is not so.  I suspect lttng can do it, but I know
that with systemtap, it's trivial to encode conditions on the marker
parameters and other state (e.g., recent events of interest), so that
only finely tuned events actually get sent to the end user.


> Also everything inserted so far is static. Sometimes only one bit in
> mask is interesting and to save time to parse nibbles people do:
> 	printk("foo = %d\n", !!(mask & foo));
> And interesting bits vary.

OK, perhaps pass both mask & foo, and let the consumer perform the
arithmetic they deem appropriate.


> Again, all events aren't interesting:
> 	if (file && file->f_op == &shm_file_operations)
> 		printk("%s: file = %p: START\n", __func__, file);
> Can I write this with markers?

Of course, if you really want to.

> So what is proposed? Insert markers at places that look strategic? 

"strategic" is the wrong term.  Choose those places that reflect
internal occurrences that are useful but difficult to reverse-engineer
from other visible interface points like system calls.  Data that
helps answer questions like "Why did (subtle internal phenomenon)
happen?" in a live system.


> mm/ patch is full if "file %p". Do you realize that pointers are
> only sometimes interesting and sometimes you want dentry (not pointer,
> but name!):
> 	printk("file = '%s'\n", file->f_dentry->d_name.name);

It may not be excessive to put both file and the dname.name as marker
parameters.


> So, let me say even more explicitly. Looking at proposed places elite
> enough to be blessed with marker...
>
> Those which are close enough to system call boundary are essentially
> strace(1).

Those may not sound worthwhile to put a marker for, BUT, you're
ignoring the huge differences of impact and scope.  A system-wide
marker-based trace (filtered a la systemtap if desired) can be done
with a tiny fraction of system load and none of the disruption caused
by an strace of all the processes.


> [...]  Markers points won't be removed, only accumulated -- somebody
> _might_ be interested in this information.

We all (data producers and consumers) need to use good judgment and
accept moderate change.


- FChE

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 34/37] LTTng instrumentation ipc
  2008-04-25  2:15     ` Frank Ch. Eigler
@ 2008-04-25 12:56       ` Mathieu Desnoyers
  2008-04-25 13:17         ` [RFC] system-wide in-kernel syscall tracing Mathieu Desnoyers
  2008-05-04 21:04         ` [patch 34/37] LTTng instrumentation ipc Alexey Dobriyan
  0 siblings, 2 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-25 12:56 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Alexey Dobriyan, akpm, Ingo Molnar, linux-kernel

* Frank Ch. Eigler (fche@redhat.com) wrote:
> 
> Alexey Dobriyan <adobriyan@gmail.com> writes:
> 

Alexey, I'll reply to Frank's email, but first let me thank you for
looking into this.

> > [...]
> > Can I write
> > 	if (rv < 0)
> > 		trace_mark(foo, "rv %d", rv);
> 
> Sure.
> 
> > Looks like i could. But people want also want to see success, so what?
> > Two markers per exit?
> >
> > 	rv = ipc_get(...);
> > 	if (rv < 0)
> > 		trace_marker(foo_err, ...);
> > 	trace_marker(foo_all, ...);
> 
> That seems excessive.  Just pass "rv" value and let the consumer
> decide whether they care about < 0.
> 
> You seem to be operating under the mistaken assumption that marker
> consumers will simply have to pass on the full firehose flow without
> filtering.  That is not so.  I suspect lttng can do it, but I know
> that with systemtap, it's trivial to encode conditions on the marker
> parameters and other state (e.g., recent events of interest), so that
> only finely tuned events actually get sent to the end user.
> 

The preferred way to do it would be

  trace_mark(foo, "rv %d", rv);

And let the probe deal with rv. The main reason is that by adding

if (test)
  trace_mark()

you will add a branch in the normal kernel code flow and slow down the
kernel a bit when disabled compared to the optimized markers.

> 
> > Also everything inserted so far is static. Sometimes only one bit in
> > mask is interesting and to save time to parse nibbles people do:
> > 	printk("foo = %d\n", !!(mask & foo));
> > And interesting bits vary.
> 
> OK, perhaps pass both mask & foo, and let the consumer perform the
> arithmetic they deem appropriate.
> 

Agreed. However, adding stuff like

!!(mask & foo)

as parameter to a marker won't add to the disabled marker runtime cost,
since it's evaluated within the "marker enabled" block. So, if all you
really need is !!(mask & foo) (and never other information about foo),
then it could make sense to use the most restrictive version so we don't
export internal details about the kernel implementation.

> 
> > Again, all events aren't interesting:
> > 	if (file && file->f_op == &shm_file_operations)
> > 		printk("%s: file = %p: START\n", __func__, file);
> > Can I write this with markers?
> 
> Of course, if you really want to.
> 

__func__ is not really interesting here, because you can name your
marker. A useful trick can be to use __builtin_return_address(0) when
needed though.

for the rest, a way that would not export too much information about
kernel's internals :

trace_mark(shm_start, "is_shm_fop %d file %p",
  file->f_op == &shm_file_operations, file);

And yes, there is a small runtime cost added here when the marker is
enabled: the test is done in the probe called rather that in the spot.
On the other hand, the impact is nearly zero when the marker is
disabled.

If you really really want to, I could modify the markers to make this
even easier by doing something like :

trace_mark_cond(file->f_op == &shm_file_operations,
  shm_start, "file %p", file);

Where the first argument would be a supplementary condition tested in
the marker block. That would make the active marker case faster. How do
you like that ? (see patch appended at the end of the email)



> > So what is proposed? Insert markers at places that look strategic? 
> 
> "strategic" is the wrong term.  Choose those places that reflect
> internal occurrences that are useful but difficult to reverse-engineer
> from other visible interface points like system calls.  Data that
> helps answer questions like "Why did (subtle internal phenomenon)
> happen?" in a live system.
> 

I totally agree. And we need to do some work in the system call tracing
area as a starting point. That will help remove some unnecessary
instrumentation I have in LTTng.

> 
> > mm/ patch is full if "file %p". Do you realize that pointers are
> > only sometimes interesting and sometimes you want dentry (not pointer,
> > but name!):
> > 	printk("file = '%s'\n", file->f_dentry->d_name.name);
> 
> It may not be excessive to put both file and the dname.name as marker
> parameters.
> 

eek, well, if we really want to identify a file, we need more than its
name. mount points, full path and file name are required. It brings me a
few years back, but I don't think the dentry name gives us that. This is
why I extract information about all opened files to my tracer once
(mapping the mount point, path and file name to file pointer) and then I
don't have to do the lookup each time the marker is encountered. Yes,
this involved a file pointers dumping at tracer start and keeping track
of open/close events.

> 
> > So, let me say even more explicitly. Looking at proposed places elite
> > enough to be blessed with marker...
> >
> > Those which are close enough to system call boundary are essentially
> > strace(1).
> 
> Those may not sound worthwhile to put a marker for, BUT, you're
> ignoring the huge differences of impact and scope.  A system-wide
> marker-based trace (filtered a la systemtap if desired) can be done
> with a tiny fraction of system load and none of the disruption caused
> by an strace of all the processes.
> 

I agree with both ;) Actually we need a low-overhead hook in
syscall_trace(), so we can perform efficient system-wide tracing of
system calls. I'll dig in this as soon as I find time.

Basic ideas :

- I already have the TIF_KERNEL_TRACE thread flag added to all
  architectures in another patchset.
- We add a function called on TIF_KERNEL_TRACE, from do_syscall_trace(),
  which is architecture-specific. It's basically a big switch() for all
  system calls. syscalls which takes similar types could be grouped
  together, but I don't think it would be useful at all. It might be
  better just to add a trace_mark for each so we extract the syscall
  fields in the marker string.
- We perform the page fault (caused by strings and structures) reads on
  the spot, because we prefer not to do this in atomic context.
- We put a marker, e.g., for x86_32, a pseudo-code like :

syscall_trace_enter()
{
  ...
  if (test_thread_flag(TIF_KERNEL_TRACE))
    do_marker_syscall_trace();
  ...
}

do_marker_syscall_trace()
{
  char *tmpbuf;

  switch(regs->orig_ax) {

  case SYS_OPEN:
    tmpbuf = vmalloc(4096); /* what size is needed ? */
    copy_from_user(tmpbuf, regs->bx);
    trace_mark(sys_open, "filename %p flags %d mode %d",
      tmpbuf, regs->cx, regs->dx);
    vfree(tmpbuf);
    break;
  }
}

Modulo some optimization, what do you think of this ? If someone is
willing to implement this, I can provide the patchset for
TIF_KERNEL_TRACE.

> 
> > [...]  Markers points won't be removed, only accumulated -- somebody
> > _might_ be interested in this information.
> 
> We all (data producers and consumers) need to use good judgment and
> accept moderate change.
> 
> 
> - FChE

Yes, and I see more and more that we need the in-kernel syscall tracing
infrastructure as a starting point. Then, the only markers left will
deal with useful inner-kernel information like scheduler change, vmalloc
memory allocation, and so on.

Mathieu


Markers condition

> 
> > Again, all events aren't interesting:
> >   if (file && file->f_op == &shm_file_operations)
> >     printk("%s: file = %p: START\n", __func__, file);
> > Can I write this with markers?
> 
> Of course, if you really want to.
> 

__func__ is not really interesting here, because you can name your
marker. A useful trick can be to use __builtin_return_address(0) when
needed though.

for the rest, a way that would not export too much information about
kernel's internals :

trace_mark(shm_start, "is_shm_fop %d file %p",
  file->f_op == &shm_file_operations, file);

And yes, there is a small runtime cost added here when the marker is
enabled: the test is done in the probe called rather that in the spot.
On the other hand, the impact is nearly zero when the marker is
disabled.

If you really really whant to, I could modify the markers to make this
even easier by doing something like :

trace_mark_cond(file->f_op == &shm_file_operations,
  shm_start, "file %p", file);

Where the first argument would be a supplementary condition tested in
the marker block. That would make the active marker case faster. How do
you like that ?

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: akpm@linux-foundation.org
---
 include/linux/marker.h |   34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

Index: linux-2.6-lttng/include/linux/marker.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/marker.h	2008-04-25 08:16:21.000000000 -0400
+++ linux-2.6-lttng/include/linux/marker.h	2008-04-25 08:21:27.000000000 -0400
@@ -64,7 +64,7 @@ struct marker {
  * If generic is true, a variable read is used.
  * If generic is false, immediate values are used.
  */
-#define __trace_mark(generic, name, call_private, format, args...)	\
+#define __trace_mark(generic, cond, name, call_private, format, args...)\
 	do {								\
 		static const char __mstrtab_##name[]			\
 		__attribute__((section("__markers_strings")))		\
@@ -76,12 +76,12 @@ struct marker {
 		{ __mark_empty_function, NULL}, NULL };			\
 		__mark_check_format(format, ## args);			\
 		if (!generic) {						\
-			if (unlikely(imv_cond(__mark_##name.state)))	\
+			if (unlikely(imv_cond(__mark_##name.state) && (cond))) \
 				(*__mark_##name.call)			\
 					(&__mark_##name, call_private,	\
 					## args);			\
 		} else {						\
-			if (unlikely(_imv_read(__mark_##name.state)))	\
+			if (unlikely(_imv_read(__mark_##name.state) && (cond)))\
 				(*__mark_##name.call)			\
 					(&__mark_##name, call_private,	\
 					## args);			\
@@ -108,7 +108,7 @@ static inline void marker_update_probe_r
  * to be enabled when immediate values are present.
  */
 #define trace_mark(name, format, args...) \
-	__trace_mark(0, name, NULL, format, ## args)
+	__trace_mark(0, 1, name, NULL, format, ## args)
 
 /**
  * _trace_mark - Marker using variable read
@@ -122,7 +122,31 @@ static inline void marker_update_probe_r
  * lockdep, some traps, printk).
  */
 #define _trace_mark(name, format, args...) \
-	__trace_mark(1, name, NULL, format, ## args)
+	__trace_mark(1, 1, name, NULL, format, ## args)
+
+/**
+ * trace_mark_cond - Marker using code patching, testing a condition
+ * @cond: condition to test
+ * @name: marker name, not quoted.
+ * @format: format string
+ * @args...: variable argument list
+ *
+ * Like trace_mark(), but tests if cond is true to execute the trace mark.
+ */
+#define trace_mark_cond(cond, name, format, args...) \
+	__trace_mark(0, cond, name, NULL, format, ## args)
+
+/**
+ * _trace_mark_cond - Marker using variable read, testing a condition
+ * @cond: condition to test
+ * @name: marker name, not quoted.
+ * @format: format string
+ * @args...: variable argument list
+ *
+ * Like _trace_mark(), but tests if cond is true to execute the trace mark.
+ */
+#define _trace_mark_cond(cond, name, format, args...) \
+	__trace_mark(1, cond, name, NULL, format, ## args)
 
 /**
  * MARK_NOARGS - Format string for a marker with no argument.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC] system-wide in-kernel syscall tracing
  2008-04-25 12:56       ` Mathieu Desnoyers
@ 2008-04-25 13:17         ` Mathieu Desnoyers
  2008-05-04 21:04         ` [patch 34/37] LTTng instrumentation ipc Alexey Dobriyan
  1 sibling, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-25 13:17 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Alexey Dobriyan, akpm, Ingo Molnar, linux-kernel

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> > >
> > > Those which are close enough to system call boundary are essentially
> > > strace(1).
> > 
> > Those may not sound worthwhile to put a marker for, BUT, you're
> > ignoring the huge differences of impact and scope.  A system-wide
> > marker-based trace (filtered a la systemtap if desired) can be done
> > with a tiny fraction of system load and none of the disruption caused
> > by an strace of all the processes.
> > 
> 
> I agree with both ;) Actually we need a low-overhead hook in
> syscall_trace(), so we can perform efficient system-wide tracing of
> system calls. I'll dig in this as soon as I find time.
> 
> Basic ideas :
> 
> - I already have the TIF_KERNEL_TRACE thread flag added to all
>   architectures in another patchset.
> - We add a function called on TIF_KERNEL_TRACE, from do_syscall_trace(),
>   which is architecture-specific. It's basically a big switch() for all
>   system calls. syscalls which takes similar types could be grouped
>   together, but I don't think it would be useful at all. It might be
>   better just to add a trace_mark for each so we extract the syscall
>   fields in the marker string.
> - We perform the page fault (caused by strings and structures) reads on
>   the spot, because we prefer not to do this in atomic context.
> - We put a marker, e.g., for x86_32, a pseudo-code like :
> 
> syscall_trace_enter()
> {
>   ...
>   if (test_thread_flag(TIF_KERNEL_TRACE))
>     do_marker_syscall_trace();
>   ...
> }
> 
> do_marker_syscall_trace()
> {
>   char *tmpbuf;
> 
>   switch(regs->orig_ax) {
> 
>   case SYS_OPEN:
>     tmpbuf = vmalloc(4096); /* what size is needed ? */
>     copy_from_user(tmpbuf, regs->bx);
>     trace_mark(sys_open, "filename %p flags %d mode %d",

Actually, I meant :
     trace_mark(sys_open, "filename %s flags %d mode %d",

and it would be even better to pass the __user pointer directly to the
probe to eliminate the copy. I think this could be done by making sure
the memory is faulted-in and locked when we call the trace_mark. It
could require to think of a way to specify a weird format string type
though, so an automated tracer would use strncpy_from_user in atomic and
al instead of trying to dereference the userspace pointer directly.

Mathieu

>       tmpbuf, regs->cx, regs->dx);
>     vfree(tmpbuf);
>     break;
>   }
> }
> 
> Modulo some optimization, what do you think of this ? If someone is
> willing to implement this, I can provide the patchset for
> TIF_KERNEL_TRACE.
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 13/37] Immediate Values - Architecture Independent Code
  2008-04-24 15:03 ` [patch 13/37] Immediate Values - Architecture Independent Code Mathieu Desnoyers
@ 2008-04-25 14:55   ` Ingo Molnar
  2008-04-26  9:36     ` Ingo Molnar
  0 siblings, 1 reply; 66+ messages in thread
From: Ingo Molnar @ 2008-04-25 14:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Jason Baron, Rusty Russell, Adrian Bunk,
	Andi Kleen, Christoph Hellwig, akpm


randconfig testing in sched-devel caught a build bug - fixed by the 
patch below.

	Ingo

-------------->
Subject: markers: build fix
From: Ingo Molnar <mingo@elte.hu>
Date: Fri Apr 25 16:36:15 CEST 2008

fix:

 include/linux/module.h:587: error: redefinition of 'module_imv_update'
 include/linux/immediate.h:74: error: previous definition of 'module_imv_update' was here

the prototype was unnecessary.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/immediate.h |    1 -
 1 file changed, 1 deletion(-)

Index: linux/include/linux/immediate.h
===================================================================
--- linux.orig/include/linux/immediate.h
+++ linux/include/linux/immediate.h
@@ -71,7 +71,6 @@ extern void imv_unref(struct __imv *begi
 #define imv_set(name, i)		(name##__imv = (i))
 
 static inline void core_imv_update(void) { }
-static inline void module_imv_update(void) { }
 static inline void imv_unref_core_init(void) { }
 
 #endif

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 13/37] Immediate Values - Architecture Independent Code
  2008-04-25 14:55   ` Ingo Molnar
@ 2008-04-26  9:36     ` Ingo Molnar
  2008-04-26 11:09       ` Ingo Molnar
  0 siblings, 1 reply; 66+ messages in thread
From: Ingo Molnar @ 2008-04-26  9:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Jason Baron, Rusty Russell, Adrian Bunk,
	Andi Kleen, Christoph Hellwig, akpm


* Ingo Molnar <mingo@elte.hu> wrote:

> randconfig testing in sched-devel caught a build bug - fixed by the 
> patch below.

found another build failure - the better fix is the one below.

	Ingo

------------------------->
Subject: markers: fix2
From: Ingo Molnar <mingo@elte.hu>
Date: Sat Apr 26 11:19:00 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/module.h |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Index: linux/include/linux/module.h
===================================================================
--- linux.orig/include/linux/module.h
+++ linux/include/linux/module.h
@@ -472,9 +472,6 @@ extern void print_modules(void);
 
 extern void module_update_markers(void);
 
-extern void _module_imv_update(void);
-extern void module_imv_update(void);
-
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -579,15 +576,19 @@ static inline void module_update_markers
 {
 }
 
+#endif /* CONFIG_MODULES */
+
+#if defined(MODULES) && defined(CONFIG_IMMEDIATE)
+extern void _module_imv_update(void);
+extern void module_imv_update(void);
+#else
 static inline void _module_imv_update(void)
 {
 }
-
 static inline void module_imv_update(void)
 {
 }
-
-#endif /* CONFIG_MODULES */
+#endif
 
 struct device_driver;
 #ifdef CONFIG_SYSFS

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 13/37] Immediate Values - Architecture Independent Code
  2008-04-26  9:36     ` Ingo Molnar
@ 2008-04-26 11:09       ` Ingo Molnar
  2008-04-26 14:17         ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: Ingo Molnar @ 2008-04-26 11:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Jason Baron, Rusty Russell, Adrian Bunk,
	Andi Kleen, Christoph Hellwig, akpm


* Ingo Molnar <mingo@elte.hu> wrote:

> > randconfig testing in sched-devel caught a build bug - fixed by the 
> > patch below.
> 
> found another build failure - the better fix is the one below.

or rather the one below ...

------------>
Subject: markers: fix2
From: Ingo Molnar <mingo@elte.hu>
Date: Sat Apr 26 11:19:00 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/module.h |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Index: linux/include/linux/module.h
===================================================================
--- linux.orig/include/linux/module.h
+++ linux/include/linux/module.h
@@ -472,9 +472,6 @@ extern void print_modules(void);
 
 extern void module_update_markers(void);
 
-extern void _module_imv_update(void);
-extern void module_imv_update(void);
-
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -579,15 +576,19 @@ static inline void module_update_markers
 {
 }
 
+#endif /* CONFIG_MODULES */
+
+#if defined(CONFIG_MODULES) && defined(CONFIG_IMMEDIATE)
+extern void _module_imv_update(void);
+extern void module_imv_update(void);
+#else
 static inline void _module_imv_update(void)
 {
 }
-
 static inline void module_imv_update(void)
 {
 }
-
-#endif /* CONFIG_MODULES */
+#endif
 
 struct device_driver;
 #ifdef CONFIG_SYSFS

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 13/37] Immediate Values - Architecture Independent Code
  2008-04-26 11:09       ` Ingo Molnar
@ 2008-04-26 14:17         ` Mathieu Desnoyers
  0 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-26 14:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: akpm, linux-kernel, Jason Baron, Rusty Russell, Adrian Bunk,
	Andi Kleen, Christoph Hellwig, akpm

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > randconfig testing in sched-devel caught a build bug - fixed by the 
> > > patch below.
> > 
> > found another build failure - the better fix is the one below.
> 
> or rather the one below ...
> 
> ------------>
> Subject: markers: fix2
> From: Ingo Molnar <mingo@elte.hu>
> Date: Sat Apr 26 11:19:00 CEST 2008
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>

Good catch, thanks!

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

> ---
>  include/linux/module.h |   13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> Index: linux/include/linux/module.h
> ===================================================================
> --- linux.orig/include/linux/module.h
> +++ linux/include/linux/module.h
> @@ -472,9 +472,6 @@ extern void print_modules(void);
>  
>  extern void module_update_markers(void);
>  
> -extern void _module_imv_update(void);
> -extern void module_imv_update(void);
> -
>  #else /* !CONFIG_MODULES... */
>  #define EXPORT_SYMBOL(sym)
>  #define EXPORT_SYMBOL_GPL(sym)
> @@ -579,15 +576,19 @@ static inline void module_update_markers
>  {
>  }
>  
> +#endif /* CONFIG_MODULES */
> +
> +#if defined(CONFIG_MODULES) && defined(CONFIG_IMMEDIATE)
> +extern void _module_imv_update(void);
> +extern void module_imv_update(void);
> +#else
>  static inline void _module_imv_update(void)
>  {
>  }
> -
>  static inline void module_imv_update(void)
>  {
>  }
> -
> -#endif /* CONFIG_MODULES */
> +#endif
>  
>  struct device_driver;
>  #ifdef CONFIG_SYSFS

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
                   ` (36 preceding siblings ...)
  2008-04-24 15:04 ` [patch 37/37] LTTng instrumentation net Mathieu Desnoyers
@ 2008-04-26 19:38 ` Peter Zijlstra
  2008-04-26 20:11   ` Mathieu Desnoyers
                     ` (2 more replies)
  37 siblings, 3 replies; 66+ messages in thread
From: Peter Zijlstra @ 2008-04-26 19:38 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Ingo Molnar, linux-kernel

On Thu, 2008-04-24 at 11:03 -0400, Mathieu Desnoyers wrote:
> Hi Ingo,
> 
> Here is a rather large patchset applying kernel instrumentation to
> sched-devel.git. It includes, mainly :

I saw this land in sched-devel, how about this:

---
Subject: sched: de-uglyfy marker impact

These trace_mark() things look like someone puked all over the code,
lets hide the ugly bits.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c       |   24 ++++++++----------------
 kernel/sched_trace.h |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+), 16 deletions(-)

Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -71,7 +71,6 @@
 #include <linux/debugfs.h>
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
-#include <linux/marker.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -745,6 +744,8 @@ static void update_rq_clock(struct rq *r
 #define task_rq(p)		cpu_rq(task_cpu(p))
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 
+#include "sched_trace.h"
+
 /*
  * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
  */
@@ -2258,8 +2259,7 @@ void wait_task_inactive(struct task_stru
 		 * just go back and repeat.
 		 */
 		rq = task_rq_lock(p, &flags);
-		trace_mark(kernel_sched_wait_task, "pid %d state %ld",
-			p->pid, p->state);
+		trace_kernel_sched_wait(p);
 		running = task_running(rq, p);
 		on_rq = p->se.on_rq;
 		task_rq_unlock(rq, &flags);
@@ -2603,9 +2603,7 @@ out_activate:
 	success = 1;
 
 out_running:
-	trace_mark(kernel_sched_wakeup,
-		"pid %d state %ld ## rq %p task %p rq->curr %p",
-		p->pid, p->state, rq, p, rq->curr);
+	trace_kernel_sched_wakeup(rq, p);
 	check_preempt_curr(rq, p);
 
 	p->state = TASK_RUNNING;
@@ -2736,9 +2734,7 @@ void wake_up_new_task(struct task_struct
 		p->sched_class->task_new(rq, p);
 		inc_nr_running(rq);
 	}
-	trace_mark(kernel_sched_wakeup_new,
-		"pid %d state %ld ## rq %p task %p rq->curr %p",
-		p->pid, p->state, rq, p, rq->curr);
+	trace_kernel_sched_wakeup_new(rq, p);
 	check_preempt_curr(rq, p);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_wake_up)
@@ -2911,11 +2907,8 @@ context_switch(struct rq *rq, struct tas
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_mark(kernel_sched_schedule,
-		"prev_pid %d next_pid %d prev_state %ld "
-		"## rq %p prev %p next %p",
-		prev->pid, next->pid, prev->state,
-		rq, prev, next);
+
+	trace_kernel_sched_switch(rq, prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
@@ -3148,8 +3141,7 @@ static void sched_migrate_task(struct ta
 	    || unlikely(cpu_is_offline(dest_cpu)))
 		goto out;
 
-	trace_mark(kernel_sched_migrate_task, "pid %d state %ld dest_cpu %d",
-		p->pid, p->state, dest_cpu);
+	trace_kernel_sched_migrate_task(p, cpu_of(rq), dest_cpu);
 	/* force the process onto the specified CPU */
 	if (migrate_task(p, dest_cpu, &req)) {
 		/* Need to wait for migration thread (might exit: take ref). */
Index: linux-2.6-2/kernel/sched_trace.h
===================================================================
--- /dev/null
+++ linux-2.6-2/kernel/sched_trace.h
@@ -0,0 +1,41 @@
+#include <linux/marker.h>
+
+static inline void trace_kernel_sched_wait(struct task_struct *p)
+{
+	trace_mark(kernel_sched_wait_task, "pid %d state %ld",
+			p->pid, p->state);
+}
+
+static inline
+void trace_kernel_sched_wakeup(struct rq *rq, struct task_struct *p)
+{
+	trace_mark(kernel_sched_wakeup,
+			"pid %d state %ld ## rq %p task %p rq->curr %p",
+			p->pid, p->state, rq, p, rq->curr);
+}
+
+static inline
+void trace_kernel_sched_wakeup_new(struct rq *rq, struct task_struct *p)
+{
+	trace_mark(kernel_sched_wakeup_new,
+			"pid %d state %ld ## rq %p task %p rq->curr %p",
+			p->pid, p->state, rq, p, rq->curr);
+}
+
+static inline void trace_kernel_sched_switch(struct rq *rq,
+		struct task_struct *prev, struct task_struct *next)
+{
+	trace_mark(kernel_sched_schedule,
+			"prev_pid %d next_pid %d prev_state %ld "
+			"## rq %p prev %p next %p",
+			prev->pid, next->pid, prev->state,
+			rq, prev, next);
+}
+
+static inline void
+trace_kernel_sched_migrate_task(struct task_struct *p, int src, int dst)
+{
+	trace_mark(kernel_sched_migrate_task,
+			"pid %d state %ld dest_cpu %d",
+			p->pid, p->state, dst);
+}



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-26 19:38 ` [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Peter Zijlstra
@ 2008-04-26 20:11   ` Mathieu Desnoyers
  2008-04-27 10:00     ` Peter Zijlstra
  2008-04-28 18:22   ` Ingo Molnar
  2008-04-28 18:36   ` Andrew Morton
  2 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-26 20:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: akpm, Ingo Molnar, linux-kernel

* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Thu, 2008-04-24 at 11:03 -0400, Mathieu Desnoyers wrote:
> > Hi Ingo,
> > 
> > Here is a rather large patchset applying kernel instrumentation to
> > sched-devel.git. It includes, mainly :
> 
> I saw this land in sched-devel, how about this:
> 

I think the main reason why those markers are ugly is they have
information meant for use by a general purpose tracer (it will only take
the parameters before the ## signs).

The other way around is to start specializing the general purpose tracer
to extract the information it needs from the task and rq pointers and
put that in the traces. Actually, that's the approach I use currently
use in LTTng, but Ingo seemed interested to have the union of parameters
needed by specialized and GP tracers, so this is what I did.

The only thing I dislike with the approach of putting everything in a
separatered header is that it adds a layer of indirection when one try
to to quickly grep for trace_mark() in the kernel tree to see where the
conceptual tracing points are. Therefore, it might be interesting to
simply remove the parameters meant for the general purpose tracer and
let this GP tracer create specialized probes for these instrumentation
sites. It would therefore remove the ugliness without creating a
supplementary indirection layer.

Mathieu

> ---
> Subject: sched: de-uglyfy marker impact
> 
> These trace_mark() things look like someone puked all over the code,
> lets hide the ugly bits.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c       |   24 ++++++++----------------
>  kernel/sched_trace.h |   41 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 49 insertions(+), 16 deletions(-)
> 
> Index: linux-2.6-2/kernel/sched.c
> ===================================================================
> --- linux-2.6-2.orig/kernel/sched.c
> +++ linux-2.6-2/kernel/sched.c
> @@ -71,7 +71,6 @@
>  #include <linux/debugfs.h>
>  #include <linux/ctype.h>
>  #include <linux/ftrace.h>
> -#include <linux/marker.h>
>  
>  #include <asm/tlb.h>
>  #include <asm/irq_regs.h>
> @@ -745,6 +744,8 @@ static void update_rq_clock(struct rq *r
>  #define task_rq(p)		cpu_rq(task_cpu(p))
>  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>  
> +#include "sched_trace.h"
> +
>  /*
>   * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
>   */
> @@ -2258,8 +2259,7 @@ void wait_task_inactive(struct task_stru
>  		 * just go back and repeat.
>  		 */
>  		rq = task_rq_lock(p, &flags);
> -		trace_mark(kernel_sched_wait_task, "pid %d state %ld",
> -			p->pid, p->state);
> +		trace_kernel_sched_wait(p);
>  		running = task_running(rq, p);
>  		on_rq = p->se.on_rq;
>  		task_rq_unlock(rq, &flags);
> @@ -2603,9 +2603,7 @@ out_activate:
>  	success = 1;
>  
>  out_running:
> -	trace_mark(kernel_sched_wakeup,
> -		"pid %d state %ld ## rq %p task %p rq->curr %p",
> -		p->pid, p->state, rq, p, rq->curr);
> +	trace_kernel_sched_wakeup(rq, p);
>  	check_preempt_curr(rq, p);
>  
>  	p->state = TASK_RUNNING;
> @@ -2736,9 +2734,7 @@ void wake_up_new_task(struct task_struct
>  		p->sched_class->task_new(rq, p);
>  		inc_nr_running(rq);
>  	}
> -	trace_mark(kernel_sched_wakeup_new,
> -		"pid %d state %ld ## rq %p task %p rq->curr %p",
> -		p->pid, p->state, rq, p, rq->curr);
> +	trace_kernel_sched_wakeup_new(rq, p);
>  	check_preempt_curr(rq, p);
>  #ifdef CONFIG_SMP
>  	if (p->sched_class->task_wake_up)
> @@ -2911,11 +2907,8 @@ context_switch(struct rq *rq, struct tas
>  	struct mm_struct *mm, *oldmm;
>  
>  	prepare_task_switch(rq, prev, next);
> -	trace_mark(kernel_sched_schedule,
> -		"prev_pid %d next_pid %d prev_state %ld "
> -		"## rq %p prev %p next %p",
> -		prev->pid, next->pid, prev->state,
> -		rq, prev, next);
> +
> +	trace_kernel_sched_switch(rq, prev, next);
>  	mm = next->mm;
>  	oldmm = prev->active_mm;
>  	/*
> @@ -3148,8 +3141,7 @@ static void sched_migrate_task(struct ta
>  	    || unlikely(cpu_is_offline(dest_cpu)))
>  		goto out;
>  
> -	trace_mark(kernel_sched_migrate_task, "pid %d state %ld dest_cpu %d",
> -		p->pid, p->state, dest_cpu);
> +	trace_kernel_sched_migrate_task(p, cpu_of(rq), dest_cpu);
>  	/* force the process onto the specified CPU */
>  	if (migrate_task(p, dest_cpu, &req)) {
>  		/* Need to wait for migration thread (might exit: take ref). */
> Index: linux-2.6-2/kernel/sched_trace.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6-2/kernel/sched_trace.h
> @@ -0,0 +1,41 @@
> +#include <linux/marker.h>
> +
> +static inline void trace_kernel_sched_wait(struct task_struct *p)
> +{
> +	trace_mark(kernel_sched_wait_task, "pid %d state %ld",
> +			p->pid, p->state);
> +}
> +
> +static inline
> +void trace_kernel_sched_wakeup(struct rq *rq, struct task_struct *p)
> +{
> +	trace_mark(kernel_sched_wakeup,
> +			"pid %d state %ld ## rq %p task %p rq->curr %p",
> +			p->pid, p->state, rq, p, rq->curr);
> +}
> +
> +static inline
> +void trace_kernel_sched_wakeup_new(struct rq *rq, struct task_struct *p)
> +{
> +	trace_mark(kernel_sched_wakeup_new,
> +			"pid %d state %ld ## rq %p task %p rq->curr %p",
> +			p->pid, p->state, rq, p, rq->curr);
> +}
> +
> +static inline void trace_kernel_sched_switch(struct rq *rq,
> +		struct task_struct *prev, struct task_struct *next)
> +{
> +	trace_mark(kernel_sched_schedule,
> +			"prev_pid %d next_pid %d prev_state %ld "
> +			"## rq %p prev %p next %p",
> +			prev->pid, next->pid, prev->state,
> +			rq, prev, next);
> +}
> +
> +static inline void
> +trace_kernel_sched_migrate_task(struct task_struct *p, int src, int dst)
> +{
> +	trace_mark(kernel_sched_migrate_task,
> +			"pid %d state %ld dest_cpu %d",
> +			p->pid, p->state, dst);
> +}
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-26 20:11   ` Mathieu Desnoyers
@ 2008-04-27 10:00     ` Peter Zijlstra
  2008-05-04 15:08       ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2008-04-27 10:00 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Ingo Molnar, linux-kernel

On Sat, 2008-04-26 at 16:11 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > On Thu, 2008-04-24 at 11:03 -0400, Mathieu Desnoyers wrote:
> > > Hi Ingo,
> > > 
> > > Here is a rather large patchset applying kernel instrumentation to
> > > sched-devel.git. It includes, mainly :
> > 
> > I saw this land in sched-devel, how about this:
> > 
> 
> I think the main reason why those markers are ugly is they have
> information meant for use by a general purpose tracer (it will only take
> the parameters before the ## signs).
> 
> The other way around is to start specializing the general purpose tracer
> to extract the information it needs from the task and rq pointers and
> put that in the traces. Actually, that's the approach I use currently
> use in LTTng, but Ingo seemed interested to have the union of parameters
> needed by specialized and GP tracers, so this is what I did.
> 
> The only thing I dislike with the approach of putting everything in a
> separatered header is that it adds a layer of indirection when one try
> to to quickly grep for trace_mark() in the kernel tree to see where the
> conceptual tracing points are. Therefore, it might be interesting to
> simply remove the parameters meant for the general purpose tracer and
> let this GP tracer create specialized probes for these instrumentation
> sites. It would therefore remove the ugliness without creating a
> supplementary indirection layer.

In part its the extra parameters, but to a large part its that darn
string. I'm still not getting why we can't just live with trace marks
like 'regular' functions much like the ones I used to wrap trace_mark().

> > Index: linux-2.6-2/kernel/sched_trace.h
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6-2/kernel/sched_trace.h
> > @@ -0,0 +1,41 @@
> > +#include <linux/marker.h>
> > +
> > +static inline void trace_kernel_sched_wait(struct task_struct *p)
> > +{
> > +	trace_mark(kernel_sched_wait_task, "pid %d state %ld",
> > +			p->pid, p->state);
> > +}
> > +
> > +static inline
> > +void trace_kernel_sched_wakeup(struct rq *rq, struct task_struct *p)
> > +{
> > +	trace_mark(kernel_sched_wakeup,
> > +			"pid %d state %ld ## rq %p task %p rq->curr %p",
> > +			p->pid, p->state, rq, p, rq->curr);
> > +}
> > +
> > +static inline
> > +void trace_kernel_sched_wakeup_new(struct rq *rq, struct task_struct *p)
> > +{
> > +	trace_mark(kernel_sched_wakeup_new,
> > +			"pid %d state %ld ## rq %p task %p rq->curr %p",
> > +			p->pid, p->state, rq, p, rq->curr);
> > +}
> > +
> > +static inline void trace_kernel_sched_switch(struct rq *rq,
> > +		struct task_struct *prev, struct task_struct *next)
> > +{
> > +	trace_mark(kernel_sched_schedule,
> > +			"prev_pid %d next_pid %d prev_state %ld "
> > +			"## rq %p prev %p next %p",
> > +			prev->pid, next->pid, prev->state,
> > +			rq, prev, next);
> > +}
> > +
> > +static inline void
> > +trace_kernel_sched_migrate_task(struct task_struct *p, int src, int dst)
> > +{
> > +	trace_mark(kernel_sched_migrate_task,
> > +			"pid %d state %ld dest_cpu %d",
> > +			p->pid, p->state, dst);
> > +}

One advantage of having these things close together would be that it
stimulates consistency between the various trace points. Something that
would be sorely needed with such a  free form mechanism.

Peter


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 36/37] LTTng instrumentation mm
  2008-04-24 15:04 ` [patch 36/37] LTTng instrumentation mm Mathieu Desnoyers
@ 2008-04-28  2:12   ` Masami Hiramatsu
  0 siblings, 0 replies; 66+ messages in thread
From: Masami Hiramatsu @ 2008-04-28  2:12 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Ingo Molnar, linux-kernel, linux-mm, Dave Hansen

Hi Mathieu,

Mathieu Desnoyers wrote:
> @@ -1844,3 +1848,22 @@ int valid_swaphandles(swp_entry_t entry,
>  	*offset = ++toff;
>  	return nr_pages? ++nr_pages: 0;
>  }
> +
> +void ltt_dump_swap_files(void *call_data)
> +{
> +	int type;
> +	struct swap_info_struct *p = NULL;
> +
> +	mutex_lock(&swapon_mutex);
> +	for (type = swap_list.head; type >= 0; type = swap_info[type].next) {
> +		p = swap_info + type;
> +		if ((p->flags & SWP_ACTIVE) != SWP_ACTIVE)
> +			continue;
> +		__trace_mark(0, statedump_swap_files, call_data,
> +			"filp %p vfsmount %p dname %s",
> +			p->swap_file, p->swap_file->f_vfsmnt,
> +			p->swap_file->f_dentry->d_name.name);
> +	}
> +	mutex_unlock(&swapon_mutex);
> +}
> +EXPORT_SYMBOL_GPL(ltt_dump_swap_files);


I'm not sure this kind of functions can be acceptable.
IMHO, you'd better use more generic method (ex. a callback function),
or just export swap_list and swapon_mutex. Thus, other subsystems can
use that interface.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 27/37] From: Adrian Bunk <bunk@kernel.org>
  2008-04-24 15:03 ` [patch 27/37] From: Adrian Bunk <bunk@kernel.org> Mathieu Desnoyers
@ 2008-04-28  9:54   ` Adrian Bunk
  2008-04-28 12:37     ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: Adrian Bunk @ 2008-04-28  9:54 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Ingo Molnar, linux-kernel

Something went wrong with the Subject.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 27/37] From: Adrian Bunk <bunk@kernel.org>
  2008-04-28  9:54   ` Adrian Bunk
@ 2008-04-28 12:37     ` Mathieu Desnoyers
  0 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-04-28 12:37 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: akpm, Ingo Molnar, linux-kernel

* Adrian Bunk (bunk@kernel.org) wrote:
> Something went wrong with the Subject.
> 

Yes, the patch should look like this. Thanks!

Mathieu


make marker_debug static

With the needlessly global marker_debug being static gcc can optimize the
unused code away.

From: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/marker.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN kernel/marker.c~make-marker_debug-static kernel/marker.c
--- a/kernel/marker.c~make-marker_debug-static
+++ a/kernel/marker.c
@@ -28,7 +28,7 @@ extern struct marker __start___markers[]
 extern struct marker __stop___markers[];
 
 /* Set to 1 to enable marker debug output */
-const int marker_debug;
+static const int marker_debug;
 
 /*
  * markers_mutex nests inside module_mutex. Markers mutex protects the builtin
_

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-26 19:38 ` [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Peter Zijlstra
  2008-04-26 20:11   ` Mathieu Desnoyers
@ 2008-04-28 18:22   ` Ingo Molnar
  2008-04-28 18:36   ` Andrew Morton
  2 siblings, 0 replies; 66+ messages in thread
From: Ingo Molnar @ 2008-04-28 18:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mathieu Desnoyers, akpm, linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Subject: sched: de-uglyfy marker impact

>  		rq = task_rq_lock(p, &flags);
> -		trace_mark(kernel_sched_wait_task, "pid %d state %ld",
> -			p->pid, p->state);
> +		trace_kernel_sched_wait(p);

that does look more compact and more maintainable indeed. (and should 
not impact tracers in any way - we still get the very same marker) Does 
this make scheduler markers more acceptable to you? If yes i'll queue 
this up.

	Ingo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-26 19:38 ` [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Peter Zijlstra
  2008-04-26 20:11   ` Mathieu Desnoyers
  2008-04-28 18:22   ` Ingo Molnar
@ 2008-04-28 18:36   ` Andrew Morton
  2008-04-28 18:40     ` Christoph Hellwig
  2008-04-28 19:52     ` Peter Zijlstra
  2 siblings, 2 replies; 66+ messages in thread
From: Andrew Morton @ 2008-04-28 18:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mathieu Desnoyers, Ingo Molnar, linux-kernel

On Sat, 26 Apr 2008 21:38:54 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2008-04-24 at 11:03 -0400, Mathieu Desnoyers wrote:
> > Hi Ingo,
> > 
> > Here is a rather large patchset applying kernel instrumentation to
> > sched-devel.git. It includes, mainly :
> 
> I saw this land in sched-devel, how about this:
> 
> ---
> Subject: sched: de-uglyfy marker impact
> 
> These trace_mark() things look like someone puked all over the code,

lol.

> lets hide the ugly bits.

It hides the cosmetically-ugly bits, but not the deeply ugly: each of these
trace points is an extension to the kernel->userspace API, with all that
this implies.

> +static inline
> +void trace_kernel_sched_wakeup(struct rq *rq, struct task_struct *p)

When doing this please put the newline immediately preceding the function
name.  Putting it between the `inline' and the return-type-declaration is
weird.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-28 18:36   ` Andrew Morton
@ 2008-04-28 18:40     ` Christoph Hellwig
  2008-04-28 18:47       ` Andrew Morton
  2008-04-28 19:52     ` Peter Zijlstra
  1 sibling, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2008-04-28 18:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, linux-kernel

On Mon, Apr 28, 2008 at 11:36:10AM -0700, Andrew Morton wrote:
> It hides the cosmetically-ugly bits, but not the deeply ugly: each of these
> trace points is an extension to the kernel->userspace API, with all that
> this implies.

Not at all.  It's only accessibe to kernel code, so it per defintion
can't be a userspace API.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-28 18:40     ` Christoph Hellwig
@ 2008-04-28 18:47       ` Andrew Morton
  2008-04-28 18:49         ` Christoph Hellwig
  2008-04-28 19:01         ` KOSAKI Motohiro
  0 siblings, 2 replies; 66+ messages in thread
From: Andrew Morton @ 2008-04-28 18:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, linux-kernel

On Mon, 28 Apr 2008 14:40:51 -0400 Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Apr 28, 2008 at 11:36:10AM -0700, Andrew Morton wrote:
> > It hides the cosmetically-ugly bits, but not the deeply ugly: each of these
> > trace points is an extension to the kernel->userspace API, with all that
> > this implies.
> 
> Not at all.  It's only accessibe to kernel code, so it per defintion
> can't be a userspace API.

eh?  It adds human-readable printk strings.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-28 18:47       ` Andrew Morton
@ 2008-04-28 18:49         ` Christoph Hellwig
  2008-04-28 19:01         ` KOSAKI Motohiro
  1 sibling, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2008-04-28 18:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Peter Zijlstra, Mathieu Desnoyers,
	Ingo Molnar, linux-kernel

On Mon, Apr 28, 2008 at 11:47:58AM -0700, Andrew Morton wrote:
> > On Mon, Apr 28, 2008 at 11:36:10AM -0700, Andrew Morton wrote:
> > > It hides the cosmetically-ugly bits, but not the deeply ugly: each of these
> > > trace points is an extension to the kernel->userspace API, with all that
> > > this implies.
> > 
> > Not at all.  It's only accessibe to kernel code, so it per defintion
> > can't be a userspace API.
> 
> eh?  It adds human-readable printk strings.

It looks like a printk string but it isn't.  You need a kernel module
to actually connect to it and do the tracing.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-28 18:47       ` Andrew Morton
  2008-04-28 18:49         ` Christoph Hellwig
@ 2008-04-28 19:01         ` KOSAKI Motohiro
  1 sibling, 0 replies; 66+ messages in thread
From: KOSAKI Motohiro @ 2008-04-28 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Peter Zijlstra, Mathieu Desnoyers,
	Ingo Molnar, linux-kernel

>  > > It hides the cosmetically-ugly bits, but not the deeply ugly: each of these
>  > > trace points is an extension to the kernel->userspace API, with all that
>  > > this implies.
>  >
>  > Not at all.  It's only accessibe to kernel code, so it per defintion
>  > can't be a userspace API.
>
>  eh?  It adds human-readable printk strings.

Yes, human-readable.
but I think trace point shouldn't be treat as kernel API.
because it cause showed implementation detail become impossible.

so, it decrease valuable of trace point.

example, nobody hope crash command hidden implementation detail.
I hope trace point is treated as so too.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-28 18:36   ` Andrew Morton
  2008-04-28 18:40     ` Christoph Hellwig
@ 2008-04-28 19:52     ` Peter Zijlstra
  2008-04-28 22:25       ` Masami Hiramatsu
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2008-04-28 19:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mathieu Desnoyers, Ingo Molnar, linux-kernel

On Mon, 2008-04-28 at 11:36 -0700, Andrew Morton wrote:
> On Sat, 26 Apr 2008 21:38:54 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Thu, 2008-04-24 at 11:03 -0400, Mathieu Desnoyers wrote:
> > > Hi Ingo,
> > > 
> > > Here is a rather large patchset applying kernel instrumentation to
> > > sched-devel.git. It includes, mainly :
> > 
> > I saw this land in sched-devel, how about this:
> > 
> > ---
> > Subject: sched: de-uglyfy marker impact
> > 
> > These trace_mark() things look like someone puked all over the code,
> 
> lol.

Glad to lighten your day a little ;-)

> > lets hide the ugly bits.
> 
> It hides the cosmetically-ugly bits, but not the deeply ugly: each of these
> trace points is an extension to the kernel->userspace API, with all that
> this implies.

Agreed, and I'm rather concerned about that as well. OTOH its very
unlikely we'll ever have a Linux that will not have a context switch, or
task wakeup operation.

So tracing these and things like syscall seem safe enough to do -
although I wish it wouldn't look so ugly. 

As for some of these other trace points in this set, dubious.

We can of course clearly state that any marker is free of API
constraints and users will have to cope with them changing. But I'm not
sure that's a realistic position.

> > +static inline
> > +void trace_kernel_sched_wakeup(struct rq *rq, struct task_struct *p)
> 
> When doing this please put the newline immediately preceding the function
> name.  Putting it between the `inline' and the return-type-declaration is
> weird.

Sure. Just to clarify my rationale, I like to keep the return type on
the same line when possible so you get a complete picture - the static
and inline qualifiers seem less important, but I don't particularly
care.




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-28 19:52     ` Peter Zijlstra
@ 2008-04-28 22:25       ` Masami Hiramatsu
  0 siblings, 0 replies; 66+ messages in thread
From: Masami Hiramatsu @ 2008-04-28 22:25 UTC (permalink / raw)
  To: Peter Zijlstra, Mathieu Desnoyers
  Cc: Andrew Morton, Ingo Molnar, linux-kernel

Peter Zijlstra wrote:
> On Mon, 2008-04-28 at 11:36 -0700, Andrew Morton wrote:
>> On Sat, 26 Apr 2008 21:38:54 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>>> lets hide the ugly bits.
>> It hides the cosmetically-ugly bits, but not the deeply ugly: each of these
>> trace points is an extension to the kernel->userspace API, with all that
>> this implies.
> 
> Agreed, and I'm rather concerned about that as well. OTOH its very
> unlikely we'll ever have a Linux that will not have a context switch, or
> task wakeup operation.
> 
> So tracing these and things like syscall seem safe enough to do -
> although I wish it wouldn't look so ugly. 

What would you think about the basic-hardware events like
interruptions, and exceptions?:-)

> As for some of these other trace points in this set, dubious.
> 
> We can of course clearly state that any marker is free of API
> constraints and users will have to cope with them changing. But I'm not
> sure that's a realistic position.


BTW, I also have a question about the maintenance policy of markers.
Who will pay a cost for updating (maintaining) those trace points
according to changing logic of the kernel?

I think that each developer who modifies the kernel has to fix
trace points just for removing compile-errors. They can (but don't need to)
leave, update or remove the trace points to fit their changes, because they
knows their changes precisely, but they don't know why the trace points are
there and what information is required.
So, trace points should be basically maintained by trace point maintainers
who know all about the trace points.
is that right?

Thanks,

Best regards,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git
  2008-04-27 10:00     ` Peter Zijlstra
@ 2008-05-04 15:08       ` Mathieu Desnoyers
  0 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-05-04 15:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: akpm, Ingo Molnar, linux-kernel

* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Sat, 2008-04-26 at 16:11 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > On Thu, 2008-04-24 at 11:03 -0400, Mathieu Desnoyers wrote:
> > > > Hi Ingo,
> > > > 
> > > > Here is a rather large patchset applying kernel instrumentation to
> > > > sched-devel.git. It includes, mainly :
> > > 
> > > I saw this land in sched-devel, how about this:
> > > 
> > 
> > I think the main reason why those markers are ugly is they have
> > information meant for use by a general purpose tracer (it will only take
> > the parameters before the ## signs).
> > 
> > The other way around is to start specializing the general purpose tracer
> > to extract the information it needs from the task and rq pointers and
> > put that in the traces. Actually, that's the approach I use currently
> > use in LTTng, but Ingo seemed interested to have the union of parameters
> > needed by specialized and GP tracers, so this is what I did.
> > 
> > The only thing I dislike with the approach of putting everything in a
> > separatered header is that it adds a layer of indirection when one try
> > to to quickly grep for trace_mark() in the kernel tree to see where the
> > conceptual tracing points are. Therefore, it might be interesting to
> > simply remove the parameters meant for the general purpose tracer and
> > let this GP tracer create specialized probes for these instrumentation
> > sites. It would therefore remove the ugliness without creating a
> > supplementary indirection layer.
> 
> In part its the extra parameters, but to a large part its that darn
> string. I'm still not getting why we can't just live with trace marks
> like 'regular' functions much like the ones I used to wrap trace_mark().
> 
> > > Index: linux-2.6-2/kernel/sched_trace.h
> > > ===================================================================
> > > --- /dev/null
> > > +++ linux-2.6-2/kernel/sched_trace.h
> > > @@ -0,0 +1,41 @@
> > > +#include <linux/marker.h>
> > > +
> > > +static inline void trace_kernel_sched_wait(struct task_struct *p)
> > > +{
> > > +	trace_mark(kernel_sched_wait_task, "pid %d state %ld",
> > > +			p->pid, p->state);
> > > +}
> > > +
> > > +static inline
> > > +void trace_kernel_sched_wakeup(struct rq *rq, struct task_struct *p)
> > > +{
> > > +	trace_mark(kernel_sched_wakeup,
> > > +			"pid %d state %ld ## rq %p task %p rq->curr %p",
> > > +			p->pid, p->state, rq, p, rq->curr);
> > > +}
> > > +
> > > +static inline
> > > +void trace_kernel_sched_wakeup_new(struct rq *rq, struct task_struct *p)
> > > +{
> > > +	trace_mark(kernel_sched_wakeup_new,
> > > +			"pid %d state %ld ## rq %p task %p rq->curr %p",
> > > +			p->pid, p->state, rq, p, rq->curr);
> > > +}
> > > +
> > > +static inline void trace_kernel_sched_switch(struct rq *rq,
> > > +		struct task_struct *prev, struct task_struct *next)
> > > +{
> > > +	trace_mark(kernel_sched_schedule,
> > > +			"prev_pid %d next_pid %d prev_state %ld "
> > > +			"## rq %p prev %p next %p",
> > > +			prev->pid, next->pid, prev->state,
> > > +			rq, prev, next);
> > > +}
> > > +
> > > +static inline void
> > > +trace_kernel_sched_migrate_task(struct task_struct *p, int src, int dst)
> > > +{
> > > +	trace_mark(kernel_sched_migrate_task,
> > > +			"pid %d state %ld dest_cpu %d",
> > > +			p->pid, p->state, dst);
> > > +}
> 
> One advantage of having these things close together would be that it
> stimulates consistency between the various trace points. Something that
> would be sorely needed with such a  free form mechanism.
> 

I was thinking of the possibility of per-subsystem list of
instrumentation sites, which would become more or less fixed, anyway,
which is what you seem to propose here. Given that it also cleans up the
scheduler code, I think it's interesting to do.

Do you have an updated version following Andrew's comments I could Ack ?
:)

Mathieu


> Peter
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 34/37] LTTng instrumentation ipc
  2008-05-04 21:04         ` [patch 34/37] LTTng instrumentation ipc Alexey Dobriyan
@ 2008-05-04 20:39           ` Frank Ch. Eigler
  0 siblings, 0 replies; 66+ messages in thread
From: Frank Ch. Eigler @ 2008-05-04 20:39 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Mathieu Desnoyers, akpm, Ingo Molnar, linux-kernel

Hi -

On Mon, May 05, 2008 at 01:04:33AM +0400, Alexey Dobriyan wrote:
> [...]
> > for the rest, a way that would not export too much information about
> > kernel's internals :
> > 
> > trace_mark(shm_start, "is_shm_fop %d file %p",
> >   file->f_op == &shm_file_operations, file);
> 
> And this is totally unexpected from you.
> 		In the name of bug-free kernel,
> 		  I DO WANT KERNEL INTERNALS!
> And I perfectly know which ones at which spots!

You as (or having the ear of) a subsystem maintainer can make that
judgement.  If your marker is more of a low-level field diagnostic
sort of thing, feel free to pass the most appropriate values - even
binary dumps of things.  An end user will not normally attach to those
markers.

> [...]
> And finally, systemtap.

(Thanks for trying it.)

> Reading some systemtap docs,
> http://sourceware.org/systemtap/wiki/UsingMarkers
> where examples of intelligent filtering are shown:
> 
> 	function inode_get_i_ino:long (i:long) %{ /* pure */
> 		struct inode *inode = (struct inode *)(long)THIS->i;
> 		THIS->__retvalue = kread(&(inode->i_ino));
> 		CATCH_DEREF_FAULT();
> 	%}
> 	probe kernel.mark("kfunc_entry")
> 	{
> 		printf("inode number: %d\n", inode_get_i_ino($arg1))
> 	}
> 
> Is this representative example?

I hope not -- this embedded-C stuff should only be necessary if we
need to compensate for the data-stingyness of the marker.  See just
below that on the wiki page for a more generous marker.

> If yes, systemtap only wants marker's name and ignores totally its
> fmt string. So, why add them in the first place?

Actually, systemtap will have consumed the format strings by then
(during script translate time), in order to figure out the types of
$argN.

> And if systemtap can hook at any place, it doesn't need markers (I
> haven't used it though, so correct me).

Systemtap doesn't *need* markers in the sense of it being an essential
prerequisite.  But it is beneficial to have them around, in order to:

- suffer only a small fraction of kprobes dispatching overhead
- pass pretty arbitrary parameters without relying on much compiler
  cooperation, and specifically without requiring debugging info


- FChE

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 34/37] LTTng instrumentation ipc
  2008-04-25 12:56       ` Mathieu Desnoyers
  2008-04-25 13:17         ` [RFC] system-wide in-kernel syscall tracing Mathieu Desnoyers
@ 2008-05-04 21:04         ` Alexey Dobriyan
  2008-05-04 20:39           ` Frank Ch. Eigler
  1 sibling, 1 reply; 66+ messages in thread
From: Alexey Dobriyan @ 2008-05-04 21:04 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Frank Ch. Eigler, akpm, Ingo Molnar, linux-kernel

On Fri, Apr 25, 2008 at 08:56:07AM -0400, Mathieu Desnoyers wrote:
> * Frank Ch. Eigler (fche@redhat.com) wrote:

> > > Can I write
> > > 	if (rv < 0)
> > > 		trace_mark(foo, "rv %d", rv);
> > 
> > Sure.
> > 
> > > Looks like i could. But people want also want to see success, so what?
> > > Two markers per exit?
> > >
> > > 	rv = ipc_get(...);
> > > 	if (rv < 0)
> > > 		trace_marker(foo_err, ...);
> > > 	trace_marker(foo_all, ...);
> > 
> > That seems excessive.  Just pass "rv" value and let the consumer
> > decide whether they care about < 0.
> > 
> > You seem to be operating under the mistaken assumption that marker
> > consumers will simply have to pass on the full firehose flow without
> > filtering.  That is not so.  I suspect lttng can do it, but I know
> > that with systemtap, it's trivial to encode conditions on the marker
> > parameters and other state (e.g., recent events of interest), so that
> > only finely tuned events actually get sent to the end user.
> > 
> 
> The preferred way to do it would be
> 
>   trace_mark(foo, "rv %d", rv);
> 
> And let the probe deal with rv. The main reason is that by adding
> 
> if (test)
>   trace_mark()
> 
> you will add a branch in the normal kernel code flow and slow down the
> kernel a bit when disabled compared to the optimized markers.

Martin, let's forget for a thread about performance. I don't complain
over performance degradation at all, there is something more serious.
People _will_ find you if something will go noticeably slower. :^)



OK, uncoditional marker forces me to do some post-filtering, that's
probably tolerable. But! Some information can be lost after serialization
to strings is done. Or more difficult to do post-filtering than needed.
Examples below.

> > > Also everything inserted so far is static. Sometimes only one bit in
> > > mask is interesting and to save time to parse nibbles people do:
> > > 	printk("foo = %d\n", !!(mask & foo));
> > > And interesting bits vary.
> > 
> > OK, perhaps pass both mask & foo, and let the consumer perform the
> > arithmetic they deem appropriate.
> > 
> 
> Agreed. However, adding stuff like
> 
> !!(mask & foo)
> 
> as parameter to a marker won't add to the disabled marker runtime cost,
> since it's evaluated within the "marker enabled" block. So, if all you
> really need is !!(mask & foo) (and never other information about foo),
> then it could make sense to use the most restrictive version so we don't
> export internal details about the kernel implementation.

People will want do see different bits, so you'll show full mask and let
people do post-filtering again.

> > > Again, all events aren't interesting:
> > > 	if (file && file->f_op == &shm_file_operations)
> > > 		printk("%s: file = %p: START\n", __func__, file);
> > > Can I write this with markers?
> > 
> > Of course, if you really want to.
> > 
> 
> __func__ is not really interesting here, because you can name your
> marker. A useful trick can be to use __builtin_return_address(0) when
> needed though.

__func__ is just real-world nit.

This example is about very specific set of struct files (hi, Eric!).
Post-filtering again.

> for the rest, a way that would not export too much information about
> kernel's internals :
> 
> trace_mark(shm_start, "is_shm_fop %d file %p",
>   file->f_op == &shm_file_operations, file);

And this is totally unexpected from you.

		In the name of bug-free kernel,
		  I DO WANT KERNEL INTERNALS!


And I perfectly know which ones at which spots!

If you're scared of internals keep this marker thingy on kernel/luserspace
boundary -- TIF_SYSCALL_TRACE or how it's called. Don't do tree-wide
source code pollution!

> If you really really want to, I could modify the markers to make this
> even easier by doing something like :
> 
> trace_mark_cond(file->f_op == &shm_file_operations,
>   shm_start, "file %p", file);

That's also strange to hear.

If I can't reboot a box, I can't insert my carefully crafter marker.
If I can, I can rebuilt kernel with all debugging I want.

> > > So what is proposed? Insert markers at places that look strategic? 
> > 
> > "strategic" is the wrong term.  Choose those places that reflect
> > internal occurrences that are useful but difficult to reverse-engineer
> > from other visible interface points like system calls.  Data that
> > helps answer questions like "Why did (subtle internal phenomenon)
> > happen?" in a live system.

This is what's called strategic :^) But there are too many of them, and you
never know to where bug report will point you at the end.

> > > mm/ patch is full if "file %p". Do you realize that pointers are
> > > only sometimes interesting and sometimes you want dentry (not pointer,
> > > but name!):
> > > 	printk("file = '%s'\n", file->f_dentry->d_name.name);
> > 
> > It may not be excessive to put both file and the dname.name as marker
> > parameters.
> > 
> 
> eek, well, if we really want to identify a file, we need more than its
> name. mount points, full path and file name are required.

Just name sometimes good clue: SYSV00000000 . Somebody will want full
path though.

> It brings me a
> few years back, but I don't think the dentry name gives us that. This is
> why I extract information about all opened files to my tracer once
> (mapping the mount point, path and file name to file pointer) and then I
> don't have to do the lookup each time the marker is encountered. Yes,
> this involved a file pointers dumping at tracer start and keeping track
> of open/close events.

> > > So, let me say even more explicitly. Looking at proposed places elite
> > > enough to be blessed with marker...
> > >
> > > Those which are close enough to system call boundary are essentially
> > > strace(1).
> > 
> > Those may not sound worthwhile to put a marker for, BUT, you're
> > ignoring the huge differences of impact and scope.  A system-wide
> > marker-based trace (filtered a la systemtap if desired) can be done
> > with a tiny fraction of system load and none of the disruption caused
> > by an strace of all the processes.

And finally, systemtap.

Reading some systemtap docs,
http://sourceware.org/systemtap/wiki/UsingMarkers
where examples of intelligent filtering are shown:

	function inode_get_i_ino:long (i:long) %{ /* pure */
		struct inode *inode = (struct inode *)(long)THIS->i;
		THIS->__retvalue = kread(&(inode->i_ino));
		CATCH_DEREF_FAULT();
	%}
	probe kernel.mark("kfunc_entry")
	{
		printf("inode number: %d\n", inode_get_i_ino($arg1))
	}

Is this representative example?

If yes, systemtap only wants marker's name and ignores totally its fmt
string. So, why add them in the first place?

And if systemtap can hook at any place, it doesn't need markers (I
haven't used it though, so correct me).


^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2008-05-04 20:41 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-24 15:03 [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Mathieu Desnoyers
2008-04-24 15:03 ` [patch 01/37] Stringify support commas Mathieu Desnoyers
2008-04-24 15:03 ` [patch 02/37] x86_64 page fault NMI-safe Mathieu Desnoyers
2008-04-24 15:03 ` [patch 03/37] Change Alpha active count bit Mathieu Desnoyers
2008-04-24 15:03 ` [patch 04/37] Change avr32 " Mathieu Desnoyers
2008-04-24 15:03 ` [patch 05/37] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
2008-04-24 15:03 ` [patch 06/37] Kprobes - use a mutex to protect the instruction pages list Mathieu Desnoyers
2008-04-24 15:03 ` [patch 07/37] Kprobes - do not use kprobes mutex in arch code Mathieu Desnoyers
2008-04-24 15:03 ` [patch 08/37] Kprobes - declare kprobe_mutex static Mathieu Desnoyers
2008-04-24 15:03 ` [patch 09/37] Fix sched-devel text_poke Mathieu Desnoyers
2008-04-24 15:03 ` [patch 10/37] Text Edit Lock - Architecture Independent Code Mathieu Desnoyers
2008-04-24 15:03 ` [patch 11/37] Text Edit Lock - kprobes architecture independent support Mathieu Desnoyers
2008-04-24 15:03 ` [patch 12/37] Add all cpus option to stop machine run Mathieu Desnoyers
2008-04-24 15:03 ` [patch 13/37] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2008-04-25 14:55   ` Ingo Molnar
2008-04-26  9:36     ` Ingo Molnar
2008-04-26 11:09       ` Ingo Molnar
2008-04-26 14:17         ` Mathieu Desnoyers
2008-04-24 15:03 ` [patch 14/37] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
2008-04-24 15:03 ` [patch 15/37] Immediate Values - x86 Optimization Mathieu Desnoyers
2008-04-24 15:03 ` [patch 16/37] Add text_poke and sync_core to powerpc Mathieu Desnoyers
2008-04-24 15:03 ` [patch 17/37] Immediate Values - Powerpc Optimization Mathieu Desnoyers
2008-04-24 15:03 ` [patch 18/37] Immediate Values - Documentation Mathieu Desnoyers
2008-04-24 15:03 ` [patch 19/37] Immediate Values Support init Mathieu Desnoyers
2008-04-24 15:03 ` [patch 20/37] Immediate Values - Move Kprobes x86 restore_interrupt to kdebug.h Mathieu Desnoyers
2008-04-24 15:03 ` [patch 21/37] Add __discard section to x86 Mathieu Desnoyers
2008-04-24 15:03 ` [patch 22/37] Immediate Values - x86 Optimization NMI and MCE support Mathieu Desnoyers
2008-04-24 15:03 ` [patch 23/37] Immediate Values - Powerpc Optimization NMI " Mathieu Desnoyers
2008-04-24 15:03 ` [patch 24/37] Immediate Values Use Arch NMI and MCE Support Mathieu Desnoyers
2008-04-24 15:03 ` [patch 25/37] Immediate Values - Jump Mathieu Desnoyers
2008-04-24 15:03 ` [patch 26/37] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
2008-04-24 15:03 ` [patch 27/37] From: Adrian Bunk <bunk@kernel.org> Mathieu Desnoyers
2008-04-28  9:54   ` Adrian Bunk
2008-04-28 12:37     ` Mathieu Desnoyers
2008-04-24 15:03 ` [patch 28/37] Markers - remove extra format argument Mathieu Desnoyers
2008-04-24 15:03 ` [patch 29/37] Markers - define non optimized marker Mathieu Desnoyers
2008-04-24 15:03 ` [patch 30/37] Linux Kernel Markers - Use Immediate Values Mathieu Desnoyers
2008-04-24 15:03 ` [patch 31/37] Markers use imv jump Mathieu Desnoyers
2008-04-24 15:03 ` [patch 32/37] Port ftrace to markers Mathieu Desnoyers
2008-04-24 15:03 ` [patch 33/37] LTTng instrumentation fs Mathieu Desnoyers
2008-04-24 15:03 ` [patch 34/37] LTTng instrumentation ipc Mathieu Desnoyers
2008-04-24 23:02   ` Alexey Dobriyan
2008-04-25  2:15     ` Frank Ch. Eigler
2008-04-25 12:56       ` Mathieu Desnoyers
2008-04-25 13:17         ` [RFC] system-wide in-kernel syscall tracing Mathieu Desnoyers
2008-05-04 21:04         ` [patch 34/37] LTTng instrumentation ipc Alexey Dobriyan
2008-05-04 20:39           ` Frank Ch. Eigler
2008-04-24 15:03 ` [patch 35/37] LTTng instrumentation kernel Mathieu Desnoyers
2008-04-24 15:04 ` [patch 36/37] LTTng instrumentation mm Mathieu Desnoyers
2008-04-28  2:12   ` Masami Hiramatsu
2008-04-24 15:04 ` [patch 37/37] LTTng instrumentation net Mathieu Desnoyers
2008-04-24 15:52   ` Pavel Emelyanov
2008-04-24 16:13     ` Mathieu Desnoyers
2008-04-24 16:30       ` Pavel Emelyanov
2008-04-26 19:38 ` [patch 00/37] Linux Kernel Markers instrumentation for sched-devel.git Peter Zijlstra
2008-04-26 20:11   ` Mathieu Desnoyers
2008-04-27 10:00     ` Peter Zijlstra
2008-05-04 15:08       ` Mathieu Desnoyers
2008-04-28 18:22   ` Ingo Molnar
2008-04-28 18:36   ` Andrew Morton
2008-04-28 18:40     ` Christoph Hellwig
2008-04-28 18:47       ` Andrew Morton
2008-04-28 18:49         ` Christoph Hellwig
2008-04-28 19:01         ` KOSAKI Motohiro
2008-04-28 19:52     ` Peter Zijlstra
2008-04-28 22:25       ` Masami Hiramatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).