All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] perf updates
@ 2010-03-03  6:54 Frederic Weisbecker
  2010-03-03  6:55 ` [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection Frederic Weisbecker
                   ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-03  6:54 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Paul Mackerras

There is an rfc patch in this set, so may be it's not that
pullable.

But for those who want to test or something, the set is in the
perf/core branch that can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
	perf/core

Thanks,
	Frederic
---

Frederic Weisbecker (3):
      lockdep: Move lock events under lockdep recursion protection
      perf: Take a hot regs snapshot for trace events
      perf/x86-64: Use frame pointer to walk on irq and process stacks


 arch/x86/include/asm/asm.h        |    1 +
 arch/x86/include/asm/perf_event.h |   25 +++++++++++++++++++++++++
 arch/x86/kernel/dumpstack_64.c    |    4 ++--
 kernel/lockdep.c                  |    9 +++------
 kernel/perf_event.c               |   19 ++++++++++++++-----
 5 files changed, 45 insertions(+), 13 deletions(-)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection
  2010-03-03  6:54 [GIT PULL] perf updates Frederic Weisbecker
@ 2010-03-03  6:55 ` Frederic Weisbecker
  2010-03-09  7:18   ` Hitoshi Mitake
  2010-03-09  8:34   ` Jens Axboe
  2010-03-03  6:55 ` [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events Frederic Weisbecker
  2010-03-03  6:55 ` [PATCH 3/3] perf/x86-64: Use frame pointer to walk on irq and process stacks Frederic Weisbecker
  2 siblings, 2 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-03  6:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Steven Rostedt, Paul Mackerras,
	Hitoshi Mitake, Li Zefan, Lai Jiangshan, Masami Hiramatsu,
	Jens Axboe

There are rcu locked read side areas in the path where we submit
a trace event. And these rcu_read_(un)lock() trigger lock events,
which create recursive events.

One pair in do_perf_sw_event:

__lock_acquire
      |
      |--96.11%-- lock_acquire
      |          |
      |          |--27.21%-- do_perf_sw_event
      |          |          perf_tp_event
      |          |          |
      |          |          |--49.62%-- ftrace_profile_lock_release
      |          |          |          lock_release
      |          |          |          |
      |          |          |          |--33.85%-- _raw_spin_unlock

Another pair in perf_output_begin/end:

__lock_acquire
      |--23.40%-- perf_output_begin
      |          |          __perf_event_overflow
      |          |          perf_swevent_overflow
      |          |          perf_swevent_add
      |          |          perf_swevent_ctx_event
      |          |          do_perf_sw_event
      |          |          perf_tp_event
      |          |          |
      |          |          |--55.37%-- ftrace_profile_lock_acquire
      |          |          |          lock_acquire
      |          |          |          |
      |          |          |          |--37.31%-- _raw_spin_lock

The problem is not that much the trace recursion itself, as we have a
recursion protection already (though it's always wasteful to recurse).
But the trace events are outside the lockdep recursion protection, then
each lockdep event triggers a lock trace, which will trigger two
other lockdep events. Here the recursive lock trace event won't
be taken because of the trace recursion, so the recursion stops there
but lockdep will still analyse these new events:

To sum up, for each lockdep events we have:

	lock_*()
	     |
             trace lock_acquire
                  |
                  ----- rcu_read_lock()
                  |          |
                  |          lock_acquire()
                  |          |
                  |          trace_lock_acquire() (stopped)
                  |          |
		  |          lockdep analyze
                  |
                  ----- rcu_read_unlock()
                             |
                             lock_release
                             |
                             trace_lock_release() (stopped)
                             |
                             lockdep analyze

And you can repeat the above two times as we have two rcu read side
sections when we submit an event.

This is fixed in this patch by moving the lock trace event under
the lockdep recursion protection.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
---
 kernel/lockdep.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index c62ec14..3de6085 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -3211,8 +3211,6 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 {
 	unsigned long flags;
 
-	trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock, ip);
-
 	if (unlikely(current->lockdep_recursion))
 		return;
 
@@ -3220,6 +3218,7 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	check_flags(flags);
 
 	current->lockdep_recursion = 1;
+	trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock, ip);
 	__lock_acquire(lock, subclass, trylock, read, check,
 		       irqs_disabled_flags(flags), nest_lock, ip, 0);
 	current->lockdep_recursion = 0;
@@ -3232,14 +3231,13 @@ void lock_release(struct lockdep_map *lock, int nested,
 {
 	unsigned long flags;
 
-	trace_lock_release(lock, nested, ip);
-
 	if (unlikely(current->lockdep_recursion))
 		return;
 
 	raw_local_irq_save(flags);
 	check_flags(flags);
 	current->lockdep_recursion = 1;
+	trace_lock_release(lock, nested, ip);
 	__lock_release(lock, nested, ip);
 	current->lockdep_recursion = 0;
 	raw_local_irq_restore(flags);
@@ -3413,8 +3411,6 @@ void lock_contended(struct lockdep_map *lock, unsigned long ip)
 {
 	unsigned long flags;
 
-	trace_lock_contended(lock, ip);
-
 	if (unlikely(!lock_stat))
 		return;
 
@@ -3424,6 +3420,7 @@ void lock_contended(struct lockdep_map *lock, unsigned long ip)
 	raw_local_irq_save(flags);
 	check_flags(flags);
 	current->lockdep_recursion = 1;
+	trace_lock_contended(lock, ip);
 	__lock_contended(lock, ip);
 	current->lockdep_recursion = 0;
 	raw_local_irq_restore(flags);
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-03  6:54 [GIT PULL] perf updates Frederic Weisbecker
  2010-03-03  6:55 ` [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection Frederic Weisbecker
@ 2010-03-03  6:55 ` Frederic Weisbecker
  2010-03-03  8:38   ` Peter Zijlstra
  2010-03-03 16:06   ` [RFC][PATCH 2/3] " Steven Rostedt
  2010-03-03  6:55 ` [PATCH 3/3] perf/x86-64: Use frame pointer to walk on irq and process stacks Frederic Weisbecker
  2 siblings, 2 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-03  6:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo

We are taking a wrong regs snapshot when a trace event triggers.
Either we use get_irq_regs(), which gives us the interrupted
registers if we are in an interrupt, or we use task_pt_regs()
which gives us the state before we entered the kernel, assuming
we are lucky enough to be no kernel thread, in which case
task_pt_regs() returns the initial set of regs when the kernel
thread was started.

What we want is different. We need a hot snapshot of the regs,
so that we can get the instruction pointer to record in the
sample, the frame pointer for the callchain, and some other
things.

In order to achieve that, we introduce PERF_SAVE_REGS() that
takes a live snapshot of the regs we are interested in, and
we use it from perf_tp_event().

Comparison with perf record -e lock: -R -a -f -g
Before:

        perf  [kernel]                   [k] __do_softirq
               |
               --- __do_softirq
                  |
                  |--55.16%-- __open
                  |
                   --44.84%-- __write_nocancel

After:

            perf  [kernel]           [k] perf_tp_event
               |
               --- perf_tp_event
                  |
                  |--41.07%-- ftrace_profile_lock_acquire
                  |          lock_acquire
                  |          |
                  |          |--39.36%-- _raw_spin_lock
                  |          |          |
                  |          |          |--7.81%-- hrtimer_interrupt
                  |          |          |          smp_apic_timer_interrupt
                  |          |          |          apic_timer_interrupt

The old case was producing unreliable callchains. Now having
right frame and instruction pointers, we have the trace we
want. We can even think about implementing a "skip" field in
PERF_SAVE_REGS() to rewind to the caller "n" (here n would
be equal to 2 so that we start on lock_acquire()).

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 arch/x86/include/asm/asm.h        |    1 +
 arch/x86/include/asm/perf_event.h |   25 +++++++++++++++++++++++++
 kernel/perf_event.c               |   19 ++++++++++++++-----
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index b3ed1e1..d86e7dc 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -27,6 +27,7 @@
 #define _ASM_ADD	__ASM_SIZE(add)
 #define _ASM_SUB	__ASM_SIZE(sub)
 #define _ASM_XADD	__ASM_SIZE(xadd)
+#define _ASM_POP	__ASM_SIZE(pop)
 
 #define _ASM_AX		__ASM_REG(ax)
 #define _ASM_BX		__ASM_REG(bx)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index befd172..a166fa5 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -122,6 +122,31 @@ union cpuid10_edx {
 extern void init_hw_perf_events(void);
 extern void perf_events_lapic_init(void);
 
+/*
+ * Take a snapshot of the regs. We only need a few of them:
+ * ip for PERF_SAMPLE_IP
+ * cs for user_mode() tests
+ * bp for callchains
+ * eflags, for future purposes, just in case
+ */
+#define PERF_SAVE_REGS(regs)				{\
+	memset((regs), 0, sizeof(*(regs)));		\
+	(regs)->cs = __KERNEL_CS;			\
+							\
+	asm volatile(					\
+		"call 1f\n"				\
+		"1:\n"					\
+		_ASM_POP " %[ip]\n"			\
+		_ASM_MOV "%%" _ASM_BP ", %[bp]\n"	\
+		"pushf\n"				\
+		_ASM_POP " %[flags]\n"			\
+		:  [ip]    "=m" ((regs)->ip),		\
+		   [bp]    "=m" ((regs)->bp),		\
+		   [flags] "=m" ((regs)->flags)		\
+		:: "memory"				\
+	);						\
+}
+
 #define PERF_EVENT_INDEX_OFFSET			0
 
 #else
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a661e79..f412be1 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2807,6 +2807,16 @@ __weak struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 }
 
 /*
+ * Hot regs snapshot support -- arch specific
+ * This needs to be a macro because we want the current
+ * frame pointer.
+ */
+#ifndef PERF_SAVE_REGS
+#define PERF_SAVE_REGS(regs)	memset(regs, 0, sizeof(*regs));
+#endif
+
+
+/*
  * Output
  */
 static bool perf_output_space(struct perf_mmap_data *data, unsigned long tail,
@@ -4337,6 +4347,8 @@ static const struct pmu perf_ops_task_clock = {
 void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
 			  int entry_size)
 {
+	struct pt_regs regs;
+
 	struct perf_raw_record raw = {
 		.size = entry_size,
 		.data = record,
@@ -4347,14 +4359,11 @@ void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
 		.raw = &raw,
 	};
 
-	struct pt_regs *regs = get_irq_regs();
-
-	if (!regs)
-		regs = task_pt_regs(current);
+	PERF_SAVE_REGS(&regs);
 
 	/* Trace events already protected against recursion */
 	do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1,
-				&data, regs);
+				&data, &regs);
 }
 EXPORT_SYMBOL_GPL(perf_tp_event);
 
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 3/3] perf/x86-64: Use frame pointer to walk on irq and process stacks
  2010-03-03  6:54 [GIT PULL] perf updates Frederic Weisbecker
  2010-03-03  6:55 ` [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection Frederic Weisbecker
  2010-03-03  6:55 ` [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events Frederic Weisbecker
@ 2010-03-03  6:55 ` Frederic Weisbecker
  2 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-03  6:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo

We were using the frame pointer based stack walker on every
contexts in x86-32, but not in x86-64 where we only use the
seven-league boots on the exception stacks.

Use it also on irq and process stacks. This utterly accelerate
the captures.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 arch/x86/kernel/dumpstack_64.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 676bc05..cab4f72 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -202,7 +202,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 			if (in_irq_stack(stack, irq_stack, irq_stack_end)) {
 				if (ops->stack(data, "IRQ") < 0)
 					break;
-				bp = print_context_stack(tinfo, stack, bp,
+				bp = ops->walk_stack(tinfo, stack, bp,
 					ops, data, irq_stack_end, &graph);
 				/*
 				 * We link to the next stack (which would be
@@ -223,7 +223,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	/*
 	 * This handles the process stack:
 	 */
-	bp = print_context_stack(tinfo, stack, bp, ops, data, NULL, &graph);
+	bp = ops->walk_stack(tinfo, stack, bp, ops, data, NULL, &graph);
 	put_cpu();
 }
 EXPORT_SYMBOL(dump_trace);
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-03  6:55 ` [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events Frederic Weisbecker
@ 2010-03-03  8:38   ` Peter Zijlstra
  2010-03-03 20:07     ` Frederic Weisbecker
                       ` (4 more replies)
  2010-03-03 16:06   ` [RFC][PATCH 2/3] " Steven Rostedt
  1 sibling, 5 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-03-03  8:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Thomas Gleixner, H. Peter Anvin,
	Paul Mackerras, Steven Rostedt, Arnaldo Carvalho de Melo

On Wed, 2010-03-03 at 07:55 +0100, Frederic Weisbecker wrote:
>  /*
> + * Hot regs snapshot support -- arch specific
> + * This needs to be a macro because we want the current
> + * frame pointer.
> + */
> +#ifndef PERF_SAVE_REGS
> +#define PERF_SAVE_REGS(regs)   memset(regs, 0, sizeof(*regs));
> +#endif

It would be nice to have the fallback at least set the current or
calling IP, now you've basically wrecked stuff for everything !x86.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-03  6:55 ` [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events Frederic Weisbecker
  2010-03-03  8:38   ` Peter Zijlstra
@ 2010-03-03 16:06   ` Steven Rostedt
  2010-03-03 16:37     ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Steven Rostedt @ 2010-03-03 16:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo

On Wed, 2010-03-03 at 07:55 +0100, Frederic Weisbecker wrote:

> +/*
>   * Output
>   */
>  static bool perf_output_space(struct perf_mmap_data *data, unsigned long tail,
> @@ -4337,6 +4347,8 @@ static const struct pmu perf_ops_task_clock = {
>  void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
>  			  int entry_size)
>  {
> +	struct pt_regs regs;
> +
>  	struct perf_raw_record raw = {
>  		.size = entry_size,
>  		.data = record,
> @@ -4347,14 +4359,11 @@ void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
>  		.raw = &raw,
>  	};
>  
> -	struct pt_regs *regs = get_irq_regs();
> -
> -	if (!regs)
> -		regs = task_pt_regs(current);
> +	PERF_SAVE_REGS(&regs);
>  
>  	/* Trace events already protected against recursion */
>  	do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1,
> -				&data, regs);
> +				&data, &regs);

Off-topic: Why is the above a perf sw event? Couldn't that also be a
normal TRACE_EVENT()?

-- Steve

>  }
>  EXPORT_SYMBOL_GPL(perf_tp_event);
>  



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-03 16:06   ` [RFC][PATCH 2/3] " Steven Rostedt
@ 2010-03-03 16:37     ` Peter Zijlstra
  2010-03-03 17:07       ` Steven Rostedt
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:37 UTC (permalink / raw)
  To: rostedt
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo

On Wed, 2010-03-03 at 11:06 -0500, Steven Rostedt wrote:
> On Wed, 2010-03-03 at 07:55 +0100, Frederic Weisbecker wrote:
> 
> > +/*
> >   * Output
> >   */
> >  static bool perf_output_space(struct perf_mmap_data *data, unsigned long tail,
> > @@ -4337,6 +4347,8 @@ static const struct pmu perf_ops_task_clock = {
> >  void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
> >  			  int entry_size)
> >  {
> > +	struct pt_regs regs;
> > +
> >  	struct perf_raw_record raw = {
> >  		.size = entry_size,
> >  		.data = record,
> > @@ -4347,14 +4359,11 @@ void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
> >  		.raw = &raw,
> >  	};
> >  
> > -	struct pt_regs *regs = get_irq_regs();
> > -
> > -	if (!regs)
> > -		regs = task_pt_regs(current);
> > +	PERF_SAVE_REGS(&regs);
> >  
> >  	/* Trace events already protected against recursion */
> >  	do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1,
> > -				&data, regs);
> > +				&data, &regs);
> 
> Off-topic: Why is the above a perf sw event? Couldn't that also be a
> normal TRACE_EVENT()?

Well, no, this is the stuff that transforms TRACE_EVENT() into perf
software events ;-)


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-03 16:37     ` Peter Zijlstra
@ 2010-03-03 17:07       ` Steven Rostedt
  2010-03-03 17:16         ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Steven Rostedt @ 2010-03-03 17:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo

On Wed, 2010-03-03 at 17:37 +0100, Peter Zijlstra wrote:

> > >  	/* Trace events already protected against recursion */
> > >  	do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1,
> > > -				&data, regs);
> > > +				&data, &regs);
> > 
> > Off-topic: Why is the above a perf sw event? Couldn't that also be a
> > normal TRACE_EVENT()?
> 
> Well, no, this is the stuff that transforms TRACE_EVENT() into perf
> software events ;-)
> 

oops, my bad :-), I thought this was in the x86 arch directory. For the
University, I was helping them with adding trace points for page faults
when I came across this in arch/x86/mm/fault.c:

	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);


This is what I actually was wondering about. Why is it a "perf only"
trace point instead of a TRACE_EVENT()?

-- Steve



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-03 17:07       ` Steven Rostedt
@ 2010-03-03 17:16         ` Peter Zijlstra
  2010-03-03 17:45           ` Steven Rostedt
  2010-03-04 11:25           ` Ingo Molnar
  0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-03-03 17:16 UTC (permalink / raw)
  To: rostedt
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo

On Wed, 2010-03-03 at 12:07 -0500, Steven Rostedt wrote:
> oops, my bad :-), I thought this was in the x86 arch directory. For the
> University, I was helping them with adding trace points for page faults
> when I came across this in arch/x86/mm/fault.c:
> 
>         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);
> 
> 
> This is what I actually was wondering about. Why is it a "perf only"
> trace point instead of a TRACE_EVENT()?

Because I wanted to make perf usable without having to rely on funny
tracepoints. That is, I am less worried about committing software
counters to ABI than I am about TRACE_EVENT(), which still gives me a
terribly uncomfortable feeling.

Also, building with all CONFIG_TRACE_*=n will still yield a usable perf,
which is something the embedded people might fancy, all that TRACE stuff
adds lots of code.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-03 17:16         ` Peter Zijlstra
@ 2010-03-03 17:45           ` Steven Rostedt
  2010-03-03 20:37             ` Frederic Weisbecker
  2010-03-04 11:25           ` Ingo Molnar
  1 sibling, 1 reply; 51+ messages in thread
From: Steven Rostedt @ 2010-03-03 17:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo

On Wed, 2010-03-03 at 18:16 +0100, Peter Zijlstra wrote:

> > This is what I actually was wondering about. Why is it a "perf only"
> > trace point instead of a TRACE_EVENT()?
> 
> Because I wanted to make perf usable without having to rely on funny
> tracepoints. That is, I am less worried about committing software
> counters to ABI than I am about TRACE_EVENT(), which still gives me a
> terribly uncomfortable feeling.
> 
> Also, building with all CONFIG_TRACE_*=n will still yield a usable perf,
> which is something the embedded people might fancy, all that TRACE stuff
> adds lots of code.

We could make TRACE_EVENT() into a perf only trace point with
CONFIG_TRACE_*=n. 

Just saying that it would be nice if ftrace could also see page faults
and such.

-- Steve




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-03  8:38   ` Peter Zijlstra
@ 2010-03-03 20:07     ` Frederic Weisbecker
  2010-03-04 19:01     ` Frederic Weisbecker
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-03 20:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Thomas Gleixner, H. Peter Anvin,
	Paul Mackerras, Steven Rostedt, Arnaldo Carvalho de Melo

On Wed, Mar 03, 2010 at 09:38:49AM +0100, Peter Zijlstra wrote:
> On Wed, 2010-03-03 at 07:55 +0100, Frederic Weisbecker wrote:
> >  /*
> > + * Hot regs snapshot support -- arch specific
> > + * This needs to be a macro because we want the current
> > + * frame pointer.
> > + */
> > +#ifndef PERF_SAVE_REGS
> > +#define PERF_SAVE_REGS(regs)   memset(regs, 0, sizeof(*regs));
> > +#endif
> 
> It would be nice to have the fallback at least set the current or
> calling IP, now you've basically wrecked stuff for everything !x86.


Oh I haven't wrecked that much !x86, no callchain was working on
trace events before, nor the ip was correct :-)

That said I agree with you, I'm going to reuse the CALLER_ADDRx
macros for that. This is the occasion to add a skip field
in PERF_SAVE_REGS to rewind where we want.

Thanks.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-03 17:45           ` Steven Rostedt
@ 2010-03-03 20:37             ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-03 20:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo

On Wed, Mar 03, 2010 at 12:45:55PM -0500, Steven Rostedt wrote:
> On Wed, 2010-03-03 at 18:16 +0100, Peter Zijlstra wrote:
> 
> > > This is what I actually was wondering about. Why is it a "perf only"
> > > trace point instead of a TRACE_EVENT()?
> > 
> > Because I wanted to make perf usable without having to rely on funny
> > tracepoints. That is, I am less worried about committing software
> > counters to ABI than I am about TRACE_EVENT(), which still gives me a
> > terribly uncomfortable feeling.
> > 
> > Also, building with all CONFIG_TRACE_*=n will still yield a usable perf,
> > which is something the embedded people might fancy, all that TRACE stuff
> > adds lots of code.
> 
> We could make TRACE_EVENT() into a perf only trace point with
> CONFIG_TRACE_*=n. 



Yeah.


 
> Just saying that it would be nice if ftrace could also see page faults
> and such.


Agreed, we could make it a TRACE_EVENT. IIRC, someone proposed patches
for that by the past.

That notwithstanding one of the main worries is the fact TRACE_EVENT
are less ABI-stable guaranteed.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-03 17:16         ` Peter Zijlstra
  2010-03-03 17:45           ` Steven Rostedt
@ 2010-03-04 11:25           ` Ingo Molnar
  2010-03-04 15:16             ` Steven Rostedt
  1 sibling, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2010-03-04 11:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rostedt, Frederic Weisbecker, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2010-03-03 at 12:07 -0500, Steven Rostedt wrote:
> > oops, my bad :-), I thought this was in the x86 arch directory. For the
> > University, I was helping them with adding trace points for page faults
> > when I came across this in arch/x86/mm/fault.c:
> > 
> >         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);
> > 
> > 
> > This is what I actually was wondering about. Why is it a "perf only" trace 
> > point instead of a TRACE_EVENT()?
> 
> Because I wanted to make perf usable without having to rely on funny 
> tracepoints. That is, I am less worried about committing software counters 
> to ABI than I am about TRACE_EVENT(), which still gives me a terribly 
> uncomfortable feeling.

I'd still like a much less error-prone and work-intense way of doing it.

I'd suggest we simply add a TRACE_EVENT_ABI() for such cases, where we really 
want to expose a tracepoint to tooling, programmatically. Maybe even change 
the usage sites to trace_foo_ABI(), to make it really clear and to make people 
aware of the consequences.

> Also, building with all CONFIG_TRACE_*=n will still yield a usable perf, 
> which is something the embedded people might fancy, all that TRACE stuff 
> adds lots of code.

Not a real issue i suspect when you do lock profiling ...

Or if it is, some debloating might be in order - and the detaching of event 
enumeration and ftrace TRACE_EVENT infrastructure from other ftrace bits. (i 
suggested an '/eventfs' special filesystem before, for nicely layed out 
hierarchy of ftrace/perf events.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-04 11:25           ` Ingo Molnar
@ 2010-03-04 15:16             ` Steven Rostedt
  2010-03-04 15:36               ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Steven Rostedt @ 2010-03-04 15:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo

On Thu, 2010-03-04 at 12:25 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, 2010-03-03 at 12:07 -0500, Steven Rostedt wrote:
> > > oops, my bad :-), I thought this was in the x86 arch directory. For the
> > > University, I was helping them with adding trace points for page faults
> > > when I came across this in arch/x86/mm/fault.c:
> > > 
> > >         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);
> > > 
> > > 
> > > This is what I actually was wondering about. Why is it a "perf only" trace 
> > > point instead of a TRACE_EVENT()?
> > 
> > Because I wanted to make perf usable without having to rely on funny 
> > tracepoints. That is, I am less worried about committing software counters 
> > to ABI than I am about TRACE_EVENT(), which still gives me a terribly 
> > uncomfortable feeling.
> 
> I'd still like a much less error-prone and work-intense way of doing it.
> 
> I'd suggest we simply add a TRACE_EVENT_ABI() for such cases, where we really 
> want to expose a tracepoint to tooling, programmatically. Maybe even change 
> the usage sites to trace_foo_ABI(), to make it really clear and to make people 
> aware of the consequences.

Would this still be available as a normal trace event?

> 
> > Also, building with all CONFIG_TRACE_*=n will still yield a usable perf, 
> > which is something the embedded people might fancy, all that TRACE stuff 
> > adds lots of code.
> 
> Not a real issue i suspect when you do lock profiling ...
> 
> Or if it is, some debloating might be in order - and the detaching of event 
> enumeration and ftrace TRACE_EVENT infrastructure from other ftrace bits. (i 
> suggested an '/eventfs' special filesystem before, for nicely layed out 
> hierarchy of ftrace/perf events.)

Actually, we already have a way to decouple it.

include/trace/define_trace.h is the file that just adds the tracepoint
that is needed.

include/trace/ftrace.h is the file that does the magic and adds the code
for callbacks and tracing.

The perf hooks probably should not have gone in that file and been put
into a include/trace/perf.h file, and then in define_trace.h we would
add:

 #ifdef CONFIG_EVENT_TRACING
 #include <trace/ftrace.h>
 #endif

+#ifdef CONFIG_PERF_EVENTS
+#include <trace/perf.h>
+#endif

This should be done anyway. But it would also let you decouple ftrace
trace events from perf trace events but still let the two use the same
trace points.

-- Steve




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-04 15:16             ` Steven Rostedt
@ 2010-03-04 15:36               ` Ingo Molnar
  2010-03-04 15:55                 ` Steven Rostedt
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2010-03-04 15:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 2010-03-04 at 12:25 +0100, Ingo Molnar wrote:
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Wed, 2010-03-03 at 12:07 -0500, Steven Rostedt wrote:
> > > > oops, my bad :-), I thought this was in the x86 arch directory. For the
> > > > University, I was helping them with adding trace points for page faults
> > > > when I came across this in arch/x86/mm/fault.c:
> > > > 
> > > >         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);
> > > > 
> > > > 
> > > > This is what I actually was wondering about. Why is it a "perf only" trace 
> > > > point instead of a TRACE_EVENT()?
> > > 
> > > Because I wanted to make perf usable without having to rely on funny 
> > > tracepoints. That is, I am less worried about committing software counters 
> > > to ABI than I am about TRACE_EVENT(), which still gives me a terribly 
> > > uncomfortable feeling.
> > 
> > I'd still like a much less error-prone and work-intense way of doing it.
> > 
> > I'd suggest we simply add a TRACE_EVENT_ABI() for such cases, where we 
> > really want to expose a tracepoint to tooling, programmatically. Maybe 
> > even change the usage sites to trace_foo_ABI(), to make it really clear 
> > and to make people aware of the consequences.
> 
> Would this still be available as a normal trace event?

Yeah, of course. It would not result in any usage or flexibility restriction.

( In the future we might want to add some sort of automated signature thing to 
  make sure that an event that has been declared an 'ABI' isnt changed - at 
  least as far as the record format goes. )

> > 
> > > Also, building with all CONFIG_TRACE_*=n will still yield a usable perf, 
> > > which is something the embedded people might fancy, all that TRACE stuff 
> > > adds lots of code.
> > 
> > Not a real issue i suspect when you do lock profiling ...
> > 
> > Or if it is, some debloating might be in order - and the detaching of event 
> > enumeration and ftrace TRACE_EVENT infrastructure from other ftrace bits. (i 
> > suggested an '/eventfs' special filesystem before, for nicely layed out 
> > hierarchy of ftrace/perf events.)
> 
> Actually, we already have a way to decouple it.
> 
> include/trace/define_trace.h is the file that just adds the tracepoint
> that is needed.
> 
> include/trace/ftrace.h is the file that does the magic and adds the code
> for callbacks and tracing.
> 
> The perf hooks probably should not have gone in that file and been put
> into a include/trace/perf.h file, and then in define_trace.h we would
> add:
> 
>  #ifdef CONFIG_EVENT_TRACING
>  #include <trace/ftrace.h>
>  #endif
> 
> +#ifdef CONFIG_PERF_EVENTS
> +#include <trace/perf.h>
> +#endif
> 
> This should be done anyway. But it would also let you decouple ftrace trace 
> events from perf trace events but still let the two use the same trace 
> points.

I think the main thing would be to have a decoupled /eventfs - basically 
/debug/tracing/events/ moved to "/eventfs" or maybe to "/proc/events/". This 
would make them available more widely, and in a standardized way.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-04 15:36               ` Ingo Molnar
@ 2010-03-04 15:55                 ` Steven Rostedt
  2010-03-04 21:17                   ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Steven Rostedt @ 2010-03-04 15:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo,
	Greg KH

On Thu, 2010-03-04 at 16:36 +0100, Ingo Molnar wrote:

> > This should be done anyway. But it would also let you decouple ftrace trace 
> > events from perf trace events but still let the two use the same trace 
> > points.
> 
> I think the main thing would be to have a decoupled /eventfs - basically 
> /debug/tracing/events/ moved to "/eventfs" or maybe to "/proc/events/". This 
> would make them available more widely, and in a standardized way.

I know Greg once proposed a /tracefs directory. I don't really care how
things work as long as we don't lose functionality. Perhaps we should
have a standard tracefs dir, and have:

/sys/kernel/trace
/sys/kernel/trace/events
/sys/kernel/trace/ftrace
/sys/kernel/trace/perf

This would keep things nicely grouped but separate.

I could also decouple the printing of the formats from ftrace.h and then
in in the define_trace.h:

#ifdef CONFIG_EVENTS
# include <trace/events.h>
# ifdef CONFIG_FTRACE_EVENTS
#  include <trace/ftrace.h>
# endif
# ifdef CONFIG_PERF_EVENTS
#  include <trace/perf.h>
# endif
#endif

Have the trace/events.h file create the files for the event directory.

But what about the enable and filter files in the event directory. How
would they be attached? Currently these modify the way ftrace works. I'm
assuming that perf enables these with the syscall. Should these files
still be specific to ftrace if enabled?

-- Steve




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-03  8:38   ` Peter Zijlstra
  2010-03-03 20:07     ` Frederic Weisbecker
@ 2010-03-04 19:01     ` Frederic Weisbecker
  2010-03-05  3:08     ` [PATCH 0/2] Perf " Frederic Weisbecker
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-04 19:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Thomas Gleixner, H. Peter Anvin,
	Paul Mackerras, Steven Rostedt, Arnaldo Carvalho de Melo

On Wed, Mar 03, 2010 at 09:38:49AM +0100, Peter Zijlstra wrote:
> On Wed, 2010-03-03 at 07:55 +0100, Frederic Weisbecker wrote:
> >  /*
> > + * Hot regs snapshot support -- arch specific
> > + * This needs to be a macro because we want the current
> > + * frame pointer.
> > + */
> > +#ifndef PERF_SAVE_REGS
> > +#define PERF_SAVE_REGS(regs)   memset(regs, 0, sizeof(*regs));
> > +#endif
> 
> It would be nice to have the fallback at least set the current or
> calling IP, now you've basically wrecked stuff for everything !x86.


Actually I can't, as there doesn't seem to be a way to write
regs->ip in a generic way :-s


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-04 15:55                 ` Steven Rostedt
@ 2010-03-04 21:17                   ` Ingo Molnar
  2010-03-04 21:30                     ` Steven Rostedt
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2010-03-04 21:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo,
	Greg KH


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 2010-03-04 at 16:36 +0100, Ingo Molnar wrote:
> 
> > > This should be done anyway. But it would also let you decouple ftrace trace 
> > > events from perf trace events but still let the two use the same trace 
> > > points.
> > 
> > I think the main thing would be to have a decoupled /eventfs - basically 
> > /debug/tracing/events/ moved to "/eventfs" or maybe to "/proc/events/". This 
> > would make them available more widely, and in a standardized way.
> 
> I know Greg once proposed a /tracefs directory. I don't really care how 
> things work as long as we don't lose functionality. Perhaps we should have a 
> standard tracefs dir, and have:

No, we want to decouple it from 'tracing'. It's events, not tracing. Events 
are more broader, they can be used for RAS, profiling, counting, etc. - not 
just tracing.

Furthermore, we only want /debug/tracing/events really, not the various 
dynamic ftrace controls - those could remain in /debug/tracing/.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-04 21:17                   ` Ingo Molnar
@ 2010-03-04 21:30                     ` Steven Rostedt
  2010-03-04 21:37                       ` Frederic Weisbecker
  0 siblings, 1 reply; 51+ messages in thread
From: Steven Rostedt @ 2010-03-04 21:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo,
	Greg KH

On Thu, 2010-03-04 at 22:17 +0100, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 

> No, we want to decouple it from 'tracing'. It's events, not tracing. Events 
> are more broader, they can be used for RAS, profiling, counting, etc. - not 
> just tracing.
> 
> Furthermore, we only want /debug/tracing/events really, not the various 
> dynamic ftrace controls - those could remain in /debug/tracing/.

I was talking about the files in the events directory:

events/sched/sched_switch/{id,format,enable,filter}


Seems only the format file should go in, and perhaps the id.

I can keep the debug/tracing/events/* as is too, where the format and id
just call the same routines that the eventfs calls, but add the enable
and filter to be specific to ftrace.

-- Steve



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-04 21:30                     ` Steven Rostedt
@ 2010-03-04 21:37                       ` Frederic Weisbecker
  2010-03-04 21:52                         ` Thomas Gleixner
  2010-03-04 22:02                         ` Steven Rostedt
  0 siblings, 2 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-04 21:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo,
	Greg KH

On Thu, Mar 04, 2010 at 04:30:38PM -0500, Steven Rostedt wrote:
> On Thu, 2010-03-04 at 22:17 +0100, Ingo Molnar wrote:
> > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> 
> > No, we want to decouple it from 'tracing'. It's events, not tracing. Events 
> > are more broader, they can be used for RAS, profiling, counting, etc. - not 
> > just tracing.
> > 
> > Furthermore, we only want /debug/tracing/events really, not the various 
> > dynamic ftrace controls - those could remain in /debug/tracing/.
> 
> I was talking about the files in the events directory:
> 
> events/sched/sched_switch/{id,format,enable,filter}
> 
> 
> Seems only the format file should go in, and perhaps the id.
> 
> I can keep the debug/tracing/events/* as is too, where the format and id
> just call the same routines that the eventfs calls, but add the enable
> and filter to be specific to ftrace.


The /debug/tracing/events could contain symlinks for the format
files so that the rest can stay there.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-04 21:37                       ` Frederic Weisbecker
@ 2010-03-04 21:52                         ` Thomas Gleixner
  2010-03-04 22:01                           ` Frederic Weisbecker
  2010-03-04 22:02                         ` Steven Rostedt
  1 sibling, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2010-03-04 21:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, LKML,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo,
	Greg KH

On Thu, 4 Mar 2010, Frederic Weisbecker wrote:

> On Thu, Mar 04, 2010 at 04:30:38PM -0500, Steven Rostedt wrote:
> > On Thu, 2010-03-04 at 22:17 +0100, Ingo Molnar wrote:
> > > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > > 
> > 
> > > No, we want to decouple it from 'tracing'. It's events, not tracing. Events 
> > > are more broader, they can be used for RAS, profiling, counting, etc. - not 
> > > just tracing.
> > > 
> > > Furthermore, we only want /debug/tracing/events really, not the various 
> > > dynamic ftrace controls - those could remain in /debug/tracing/.
> > 
> > I was talking about the files in the events directory:
> > 
> > events/sched/sched_switch/{id,format,enable,filter}
> > 
> > 
> > Seems only the format file should go in, and perhaps the id.
> > 
> > I can keep the debug/tracing/events/* as is too, where the format and id
> > just call the same routines that the eventfs calls, but add the enable
> > and filter to be specific to ftrace.
> 
> 
> The /debug/tracing/events could contain symlinks for the format
> files so that the rest can stay there.

Are you proposing to create another sysfs symlink maze ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-04 21:52                         ` Thomas Gleixner
@ 2010-03-04 22:01                           ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-04 22:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, LKML,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo,
	Greg KH

On Thu, Mar 04, 2010 at 10:52:36PM +0100, Thomas Gleixner wrote:
> On Thu, 4 Mar 2010, Frederic Weisbecker wrote:
> 
> > On Thu, Mar 04, 2010 at 04:30:38PM -0500, Steven Rostedt wrote:
> > > On Thu, 2010-03-04 at 22:17 +0100, Ingo Molnar wrote:
> > > > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > > > 
> > > 
> > > > No, we want to decouple it from 'tracing'. It's events, not tracing. Events 
> > > > are more broader, they can be used for RAS, profiling, counting, etc. - not 
> > > > just tracing.
> > > > 
> > > > Furthermore, we only want /debug/tracing/events really, not the various 
> > > > dynamic ftrace controls - those could remain in /debug/tracing/.
> > > 
> > > I was talking about the files in the events directory:
> > > 
> > > events/sched/sched_switch/{id,format,enable,filter}
> > > 
> > > 
> > > Seems only the format file should go in, and perhaps the id.
> > > 
> > > I can keep the debug/tracing/events/* as is too, where the format and id
> > > just call the same routines that the eventfs calls, but add the enable
> > > and filter to be specific to ftrace.
> > 
> > 
> > The /debug/tracing/events could contain symlinks for the format
> > files so that the rest can stay there.
> 
> Are you proposing to create another sysfs symlink maze ?
> 
> Thanks,
> 
> 	tglx


I don't know, I like the idea of generalizing the events, but I hope
it won't be at the detriment of ftrace usability.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-04 21:37                       ` Frederic Weisbecker
  2010-03-04 21:52                         ` Thomas Gleixner
@ 2010-03-04 22:02                         ` Steven Rostedt
  2010-03-04 22:09                           ` Frederic Weisbecker
  1 sibling, 1 reply; 51+ messages in thread
From: Steven Rostedt @ 2010-03-04 22:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo,
	Greg KH

On Thu, 2010-03-04 at 22:37 +0100, Frederic Weisbecker wrote:

> > I can keep the debug/tracing/events/* as is too, where the format and id
> > just call the same routines that the eventfs calls, but add the enable
> > and filter to be specific to ftrace.
> 
> 
> The /debug/tracing/events could contain symlinks for the format
> files so that the rest can stay there.

Symlinks buy us nothing but headaches. This is a pseudo filesystem, we
can use the same code that generates the file with no extra overhead.

-- Steve
 



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events
  2010-03-04 22:02                         ` Steven Rostedt
@ 2010-03-04 22:09                           ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-04 22:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Thomas Gleixner,
	H. Peter Anvin, Paul Mackerras, Arnaldo Carvalho de Melo,
	Greg KH

On Thu, Mar 04, 2010 at 05:02:48PM -0500, Steven Rostedt wrote:
> On Thu, 2010-03-04 at 22:37 +0100, Frederic Weisbecker wrote:
> 
> > > I can keep the debug/tracing/events/* as is too, where the format and id
> > > just call the same routines that the eventfs calls, but add the enable
> > > and filter to be specific to ftrace.
> > 
> > 
> > The /debug/tracing/events could contain symlinks for the format
> > files so that the rest can stay there.
> 
> Symlinks buy us nothing but headaches. This is a pseudo filesystem, we
> can use the same code that generates the file with no extra overhead.


Yeah, you're right.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 0/2] Perf hot regs snapshot for trace events
  2010-03-03  8:38   ` Peter Zijlstra
  2010-03-03 20:07     ` Frederic Weisbecker
  2010-03-04 19:01     ` Frederic Weisbecker
@ 2010-03-05  3:08     ` Frederic Weisbecker
  2010-03-05  3:08       ` Frederic Weisbecker
  2010-03-05  3:08       ` Frederic Weisbecker
  4 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05  3:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	H . Peter Anvin, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu, Jason Baron, Archs

Please tell me what you think about this new version.
I still can not fallback to save the ip for archs that
don't implement it as it can't be done in a generic way,
but more work has been made to implement this support
more easily.

No more assembly in this new version, it also uses
functions instead of macros and it implements the callers
level skipping. Then instead of starting the callchain from
perf/ftrace event callbacks, we start on the real event
source.

Thanks.

Frederic Weisbecker (2):
  perf: Introduce new perf_save_regs() for hot regs snapshot
  perf: Take a hot regs snapshot for trace events

 arch/x86/kernel/cpu/perf_event.c   |   12 ++++++++++
 arch/x86/kernel/dumpstack.h        |   15 ++++++++++++
 include/linux/ftrace_event.h       |    7 ++++-
 include/linux/perf_event.h         |   42 +++++++++++++++++++++++++++++++++++-
 include/trace/ftrace.h             |    6 ++++-
 kernel/perf_event.c                |   14 ++++++------
 kernel/trace/trace_event_profile.c |    3 +-
 kernel/trace/trace_kprobe.c        |    5 ++-
 kernel/trace/trace_syscalls.c      |    4 +-
 9 files changed, 92 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/2] perf: Introduce new perf_save_regs() for hot regs snapshot
  2010-03-03  8:38   ` Peter Zijlstra
  2010-03-03 20:07     ` Frederic Weisbecker
@ 2010-03-05  3:08       ` Frederic Weisbecker
  2010-03-05  3:08     ` [PATCH 0/2] Perf " Frederic Weisbecker
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05  3:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	H . Peter Anvin, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu, Jason Baron, Archs,
	Peter Zijlstra

Events that trigger overflows by interrupting a context can
use get_irq_regs() or task_pt_regs() to retrieve the state
when the event triggered. But this is not the case for some
other class of events like trace events as tracepoints are
executed in the same context than the code that triggered
the event.

It means we need a different api to capture the regs there,
namely we need a hot snapshot to get the most important
informations for perf: the instruction pointer to get the
event origin, the frame pointer for the callchain, the code
segment for user_mode() tests (we always use __KERNEL_CS as
trace events always occur from the kernel) and the eflags
for further purposes.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 arch/x86/kernel/cpu/perf_event.c |   12 ++++++++++
 arch/x86/kernel/dumpstack.h      |   15 +++++++++++++
 include/linux/perf_event.h       |   42 +++++++++++++++++++++++++++++++++++++-
 kernel/perf_event.c              |    5 ++++
 4 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 641ccb9..274d0cd 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1662,6 +1662,18 @@ struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return entry;
 }
 
+void perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+	regs->ip = ip;
+	/*
+	 * perf_arch_save_regs adds another call, we need to increment
+	 * the skip level
+	 */
+	regs->bp = rewind_frame_pointer(skip + 1);
+	regs->cs = __KERNEL_CS;
+	local_save_flags(regs->flags);
+}
+
 void hw_perf_event_setup_online(int cpu)
 {
 	init_debug_store_on_cpu(cpu);
diff --git a/arch/x86/kernel/dumpstack.h b/arch/x86/kernel/dumpstack.h
index 4fd1420..29e5f7c 100644
--- a/arch/x86/kernel/dumpstack.h
+++ b/arch/x86/kernel/dumpstack.h
@@ -29,4 +29,19 @@ struct stack_frame {
 	struct stack_frame *next_frame;
 	unsigned long return_address;
 };
+
+static inline unsigned long rewind_frame_pointer(int n)
+{
+	struct stack_frame *frame;
+
+	get_bp(frame);
+
+#ifdef CONFIG_FRAME_POINTER
+	while (n--)
+		frame = frame->next_frame;
 #endif
+
+	return (unsigned long)frame;
+}
+
+#endif /* DUMPSTACK_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 04f06b4..e35ad6f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -452,6 +452,7 @@ enum perf_callchain_context {
 #include <linux/fs.h>
 #include <linux/pid_namespace.h>
 #include <linux/workqueue.h>
+#include <linux/ftrace.h>
 #include <asm/atomic.h>
 
 #define PERF_MAX_STACK_DEPTH		255
@@ -840,6 +841,44 @@ perf_sw_event(u32 event_id, u64 nr, int nmi, struct pt_regs *regs, u64 addr)
 		__perf_sw_event(event_id, nr, nmi, regs, addr);
 }
 
+extern void
+perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip);
+
+/*
+ * Take a snapshot of the regs. Skip ip and frame pointer to
+ * the nth caller. We only need a few of the regs:
+ * - ip for PERF_SAMPLE_IP
+ * - cs for user_mode() tests
+ * - bp for callchains
+ * - eflags, for future purposes, just in case
+ */
+static inline void perf_save_regs(struct pt_regs *regs, int skip)
+{
+	unsigned long ip;
+
+	memset(regs, 0, sizeof(*regs));
+
+	switch (skip) {
+	case 1 :
+		ip = CALLER_ADDR0;
+		break;
+	case 2 :
+		ip = CALLER_ADDR1;
+		break;
+	case 3 :
+		ip = CALLER_ADDR2;
+		break;
+	case 4:
+		ip = CALLER_ADDR3;
+		break;
+	/* No need to support further for now */
+	default:
+		ip = 0;
+	}
+
+	return perf_arch_save_regs(regs, ip, skip);
+}
+
 extern void __perf_event_mmap(struct vm_area_struct *vma);
 
 static inline void perf_event_mmap(struct vm_area_struct *vma)
@@ -858,7 +897,8 @@ extern int sysctl_perf_event_mlock;
 extern int sysctl_perf_event_sample_rate;
 
 extern void perf_event_init(void);
-extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record, int entry_size);
+extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
+			  int entry_size, struct pt_regs *regs);
 extern void perf_bp_event(struct perf_event *event, void *data);
 
 #ifndef perf_misc_flags
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a661e79..42b8276 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2806,6 +2806,11 @@ __weak struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return NULL;
 }
 
+__weak
+void perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+}
+
 /*
  * Output
  */
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 1/2] perf: Introduce new perf_save_regs() for hot regs snapshot
@ 2010-03-05  3:08       ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05  3:08 UTC (permalink / raw)
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	H . Peter Anvin, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu, Jason Baron, Archs,
	Peter Zijlstra

Events that trigger overflows by interrupting a context can
use get_irq_regs() or task_pt_regs() to retrieve the state
when the event triggered. But this is not the case for some
other class of events like trace events as tracepoints are
executed in the same context than the code that triggered
the event.

It means we need a different api to capture the regs there,
namely we need a hot snapshot to get the most important
informations for perf: the instruction pointer to get the
event origin, the frame pointer for the callchain, the code
segment for user_mode() tests (we always use __KERNEL_CS as
trace events always occur from the kernel) and the eflags
for further purposes.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 arch/x86/kernel/cpu/perf_event.c |   12 ++++++++++
 arch/x86/kernel/dumpstack.h      |   15 +++++++++++++
 include/linux/perf_event.h       |   42 +++++++++++++++++++++++++++++++++++++-
 kernel/perf_event.c              |    5 ++++
 4 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 641ccb9..274d0cd 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1662,6 +1662,18 @@ struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return entry;
 }
 
+void perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+	regs->ip = ip;
+	/*
+	 * perf_arch_save_regs adds another call, we need to increment
+	 * the skip level
+	 */
+	regs->bp = rewind_frame_pointer(skip + 1);
+	regs->cs = __KERNEL_CS;
+	local_save_flags(regs->flags);
+}
+
 void hw_perf_event_setup_online(int cpu)
 {
 	init_debug_store_on_cpu(cpu);
diff --git a/arch/x86/kernel/dumpstack.h b/arch/x86/kernel/dumpstack.h
index 4fd1420..29e5f7c 100644
--- a/arch/x86/kernel/dumpstack.h
+++ b/arch/x86/kernel/dumpstack.h
@@ -29,4 +29,19 @@ struct stack_frame {
 	struct stack_frame *next_frame;
 	unsigned long return_address;
 };
+
+static inline unsigned long rewind_frame_pointer(int n)
+{
+	struct stack_frame *frame;
+
+	get_bp(frame);
+
+#ifdef CONFIG_FRAME_POINTER
+	while (n--)
+		frame = frame->next_frame;
 #endif
+
+	return (unsigned long)frame;
+}
+
+#endif /* DUMPSTACK_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 04f06b4..e35ad6f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -452,6 +452,7 @@ enum perf_callchain_context {
 #include <linux/fs.h>
 #include <linux/pid_namespace.h>
 #include <linux/workqueue.h>
+#include <linux/ftrace.h>
 #include <asm/atomic.h>
 
 #define PERF_MAX_STACK_DEPTH		255
@@ -840,6 +841,44 @@ perf_sw_event(u32 event_id, u64 nr, int nmi, struct pt_regs *regs, u64 addr)
 		__perf_sw_event(event_id, nr, nmi, regs, addr);
 }
 
+extern void
+perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip);
+
+/*
+ * Take a snapshot of the regs. Skip ip and frame pointer to
+ * the nth caller. We only need a few of the regs:
+ * - ip for PERF_SAMPLE_IP
+ * - cs for user_mode() tests
+ * - bp for callchains
+ * - eflags, for future purposes, just in case
+ */
+static inline void perf_save_regs(struct pt_regs *regs, int skip)
+{
+	unsigned long ip;
+
+	memset(regs, 0, sizeof(*regs));
+
+	switch (skip) {
+	case 1 :
+		ip = CALLER_ADDR0;
+		break;
+	case 2 :
+		ip = CALLER_ADDR1;
+		break;
+	case 3 :
+		ip = CALLER_ADDR2;
+		break;
+	case 4:
+		ip = CALLER_ADDR3;
+		break;
+	/* No need to support further for now */
+	default:
+		ip = 0;
+	}
+
+	return perf_arch_save_regs(regs, ip, skip);
+}
+
 extern void __perf_event_mmap(struct vm_area_struct *vma);
 
 static inline void perf_event_mmap(struct vm_area_struct *vma)
@@ -858,7 +897,8 @@ extern int sysctl_perf_event_mlock;
 extern int sysctl_perf_event_sample_rate;
 
 extern void perf_event_init(void);
-extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record, int entry_size);
+extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
+			  int entry_size, struct pt_regs *regs);
 extern void perf_bp_event(struct perf_event *event, void *data);
 
 #ifndef perf_misc_flags
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a661e79..42b8276 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2806,6 +2806,11 @@ __weak struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return NULL;
 }
 
+__weak
+void perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+}
+
 /*
  * Output
  */
-- 
1.6.2.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 1/2] perf: Introduce new perf_save_regs() for hot regs snapshot
@ 2010-03-05  3:08       ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05  3:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	H . Peter Anvin, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu, Jason Baron, Archs

Events that trigger overflows by interrupting a context can
use get_irq_regs() or task_pt_regs() to retrieve the state
when the event triggered. But this is not the case for some
other class of events like trace events as tracepoints are
executed in the same context than the code that triggered
the event.

It means we need a different api to capture the regs there,
namely we need a hot snapshot to get the most important
informations for perf: the instruction pointer to get the
event origin, the frame pointer for the callchain, the code
segment for user_mode() tests (we always use __KERNEL_CS as
trace events always occur from the kernel) and the eflags
for further purposes.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 arch/x86/kernel/cpu/perf_event.c |   12 ++++++++++
 arch/x86/kernel/dumpstack.h      |   15 +++++++++++++
 include/linux/perf_event.h       |   42 +++++++++++++++++++++++++++++++++++++-
 kernel/perf_event.c              |    5 ++++
 4 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 641ccb9..274d0cd 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1662,6 +1662,18 @@ struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return entry;
 }
 
+void perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+	regs->ip = ip;
+	/*
+	 * perf_arch_save_regs adds another call, we need to increment
+	 * the skip level
+	 */
+	regs->bp = rewind_frame_pointer(skip + 1);
+	regs->cs = __KERNEL_CS;
+	local_save_flags(regs->flags);
+}
+
 void hw_perf_event_setup_online(int cpu)
 {
 	init_debug_store_on_cpu(cpu);
diff --git a/arch/x86/kernel/dumpstack.h b/arch/x86/kernel/dumpstack.h
index 4fd1420..29e5f7c 100644
--- a/arch/x86/kernel/dumpstack.h
+++ b/arch/x86/kernel/dumpstack.h
@@ -29,4 +29,19 @@ struct stack_frame {
 	struct stack_frame *next_frame;
 	unsigned long return_address;
 };
+
+static inline unsigned long rewind_frame_pointer(int n)
+{
+	struct stack_frame *frame;
+
+	get_bp(frame);
+
+#ifdef CONFIG_FRAME_POINTER
+	while (n--)
+		frame = frame->next_frame;
 #endif
+
+	return (unsigned long)frame;
+}
+
+#endif /* DUMPSTACK_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 04f06b4..e35ad6f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -452,6 +452,7 @@ enum perf_callchain_context {
 #include <linux/fs.h>
 #include <linux/pid_namespace.h>
 #include <linux/workqueue.h>
+#include <linux/ftrace.h>
 #include <asm/atomic.h>
 
 #define PERF_MAX_STACK_DEPTH		255
@@ -840,6 +841,44 @@ perf_sw_event(u32 event_id, u64 nr, int nmi, struct pt_regs *regs, u64 addr)
 		__perf_sw_event(event_id, nr, nmi, regs, addr);
 }
 
+extern void
+perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip);
+
+/*
+ * Take a snapshot of the regs. Skip ip and frame pointer to
+ * the nth caller. We only need a few of the regs:
+ * - ip for PERF_SAMPLE_IP
+ * - cs for user_mode() tests
+ * - bp for callchains
+ * - eflags, for future purposes, just in case
+ */
+static inline void perf_save_regs(struct pt_regs *regs, int skip)
+{
+	unsigned long ip;
+
+	memset(regs, 0, sizeof(*regs));
+
+	switch (skip) {
+	case 1 :
+		ip = CALLER_ADDR0;
+		break;
+	case 2 :
+		ip = CALLER_ADDR1;
+		break;
+	case 3 :
+		ip = CALLER_ADDR2;
+		break;
+	case 4:
+		ip = CALLER_ADDR3;
+		break;
+	/* No need to support further for now */
+	default:
+		ip = 0;
+	}
+
+	return perf_arch_save_regs(regs, ip, skip);
+}
+
 extern void __perf_event_mmap(struct vm_area_struct *vma);
 
 static inline void perf_event_mmap(struct vm_area_struct *vma)
@@ -858,7 +897,8 @@ extern int sysctl_perf_event_mlock;
 extern int sysctl_perf_event_sample_rate;
 
 extern void perf_event_init(void);
-extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record, int entry_size);
+extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
+			  int entry_size, struct pt_regs *regs);
 extern void perf_bp_event(struct perf_event *event, void *data);
 
 #ifndef perf_misc_flags
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a661e79..42b8276 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2806,6 +2806,11 @@ __weak struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return NULL;
 }
 
+__weak
+void perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+}
+
 /*
  * Output
  */
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/2] perf: Take a hot regs snapshot for trace events
  2010-03-03  8:38   ` Peter Zijlstra
  2010-03-03 20:07     ` Frederic Weisbecker
@ 2010-03-05  3:08       ` Frederic Weisbecker
  2010-03-05  3:08     ` [PATCH 0/2] Perf " Frederic Weisbecker
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05  3:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	H . Peter Anvin, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu, Jason Baron, Archs,
	Peter Zijlstra

We are taking a wrong regs snapshot when a trace event triggers.
Either we use get_irq_regs(), which gives us the interrupted
registers if we are in an interrupt, or we use task_pt_regs()
which gives us the state before we entered the kernel, assuming
we are lucky enough to be no kernel thread, in which case
task_pt_regs() returns the initial set of regs when the kernel
thread was started.

What we want is different. We need a hot snapshot of the regs,
so that we can get the instruction pointer to record in the
sample, the frame pointer for the callchain, and some other
things.

Let's use the new perf_save_regs() for that.

Comparison with perf record -e lock: -R -a -f -g
Before:

        perf  [kernel]                   [k] __do_softirq
               |
               --- __do_softirq
                  |
                  |--55.16%-- __open
                  |
                   --44.84%-- __write_nocancel

After:

            perf  [kernel]           [k] perf_tp_event
               |
               --- perf_tp_event
                  |
                  |--41.07%-- lock_acquire
                  |          |
                  |          |--39.36%-- _raw_spin_lock
                  |          |          |
                  |          |          |--7.81%-- hrtimer_interrupt
                  |          |          |          smp_apic_timer_interrupt
                  |          |          |          apic_timer_interrupt

The old case was producing unreliable callchains. Now having
right frame and instruction pointers, we have the trace we
want.

Also syscalls and kprobe events already have the right regs,
let's use them instead of wasting a retrieval.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 include/linux/ftrace_event.h       |    7 +++++--
 include/trace/ftrace.h             |    6 +++++-
 kernel/perf_event.c                |    9 ++-------
 kernel/trace/trace_event_profile.c |    3 ++-
 kernel/trace/trace_kprobe.c        |    5 +++--
 kernel/trace/trace_syscalls.c      |    4 ++--
 6 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index cd95919..48b367e 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -188,6 +188,9 @@ do {									\
 
 #ifdef CONFIG_PERF_EVENTS
 struct perf_event;
+
+DECLARE_PER_CPU(struct pt_regs, perf_trace_regs);
+
 extern int ftrace_profile_enable(int event_id);
 extern void ftrace_profile_disable(int event_id);
 extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
@@ -199,11 +202,11 @@ ftrace_perf_buf_prepare(int size, unsigned short type, int *rctxp,
 
 static inline void
 ftrace_perf_buf_submit(void *raw_data, int size, int rctx, u64 addr,
-		       u64 count, unsigned long irq_flags)
+		       u64 count, unsigned long irq_flags, struct pt_regs *regs)
 {
 	struct trace_entry *entry = raw_data;
 
-	perf_tp_event(entry->type, addr, count, raw_data, size);
+	perf_tp_event(entry->type, addr, count, raw_data, size, regs);
 	perf_swevent_put_recursion_context(rctx);
 	local_irq_restore(irq_flags);
 }
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index f2c09e4..a4fa51c 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -853,6 +853,7 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 	struct ftrace_raw_##call *entry;				\
 	u64 __addr = 0, __count = 1;					\
 	unsigned long irq_flags;					\
+	struct pt_regs *__regs;						\
 	int __entry_size;						\
 	int __data_size;						\
 	int rctx;							\
@@ -873,8 +874,11 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 									\
 	{ assign; }							\
 									\
+	__regs = &__get_cpu_var(perf_trace_regs);			\
+	perf_save_regs(__regs, 2);					\
+									\
 	ftrace_perf_buf_submit(entry, __entry_size, rctx, __addr,	\
-			       __count, irq_flags);			\
+			       __count, irq_flags, __regs);		\
 }
 
 #undef DEFINE_EVENT
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 42b8276..763a482 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4340,7 +4340,7 @@ static const struct pmu perf_ops_task_clock = {
 #ifdef CONFIG_EVENT_TRACING
 
 void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
-			  int entry_size)
+		   int entry_size, struct pt_regs *regs)
 {
 	struct perf_raw_record raw = {
 		.size = entry_size,
@@ -4352,14 +4352,9 @@ void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
 		.raw = &raw,
 	};
 
-	struct pt_regs *regs = get_irq_regs();
-
-	if (!regs)
-		regs = task_pt_regs(current);
-
 	/* Trace events already protected against recursion */
 	do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1,
-				&data, regs);
+			 &data, regs);
 }
 EXPORT_SYMBOL_GPL(perf_tp_event);
 
diff --git a/kernel/trace/trace_event_profile.c b/kernel/trace/trace_event_profile.c
index f0d6930..e66d21e 100644
--- a/kernel/trace/trace_event_profile.c
+++ b/kernel/trace/trace_event_profile.c
@@ -2,13 +2,14 @@
  * trace event based perf counter profiling
  *
  * Copyright (C) 2009 Red Hat Inc, Peter Zijlstra <pzijlstr@redhat.com>
- *
+ * Copyright (C) 2009-2010 Frederic Weisbecker <fweisbec@gmail.com>
  */
 
 #include <linux/module.h>
 #include <linux/kprobes.h>
 #include "trace.h"
 
+DEFINE_PER_CPU(struct pt_regs, perf_trace_regs);
 
 static char *perf_trace_buf;
 static char *perf_trace_buf_nmi;
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 356c102..73e0526 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1260,7 +1260,7 @@ static __kprobes void kprobe_profile_func(struct kprobe *kp,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags, regs);
 }
 
 /* Kretprobe profile handler */
@@ -1291,7 +1291,8 @@ static __kprobes void kretprobe_profile_func(struct kretprobe_instance *ri,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1,
+			       irq_flags, regs);
 }
 
 static int probe_profile_enable(struct ftrace_event_call *call)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 4e332b9..d0f34ab 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -462,7 +462,7 @@ static void prof_syscall_enter(struct pt_regs *regs, long id)
 	rec->nr = syscall_nr;
 	syscall_get_arguments(current, regs, 0, sys_data->nb_args,
 			       (unsigned long *)&rec->args);
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysenter_enable(struct ftrace_event_call *call)
@@ -537,7 +537,7 @@ static void prof_syscall_exit(struct pt_regs *regs, long ret)
 	rec->nr = syscall_nr;
 	rec->ret = syscall_get_return_value(current, regs);
 
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysexit_enable(struct ftrace_event_call *call)
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/2] perf: Take a hot regs snapshot for trace events
@ 2010-03-05  3:08       ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05  3:08 UTC (permalink / raw)
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	H . Peter Anvin, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu, Jason Baron, Archs,
	Peter Zijlstra

We are taking a wrong regs snapshot when a trace event triggers.
Either we use get_irq_regs(), which gives us the interrupted
registers if we are in an interrupt, or we use task_pt_regs()
which gives us the state before we entered the kernel, assuming
we are lucky enough to be no kernel thread, in which case
task_pt_regs() returns the initial set of regs when the kernel
thread was started.

What we want is different. We need a hot snapshot of the regs,
so that we can get the instruction pointer to record in the
sample, the frame pointer for the callchain, and some other
things.

Let's use the new perf_save_regs() for that.

Comparison with perf record -e lock: -R -a -f -g
Before:

        perf  [kernel]                   [k] __do_softirq
               |
               --- __do_softirq
                  |
                  |--55.16%-- __open
                  |
                   --44.84%-- __write_nocancel

After:

            perf  [kernel]           [k] perf_tp_event
               |
               --- perf_tp_event
                  |
                  |--41.07%-- lock_acquire
                  |          |
                  |          |--39.36%-- _raw_spin_lock
                  |          |          |
                  |          |          |--7.81%-- hrtimer_interrupt
                  |          |          |          smp_apic_timer_interrupt
                  |          |          |          apic_timer_interrupt

The old case was producing unreliable callchains. Now having
right frame and instruction pointers, we have the trace we
want.

Also syscalls and kprobe events already have the right regs,
let's use them instead of wasting a retrieval.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 include/linux/ftrace_event.h       |    7 +++++--
 include/trace/ftrace.h             |    6 +++++-
 kernel/perf_event.c                |    9 ++-------
 kernel/trace/trace_event_profile.c |    3 ++-
 kernel/trace/trace_kprobe.c        |    5 +++--
 kernel/trace/trace_syscalls.c      |    4 ++--
 6 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index cd95919..48b367e 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -188,6 +188,9 @@ do {									\
 
 #ifdef CONFIG_PERF_EVENTS
 struct perf_event;
+
+DECLARE_PER_CPU(struct pt_regs, perf_trace_regs);
+
 extern int ftrace_profile_enable(int event_id);
 extern void ftrace_profile_disable(int event_id);
 extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
@@ -199,11 +202,11 @@ ftrace_perf_buf_prepare(int size, unsigned short type, int *rctxp,
 
 static inline void
 ftrace_perf_buf_submit(void *raw_data, int size, int rctx, u64 addr,
-		       u64 count, unsigned long irq_flags)
+		       u64 count, unsigned long irq_flags, struct pt_regs *regs)
 {
 	struct trace_entry *entry = raw_data;
 
-	perf_tp_event(entry->type, addr, count, raw_data, size);
+	perf_tp_event(entry->type, addr, count, raw_data, size, regs);
 	perf_swevent_put_recursion_context(rctx);
 	local_irq_restore(irq_flags);
 }
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index f2c09e4..a4fa51c 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -853,6 +853,7 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 	struct ftrace_raw_##call *entry;				\
 	u64 __addr = 0, __count = 1;					\
 	unsigned long irq_flags;					\
+	struct pt_regs *__regs;						\
 	int __entry_size;						\
 	int __data_size;						\
 	int rctx;							\
@@ -873,8 +874,11 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 									\
 	{ assign; }							\
 									\
+	__regs = &__get_cpu_var(perf_trace_regs);			\
+	perf_save_regs(__regs, 2);					\
+									\
 	ftrace_perf_buf_submit(entry, __entry_size, rctx, __addr,	\
-			       __count, irq_flags);			\
+			       __count, irq_flags, __regs);		\
 }
 
 #undef DEFINE_EVENT
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 42b8276..763a482 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4340,7 +4340,7 @@ static const struct pmu perf_ops_task_clock = {
 #ifdef CONFIG_EVENT_TRACING
 
 void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
-			  int entry_size)
+		   int entry_size, struct pt_regs *regs)
 {
 	struct perf_raw_record raw = {
 		.size = entry_size,
@@ -4352,14 +4352,9 @@ void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
 		.raw = &raw,
 	};
 
-	struct pt_regs *regs = get_irq_regs();
-
-	if (!regs)
-		regs = task_pt_regs(current);
-
 	/* Trace events already protected against recursion */
 	do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1,
-				&data, regs);
+			 &data, regs);
 }
 EXPORT_SYMBOL_GPL(perf_tp_event);
 
diff --git a/kernel/trace/trace_event_profile.c b/kernel/trace/trace_event_profile.c
index f0d6930..e66d21e 100644
--- a/kernel/trace/trace_event_profile.c
+++ b/kernel/trace/trace_event_profile.c
@@ -2,13 +2,14 @@
  * trace event based perf counter profiling
  *
  * Copyright (C) 2009 Red Hat Inc, Peter Zijlstra <pzijlstr@redhat.com>
- *
+ * Copyright (C) 2009-2010 Frederic Weisbecker <fweisbec@gmail.com>
  */
 
 #include <linux/module.h>
 #include <linux/kprobes.h>
 #include "trace.h"
 
+DEFINE_PER_CPU(struct pt_regs, perf_trace_regs);
 
 static char *perf_trace_buf;
 static char *perf_trace_buf_nmi;
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 356c102..73e0526 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1260,7 +1260,7 @@ static __kprobes void kprobe_profile_func(struct kprobe *kp,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags, regs);
 }
 
 /* Kretprobe profile handler */
@@ -1291,7 +1291,8 @@ static __kprobes void kretprobe_profile_func(struct kretprobe_instance *ri,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1,
+			       irq_flags, regs);
 }
 
 static int probe_profile_enable(struct ftrace_event_call *call)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 4e332b9..d0f34ab 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -462,7 +462,7 @@ static void prof_syscall_enter(struct pt_regs *regs, long id)
 	rec->nr = syscall_nr;
 	syscall_get_arguments(current, regs, 0, sys_data->nb_args,
 			       (unsigned long *)&rec->args);
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysenter_enable(struct ftrace_event_call *call)
@@ -537,7 +537,7 @@ static void prof_syscall_exit(struct pt_regs *regs, long ret)
 	rec->nr = syscall_nr;
 	rec->ret = syscall_get_return_value(current, regs);
 
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysexit_enable(struct ftrace_event_call *call)
-- 
1.6.2.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/2] perf: Take a hot regs snapshot for trace events
@ 2010-03-05  3:08       ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05  3:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	H . Peter Anvin, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu, Jason Baron, Archs

We are taking a wrong regs snapshot when a trace event triggers.
Either we use get_irq_regs(), which gives us the interrupted
registers if we are in an interrupt, or we use task_pt_regs()
which gives us the state before we entered the kernel, assuming
we are lucky enough to be no kernel thread, in which case
task_pt_regs() returns the initial set of regs when the kernel
thread was started.

What we want is different. We need a hot snapshot of the regs,
so that we can get the instruction pointer to record in the
sample, the frame pointer for the callchain, and some other
things.

Let's use the new perf_save_regs() for that.

Comparison with perf record -e lock: -R -a -f -g
Before:

        perf  [kernel]                   [k] __do_softirq
               |
               --- __do_softirq
                  |
                  |--55.16%-- __open
                  |
                   --44.84%-- __write_nocancel

After:

            perf  [kernel]           [k] perf_tp_event
               |
               --- perf_tp_event
                  |
                  |--41.07%-- lock_acquire
                  |          |
                  |          |--39.36%-- _raw_spin_lock
                  |          |          |
                  |          |          |--7.81%-- hrtimer_interrupt
                  |          |          |          smp_apic_timer_interrupt
                  |          |          |          apic_timer_interrupt

The old case was producing unreliable callchains. Now having
right frame and instruction pointers, we have the trace we
want.

Also syscalls and kprobe events already have the right regs,
let's use them instead of wasting a retrieval.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 include/linux/ftrace_event.h       |    7 +++++--
 include/trace/ftrace.h             |    6 +++++-
 kernel/perf_event.c                |    9 ++-------
 kernel/trace/trace_event_profile.c |    3 ++-
 kernel/trace/trace_kprobe.c        |    5 +++--
 kernel/trace/trace_syscalls.c      |    4 ++--
 6 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index cd95919..48b367e 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -188,6 +188,9 @@ do {									\
 
 #ifdef CONFIG_PERF_EVENTS
 struct perf_event;
+
+DECLARE_PER_CPU(struct pt_regs, perf_trace_regs);
+
 extern int ftrace_profile_enable(int event_id);
 extern void ftrace_profile_disable(int event_id);
 extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
@@ -199,11 +202,11 @@ ftrace_perf_buf_prepare(int size, unsigned short type, int *rctxp,
 
 static inline void
 ftrace_perf_buf_submit(void *raw_data, int size, int rctx, u64 addr,
-		       u64 count, unsigned long irq_flags)
+		       u64 count, unsigned long irq_flags, struct pt_regs *regs)
 {
 	struct trace_entry *entry = raw_data;
 
-	perf_tp_event(entry->type, addr, count, raw_data, size);
+	perf_tp_event(entry->type, addr, count, raw_data, size, regs);
 	perf_swevent_put_recursion_context(rctx);
 	local_irq_restore(irq_flags);
 }
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index f2c09e4..a4fa51c 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -853,6 +853,7 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 	struct ftrace_raw_##call *entry;				\
 	u64 __addr = 0, __count = 1;					\
 	unsigned long irq_flags;					\
+	struct pt_regs *__regs;						\
 	int __entry_size;						\
 	int __data_size;						\
 	int rctx;							\
@@ -873,8 +874,11 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 									\
 	{ assign; }							\
 									\
+	__regs = &__get_cpu_var(perf_trace_regs);			\
+	perf_save_regs(__regs, 2);					\
+									\
 	ftrace_perf_buf_submit(entry, __entry_size, rctx, __addr,	\
-			       __count, irq_flags);			\
+			       __count, irq_flags, __regs);		\
 }
 
 #undef DEFINE_EVENT
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 42b8276..763a482 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4340,7 +4340,7 @@ static const struct pmu perf_ops_task_clock = {
 #ifdef CONFIG_EVENT_TRACING
 
 void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
-			  int entry_size)
+		   int entry_size, struct pt_regs *regs)
 {
 	struct perf_raw_record raw = {
 		.size = entry_size,
@@ -4352,14 +4352,9 @@ void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
 		.raw = &raw,
 	};
 
-	struct pt_regs *regs = get_irq_regs();
-
-	if (!regs)
-		regs = task_pt_regs(current);
-
 	/* Trace events already protected against recursion */
 	do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1,
-				&data, regs);
+			 &data, regs);
 }
 EXPORT_SYMBOL_GPL(perf_tp_event);
 
diff --git a/kernel/trace/trace_event_profile.c b/kernel/trace/trace_event_profile.c
index f0d6930..e66d21e 100644
--- a/kernel/trace/trace_event_profile.c
+++ b/kernel/trace/trace_event_profile.c
@@ -2,13 +2,14 @@
  * trace event based perf counter profiling
  *
  * Copyright (C) 2009 Red Hat Inc, Peter Zijlstra <pzijlstr@redhat.com>
- *
+ * Copyright (C) 2009-2010 Frederic Weisbecker <fweisbec@gmail.com>
  */
 
 #include <linux/module.h>
 #include <linux/kprobes.h>
 #include "trace.h"
 
+DEFINE_PER_CPU(struct pt_regs, perf_trace_regs);
 
 static char *perf_trace_buf;
 static char *perf_trace_buf_nmi;
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 356c102..73e0526 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1260,7 +1260,7 @@ static __kprobes void kprobe_profile_func(struct kprobe *kp,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags, regs);
 }
 
 /* Kretprobe profile handler */
@@ -1291,7 +1291,8 @@ static __kprobes void kretprobe_profile_func(struct kretprobe_instance *ri,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1,
+			       irq_flags, regs);
 }
 
 static int probe_profile_enable(struct ftrace_event_call *call)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 4e332b9..d0f34ab 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -462,7 +462,7 @@ static void prof_syscall_enter(struct pt_regs *regs, long id)
 	rec->nr = syscall_nr;
 	syscall_get_arguments(current, regs, 0, sys_data->nb_args,
 			       (unsigned long *)&rec->args);
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysenter_enable(struct ftrace_event_call *call)
@@ -537,7 +537,7 @@ static void prof_syscall_exit(struct pt_regs *regs, long ret)
 	rec->nr = syscall_nr;
 	rec->ret = syscall_get_return_value(current, regs);
 
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysexit_enable(struct ftrace_event_call *call)
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/2] perf: Introduce new perf_save_regs() for hot regs snapshot
  2010-03-05  3:08       ` Frederic Weisbecker
  (?)
  (?)
@ 2010-03-05 15:08       ` Masami Hiramatsu
  2010-03-05 16:38         ` Frederic Weisbecker
  -1 siblings, 1 reply; 51+ messages in thread
From: Masami Hiramatsu @ 2010-03-05 15:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	H . Peter Anvin, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo, Jason Baron, Archs

Frederic Weisbecker wrote:
> Events that trigger overflows by interrupting a context can
> use get_irq_regs() or task_pt_regs() to retrieve the state
> when the event triggered. But this is not the case for some
> other class of events like trace events as tracepoints are
> executed in the same context than the code that triggered
> the event.
> 
> It means we need a different api to capture the regs there,
> namely we need a hot snapshot to get the most important
> informations for perf: the instruction pointer to get the
> event origin, the frame pointer for the callchain, the code
> segment for user_mode() tests (we always use __KERNEL_CS as
> trace events always occur from the kernel) and the eflags
> for further purposes.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Masami Hiramatsu <mhiramat@redhat.com>
> Cc: Jason Baron <jbaron@redhat.com>
> Cc: Archs <linux-arch@vger.kernel.org>
> ---
>  arch/x86/kernel/cpu/perf_event.c |   12 ++++++++++
>  arch/x86/kernel/dumpstack.h      |   15 +++++++++++++
>  include/linux/perf_event.h       |   42 +++++++++++++++++++++++++++++++++++++-
>  kernel/perf_event.c              |    5 ++++
>  4 files changed, 73 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 641ccb9..274d0cd 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1662,6 +1662,18 @@ struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
>  	return entry;
>  }
>  
> +void perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip)

Hmm, why would you call it 'save_regs' ?
It seems that this function is just for fixing registers 
instead of saving it into somewhere...

Thank you,


-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/2] perf: Introduce new perf_save_regs() for hot regs snapshot
  2010-03-05 15:08       ` Masami Hiramatsu
@ 2010-03-05 16:38         ` Frederic Weisbecker
  2010-03-05 17:08           ` Masami Hiramatsu
  0 siblings, 1 reply; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05 16:38 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	H . Peter Anvin, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo, Jason Baron, Archs

On Fri, Mar 05, 2010 at 10:08:07AM -0500, Masami Hiramatsu wrote:
> > +void perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip)
> 
> Hmm, why would you call it 'save_regs' ?
> It seems that this function is just for fixing registers 
> instead of saving it into somewhere...
> 
> Thank you,


Hmm, save_regs() describes what it does: you pass
a pt_regs and it saves registers inside. But it
has also a kind of fixup thing as it also rewinds.

I'm not sure using a fixup thing for the naming
is correct as we are not starting with initial
regs passed to the function (just a raw buffer).

What about perf_save_caller_regs() ?


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/2] perf: Introduce new perf_save_regs() for hot regs snapshot
  2010-03-05 16:38         ` Frederic Weisbecker
@ 2010-03-05 17:08           ` Masami Hiramatsu
  2010-03-05 17:17             ` Frederic Weisbecker
                               ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Masami Hiramatsu @ 2010-03-05 17:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	H . Peter Anvin, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo, Jason Baron, Archs

Frederic Weisbecker wrote:
> On Fri, Mar 05, 2010 at 10:08:07AM -0500, Masami Hiramatsu wrote:
>>> +void perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip)
>>
>> Hmm, why would you call it 'save_regs' ?
>> It seems that this function is just for fixing registers 
>> instead of saving it into somewhere...
>>
>> Thank you,
> 
> 
> Hmm, save_regs() describes what it does: you pass
> a pt_regs and it saves registers inside. But it
> has also a kind of fixup thing as it also rewinds.

Ah, I see. so this saves current register values
into pt_regs. :)

> 
> I'm not sure using a fixup thing for the naming
> is correct as we are not starting with initial
> regs passed to the function (just a raw buffer).
> 
> What about perf_save_caller_regs() ?
> 

Hmm, I think, it might be better perf_get_caller_regs()
or something like that (fetch ?). 

Thank you,


-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/2] perf: Introduce new perf_save_regs() for hot regs snapshot
  2010-03-05 17:08           ` Masami Hiramatsu
@ 2010-03-05 17:17             ` Frederic Weisbecker
  2010-03-05 20:55               ` Frederic Weisbecker
  2010-03-05 20:55               ` Frederic Weisbecker
  2 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05 17:17 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	H . Peter Anvin, Paul Mackerras, Steven Rostedt,
	Arnaldo Carvalho de Melo, Jason Baron, Archs

On Fri, Mar 05, 2010 at 12:08:19PM -0500, Masami Hiramatsu wrote:
> Frederic Weisbecker wrote:
> > On Fri, Mar 05, 2010 at 10:08:07AM -0500, Masami Hiramatsu wrote:
> >>> +void perf_arch_save_regs(struct pt_regs *regs, unsigned long ip, int skip)
> >>
> >> Hmm, why would you call it 'save_regs' ?
> >> It seems that this function is just for fixing registers 
> >> instead of saving it into somewhere...
> >>
> >> Thank you,
> > 
> > 
> > Hmm, save_regs() describes what it does: you pass
> > a pt_regs and it saves registers inside. But it
> > has also a kind of fixup thing as it also rewinds.
> 
> Ah, I see. so this saves current register values
> into pt_regs. :)
> 
> > 
> > I'm not sure using a fixup thing for the naming
> > is correct as we are not starting with initial
> > regs passed to the function (just a raw buffer).
> > 
> > What about perf_save_caller_regs() ?
> > 
> 
> Hmm, I think, it might be better perf_get_caller_regs()
> or something like that (fetch ?). 


perf_fetch_caller_regs() looks fine. I'll update my
patch accordingly, thanks.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/2] perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot
  2010-03-05 17:08           ` Masami Hiramatsu
  2010-03-05 17:17             ` Frederic Weisbecker
@ 2010-03-05 20:55               ` Frederic Weisbecker
  2010-03-05 20:55               ` Frederic Weisbecker
  2 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05 20:55 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Paul Mackerras,
	Steven Rostedt, Masami Hiramatsu, Jason Baron, Thomas Gleixner,
	H . Peter Anvin, Arnaldo Carvalho de Melo, Archs, Ingo Molnar

Events that trigger overflows by interrupting a context can
use get_irq_regs() or task_pt_regs() to retrieve the state
when the event triggered. But this is not the case for some
other class of events like trace events as tracepoints are
executed in the same context than the code that triggered
the event.

It means we need a different api to capture the regs there,
namely we need a hot snapshot to get the most important
informations for perf: the instruction pointer to get the
event origin, the frame pointer for the callchain, the code
segment for user_mode() tests (we always use __KERNEL_CS as
trace events always occur from the kernel) and the eflags
for further purposes.

v2: rename perf_save_regs to perf_fetch_caller_regs as per
Masami's suggestion.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 arch/x86/kernel/cpu/perf_event.c |   12 ++++++++++
 arch/x86/kernel/dumpstack.h      |   15 +++++++++++++
 include/linux/perf_event.h       |   42 +++++++++++++++++++++++++++++++++++++-
 kernel/perf_event.c              |    5 ++++
 4 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 641ccb9..4c1f430 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1662,6 +1662,18 @@ struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return entry;
 }
 
+void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+	regs->ip = ip;
+	/*
+	 * perf_arch_fetch_caller_regs adds another call, we need to increment
+	 * the skip level
+	 */
+	regs->bp = rewind_frame_pointer(skip + 1);
+	regs->cs = __KERNEL_CS;
+	local_save_flags(regs->flags);
+}
+
 void hw_perf_event_setup_online(int cpu)
 {
 	init_debug_store_on_cpu(cpu);
diff --git a/arch/x86/kernel/dumpstack.h b/arch/x86/kernel/dumpstack.h
index 4fd1420..29e5f7c 100644
--- a/arch/x86/kernel/dumpstack.h
+++ b/arch/x86/kernel/dumpstack.h
@@ -29,4 +29,19 @@ struct stack_frame {
 	struct stack_frame *next_frame;
 	unsigned long return_address;
 };
+
+static inline unsigned long rewind_frame_pointer(int n)
+{
+	struct stack_frame *frame;
+
+	get_bp(frame);
+
+#ifdef CONFIG_FRAME_POINTER
+	while (n--)
+		frame = frame->next_frame;
 #endif
+
+	return (unsigned long)frame;
+}
+
+#endif /* DUMPSTACK_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 04f06b4..c859bee 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -452,6 +452,7 @@ enum perf_callchain_context {
 #include <linux/fs.h>
 #include <linux/pid_namespace.h>
 #include <linux/workqueue.h>
+#include <linux/ftrace.h>
 #include <asm/atomic.h>
 
 #define PERF_MAX_STACK_DEPTH		255
@@ -840,6 +841,44 @@ perf_sw_event(u32 event_id, u64 nr, int nmi, struct pt_regs *regs, u64 addr)
 		__perf_sw_event(event_id, nr, nmi, regs, addr);
 }
 
+extern void
+perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip);
+
+/*
+ * Take a snapshot of the regs. Skip ip and frame pointer to
+ * the nth caller. We only need a few of the regs:
+ * - ip for PERF_SAMPLE_IP
+ * - cs for user_mode() tests
+ * - bp for callchains
+ * - eflags, for future purposes, just in case
+ */
+static inline void perf_fetch_caller_regs(struct pt_regs *regs, int skip)
+{
+	unsigned long ip;
+
+	memset(regs, 0, sizeof(*regs));
+
+	switch (skip) {
+	case 1 :
+		ip = CALLER_ADDR0;
+		break;
+	case 2 :
+		ip = CALLER_ADDR1;
+		break;
+	case 3 :
+		ip = CALLER_ADDR2;
+		break;
+	case 4:
+		ip = CALLER_ADDR3;
+		break;
+	/* No need to support further for now */
+	default:
+		ip = 0;
+	}
+
+	return perf_arch_fetch_caller_regs(regs, ip, skip);
+}
+
 extern void __perf_event_mmap(struct vm_area_struct *vma);
 
 static inline void perf_event_mmap(struct vm_area_struct *vma)
@@ -858,7 +897,8 @@ extern int sysctl_perf_event_mlock;
 extern int sysctl_perf_event_sample_rate;
 
 extern void perf_event_init(void);
-extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record, int entry_size);
+extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
+			  int entry_size, struct pt_regs *regs);
 extern void perf_bp_event(struct perf_event *event, void *data);
 
 #ifndef perf_misc_flags
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a661e79..1c3059b 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2806,6 +2806,11 @@ __weak struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return NULL;
 }
 
+__weak
+void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+}
+
 /*
  * Output
  */
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 1/2] perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot
@ 2010-03-05 20:55               ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05 20:55 UTC (permalink / raw)
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Paul Mackerras,
	Steven Rostedt, Masami Hiramatsu, Jason Baron, Thomas Gleixner,
	H . Peter Anvin, Arnaldo Carvalho de Melo, Archs, Ingo Molnar

Events that trigger overflows by interrupting a context can
use get_irq_regs() or task_pt_regs() to retrieve the state
when the event triggered. But this is not the case for some
other class of events like trace events as tracepoints are
executed in the same context than the code that triggered
the event.

It means we need a different api to capture the regs there,
namely we need a hot snapshot to get the most important
informations for perf: the instruction pointer to get the
event origin, the frame pointer for the callchain, the code
segment for user_mode() tests (we always use __KERNEL_CS as
trace events always occur from the kernel) and the eflags
for further purposes.

v2: rename perf_save_regs to perf_fetch_caller_regs as per
Masami's suggestion.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 arch/x86/kernel/cpu/perf_event.c |   12 ++++++++++
 arch/x86/kernel/dumpstack.h      |   15 +++++++++++++
 include/linux/perf_event.h       |   42 +++++++++++++++++++++++++++++++++++++-
 kernel/perf_event.c              |    5 ++++
 4 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 641ccb9..4c1f430 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1662,6 +1662,18 @@ struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return entry;
 }
 
+void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+	regs->ip = ip;
+	/*
+	 * perf_arch_fetch_caller_regs adds another call, we need to increment
+	 * the skip level
+	 */
+	regs->bp = rewind_frame_pointer(skip + 1);
+	regs->cs = __KERNEL_CS;
+	local_save_flags(regs->flags);
+}
+
 void hw_perf_event_setup_online(int cpu)
 {
 	init_debug_store_on_cpu(cpu);
diff --git a/arch/x86/kernel/dumpstack.h b/arch/x86/kernel/dumpstack.h
index 4fd1420..29e5f7c 100644
--- a/arch/x86/kernel/dumpstack.h
+++ b/arch/x86/kernel/dumpstack.h
@@ -29,4 +29,19 @@ struct stack_frame {
 	struct stack_frame *next_frame;
 	unsigned long return_address;
 };
+
+static inline unsigned long rewind_frame_pointer(int n)
+{
+	struct stack_frame *frame;
+
+	get_bp(frame);
+
+#ifdef CONFIG_FRAME_POINTER
+	while (n--)
+		frame = frame->next_frame;
 #endif
+
+	return (unsigned long)frame;
+}
+
+#endif /* DUMPSTACK_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 04f06b4..c859bee 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -452,6 +452,7 @@ enum perf_callchain_context {
 #include <linux/fs.h>
 #include <linux/pid_namespace.h>
 #include <linux/workqueue.h>
+#include <linux/ftrace.h>
 #include <asm/atomic.h>
 
 #define PERF_MAX_STACK_DEPTH		255
@@ -840,6 +841,44 @@ perf_sw_event(u32 event_id, u64 nr, int nmi, struct pt_regs *regs, u64 addr)
 		__perf_sw_event(event_id, nr, nmi, regs, addr);
 }
 
+extern void
+perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip);
+
+/*
+ * Take a snapshot of the regs. Skip ip and frame pointer to
+ * the nth caller. We only need a few of the regs:
+ * - ip for PERF_SAMPLE_IP
+ * - cs for user_mode() tests
+ * - bp for callchains
+ * - eflags, for future purposes, just in case
+ */
+static inline void perf_fetch_caller_regs(struct pt_regs *regs, int skip)
+{
+	unsigned long ip;
+
+	memset(regs, 0, sizeof(*regs));
+
+	switch (skip) {
+	case 1 :
+		ip = CALLER_ADDR0;
+		break;
+	case 2 :
+		ip = CALLER_ADDR1;
+		break;
+	case 3 :
+		ip = CALLER_ADDR2;
+		break;
+	case 4:
+		ip = CALLER_ADDR3;
+		break;
+	/* No need to support further for now */
+	default:
+		ip = 0;
+	}
+
+	return perf_arch_fetch_caller_regs(regs, ip, skip);
+}
+
 extern void __perf_event_mmap(struct vm_area_struct *vma);
 
 static inline void perf_event_mmap(struct vm_area_struct *vma)
@@ -858,7 +897,8 @@ extern int sysctl_perf_event_mlock;
 extern int sysctl_perf_event_sample_rate;
 
 extern void perf_event_init(void);
-extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record, int entry_size);
+extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
+			  int entry_size, struct pt_regs *regs);
 extern void perf_bp_event(struct perf_event *event, void *data);
 
 #ifndef perf_misc_flags
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a661e79..1c3059b 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2806,6 +2806,11 @@ __weak struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return NULL;
 }
 
+__weak
+void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+}
+
 /*
  * Output
  */
-- 
1.6.2.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 1/2] perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot
@ 2010-03-05 20:55               ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05 20:55 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Paul Mackerras,
	Steven Rostedt, Masami Hiramatsu, Jason Baron, Thomas Gleixner,
	H . Peter Anvin, Arnaldo Carvalho de Melo, Archs, Ingo Molnar

Events that trigger overflows by interrupting a context can
use get_irq_regs() or task_pt_regs() to retrieve the state
when the event triggered. But this is not the case for some
other class of events like trace events as tracepoints are
executed in the same context than the code that triggered
the event.

It means we need a different api to capture the regs there,
namely we need a hot snapshot to get the most important
informations for perf: the instruction pointer to get the
event origin, the frame pointer for the callchain, the code
segment for user_mode() tests (we always use __KERNEL_CS as
trace events always occur from the kernel) and the eflags
for further purposes.

v2: rename perf_save_regs to perf_fetch_caller_regs as per
Masami's suggestion.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 arch/x86/kernel/cpu/perf_event.c |   12 ++++++++++
 arch/x86/kernel/dumpstack.h      |   15 +++++++++++++
 include/linux/perf_event.h       |   42 +++++++++++++++++++++++++++++++++++++-
 kernel/perf_event.c              |    5 ++++
 4 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 641ccb9..4c1f430 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1662,6 +1662,18 @@ struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return entry;
 }
 
+void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+	regs->ip = ip;
+	/*
+	 * perf_arch_fetch_caller_regs adds another call, we need to increment
+	 * the skip level
+	 */
+	regs->bp = rewind_frame_pointer(skip + 1);
+	regs->cs = __KERNEL_CS;
+	local_save_flags(regs->flags);
+}
+
 void hw_perf_event_setup_online(int cpu)
 {
 	init_debug_store_on_cpu(cpu);
diff --git a/arch/x86/kernel/dumpstack.h b/arch/x86/kernel/dumpstack.h
index 4fd1420..29e5f7c 100644
--- a/arch/x86/kernel/dumpstack.h
+++ b/arch/x86/kernel/dumpstack.h
@@ -29,4 +29,19 @@ struct stack_frame {
 	struct stack_frame *next_frame;
 	unsigned long return_address;
 };
+
+static inline unsigned long rewind_frame_pointer(int n)
+{
+	struct stack_frame *frame;
+
+	get_bp(frame);
+
+#ifdef CONFIG_FRAME_POINTER
+	while (n--)
+		frame = frame->next_frame;
 #endif
+
+	return (unsigned long)frame;
+}
+
+#endif /* DUMPSTACK_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 04f06b4..c859bee 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -452,6 +452,7 @@ enum perf_callchain_context {
 #include <linux/fs.h>
 #include <linux/pid_namespace.h>
 #include <linux/workqueue.h>
+#include <linux/ftrace.h>
 #include <asm/atomic.h>
 
 #define PERF_MAX_STACK_DEPTH		255
@@ -840,6 +841,44 @@ perf_sw_event(u32 event_id, u64 nr, int nmi, struct pt_regs *regs, u64 addr)
 		__perf_sw_event(event_id, nr, nmi, regs, addr);
 }
 
+extern void
+perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip);
+
+/*
+ * Take a snapshot of the regs. Skip ip and frame pointer to
+ * the nth caller. We only need a few of the regs:
+ * - ip for PERF_SAMPLE_IP
+ * - cs for user_mode() tests
+ * - bp for callchains
+ * - eflags, for future purposes, just in case
+ */
+static inline void perf_fetch_caller_regs(struct pt_regs *regs, int skip)
+{
+	unsigned long ip;
+
+	memset(regs, 0, sizeof(*regs));
+
+	switch (skip) {
+	case 1 :
+		ip = CALLER_ADDR0;
+		break;
+	case 2 :
+		ip = CALLER_ADDR1;
+		break;
+	case 3 :
+		ip = CALLER_ADDR2;
+		break;
+	case 4:
+		ip = CALLER_ADDR3;
+		break;
+	/* No need to support further for now */
+	default:
+		ip = 0;
+	}
+
+	return perf_arch_fetch_caller_regs(regs, ip, skip);
+}
+
 extern void __perf_event_mmap(struct vm_area_struct *vma);
 
 static inline void perf_event_mmap(struct vm_area_struct *vma)
@@ -858,7 +897,8 @@ extern int sysctl_perf_event_mlock;
 extern int sysctl_perf_event_sample_rate;
 
 extern void perf_event_init(void);
-extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record, int entry_size);
+extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
+			  int entry_size, struct pt_regs *regs);
 extern void perf_bp_event(struct perf_event *event, void *data);
 
 #ifndef perf_misc_flags
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a661e79..1c3059b 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2806,6 +2806,11 @@ __weak struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	return NULL;
 }
 
+__weak
+void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned long ip, int skip)
+{
+}
+
 /*
  * Output
  */
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/2] perf: Take a hot regs snapshot for trace events
  2010-03-05 17:08           ` Masami Hiramatsu
  2010-03-05 17:17             ` Frederic Weisbecker
@ 2010-03-05 20:55               ` Frederic Weisbecker
  2010-03-05 20:55               ` Frederic Weisbecker
  2 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05 20:55 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Paul Mackerras,
	Steven Rostedt, Masami Hiramatsu, Jason Baron, Thomas Gleixner,
	H . Peter Anvin, Arnaldo Carvalho de Melo, Archs, Ingo Molnar

We are taking a wrong regs snapshot when a trace event triggers.
Either we use get_irq_regs(), which gives us the interrupted
registers if we are in an interrupt, or we use task_pt_regs()
which gives us the state before we entered the kernel, assuming
we are lucky enough to be no kernel thread, in which case
task_pt_regs() returns the initial set of regs when the kernel
thread was started.

What we want is different. We need a hot snapshot of the regs,
so that we can get the instruction pointer to record in the
sample, the frame pointer for the callchain, and some other
things.

Let's use the new perf_fetch_caller_regs() for that.

Comparison with perf record -e lock: -R -a -f -g
Before:

        perf  [kernel]                   [k] __do_softirq
               |
               --- __do_softirq
                  |
                  |--55.16%-- __open
                  |
                   --44.84%-- __write_nocancel

After:

            perf  [kernel]           [k] perf_tp_event
               |
               --- perf_tp_event
                  |
                  |--41.07%-- lock_acquire
                  |          |
                  |          |--39.36%-- _raw_spin_lock
                  |          |          |
                  |          |          |--7.81%-- hrtimer_interrupt
                  |          |          |          smp_apic_timer_interrupt
                  |          |          |          apic_timer_interrupt

The old case was producing unreliable callchains. Now having
right frame and instruction pointers, we have the trace we
want.

Also syscalls and kprobe events already have the right regs,
let's use them instead of wasting a retrieval.

v2: Follow the rename perf_save_regs() -> perf_fetch_caller_regs()

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 include/linux/ftrace_event.h       |    7 +++++--
 include/trace/ftrace.h             |    6 +++++-
 kernel/perf_event.c                |    9 ++-------
 kernel/trace/trace_event_profile.c |    3 ++-
 kernel/trace/trace_kprobe.c        |    5 +++--
 kernel/trace/trace_syscalls.c      |    4 ++--
 6 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index cd95919..48b367e 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -188,6 +188,9 @@ do {									\
 
 #ifdef CONFIG_PERF_EVENTS
 struct perf_event;
+
+DECLARE_PER_CPU(struct pt_regs, perf_trace_regs);
+
 extern int ftrace_profile_enable(int event_id);
 extern void ftrace_profile_disable(int event_id);
 extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
@@ -199,11 +202,11 @@ ftrace_perf_buf_prepare(int size, unsigned short type, int *rctxp,
 
 static inline void
 ftrace_perf_buf_submit(void *raw_data, int size, int rctx, u64 addr,
-		       u64 count, unsigned long irq_flags)
+		       u64 count, unsigned long irq_flags, struct pt_regs *regs)
 {
 	struct trace_entry *entry = raw_data;
 
-	perf_tp_event(entry->type, addr, count, raw_data, size);
+	perf_tp_event(entry->type, addr, count, raw_data, size, regs);
 	perf_swevent_put_recursion_context(rctx);
 	local_irq_restore(irq_flags);
 }
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index f2c09e4..1f7677d 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -853,6 +853,7 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 	struct ftrace_raw_##call *entry;				\
 	u64 __addr = 0, __count = 1;					\
 	unsigned long irq_flags;					\
+	struct pt_regs *__regs;						\
 	int __entry_size;						\
 	int __data_size;						\
 	int rctx;							\
@@ -873,8 +874,11 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 									\
 	{ assign; }							\
 									\
+	__regs = &__get_cpu_var(perf_trace_regs);			\
+	perf_fetch_caller_regs(__regs, 2);				\
+									\
 	ftrace_perf_buf_submit(entry, __entry_size, rctx, __addr,	\
-			       __count, irq_flags);			\
+			       __count, irq_flags, __regs);		\
 }
 
 #undef DEFINE_EVENT
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 1c3059b..aa3b98c 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4340,7 +4340,7 @@ static const struct pmu perf_ops_task_clock = {
 #ifdef CONFIG_EVENT_TRACING
 
 void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
-			  int entry_size)
+		   int entry_size, struct pt_regs *regs)
 {
 	struct perf_raw_record raw = {
 		.size = entry_size,
@@ -4352,14 +4352,9 @@ void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
 		.raw = &raw,
 	};
 
-	struct pt_regs *regs = get_irq_regs();
-
-	if (!regs)
-		regs = task_pt_regs(current);
-
 	/* Trace events already protected against recursion */
 	do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1,
-				&data, regs);
+			 &data, regs);
 }
 EXPORT_SYMBOL_GPL(perf_tp_event);
 
diff --git a/kernel/trace/trace_event_profile.c b/kernel/trace/trace_event_profile.c
index f0d6930..e66d21e 100644
--- a/kernel/trace/trace_event_profile.c
+++ b/kernel/trace/trace_event_profile.c
@@ -2,13 +2,14 @@
  * trace event based perf counter profiling
  *
  * Copyright (C) 2009 Red Hat Inc, Peter Zijlstra <pzijlstr@redhat.com>
- *
+ * Copyright (C) 2009-2010 Frederic Weisbecker <fweisbec@gmail.com>
  */
 
 #include <linux/module.h>
 #include <linux/kprobes.h>
 #include "trace.h"
 
+DEFINE_PER_CPU(struct pt_regs, perf_trace_regs);
 
 static char *perf_trace_buf;
 static char *perf_trace_buf_nmi;
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 356c102..73e0526 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1260,7 +1260,7 @@ static __kprobes void kprobe_profile_func(struct kprobe *kp,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags, regs);
 }
 
 /* Kretprobe profile handler */
@@ -1291,7 +1291,8 @@ static __kprobes void kretprobe_profile_func(struct kretprobe_instance *ri,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1,
+			       irq_flags, regs);
 }
 
 static int probe_profile_enable(struct ftrace_event_call *call)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 4e332b9..d0f34ab 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -462,7 +462,7 @@ static void prof_syscall_enter(struct pt_regs *regs, long id)
 	rec->nr = syscall_nr;
 	syscall_get_arguments(current, regs, 0, sys_data->nb_args,
 			       (unsigned long *)&rec->args);
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysenter_enable(struct ftrace_event_call *call)
@@ -537,7 +537,7 @@ static void prof_syscall_exit(struct pt_regs *regs, long ret)
 	rec->nr = syscall_nr;
 	rec->ret = syscall_get_return_value(current, regs);
 
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysexit_enable(struct ftrace_event_call *call)
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/2] perf: Take a hot regs snapshot for trace events
@ 2010-03-05 20:55               ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05 20:55 UTC (permalink / raw)
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Paul Mackerras,
	Steven Rostedt, Masami Hiramatsu, Jason Baron, Thomas Gleixner,
	H . Peter Anvin, Arnaldo Carvalho de Melo, Archs, Ingo Molnar

We are taking a wrong regs snapshot when a trace event triggers.
Either we use get_irq_regs(), which gives us the interrupted
registers if we are in an interrupt, or we use task_pt_regs()
which gives us the state before we entered the kernel, assuming
we are lucky enough to be no kernel thread, in which case
task_pt_regs() returns the initial set of regs when the kernel
thread was started.

What we want is different. We need a hot snapshot of the regs,
so that we can get the instruction pointer to record in the
sample, the frame pointer for the callchain, and some other
things.

Let's use the new perf_fetch_caller_regs() for that.

Comparison with perf record -e lock: -R -a -f -g
Before:

        perf  [kernel]                   [k] __do_softirq
               |
               --- __do_softirq
                  |
                  |--55.16%-- __open
                  |
                   --44.84%-- __write_nocancel

After:

            perf  [kernel]           [k] perf_tp_event
               |
               --- perf_tp_event
                  |
                  |--41.07%-- lock_acquire
                  |          |
                  |          |--39.36%-- _raw_spin_lock
                  |          |          |
                  |          |          |--7.81%-- hrtimer_interrupt
                  |          |          |          smp_apic_timer_interrupt
                  |          |          |          apic_timer_interrupt

The old case was producing unreliable callchains. Now having
right frame and instruction pointers, we have the trace we
want.

Also syscalls and kprobe events already have the right regs,
let's use them instead of wasting a retrieval.

v2: Follow the rename perf_save_regs() -> perf_fetch_caller_regs()

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 include/linux/ftrace_event.h       |    7 +++++--
 include/trace/ftrace.h             |    6 +++++-
 kernel/perf_event.c                |    9 ++-------
 kernel/trace/trace_event_profile.c |    3 ++-
 kernel/trace/trace_kprobe.c        |    5 +++--
 kernel/trace/trace_syscalls.c      |    4 ++--
 6 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index cd95919..48b367e 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -188,6 +188,9 @@ do {									\
 
 #ifdef CONFIG_PERF_EVENTS
 struct perf_event;
+
+DECLARE_PER_CPU(struct pt_regs, perf_trace_regs);
+
 extern int ftrace_profile_enable(int event_id);
 extern void ftrace_profile_disable(int event_id);
 extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
@@ -199,11 +202,11 @@ ftrace_perf_buf_prepare(int size, unsigned short type, int *rctxp,
 
 static inline void
 ftrace_perf_buf_submit(void *raw_data, int size, int rctx, u64 addr,
-		       u64 count, unsigned long irq_flags)
+		       u64 count, unsigned long irq_flags, struct pt_regs *regs)
 {
 	struct trace_entry *entry = raw_data;
 
-	perf_tp_event(entry->type, addr, count, raw_data, size);
+	perf_tp_event(entry->type, addr, count, raw_data, size, regs);
 	perf_swevent_put_recursion_context(rctx);
 	local_irq_restore(irq_flags);
 }
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index f2c09e4..1f7677d 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -853,6 +853,7 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 	struct ftrace_raw_##call *entry;				\
 	u64 __addr = 0, __count = 1;					\
 	unsigned long irq_flags;					\
+	struct pt_regs *__regs;						\
 	int __entry_size;						\
 	int __data_size;						\
 	int rctx;							\
@@ -873,8 +874,11 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 									\
 	{ assign; }							\
 									\
+	__regs = &__get_cpu_var(perf_trace_regs);			\
+	perf_fetch_caller_regs(__regs, 2);				\
+									\
 	ftrace_perf_buf_submit(entry, __entry_size, rctx, __addr,	\
-			       __count, irq_flags);			\
+			       __count, irq_flags, __regs);		\
 }
 
 #undef DEFINE_EVENT
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 1c3059b..aa3b98c 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4340,7 +4340,7 @@ static const struct pmu perf_ops_task_clock = {
 #ifdef CONFIG_EVENT_TRACING
 
 void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
-			  int entry_size)
+		   int entry_size, struct pt_regs *regs)
 {
 	struct perf_raw_record raw = {
 		.size = entry_size,
@@ -4352,14 +4352,9 @@ void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
 		.raw = &raw,
 	};
 
-	struct pt_regs *regs = get_irq_regs();
-
-	if (!regs)
-		regs = task_pt_regs(current);
-
 	/* Trace events already protected against recursion */
 	do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1,
-				&data, regs);
+			 &data, regs);
 }
 EXPORT_SYMBOL_GPL(perf_tp_event);
 
diff --git a/kernel/trace/trace_event_profile.c b/kernel/trace/trace_event_profile.c
index f0d6930..e66d21e 100644
--- a/kernel/trace/trace_event_profile.c
+++ b/kernel/trace/trace_event_profile.c
@@ -2,13 +2,14 @@
  * trace event based perf counter profiling
  *
  * Copyright (C) 2009 Red Hat Inc, Peter Zijlstra <pzijlstr@redhat.com>
- *
+ * Copyright (C) 2009-2010 Frederic Weisbecker <fweisbec@gmail.com>
  */
 
 #include <linux/module.h>
 #include <linux/kprobes.h>
 #include "trace.h"
 
+DEFINE_PER_CPU(struct pt_regs, perf_trace_regs);
 
 static char *perf_trace_buf;
 static char *perf_trace_buf_nmi;
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 356c102..73e0526 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1260,7 +1260,7 @@ static __kprobes void kprobe_profile_func(struct kprobe *kp,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags, regs);
 }
 
 /* Kretprobe profile handler */
@@ -1291,7 +1291,8 @@ static __kprobes void kretprobe_profile_func(struct kretprobe_instance *ri,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1,
+			       irq_flags, regs);
 }
 
 static int probe_profile_enable(struct ftrace_event_call *call)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 4e332b9..d0f34ab 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -462,7 +462,7 @@ static void prof_syscall_enter(struct pt_regs *regs, long id)
 	rec->nr = syscall_nr;
 	syscall_get_arguments(current, regs, 0, sys_data->nb_args,
 			       (unsigned long *)&rec->args);
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysenter_enable(struct ftrace_event_call *call)
@@ -537,7 +537,7 @@ static void prof_syscall_exit(struct pt_regs *regs, long ret)
 	rec->nr = syscall_nr;
 	rec->ret = syscall_get_return_value(current, regs);
 
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysexit_enable(struct ftrace_event_call *call)
-- 
1.6.2.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/2] perf: Take a hot regs snapshot for trace events
@ 2010-03-05 20:55               ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-05 20:55 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Paul Mackerras,
	Steven Rostedt, Masami Hiramatsu, Jason Baron, Thomas Gleixner,
	H . Peter Anvin, Arnaldo Carvalho de Melo, Archs, Ingo Molnar

We are taking a wrong regs snapshot when a trace event triggers.
Either we use get_irq_regs(), which gives us the interrupted
registers if we are in an interrupt, or we use task_pt_regs()
which gives us the state before we entered the kernel, assuming
we are lucky enough to be no kernel thread, in which case
task_pt_regs() returns the initial set of regs when the kernel
thread was started.

What we want is different. We need a hot snapshot of the regs,
so that we can get the instruction pointer to record in the
sample, the frame pointer for the callchain, and some other
things.

Let's use the new perf_fetch_caller_regs() for that.

Comparison with perf record -e lock: -R -a -f -g
Before:

        perf  [kernel]                   [k] __do_softirq
               |
               --- __do_softirq
                  |
                  |--55.16%-- __open
                  |
                   --44.84%-- __write_nocancel

After:

            perf  [kernel]           [k] perf_tp_event
               |
               --- perf_tp_event
                  |
                  |--41.07%-- lock_acquire
                  |          |
                  |          |--39.36%-- _raw_spin_lock
                  |          |          |
                  |          |          |--7.81%-- hrtimer_interrupt
                  |          |          |          smp_apic_timer_interrupt
                  |          |          |          apic_timer_interrupt

The old case was producing unreliable callchains. Now having
right frame and instruction pointers, we have the trace we
want.

Also syscalls and kprobe events already have the right regs,
let's use them instead of wasting a retrieval.

v2: Follow the rename perf_save_regs() -> perf_fetch_caller_regs()

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
---
 include/linux/ftrace_event.h       |    7 +++++--
 include/trace/ftrace.h             |    6 +++++-
 kernel/perf_event.c                |    9 ++-------
 kernel/trace/trace_event_profile.c |    3 ++-
 kernel/trace/trace_kprobe.c        |    5 +++--
 kernel/trace/trace_syscalls.c      |    4 ++--
 6 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index cd95919..48b367e 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -188,6 +188,9 @@ do {									\
 
 #ifdef CONFIG_PERF_EVENTS
 struct perf_event;
+
+DECLARE_PER_CPU(struct pt_regs, perf_trace_regs);
+
 extern int ftrace_profile_enable(int event_id);
 extern void ftrace_profile_disable(int event_id);
 extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
@@ -199,11 +202,11 @@ ftrace_perf_buf_prepare(int size, unsigned short type, int *rctxp,
 
 static inline void
 ftrace_perf_buf_submit(void *raw_data, int size, int rctx, u64 addr,
-		       u64 count, unsigned long irq_flags)
+		       u64 count, unsigned long irq_flags, struct pt_regs *regs)
 {
 	struct trace_entry *entry = raw_data;
 
-	perf_tp_event(entry->type, addr, count, raw_data, size);
+	perf_tp_event(entry->type, addr, count, raw_data, size, regs);
 	perf_swevent_put_recursion_context(rctx);
 	local_irq_restore(irq_flags);
 }
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index f2c09e4..1f7677d 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -853,6 +853,7 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 	struct ftrace_raw_##call *entry;				\
 	u64 __addr = 0, __count = 1;					\
 	unsigned long irq_flags;					\
+	struct pt_regs *__regs;						\
 	int __entry_size;						\
 	int __data_size;						\
 	int rctx;							\
@@ -873,8 +874,11 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call,	\
 									\
 	{ assign; }							\
 									\
+	__regs = &__get_cpu_var(perf_trace_regs);			\
+	perf_fetch_caller_regs(__regs, 2);				\
+									\
 	ftrace_perf_buf_submit(entry, __entry_size, rctx, __addr,	\
-			       __count, irq_flags);			\
+			       __count, irq_flags, __regs);		\
 }
 
 #undef DEFINE_EVENT
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 1c3059b..aa3b98c 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4340,7 +4340,7 @@ static const struct pmu perf_ops_task_clock = {
 #ifdef CONFIG_EVENT_TRACING
 
 void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
-			  int entry_size)
+		   int entry_size, struct pt_regs *regs)
 {
 	struct perf_raw_record raw = {
 		.size = entry_size,
@@ -4352,14 +4352,9 @@ void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
 		.raw = &raw,
 	};
 
-	struct pt_regs *regs = get_irq_regs();
-
-	if (!regs)
-		regs = task_pt_regs(current);
-
 	/* Trace events already protected against recursion */
 	do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1,
-				&data, regs);
+			 &data, regs);
 }
 EXPORT_SYMBOL_GPL(perf_tp_event);
 
diff --git a/kernel/trace/trace_event_profile.c b/kernel/trace/trace_event_profile.c
index f0d6930..e66d21e 100644
--- a/kernel/trace/trace_event_profile.c
+++ b/kernel/trace/trace_event_profile.c
@@ -2,13 +2,14 @@
  * trace event based perf counter profiling
  *
  * Copyright (C) 2009 Red Hat Inc, Peter Zijlstra <pzijlstr@redhat.com>
- *
+ * Copyright (C) 2009-2010 Frederic Weisbecker <fweisbec@gmail.com>
  */
 
 #include <linux/module.h>
 #include <linux/kprobes.h>
 #include "trace.h"
 
+DEFINE_PER_CPU(struct pt_regs, perf_trace_regs);
 
 static char *perf_trace_buf;
 static char *perf_trace_buf_nmi;
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 356c102..73e0526 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1260,7 +1260,7 @@ static __kprobes void kprobe_profile_func(struct kprobe *kp,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ip, 1, irq_flags, regs);
 }
 
 /* Kretprobe profile handler */
@@ -1291,7 +1291,8 @@ static __kprobes void kretprobe_profile_func(struct kretprobe_instance *ri,
 	for (i = 0; i < tp->nr_args; i++)
 		entry->args[i] = call_fetch(&tp->args[i].fetch, regs);
 
-	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1, irq_flags);
+	ftrace_perf_buf_submit(entry, size, rctx, entry->ret_ip, 1,
+			       irq_flags, regs);
 }
 
 static int probe_profile_enable(struct ftrace_event_call *call)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 4e332b9..d0f34ab 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -462,7 +462,7 @@ static void prof_syscall_enter(struct pt_regs *regs, long id)
 	rec->nr = syscall_nr;
 	syscall_get_arguments(current, regs, 0, sys_data->nb_args,
 			       (unsigned long *)&rec->args);
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysenter_enable(struct ftrace_event_call *call)
@@ -537,7 +537,7 @@ static void prof_syscall_exit(struct pt_regs *regs, long ret)
 	rec->nr = syscall_nr;
 	rec->ret = syscall_get_return_value(current, regs);
 
-	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags);
+	ftrace_perf_buf_submit(rec, size, rctx, 0, 1, flags, regs);
 }
 
 int prof_sysexit_enable(struct ftrace_event_call *call)
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection
  2010-03-03  6:55 ` [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection Frederic Weisbecker
@ 2010-03-09  7:18   ` Hitoshi Mitake
  2010-03-10  0:10     ` Frederic Weisbecker
  2010-03-09  8:34   ` Jens Axboe
  1 sibling, 1 reply; 51+ messages in thread
From: Hitoshi Mitake @ 2010-03-09  7:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Steven Rostedt, Paul Mackerras, Li Zefan, Lai Jiangshan,
	Masami Hiramatsu, Jens Axboe

On 03/03/10 15:55, Frederic Weisbecker wrote:
> There are rcu locked read side areas in the path where we submit
> a trace event. And these rcu_read_(un)lock() trigger lock events,
> which create recursive events.
>
> One pair in do_perf_sw_event:
>
> __lock_acquire
>        |
>        |--96.11%-- lock_acquire
>        |          |
>        |          |--27.21%-- do_perf_sw_event
>        |          |          perf_tp_event
>        |          |          |
>        |          |          |--49.62%-- ftrace_profile_lock_release
>        |          |          |          lock_release
>        |          |          |          |
>        |          |          |          |--33.85%-- _raw_spin_unlock
>
> Another pair in perf_output_begin/end:
>
> __lock_acquire
>        |--23.40%-- perf_output_begin
>        |          |          __perf_event_overflow
>        |          |          perf_swevent_overflow
>        |          |          perf_swevent_add
>        |          |          perf_swevent_ctx_event
>        |          |          do_perf_sw_event
>        |          |          perf_tp_event
>        |          |          |
>        |          |          |--55.37%-- ftrace_profile_lock_acquire
>        |          |          |          lock_acquire
>        |          |          |          |
>        |          |          |          |--37.31%-- _raw_spin_lock
>
> The problem is not that much the trace recursion itself, as we have a
> recursion protection already (though it's always wasteful to recurse).
> But the trace events are outside the lockdep recursion protection, then
> each lockdep event triggers a lock trace, which will trigger two
> other lockdep events. Here the recursive lock trace event won't
> be taken because of the trace recursion, so the recursion stops there
> but lockdep will still analyse these new events:
>
> To sum up, for each lockdep events we have:
>
> 	lock_*()
> 	     |
>               trace lock_acquire
>                    |
>                    ----- rcu_read_lock()
>                    |          |
>                    |          lock_acquire()
>                    |          |
>                    |          trace_lock_acquire() (stopped)
>                    |          |
> 		  |          lockdep analyze
>                    |
>                    ----- rcu_read_unlock()
>                               |
>                               lock_release
>                               |
>                               trace_lock_release() (stopped)
>                               |
>                               lockdep analyze
>
> And you can repeat the above two times as we have two rcu read side
> sections when we submit an event.
>
> This is fixed in this patch by moving the lock trace event under
> the lockdep recursion protection.

Thanks a lot, Frederic!

I tested perf lock with your patch, result is like this,

Typical scores:

before:
% sudo ./perf lock record ./perf bench sched messaging
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

      Total time: 3.265 [sec]
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 143.952 MB perf.data (~6289344 samples) ]

after:
% sudo ./perf lock record ./perf bench sched messaging
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

      Total time: 1.943 [sec]                       <--- about x1.5 faster!
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 98.161 MB perf.data (~4288734 samples) 
]   <--- size of perf.data is also reduced


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection
  2010-03-03  6:55 ` [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection Frederic Weisbecker
  2010-03-09  7:18   ` Hitoshi Mitake
@ 2010-03-09  8:34   ` Jens Axboe
  2010-03-09  8:35     ` Jens Axboe
  2010-03-09 23:45     ` Frederic Weisbecker
  1 sibling, 2 replies; 51+ messages in thread
From: Jens Axboe @ 2010-03-09  8:34 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Steven Rostedt, Paul Mackerras, Hitoshi Mitake, Li Zefan,
	Lai Jiangshan, Masami Hiramatsu

On Wed, Mar 03 2010, Frederic Weisbecker wrote:
> There are rcu locked read side areas in the path where we submit
> a trace event. And these rcu_read_(un)lock() trigger lock events,
> which create recursive events.
> 
> One pair in do_perf_sw_event:
> 
> __lock_acquire
>       |
>       |--96.11%-- lock_acquire
>       |          |
>       |          |--27.21%-- do_perf_sw_event
>       |          |          perf_tp_event
>       |          |          |
>       |          |          |--49.62%-- ftrace_profile_lock_release
>       |          |          |          lock_release
>       |          |          |          |
>       |          |          |          |--33.85%-- _raw_spin_unlock
> 
> Another pair in perf_output_begin/end:
> 
> __lock_acquire
>       |--23.40%-- perf_output_begin
>       |          |          __perf_event_overflow
>       |          |          perf_swevent_overflow
>       |          |          perf_swevent_add
>       |          |          perf_swevent_ctx_event
>       |          |          do_perf_sw_event
>       |          |          perf_tp_event
>       |          |          |
>       |          |          |--55.37%-- ftrace_profile_lock_acquire
>       |          |          |          lock_acquire
>       |          |          |          |
>       |          |          |          |--37.31%-- _raw_spin_lock
> 
> The problem is not that much the trace recursion itself, as we have a
> recursion protection already (though it's always wasteful to recurse).
> But the trace events are outside the lockdep recursion protection, then
> each lockdep event triggers a lock trace, which will trigger two
> other lockdep events. Here the recursive lock trace event won't
> be taken because of the trace recursion, so the recursion stops there
> but lockdep will still analyse these new events:
> 
> To sum up, for each lockdep events we have:
> 
> 	lock_*()
> 	     |
>              trace lock_acquire
>                   |
>                   ----- rcu_read_lock()
>                   |          |
>                   |          lock_acquire()
>                   |          |
>                   |          trace_lock_acquire() (stopped)
>                   |          |
> 		  |          lockdep analyze
>                   |
>                   ----- rcu_read_unlock()
>                              |
>                              lock_release
>                              |
>                              trace_lock_release() (stopped)
>                              |
>                              lockdep analyze
> 
> And you can repeat the above two times as we have two rcu read side
> sections when we submit an event.
> 
> This is fixed in this patch by moving the lock trace event under
> the lockdep recursion protection.

I went to try this on 2.6.34-rc1 to see how much it would improve things
here. With 2.6.34-rc1, a

$ time sudo perf lock record ls /dev/shm

essentially hangs the box, it ends up crashing hard (not just live
locked like before). With the patch in place, it does eventually finish:

real    0m21.301s
user    0m0.030s
sys     0m21.040s

The directory is empty.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection
  2010-03-09  8:34   ` Jens Axboe
@ 2010-03-09  8:35     ` Jens Axboe
  2010-03-10  0:05       ` Frederic Weisbecker
  2010-03-09 23:45     ` Frederic Weisbecker
  1 sibling, 1 reply; 51+ messages in thread
From: Jens Axboe @ 2010-03-09  8:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Steven Rostedt, Paul Mackerras, Hitoshi Mitake, Li Zefan,
	Lai Jiangshan, Masami Hiramatsu

On Tue, Mar 09 2010, Jens Axboe wrote:
> On Wed, Mar 03 2010, Frederic Weisbecker wrote:
> > There are rcu locked read side areas in the path where we submit
> > a trace event. And these rcu_read_(un)lock() trigger lock events,
> > which create recursive events.
> > 
> > One pair in do_perf_sw_event:
> > 
> > __lock_acquire
> >       |
> >       |--96.11%-- lock_acquire
> >       |          |
> >       |          |--27.21%-- do_perf_sw_event
> >       |          |          perf_tp_event
> >       |          |          |
> >       |          |          |--49.62%-- ftrace_profile_lock_release
> >       |          |          |          lock_release
> >       |          |          |          |
> >       |          |          |          |--33.85%-- _raw_spin_unlock
> > 
> > Another pair in perf_output_begin/end:
> > 
> > __lock_acquire
> >       |--23.40%-- perf_output_begin
> >       |          |          __perf_event_overflow
> >       |          |          perf_swevent_overflow
> >       |          |          perf_swevent_add
> >       |          |          perf_swevent_ctx_event
> >       |          |          do_perf_sw_event
> >       |          |          perf_tp_event
> >       |          |          |
> >       |          |          |--55.37%-- ftrace_profile_lock_acquire
> >       |          |          |          lock_acquire
> >       |          |          |          |
> >       |          |          |          |--37.31%-- _raw_spin_lock
> > 
> > The problem is not that much the trace recursion itself, as we have a
> > recursion protection already (though it's always wasteful to recurse).
> > But the trace events are outside the lockdep recursion protection, then
> > each lockdep event triggers a lock trace, which will trigger two
> > other lockdep events. Here the recursive lock trace event won't
> > be taken because of the trace recursion, so the recursion stops there
> > but lockdep will still analyse these new events:
> > 
> > To sum up, for each lockdep events we have:
> > 
> > 	lock_*()
> > 	     |
> >              trace lock_acquire
> >                   |
> >                   ----- rcu_read_lock()
> >                   |          |
> >                   |          lock_acquire()
> >                   |          |
> >                   |          trace_lock_acquire() (stopped)
> >                   |          |
> > 		  |          lockdep analyze
> >                   |
> >                   ----- rcu_read_unlock()
> >                              |
> >                              lock_release
> >                              |
> >                              trace_lock_release() (stopped)
> >                              |
> >                              lockdep analyze
> > 
> > And you can repeat the above two times as we have two rcu read side
> > sections when we submit an event.
> > 
> > This is fixed in this patch by moving the lock trace event under
> > the lockdep recursion protection.
> 
> I went to try this on 2.6.34-rc1 to see how much it would improve things
> here. With 2.6.34-rc1, a

Which, btw, throws a new lockdep warning fest:

[   42.247718] scsi7 : ioc0: LSISAS1068E B3, FwRev=011b0300h, Ports=1, MaxQ=483, IRQ=26
[   42.281125] BUG: key ffff880c7cb75250 not in .data!
[   42.288346] ------------[ cut here ]------------
[   42.294490] WARNING: at kernel/lockdep.c:2706 lockdep_init_map+0x545/0x5f0()
[   42.304013] Hardware name: QSSC-S4R
[   42.309674] Modules linked in: hid_apple usbhid ehci_hcd uhci_hcd usbcore nls_base mptsas(+) mptscsih mptbase scsi_transport_sas igb sg sr_mod cdrom
[   42.332072] Pid: 5197, comm: modprobe Not tainted 2.6.34-rc1 #176
[   42.340597] Call Trace:
[   42.345335]  [<ffffffff8107fb8d>] ? is_module_address+0x2d/0x60
[   42.353670]  [<ffffffff81074aa5>] ? lockdep_init_map+0x545/0x5f0
[   42.362154]  [<ffffffff81044648>] warn_slowpath_common+0x78/0xd0
[   42.371913]  [<ffffffff810446af>] warn_slowpath_null+0xf/0x20
[   42.380109]  [<ffffffff81074aa5>] lockdep_init_map+0x545/0x5f0
[   42.388402]  [<ffffffff81156636>] ? sysfs_new_dirent+0x76/0x120
[   42.396748]  [<ffffffff81155b6c>] sysfs_add_file_mode+0x6c/0xd0
[   42.405170]  [<ffffffff812dd260>] ? transport_add_class_device+0x0/0x50
[   42.414406]  [<ffffffff81155bdc>] sysfs_add_file+0xc/0x10
[   42.422183]  [<ffffffff81155ca1>] sysfs_create_file+0x21/0x40
[   42.430328]  [<ffffffff812d6574>] device_create_file+0x14/0x20
[   42.438584]  [<ffffffff812dcace>] attribute_container_add_attrs+0x8e/0xb0
[   42.447952]  [<ffffffff812dcb0d>] attribute_container_add_class_device+0x1d/0x30
[   42.457968]  [<ffffffff812dd27f>] transport_add_class_device+0x1f/0x50
[   42.467108]  [<ffffffff812dcc4f>] attribute_container_device_trigger+0x9f/0xe0
[   42.476978]  [<ffffffff812dd1e0>] transport_add_device+0x10/0x20
[   42.485415]  [<ffffffffa0049208>] sas_phy_add+0x28/0x40 [scsi_transport_sas]
[   42.495039]  [<ffffffffa0079e58>] mptsas_probe_one_phy+0x5f8/0x850 [mptsas]
[   42.504680]  [<ffffffff813fb469>] ? mutex_unlock+0x9/0x10
[   42.512429]  [<ffffffffa0079459>] ? mptsas_setup_wide_ports+0x269/0x380 [mptsas]
[   42.522481]  [<ffffffffa007a65d>] mptsas_probe_hba_phys+0x5ad/0xa10 [mptsas]
[   42.532154]  [<ffffffff8107401e>] ? lock_release_holdtime+0xfe/0x1f0
[   42.541071]  [<ffffffffa007af6d>] mptsas_scan_sas_topology+0x1d/0x340 [mptsas]
[   42.550920]  [<ffffffff812dd1c0>] ? transport_configure_device+0x10/0x20
[   42.560106]  [<ffffffff812ecd78>] ? scsi_sysfs_add_host+0x98/0xb0
[   42.568590]  [<ffffffffa007b693>] mptsas_probe+0x403/0x4b0 [mptsas]
[   42.577215]  [<ffffffff8123f2b2>] local_pci_probe+0x12/0x20
[   42.585257]  [<ffffffff8123f611>] pci_device_probe+0x111/0x120
[   42.593590]  [<ffffffff812d936b>] ? driver_sysfs_add+0x6b/0x90
[   42.601814]  [<ffffffff812d94a5>] driver_probe_device+0x95/0x1a0
[   42.610212]  [<ffffffff812d9643>] __driver_attach+0x93/0xa0
[   42.618261]  [<ffffffff812d95b0>] ? __driver_attach+0x0/0xa0
[   42.626320]  [<ffffffff812d8b5b>] bus_for_each_dev+0x6b/0xa0
[   42.634509]  [<ffffffff812d92fc>] driver_attach+0x1c/0x20
[   42.642204]  [<ffffffff812d8315>] bus_add_driver+0x1d5/0x320
[   42.650234]  [<ffffffffa0015000>] ? mptsas_init+0x0/0x114 [mptsas]
[   42.658889]  [<ffffffff812d994c>] driver_register+0x7c/0x170
[   42.667004]  [<ffffffffa0015000>] ? mptsas_init+0x0/0x114 [mptsas]
[   42.675673]  [<ffffffff8123f8ba>] __pci_register_driver+0x6a/0xf0
[   42.684215]  [<ffffffffa0015000>] ? mptsas_init+0x0/0x114 [mptsas]
[   42.692814]  [<ffffffffa00150f9>] mptsas_init+0xf9/0x114 [mptsas]
[   42.701336]  [<ffffffff810001d8>] do_one_initcall+0x38/0x190
[   42.709375]  [<ffffffff81083fe3>] sys_init_module+0xe3/0x260
[   42.717447]  [<ffffffff81002e2b>] system_call_fastpath+0x16/0x1b
[   42.726261] ---[ end trace 52cf32179b76155b ]---
[   42.731555] BUG: key ffff880c7cb75288 not in .data!
[   42.737130] BUG: key ffff880c7cb752c0 not in .data!
[   42.742817] BUG: key ffff880c7cb752f8 not in .data!
[   42.748476] BUG: key ffff880c7cb75330 not in .data!
[   42.754039] BUG: key ffff880c7cb75368 not in .data!
[   42.759634] BUG: key ffff880c7cb753a0 not in .data!
[   42.765192] BUG: key ffff880c7cb753d8 not in .data!
[   42.770833] BUG: key ffff880c7cb75410 not in .data!
[   42.776394] BUG: key ffff880c7cb75448 not in .data!
[   42.781967] BUG: key ffff880c7cb75480 not in .data!
[   42.787480] BUG: key ffff880c7cb754b8 not in .data!
[   42.793026] BUG: key ffff880c7cb754f0 not in .data!
[   42.798689] BUG: key ffff880c7cb75528 not in .data!
[   42.804235] BUG: key ffff880c7cb75560 not in .data!
[   42.809828] BUG: key ffff880c7cb75598 not in .data!
[   42.815401] BUG: key ffff880c7cb755d0 not in .data!
[   42.821078] BUG: key ffff880c7cb75250 not in .data!
[   42.826714] BUG: key ffff880c7cb75288 not in .data!
[   42.832401] BUG: key ffff880c7cb752c0 not in .data!
[   42.838064] BUG: key ffff880c7cb752f8 not in .data!
[   42.843619] BUG: key ffff880c7cb75330 not in .data!
[   42.849324] BUG: key ffff880c7cb75368 not in .data!
[   42.854900] BUG: key ffff880c7cb753a0 not in .data!
[   42.860519] BUG: key ffff880c7cb753d8 not in .data!
[   42.866135] BUG: key ffff880c7cb75410 not in .data!
[   42.871746] BUG: key ffff880c7cb75448 not in .data!
[   42.877257] BUG: key ffff880c7cb75480 not in .data!
[   42.882831] BUG: key ffff880c7cb754b8 not in .data!
[   42.888425] BUG: key ffff880c7cb754f0 not in .data!
[   42.893990] BUG: key ffff880c7cb75528 not in .data!
[   42.899668] BUG: key ffff880c7cb75560 not in .data!
[   42.905220] BUG: key ffff880c7cb75598 not in .data!
[   42.910881] BUG: key ffff880c7cb755d0 not in .data!
[   42.916753] BUG: key ffff880c7cb75250 not in .data!
[   42.922374] BUG: key ffff880c7cb75288 not in .data!
[   42.928045] BUG: key ffff880c7cb752c0 not in .data!
[   42.933596] BUG: key ffff880c7cb752f8 not in .data!
[   42.939245] BUG: key ffff880c7cb75330 not in .data!
[   42.944820] BUG: key ffff880c7cb75368 not in .data!
[   42.950467] BUG: key ffff880c7cb753a0 not in .data!
[   42.956052] BUG: key ffff880c7cb753d8 not in .data!
[   42.961694] BUG: key ffff880c7cb75410 not in .data!
[   42.967206] BUG: key ffff880c7cb75448 not in .data!
[   42.972741] BUG: key ffff880c7cb75480 not in .data!
[   42.978397] BUG: key ffff880c7cb754b8 not in .data!
[   42.983965] BUG: key ffff880c7cb754f0 not in .data!
[   42.989570] BUG: key ffff880c7cb75528 not in .data!
[   42.995116] BUG: key ffff880c7cb75560 not in .data!
[   43.000777] BUG: key ffff880c7cb75598 not in .data!
[   43.006418] BUG: key ffff880c7cb755d0 not in .data!
[   43.012110] BUG: key ffff880c7cb75250 not in .data!
[   43.017697] BUG: key ffff880c7cb75288 not in .data!
[   43.023278] BUG: key ffff880c7cb752c0 not in .data!
[   43.028927] BUG: key ffff880c7cb752f8 not in .data!
[   43.034543] BUG: key ffff880c7cb75330 not in .data!
[   43.040158] BUG: key ffff880c7cb75368 not in .data!
[   43.045767] BUG: key ffff880c7cb753a0 not in .data!
[   43.051370] BUG: key ffff880c7cb753d8 not in .data!
[   43.056882] BUG: key ffff880c7cb75410 not in .data!
[   43.062395] BUG: key ffff880c7cb75448 not in .data!
[   43.068074] BUG: key ffff880c7cb75480 not in .data!
[   43.073629] BUG: key ffff880c7cb754b8 not in .data!
[   43.079261] BUG: key ffff880c7cb754f0 not in .data!
[   43.084775] BUG: key ffff880c7cb75528 not in .data!
[   43.090401] BUG: key ffff880c7cb75560 not in .data!
[   43.095914] BUG: key ffff880c7cb75598 not in .data!
[   43.101507] BUG: key ffff880c7cb755d0 not in .data!
[   43.107046] BUG: key ffff880c7cb75250 not in .data!
[   43.112684] BUG: key ffff880c7cb75288 not in .data!
[   43.118270] BUG: key ffff880c7cb752c0 not in .data!
[   43.123832] BUG: key ffff880c7cb752f8 not in .data!
[   43.129442] BUG: key ffff880c7cb75330 not in .data!
[   43.134967] BUG: key ffff880c7cb75368 not in .data!
[   43.140574] BUG: key ffff880c7cb753a0 not in .data!
[   43.146190] BUG: key ffff880c7cb753d8 not in .data!
[   43.151806] BUG: key ffff880c7cb75410 not in .data!
[   43.157403] BUG: key ffff880c7cb75448 not in .data!
[   43.162976] BUG: key ffff880c7cb75480 not in .data!
[   43.168570] BUG: key ffff880c7cb754b8 not in .data!
[   43.174200] BUG: key ffff880c7cb754f0 not in .data!
[   43.179791] BUG: key ffff880c7cb75528 not in .data!
[   43.185350] BUG: key ffff880c7cb75560 not in .data!
[   43.190907] BUG: key ffff880c7cb75598 not in .data!
[   43.196421] BUG: key ffff880c7cb755d0 not in .data!
[   43.202012] BUG: key ffff880c7cb75250 not in .data!
[   43.207579] BUG: key ffff880c7cb75288 not in .data!
[   43.213112] BUG: key ffff880c7cb752c0 not in .data!
[   43.218880] BUG: key ffff880c7cb752f8 not in .data!
[   43.224416] BUG: key ffff880c7cb75330 not in .data!
[   43.230075] BUG: key ffff880c7cb75368 not in .data!
[   43.235699] BUG: key ffff880c7cb753a0 not in .data!
[   43.241477] BUG: key ffff880c7cb753d8 not in .data!
[   43.247058] BUG: key ffff880c7cb75410 not in .data!
[   43.252559] BUG: key ffff880c7cb75448 not in .data!
[   43.258273] BUG: key ffff880c7cb75480 not in .data!
[   43.263775] BUG: key ffff880c7cb754b8 not in .data!
[   43.269451] BUG: key ffff880c7cb754f0 not in .data!
[   43.275018] BUG: key ffff880c7cb75528 not in .data!
[   43.280674] BUG: key ffff880c7cb75560 not in .data!
[   43.286185] BUG: key ffff880c7cb75598 not in .data!
[   43.291700] BUG: key ffff880c7cb755d0 not in .data!
[   43.297239] BUG: key ffff880c7cb75250 not in .data!
[   43.302749] BUG: key ffff880c7cb75288 not in .data!
[   43.308351] BUG: key ffff880c7cb752c0 not in .data!
[   43.313968] BUG: key ffff880c7cb752f8 not in .data!
[   43.319584] BUG: key ffff880c7cb75330 not in .data!
[   43.325202] BUG: key ffff880c7cb75368 not in .data!
[   43.330813] BUG: key ffff880c7cb753a0 not in .data!
[   43.336326] BUG: key ffff880c7cb753d8 not in .data!
[   43.341900] BUG: key ffff880c7cb75410 not in .data!
[   43.347480] BUG: key ffff880c7cb75448 not in .data!
[   43.353050] BUG: key ffff880c7cb75480 not in .data!
[   43.358664] BUG: key ffff880c7cb754b8 not in .data!
[   43.364187] BUG: key ffff880c7cb754f0 not in .data!
[   43.369865] BUG: key ffff880c7cb75528 not in .data!
[   43.375417] BUG: key ffff880c7cb75560 not in .data!
[   43.381026] BUG: key ffff880c7cb75598 not in .data!
[   43.386632] BUG: key ffff880c7cb755d0 not in .data!
[   43.392189] BUG: key ffff880c7cb75250 not in .data!
[   43.397785] BUG: key ffff880c7cb75288 not in .data!
[   43.403343] BUG: key ffff880c7cb752c0 not in .data!
[   43.408862] BUG: key ffff880c7cb752f8 not in .data!
[   43.414405] BUG: key ffff880c7cb75330 not in .data!
[   43.420069] BUG: key ffff880c7cb75368 not in .data!
[   43.425720] BUG: key ffff880c7cb753a0 not in .data!
[   43.431331] BUG: key ffff880c7cb753d8 not in .data!
[   43.436970] BUG: key ffff880c7cb75410 not in .data!
[   43.442554] BUG: key ffff880c7cb75448 not in .data!
[   43.448209] BUG: key ffff880c7cb75480 not in .data!
[   43.453786] BUG: key ffff880c7cb754b8 not in .data!
[   43.459429] BUG: key ffff880c7cb754f0 not in .data!
[   43.465011] BUG: key ffff880c7cb75528 not in .data!
[   43.470654] BUG: key ffff880c7cb75560 not in .data!
[   43.476168] BUG: key ffff880c7cb75598 not in .data!
[   43.483110] BUG: key ffff880c7cb755d0 not in .data!


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection
  2010-03-09  8:34   ` Jens Axboe
  2010-03-09  8:35     ` Jens Axboe
@ 2010-03-09 23:45     ` Frederic Weisbecker
  2010-03-10 15:55       ` Jens Axboe
  1 sibling, 1 reply; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-09 23:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Steven Rostedt, Paul Mackerras, Hitoshi Mitake, Li Zefan,
	Lai Jiangshan, Masami Hiramatsu

On Tue, Mar 09, 2010 at 09:34:10AM +0100, Jens Axboe wrote:
> I went to try this on 2.6.34-rc1 to see how much it would improve things
> here. With 2.6.34-rc1, a
> 
> $ time sudo perf lock record ls /dev/shm
> 
> essentially hangs the box, it ends up crashing hard (not just live
> locked like before). With the patch in place, it does eventually finish:
> 
> real    0m21.301s
> user    0m0.030s
> sys     0m21.040s
> 
> The directory is empty.


Hehe :-)

That said you are probably missing a part of the puzzle.
This patch avoids the scary recursions we had. But there
is another separate patch that fixes the buffers multiplexing
dependency we had. Buffering is now done per cpu.

You need this patch:

	commit b67577dfb45580c498bfdb1bc76c00c3b2ad6310
	Author: Frederic Weisbecker <fweisbec@gmail.com>
	Date:   Wed Feb 3 09:09:33 2010 +0100

	perf lock: Drop the buffers multiplexing dependency

But this is not in -rc1, you need to checkout tip/perf/core

Well, the best would be you actually pull this on top of -rc1:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
	perf/core

so that you get the buffers multiplexing dependency that is in -tip,
and you also get the recursions fixes, not yet in -tip (but should make
it soon).

I bet you still won't have magic results after that, but still it should
be better.

We still have further projects in mind to improve the scalability,
like the events injection thing (that avoids the string copy)

Again, you are my eyes on this, I'm still blind with my poor dual
laptop or my atom testbox.

Thanks for your testing.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection
  2010-03-09  8:35     ` Jens Axboe
@ 2010-03-10  0:05       ` Frederic Weisbecker
  2010-03-10 15:45         ` Peter Zijlstra
  2010-03-10 15:55         ` Jens Axboe
  0 siblings, 2 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-10  0:05 UTC (permalink / raw)
  To: Jens Axboe, Peter Zijlstra, Greg KH
  Cc: Ingo Molnar, LKML, Arnaldo Carvalho de Melo, Steven Rostedt,
	Paul Mackerras, Hitoshi Mitake, Li Zefan, Lai Jiangshan,
	Masami Hiramatsu

On Tue, Mar 09, 2010 at 09:35:37AM +0100, Jens Axboe wrote:
> Which, btw, throws a new lockdep warning fest:
> 
> [   42.247718] scsi7 : ioc0: LSISAS1068E B3, FwRev=011b0300h, Ports=1, MaxQ=483, IRQ=26
> [   42.281125] BUG: key ffff880c7cb75250 not in .data!
> [   42.288346] ------------[ cut here ]------------
> [   42.294490] WARNING: at kernel/lockdep.c:2706 lockdep_init_map+0x545/0x5f0()
> [   42.304013] Hardware name: QSSC-S4R
> [   42.309674] Modules linked in: hid_apple usbhid ehci_hcd uhci_hcd usbcore nls_base mptsas(+) mptscsih mptbase scsi_transport_sas igb sg sr_mod cdrom
> [   42.332072] Pid: 5197, comm: modprobe Not tainted 2.6.34-rc1 #176
> [   42.340597] Call Trace:
> [   42.345335]  [<ffffffff8107fb8d>] ? is_module_address+0x2d/0x60
> [   42.353670]  [<ffffffff81074aa5>] ? lockdep_init_map+0x545/0x5f0
> [   42.362154]  [<ffffffff81044648>] warn_slowpath_common+0x78/0xd0
> [   42.371913]  [<ffffffff810446af>] warn_slowpath_null+0xf/0x20
> [   42.380109]  [<ffffffff81074aa5>] lockdep_init_map+0x545/0x5f0



This doesn't look related to my patch, it also happen in -rc1 right?



> [   42.388402]  [<ffffffff81156636>] ? sysfs_new_dirent+0x76/0x120


This calls:

#define sysfs_dirent_init_lockdep(sd)				\
do {								\
	struct attribute *attr = sd->s_attr.attr;		\
	struct lock_class_key *key = attr->key;			\
	if (!key)						\
		key = &attr->skey;				\
								\
	lockdep_init_map(&sd->dep_map, "s_active", key, 0);	\
} while(0)


Looks like the struct lock_class_key is dynamically allocated
and lockdep doesn't like that.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection
  2010-03-09  7:18   ` Hitoshi Mitake
@ 2010-03-10  0:10     ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2010-03-10  0:10 UTC (permalink / raw)
  To: Hitoshi Mitake
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Steven Rostedt, Paul Mackerras, Li Zefan, Lai Jiangshan,
	Masami Hiramatsu, Jens Axboe

On Tue, Mar 09, 2010 at 04:18:58PM +0900, Hitoshi Mitake wrote:
> Thanks a lot, Frederic!
>
> I tested perf lock with your patch, result is like this,
>
> Typical scores:
>
> before:
> % sudo ./perf lock record ./perf bench sched messaging
> # Running sched/messaging benchmark...
> # 20 sender and receiver processes per group
> # 10 groups == 400 processes run
>
>      Total time: 3.265 [sec]
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 143.952 MB perf.data (~6289344 samples) ]
>
> after:
> % sudo ./perf lock record ./perf bench sched messaging
> # Running sched/messaging benchmark...
> # 20 sender and receiver processes per group
> # 10 groups == 400 processes run
>
>      Total time: 1.943 [sec]                       <--- about x1.5 faster!
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 98.161 MB perf.data (~4288734 samples)  
> ]   <--- size of perf.data is also reduced
>


Oh great! Yeah this recursion thing was really bad.

Thanks.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection
  2010-03-10  0:05       ` Frederic Weisbecker
@ 2010-03-10 15:45         ` Peter Zijlstra
  2010-03-10 15:56           ` Jens Axboe
  2010-03-10 15:55         ` Jens Axboe
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-03-10 15:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jens Axboe, Greg KH, Ingo Molnar, LKML, Arnaldo Carvalho de Melo,
	Steven Rostedt, Paul Mackerras, Hitoshi Mitake, Li Zefan,
	Lai Jiangshan, Masami Hiramatsu

On Wed, 2010-03-10 at 01:05 +0100, Frederic Weisbecker wrote:
> On Tue, Mar 09, 2010 at 09:35:37AM +0100, Jens Axboe wrote:
> > Which, btw, throws a new lockdep warning fest:
> > 
> > [   42.247718] scsi7 : ioc0: LSISAS1068E B3, FwRev=011b0300h,
> Ports=1, MaxQ=483, IRQ=26
> > [   42.281125] BUG: key ffff880c7cb75250 not in .data!
> > [   42.288346] ------------[ cut here ]------------
> > [   42.294490] WARNING: at kernel/lockdep.c:2706 lockdep_init_map
> +0x545/0x5f0()
> > [   42.304013] Hardware name: QSSC-S4R
> > [   42.309674] Modules linked in: hid_apple usbhid ehci_hcd uhci_hcd
> usbcore nls_base mptsas(+) mptscsih mptbase scsi_transport_sas igb sg
> sr_mod cdrom
> > [   42.332072] Pid: 5197, comm: modprobe Not tainted 2.6.34-rc1 #176
> > [   42.340597] Call Trace:
> > [   42.345335]  [<ffffffff8107fb8d>] ? is_module_address+0x2d/0x60
> > [   42.353670]  [<ffffffff81074aa5>] ? lockdep_init_map+0x545/0x5f0
> > [   42.362154]  [<ffffffff81044648>] warn_slowpath_common+0x78/0xd0
> > [   42.371913]  [<ffffffff810446af>] warn_slowpath_null+0xf/0x20
> > [   42.380109]  [<ffffffff81074aa5>] lockdep_init_map+0x545/0x5f0
> 
> 
> 
> This doesn't look related to my patch, it also happen in -rc1 right?


The per-cpu changes broke lockdep, Tejun already posted patches for
this:

http://lkml.org/lkml/2010/3/10/76
http://lkml.org/lkml/2010/3/10/79



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection
  2010-03-09 23:45     ` Frederic Weisbecker
@ 2010-03-10 15:55       ` Jens Axboe
  0 siblings, 0 replies; 51+ messages in thread
From: Jens Axboe @ 2010-03-10 15:55 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Steven Rostedt, Paul Mackerras, Hitoshi Mitake, Li Zefan,
	Lai Jiangshan, Masami Hiramatsu

On Wed, Mar 10 2010, Frederic Weisbecker wrote:
> On Tue, Mar 09, 2010 at 09:34:10AM +0100, Jens Axboe wrote:
> > I went to try this on 2.6.34-rc1 to see how much it would improve things
> > here. With 2.6.34-rc1, a
> > 
> > $ time sudo perf lock record ls /dev/shm
> > 
> > essentially hangs the box, it ends up crashing hard (not just live
> > locked like before). With the patch in place, it does eventually finish:
> > 
> > real    0m21.301s
> > user    0m0.030s
> > sys     0m21.040s
> > 
> > The directory is empty.
> 
> 
> Hehe :-)
> 
> That said you are probably missing a part of the puzzle.
> This patch avoids the scary recursions we had. But there
> is another separate patch that fixes the buffers multiplexing
> dependency we had. Buffering is now done per cpu.
> 
> You need this patch:
> 
> 	commit b67577dfb45580c498bfdb1bc76c00c3b2ad6310
> 	Author: Frederic Weisbecker <fweisbec@gmail.com>
> 	Date:   Wed Feb 3 09:09:33 2010 +0100
> 
> 	perf lock: Drop the buffers multiplexing dependency
> 
> But this is not in -rc1, you need to checkout tip/perf/core
> 
> Well, the best would be you actually pull this on top of -rc1:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
> 	perf/core
> 
> so that you get the buffers multiplexing dependency that is in -tip,
> and you also get the recursions fixes, not yet in -tip (but should make
> it soon).
> 
> I bet you still won't have magic results after that, but still it should
> be better.

OK, I will give it a spin and see what happens. I had seen the
de-multiplexing patch, but just assumed it was in -rc1 already.

> We still have further projects in mind to improve the scalability,
> like the events injection thing (that avoids the string copy)
> 
> Again, you are my eyes on this, I'm still blind with my poor dual
> laptop or my atom testbox.
> 
> Thanks for your testing.

Not a problem, thank you for trying to get some nicer lock profiling in
the kernel :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection
  2010-03-10  0:05       ` Frederic Weisbecker
  2010-03-10 15:45         ` Peter Zijlstra
@ 2010-03-10 15:55         ` Jens Axboe
  1 sibling, 0 replies; 51+ messages in thread
From: Jens Axboe @ 2010-03-10 15:55 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Greg KH, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Steven Rostedt, Paul Mackerras,
	Hitoshi Mitake, Li Zefan, Lai Jiangshan, Masami Hiramatsu

On Wed, Mar 10 2010, Frederic Weisbecker wrote:
> On Tue, Mar 09, 2010 at 09:35:37AM +0100, Jens Axboe wrote:
> > Which, btw, throws a new lockdep warning fest:
> > 
> > [   42.247718] scsi7 : ioc0: LSISAS1068E B3, FwRev=011b0300h, Ports=1, MaxQ=483, IRQ=26
> > [   42.281125] BUG: key ffff880c7cb75250 not in .data!
> > [   42.288346] ------------[ cut here ]------------
> > [   42.294490] WARNING: at kernel/lockdep.c:2706 lockdep_init_map+0x545/0x5f0()
> > [   42.304013] Hardware name: QSSC-S4R
> > [   42.309674] Modules linked in: hid_apple usbhid ehci_hcd uhci_hcd usbcore nls_base mptsas(+) mptscsih mptbase scsi_transport_sas igb sg sr_mod cdrom
> > [   42.332072] Pid: 5197, comm: modprobe Not tainted 2.6.34-rc1 #176
> > [   42.340597] Call Trace:
> > [   42.345335]  [<ffffffff8107fb8d>] ? is_module_address+0x2d/0x60
> > [   42.353670]  [<ffffffff81074aa5>] ? lockdep_init_map+0x545/0x5f0
> > [   42.362154]  [<ffffffff81044648>] warn_slowpath_common+0x78/0xd0
> > [   42.371913]  [<ffffffff810446af>] warn_slowpath_null+0xf/0x20
> > [   42.380109]  [<ffffffff81074aa5>] lockdep_init_map+0x545/0x5f0
> 
> 
> 
> This doesn't look related to my patch, it also happen in -rc1 right?

It's not related, it does happen in -rc1 plain as well.

> > [   42.388402]  [<ffffffff81156636>] ? sysfs_new_dirent+0x76/0x120
> 
> 
> This calls:
> 
> #define sysfs_dirent_init_lockdep(sd)				\
> do {								\
> 	struct attribute *attr = sd->s_attr.attr;		\
> 	struct lock_class_key *key = attr->key;			\
> 	if (!key)						\
> 		key = &attr->skey;				\
> 								\
> 	lockdep_init_map(&sd->dep_map, "s_active", key, 0);	\
> } while(0)
> 
> 
> Looks like the struct lock_class_key is dynamically allocated
> and lockdep doesn't like that.

I saw someone else report the same warning, but from a different path
than mine. I'll try and bisect it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection
  2010-03-10 15:45         ` Peter Zijlstra
@ 2010-03-10 15:56           ` Jens Axboe
  0 siblings, 0 replies; 51+ messages in thread
From: Jens Axboe @ 2010-03-10 15:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Greg KH, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Steven Rostedt, Paul Mackerras,
	Hitoshi Mitake, Li Zefan, Lai Jiangshan, Masami Hiramatsu

On Wed, Mar 10 2010, Peter Zijlstra wrote:
> On Wed, 2010-03-10 at 01:05 +0100, Frederic Weisbecker wrote:
> > On Tue, Mar 09, 2010 at 09:35:37AM +0100, Jens Axboe wrote:
> > > Which, btw, throws a new lockdep warning fest:
> > > 
> > > [   42.247718] scsi7 : ioc0: LSISAS1068E B3, FwRev=011b0300h,
> > Ports=1, MaxQ=483, IRQ=26
> > > [   42.281125] BUG: key ffff880c7cb75250 not in .data!
> > > [   42.288346] ------------[ cut here ]------------
> > > [   42.294490] WARNING: at kernel/lockdep.c:2706 lockdep_init_map
> > +0x545/0x5f0()
> > > [   42.304013] Hardware name: QSSC-S4R
> > > [   42.309674] Modules linked in: hid_apple usbhid ehci_hcd uhci_hcd
> > usbcore nls_base mptsas(+) mptscsih mptbase scsi_transport_sas igb sg
> > sr_mod cdrom
> > > [   42.332072] Pid: 5197, comm: modprobe Not tainted 2.6.34-rc1 #176
> > > [   42.340597] Call Trace:
> > > [   42.345335]  [<ffffffff8107fb8d>] ? is_module_address+0x2d/0x60
> > > [   42.353670]  [<ffffffff81074aa5>] ? lockdep_init_map+0x545/0x5f0
> > > [   42.362154]  [<ffffffff81044648>] warn_slowpath_common+0x78/0xd0
> > > [   42.371913]  [<ffffffff810446af>] warn_slowpath_null+0xf/0x20
> > > [   42.380109]  [<ffffffff81074aa5>] lockdep_init_map+0x545/0x5f0
> > 
> > 
> > 
> > This doesn't look related to my patch, it also happen in -rc1 right?
> 
> 
> The per-cpu changes broke lockdep, Tejun already posted patches for
> this:
> 
> http://lkml.org/lkml/2010/3/10/76
> http://lkml.org/lkml/2010/3/10/79

Ah good, saves me the trouble of locating it, thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2010-03-10 15:56 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-03  6:54 [GIT PULL] perf updates Frederic Weisbecker
2010-03-03  6:55 ` [PATCH 1/3] lockdep: Move lock events under lockdep recursion protection Frederic Weisbecker
2010-03-09  7:18   ` Hitoshi Mitake
2010-03-10  0:10     ` Frederic Weisbecker
2010-03-09  8:34   ` Jens Axboe
2010-03-09  8:35     ` Jens Axboe
2010-03-10  0:05       ` Frederic Weisbecker
2010-03-10 15:45         ` Peter Zijlstra
2010-03-10 15:56           ` Jens Axboe
2010-03-10 15:55         ` Jens Axboe
2010-03-09 23:45     ` Frederic Weisbecker
2010-03-10 15:55       ` Jens Axboe
2010-03-03  6:55 ` [RFC][PATCH 2/3] perf: Take a hot regs snapshot for trace events Frederic Weisbecker
2010-03-03  8:38   ` Peter Zijlstra
2010-03-03 20:07     ` Frederic Weisbecker
2010-03-04 19:01     ` Frederic Weisbecker
2010-03-05  3:08     ` [PATCH 0/2] Perf " Frederic Weisbecker
2010-03-05  3:08     ` [PATCH 1/2] perf: Introduce new perf_save_regs() for hot regs snapshot Frederic Weisbecker
2010-03-05  3:08       ` Frederic Weisbecker
2010-03-05  3:08       ` Frederic Weisbecker
2010-03-05 15:08       ` Masami Hiramatsu
2010-03-05 16:38         ` Frederic Weisbecker
2010-03-05 17:08           ` Masami Hiramatsu
2010-03-05 17:17             ` Frederic Weisbecker
2010-03-05 20:55             ` [PATCH 1/2] perf: Introduce new perf_fetch_caller_regs() " Frederic Weisbecker
2010-03-05 20:55               ` Frederic Weisbecker
2010-03-05 20:55               ` Frederic Weisbecker
2010-03-05 20:55             ` [PATCH 2/2] perf: Take a hot regs snapshot for trace events Frederic Weisbecker
2010-03-05 20:55               ` Frederic Weisbecker
2010-03-05 20:55               ` Frederic Weisbecker
2010-03-05  3:08     ` Frederic Weisbecker
2010-03-05  3:08       ` Frederic Weisbecker
2010-03-05  3:08       ` Frederic Weisbecker
2010-03-03 16:06   ` [RFC][PATCH 2/3] " Steven Rostedt
2010-03-03 16:37     ` Peter Zijlstra
2010-03-03 17:07       ` Steven Rostedt
2010-03-03 17:16         ` Peter Zijlstra
2010-03-03 17:45           ` Steven Rostedt
2010-03-03 20:37             ` Frederic Weisbecker
2010-03-04 11:25           ` Ingo Molnar
2010-03-04 15:16             ` Steven Rostedt
2010-03-04 15:36               ` Ingo Molnar
2010-03-04 15:55                 ` Steven Rostedt
2010-03-04 21:17                   ` Ingo Molnar
2010-03-04 21:30                     ` Steven Rostedt
2010-03-04 21:37                       ` Frederic Weisbecker
2010-03-04 21:52                         ` Thomas Gleixner
2010-03-04 22:01                           ` Frederic Weisbecker
2010-03-04 22:02                         ` Steven Rostedt
2010-03-04 22:09                           ` Frederic Weisbecker
2010-03-03  6:55 ` [PATCH 3/3] perf/x86-64: Use frame pointer to walk on irq and process stacks Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.