All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] trace: add ability to collect call chains of non-current task.
@ 2011-09-26 15:55 Andrew Vagin
  2011-09-26 15:55 ` [PATCH 1/4] perf: fix counter of ftrace events Andrew Vagin
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Andrew Vagin @ 2011-09-26 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, avagin

Without these patches, an event may only save call chain of a current
task, but some events may require call chains of non-current task.

E.g. You may want to know where and how long a task is sleeping. A time
slice is known when the task wakes up.  We send event in this moment,
but a target task isn't "current".

Usually when the system begin stuttering, it often waits a lock or an
operation for a long time. For investigation such situation you probaly
send sysrq-t and analyze backtraces, but perf may do it more effectively.

A macto __perf_task() is set task for wich a call chain will be collected.

Now this functionality is suitable for sched_stat_sleep and sched_stat_iowait.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] perf: fix counter of ftrace events
  2011-09-26 15:55 [PATCH 0/4] trace: add ability to collect call chains of non-current task Andrew Vagin
@ 2011-09-26 15:55 ` Andrew Vagin
  2011-10-14 19:11   ` Arun Sharma
  2011-09-26 15:55 ` [PATCH 2/4] trace: prepare to collect call chains of non-current task Andrew Vagin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Andrew Vagin @ 2011-09-26 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, avagin

Each event adds some points to its counters. By default it adds 1,
and a number of points may be transmited in event's parameters.

E.g. sched:sched_stat_runtime adds how long process has been running.

But this functionality was broken by v2.6.31-rc5-392-gf413cdb
and now the event's parameters doesn't affect on a number of points.

TP_perf_assign isn't defined, so __perf_count(c) isn't executed and
__count is always equal to 1.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 include/trace/ftrace.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 533c49f..7697249 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -711,6 +711,9 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
 #undef __perf_count
 #define __perf_count(c) __count = (c)
 
+#undef TP_perf_assign
+#define TP_perf_assign(args...) args
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 static notrace void							\
-- 
1.7.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/4] trace: prepare to collect call chains of non-current task.
  2011-09-26 15:55 [PATCH 0/4] trace: add ability to collect call chains of non-current task Andrew Vagin
  2011-09-26 15:55 ` [PATCH 1/4] perf: fix counter of ftrace events Andrew Vagin
@ 2011-09-26 15:55 ` Andrew Vagin
  2011-09-27 14:02   ` Peter Zijlstra
  2011-10-14 19:12   ` Arun Sharma
  2011-09-26 15:55 ` [PATCH 3/4] trace: add ability to collect call chain " Andrew Vagin
  2011-09-26 15:55 ` [PATCH 4/4] events: sched_stat_template saves call chains of a target task Andrew Vagin
  3 siblings, 2 replies; 17+ messages in thread
From: Andrew Vagin @ 2011-09-26 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, avagin

This patch adds argument "task" to all functions between perf_tp_event()
and perf_callchain(). All logic is in the next patch.

We need "task" and "regs" simultaneously, because regs contains
state of current task, before it entered in perf code. Task contains
pointer to task_struct and it will be used for collecting call chains
and filling event data, if task isn't current.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 arch/arm/kernel/perf_event.c              |    6 +++-
 arch/mips/kernel/perf_event.c             |   13 ++++++---
 arch/powerpc/kernel/perf_callchain.c      |    6 +++-
 arch/sh/kernel/perf_callchain.c           |    6 +++-
 arch/sparc/kernel/perf_event.c            |    5 +++-
 arch/x86/kernel/cpu/perf_event.c          |    6 +++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c |    2 +-
 include/linux/ftrace_event.h              |    2 +-
 include/linux/perf_event.h                |    8 +++--
 include/trace/ftrace.h                    |    4 +-
 kernel/events/core.c                      |   43 ++++++++++++++++-------------
 11 files changed, 66 insertions(+), 35 deletions(-)

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 53c9c26..0cb8d44 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -769,10 +769,14 @@ callchain_trace(struct stackframe *fr,
 }
 
 void
-perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
+perf_callchain_kernel(struct perf_callchain_entry *entry,
+			struct task_struct *tsk, struct pt_regs *regs)
 {
 	struct stackframe fr;
 
+	if (!regs)
+		return;
+
 	fr.fp = regs->ARM_fp;
 	fr.sp = regs->ARM_sp;
 	fr.lr = regs->ARM_lr;
diff --git a/arch/mips/kernel/perf_event.c b/arch/mips/kernel/perf_event.c
index 0aee944..c216182 100644
--- a/arch/mips/kernel/perf_event.c
+++ b/arch/mips/kernel/perf_event.c
@@ -559,12 +559,17 @@ static void save_raw_perf_callchain(struct perf_callchain_entry *entry,
 }
 
 void perf_callchain_kernel(struct perf_callchain_entry *entry,
-		      struct pt_regs *regs)
+			struct task_struct *tsk, struct pt_regs *regs)
 {
-	unsigned long sp = regs->regs[29];
+	unsigned long sp, ra, pc;
+
+	if (!regs)
+		return;
+
+	sp = regs->regs[29];
 #ifdef CONFIG_KALLSYMS
-	unsigned long ra = regs->regs[31];
-	unsigned long pc = regs->cp0_epc;
+	ra = regs->regs[31];
+	pc = regs->cp0_epc;
 
 	if (raw_show_trace || !__kernel_text_address(pc)) {
 		unsigned long stack_page =
diff --git a/arch/powerpc/kernel/perf_callchain.c b/arch/powerpc/kernel/perf_callchain.c
index 564c1d8..a39369c 100644
--- a/arch/powerpc/kernel/perf_callchain.c
+++ b/arch/powerpc/kernel/perf_callchain.c
@@ -47,7 +47,8 @@ static int valid_next_sp(unsigned long sp, unsigned long prev_sp)
 }
 
 void
-perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
+perf_callchain_kernel(struct perf_callchain_entry *entry,
+			struct task_struct *tsk, struct pt_regs *regs)
 {
 	unsigned long sp, next_sp;
 	unsigned long next_ip;
@@ -55,6 +56,9 @@ perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
 	long level = 0;
 	unsigned long *fp;
 
+	if (!regs)
+		return;
+
 	lr = regs->link;
 	sp = regs->gpr[1];
 	perf_callchain_store(entry, regs->nip);
diff --git a/arch/sh/kernel/perf_callchain.c b/arch/sh/kernel/perf_callchain.c
index cc80b61..d755c0e 100644
--- a/arch/sh/kernel/perf_callchain.c
+++ b/arch/sh/kernel/perf_callchain.c
@@ -33,8 +33,12 @@ static const struct stacktrace_ops callchain_ops = {
 };
 
 void
-perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
+perf_callchain_kernel(struct perf_callchain_entry *entry,
+			struct task_struct *tsk, struct pt_regs *regs)
 {
+	if (!regs)
+		return;
+
 	perf_callchain_store(entry, regs->pc);
 
 	unwind_stack(NULL, regs, NULL, &callchain_ops, entry);
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 614da62..e0bf3d3 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1370,13 +1370,16 @@ int __init init_hw_perf_events(void)
 early_initcall(init_hw_perf_events);
 
 void perf_callchain_kernel(struct perf_callchain_entry *entry,
-			   struct pt_regs *regs)
+			struct task_struct *tsk, struct pt_regs *regs)
 {
 	unsigned long ksp, fp;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 	int graph = 0;
 #endif
 
+	if (!regs)
+		return;
+
 	stack_trace_flush();
 
 	perf_callchain_store(entry, regs->tpc);
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index cfa62ec..c3d229d 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1836,8 +1836,12 @@ static const struct stacktrace_ops backtrace_ops = {
 };
 
 void
-perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs)
+perf_callchain_kernel(struct perf_callchain_entry *entry,
+			struct task_struct *tsk, struct pt_regs *regs)
 {
+	if (!regs)
+		return;
+
 	if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
 		/* TODO: We don't support guest os callchain now */
 		return;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 1b1ef3a..2631ebb 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -338,7 +338,7 @@ static int intel_pmu_drain_bts_buffer(void)
 	 * We will overwrite the from and to address before we output
 	 * the sample.
 	 */
-	perf_prepare_sample(&header, &data, event, &regs);
+	perf_prepare_sample(&header, &data, event, current, &regs);
 
 	if (perf_output_begin(&handle, event, header.size * (top - at)))
 		return 1;
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 96efa67..5097f9a 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -294,7 +294,7 @@ static inline void
 perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
 		       u64 count, struct pt_regs *regs, void *head)
 {
-	perf_tp_event(addr, count, raw_data, size, regs, head, rctx);
+	perf_tp_event(addr, count, raw_data, size, current, regs, head, rctx);
 }
 #endif
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c816075..db5acb3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1003,6 +1003,7 @@ extern void perf_output_sample(struct perf_output_handle *handle,
 extern void perf_prepare_sample(struct perf_event_header *header,
 				struct perf_sample_data *data,
 				struct perf_event *event,
+				struct task_struct *tsk,
 				struct pt_regs *regs);
 
 extern int perf_event_overflow(struct perf_event *event,
@@ -1089,7 +1090,8 @@ extern void perf_event_fork(struct task_struct *tsk);
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 
 extern void perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs);
-extern void perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs);
+extern void perf_callchain_kernel(struct perf_callchain_entry *entry,
+			struct task_struct *tsk, struct pt_regs *regs);
 
 static inline void perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
 {
@@ -1121,8 +1123,8 @@ static inline bool perf_paranoid_kernel(void)
 }
 
 extern void perf_event_init(void);
-extern void perf_tp_event(u64 addr, u64 count, void *record,
-			  int entry_size, struct pt_regs *regs,
+extern void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
+			  struct task_struct *tsk, struct pt_regs *regs,
 			  struct hlist_head *head, int rctx);
 extern void perf_bp_event(struct perf_event *event, void *data);
 
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 7697249..5e4c72a 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -750,8 +750,8 @@ perf_trace_##call(void *__data, proto)					\
 	{ assign; }							\
 									\
 	head = this_cpu_ptr(event_call->perf_events);			\
-	perf_trace_buf_submit(entry, __entry_size, rctx, __addr,	\
-		__count, &__regs, head);				\
+	perf_tp_event(__addr, __count, entry, __entry_size,		\
+				current, &__regs, head, rctx);		\
 }
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0f85778..41ce4db 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2584,7 +2584,7 @@ struct callchain_cpus_entries *callchain_cpus_entries;
 
 
 __weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
-				  struct pt_regs *regs)
+			struct task_struct *tsk, struct pt_regs *regs)
 {
 }
 
@@ -2742,12 +2742,12 @@ put_callchain_entry(int rctx)
 	put_recursion_context(__get_cpu_var(callchain_recursion), rctx);
 }
 
-static struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
+static struct perf_callchain_entry
+		*perf_callchain(struct task_struct *tsk, struct pt_regs *regs)
 {
 	int rctx;
 	struct perf_callchain_entry *entry;
 
-
 	entry = get_callchain_entry(&rctx);
 	if (rctx == -1)
 		return NULL;
@@ -2759,7 +2759,7 @@ static struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 
 	if (!user_mode(regs)) {
 		perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
-		perf_callchain_kernel(entry, regs);
+		perf_callchain_kernel(entry, tsk, regs);
 		if (current->mm)
 			regs = task_pt_regs(current);
 		else
@@ -3991,6 +3991,7 @@ void perf_output_sample(struct perf_output_handle *handle,
 void perf_prepare_sample(struct perf_event_header *header,
 			 struct perf_sample_data *data,
 			 struct perf_event *event,
+			 struct task_struct *tsk,
 			 struct pt_regs *regs)
 {
 	u64 sample_type = event->attr.sample_type;
@@ -4009,7 +4010,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
 		int size = 1;
 
-		data->callchain = perf_callchain(regs);
+		data->callchain = perf_callchain(tsk, regs);
 
 		if (data->callchain)
 			size += data->callchain->nr;
@@ -4032,6 +4033,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 static void perf_event_output(struct perf_event *event,
 				struct perf_sample_data *data,
+				struct task_struct *tsk,
 				struct pt_regs *regs)
 {
 	struct perf_output_handle handle;
@@ -4040,7 +4042,7 @@ static void perf_event_output(struct perf_event *event,
 	/* protect the callchain buffers */
 	rcu_read_lock();
 
-	perf_prepare_sample(&header, data, event, regs);
+	perf_prepare_sample(&header, data, event, tsk, regs);
 
 	if (perf_output_begin(&handle, event, header.size))
 		goto exit;
@@ -4619,6 +4621,7 @@ static void perf_log_throttle(struct perf_event *event, int enable)
 
 static int __perf_event_overflow(struct perf_event *event,
 				   int throttle, struct perf_sample_data *data,
+				   struct task_struct *tsk,
 				   struct pt_regs *regs)
 {
 	int events = atomic_read(&event->event_limit);
@@ -4667,7 +4670,7 @@ static int __perf_event_overflow(struct perf_event *event,
 	if (event->overflow_handler)
 		event->overflow_handler(event, data, regs);
 	else
-		perf_event_output(event, data, regs);
+		perf_event_output(event, data, tsk, regs);
 
 	if (event->fasync && event->pending_kill) {
 		event->pending_wakeup = 1;
@@ -4681,7 +4684,7 @@ int perf_event_overflow(struct perf_event *event,
 			  struct perf_sample_data *data,
 			  struct pt_regs *regs)
 {
-	return __perf_event_overflow(event, 1, data, regs);
+	return __perf_event_overflow(event, 1, data, current, regs);
 }
 
 /*
@@ -4731,6 +4734,7 @@ again:
 
 static void perf_swevent_overflow(struct perf_event *event, u64 overflow,
 				    struct perf_sample_data *data,
+				    struct task_struct *tsk,
 				    struct pt_regs *regs)
 {
 	struct hw_perf_event *hwc = &event->hw;
@@ -4745,7 +4749,7 @@ static void perf_swevent_overflow(struct perf_event *event, u64 overflow,
 
 	for (; overflow; overflow--) {
 		if (__perf_event_overflow(event, throttle,
-					    data, regs)) {
+					    data, tsk, regs)) {
 			/*
 			 * We inhibit the overflow from happening when
 			 * hwc->interrupts == MAX_INTERRUPTS.
@@ -4758,25 +4762,26 @@ static void perf_swevent_overflow(struct perf_event *event, u64 overflow,
 
 static void perf_swevent_event(struct perf_event *event, u64 nr,
 			       struct perf_sample_data *data,
+			       struct task_struct *tsk,
 			       struct pt_regs *regs)
 {
 	struct hw_perf_event *hwc = &event->hw;
 
 	local64_add(nr, &event->count);
 
-	if (!regs)
+	if (!tsk && !regs)
 		return;
 
 	if (!is_sampling_event(event))
 		return;
 
 	if (nr == 1 && hwc->sample_period == 1 && !event->attr.freq)
-		return perf_swevent_overflow(event, 1, data, regs);
+		return perf_swevent_overflow(event, 1, data, tsk, regs);
 
 	if (local64_add_negative(nr, &hwc->period_left))
 		return;
 
-	perf_swevent_overflow(event, 0, data, regs);
+	perf_swevent_overflow(event, 0, data, tsk, regs);
 }
 
 static int perf_exclude_event(struct perf_event *event,
@@ -4880,7 +4885,7 @@ static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
 
 	hlist_for_each_entry_rcu(event, node, head, hlist_entry) {
 		if (perf_swevent_match(event, type, event_id, data, regs))
-			perf_swevent_event(event, nr, data, regs);
+			perf_swevent_event(event, nr, data, current, regs);
 	}
 end:
 	rcu_read_unlock();
@@ -5127,8 +5132,7 @@ static int perf_tp_filter_match(struct perf_event *event,
 }
 
 static int perf_tp_event_match(struct perf_event *event,
-				struct perf_sample_data *data,
-				struct pt_regs *regs)
+				struct perf_sample_data *data)
 {
 	if (event->hw.state & PERF_HES_STOPPED)
 		return 0;
@@ -5145,7 +5149,8 @@ static int perf_tp_event_match(struct perf_event *event,
 }
 
 void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
-		   struct pt_regs *regs, struct hlist_head *head, int rctx)
+			struct task_struct *tsk, struct pt_regs *regs,
+			struct hlist_head *head, int rctx)
 {
 	struct perf_sample_data data;
 	struct perf_event *event;
@@ -5160,8 +5165,8 @@ void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
 	data.raw = &raw;
 
 	hlist_for_each_entry_rcu(event, node, head, hlist_entry) {
-		if (perf_tp_event_match(event, &data, regs))
-			perf_swevent_event(event, count, &data, regs);
+		if (perf_tp_event_match(event, &data))
+			perf_swevent_event(event, count, &data, tsk, regs);
 	}
 
 	perf_swevent_put_recursion_context(rctx);
@@ -5254,7 +5259,7 @@ void perf_bp_event(struct perf_event *bp, void *data)
 	perf_sample_data_init(&sample, bp->attr.bp_addr);
 
 	if (!bp->hw.state && !perf_exclude_event(bp, regs))
-		perf_swevent_event(bp, 1, &sample, regs);
+		perf_swevent_event(bp, 1, &sample, current, regs);
 }
 #endif
 
-- 
1.7.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/4] trace: add ability to collect call chain of non-current task.
  2011-09-26 15:55 [PATCH 0/4] trace: add ability to collect call chains of non-current task Andrew Vagin
  2011-09-26 15:55 ` [PATCH 1/4] perf: fix counter of ftrace events Andrew Vagin
  2011-09-26 15:55 ` [PATCH 2/4] trace: prepare to collect call chains of non-current task Andrew Vagin
@ 2011-09-26 15:55 ` Andrew Vagin
  2011-09-27 14:05   ` Peter Zijlstra
  2011-09-26 15:55 ` [PATCH 4/4] events: sched_stat_template saves call chains of a target task Andrew Vagin
  3 siblings, 1 reply; 17+ messages in thread
From: Andrew Vagin @ 2011-09-26 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, avagin

An event may requires call chain of non-current task.  E.g. You may want
to know where and how long a task is sleeping. A time slice is known
when the task wakes up.  We can send event in this moment, but a target
task isn't "current".

The macro __perf_task set a task for which a call chain will be
collected.

Know issues:
* Now call chains for non-current tasks are collected on x86 only,
  but it may be done for other architectures simply.
* It collects only kernel call chains, because we can't get direct
  access to memory of other processes and this operation should be
  fast enough.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 arch/x86/kernel/cpu/perf_event.c |    8 +++-----
 include/trace/ftrace.h           |   11 +++++++++--
 kernel/events/core.c             |   30 ++++++++++++++++++------------
 3 files changed, 30 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index c3d229d..18817a8 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1839,17 +1839,15 @@ void
 perf_callchain_kernel(struct perf_callchain_entry *entry,
 			struct task_struct *tsk, struct pt_regs *regs)
 {
-	if (!regs)
-		return;
-
 	if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
 		/* TODO: We don't support guest os callchain now */
 		return;
 	}
 
-	perf_callchain_store(entry, regs->ip);
+	if (regs)
+		perf_callchain_store(entry, regs->ip);
 
-	dump_trace(NULL, regs, NULL, 0, &backtrace_ops, entry);
+	dump_trace(tsk, regs, NULL, 0, &backtrace_ops, entry);
 }
 
 #ifdef CONFIG_COMPAT
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 5e4c72a..d03e926 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -711,6 +711,9 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
 #undef __perf_count
 #define __perf_count(c) __count = (c)
 
+#undef __perf_task
+#define __perf_task(t) __task = (t)
+
 #undef TP_perf_assign
 #define TP_perf_assign(args...) args
 
@@ -722,7 +725,8 @@ perf_trace_##call(void *__data, proto)					\
 	struct ftrace_event_call *event_call = __data;			\
 	struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
 	struct ftrace_raw_##call *entry;				\
-	struct pt_regs __regs;						\
+	struct pt_regs __regs, *__pregs = &__regs;			\
+	struct task_struct *__task = current;				\
 	u64 __addr = 0, __count = 1;					\
 	struct hlist_head *head;					\
 	int __entry_size;						\
@@ -749,9 +753,12 @@ perf_trace_##call(void *__data, proto)					\
 									\
 	{ assign; }							\
 									\
+	if (__task != current)						\
+		__pregs = NULL;						\
+									\
 	head = this_cpu_ptr(event_call->perf_events);			\
 	perf_tp_event(__addr, __count, entry, __entry_size,		\
-				current, &__regs, head, rctx);		\
+				__task, __pregs, head, rctx);		\
 }
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 41ce4db..2dce0f9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2757,16 +2757,16 @@ static struct perf_callchain_entry
 
 	entry->nr = 0;
 
-	if (!user_mode(regs)) {
+	if (!regs || !user_mode(regs)) {
 		perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
 		perf_callchain_kernel(entry, tsk, regs);
-		if (current->mm)
-			regs = task_pt_regs(current);
+		if (tsk->mm)
+			regs = task_pt_regs(tsk);
 		else
 			regs = NULL;
 	}
 
-	if (regs) {
+	if (tsk == current && regs) {
 		perf_callchain_store(entry, PERF_CONTEXT_USER);
 		perf_callchain_user(entry, regs);
 	}
@@ -3746,7 +3746,8 @@ EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
 
 static void __perf_event_header__init_id(struct perf_event_header *header,
 					 struct perf_sample_data *data,
-					 struct perf_event *event)
+					 struct perf_event *event,
+					 struct task_struct *tsk)
 {
 	u64 sample_type = event->attr.sample_type;
 
@@ -3755,8 +3756,8 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
 
 	if (sample_type & PERF_SAMPLE_TID) {
 		/* namespace issues */
-		data->tid_entry.pid = perf_event_pid(event, current);
-		data->tid_entry.tid = perf_event_tid(event, current);
+		data->tid_entry.pid = perf_event_pid(event, tsk);
+		data->tid_entry.tid = perf_event_tid(event, tsk);
 	}
 
 	if (sample_type & PERF_SAMPLE_TIME)
@@ -3779,7 +3780,7 @@ void perf_event_header__init_id(struct perf_event_header *header,
 				struct perf_event *event)
 {
 	if (event->attr.sample_id_all)
-		__perf_event_header__init_id(header, data, event);
+		__perf_event_header__init_id(header, data, event, current);
 }
 
 static void __perf_event__output_id_sample(struct perf_output_handle *handle,
@@ -4000,12 +4001,17 @@ void perf_prepare_sample(struct perf_event_header *header,
 	header->size = sizeof(*header) + event->header_size;
 
 	header->misc = 0;
-	header->misc |= perf_misc_flags(regs);
 
-	__perf_event_header__init_id(header, data, event);
+	__perf_event_header__init_id(header, data, event, tsk);
 
-	if (sample_type & PERF_SAMPLE_IP)
-		data->ip = perf_instruction_pointer(regs);
+	if (regs) {
+		header->misc |= perf_misc_flags(regs);
+		if (sample_type & PERF_SAMPLE_IP)
+			data->ip = perf_instruction_pointer(regs);
+	} else {
+		header->misc |= PERF_RECORD_MISC_KERNEL;
+		data->ip = 0;
+	}
 
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
 		int size = 1;
-- 
1.7.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/4] events: sched_stat_template saves call chains of a target task
  2011-09-26 15:55 [PATCH 0/4] trace: add ability to collect call chains of non-current task Andrew Vagin
                   ` (2 preceding siblings ...)
  2011-09-26 15:55 ` [PATCH 3/4] trace: add ability to collect call chain " Andrew Vagin
@ 2011-09-26 15:55 ` Andrew Vagin
  3 siblings, 0 replies; 17+ messages in thread
From: Andrew Vagin @ 2011-09-26 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, avagin

It's more useful, because

sched:sched_stat_iowait:
Before we can see that block layer notifies that i/o request completed.
Now we can see which code wanted to read/write data.

sched:sched_stat_sleep:
Before we can see that delayed_work_timer_fn wakes up a task.
Now we see where a task sleeps.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 include/trace/events/sched.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index f633478..7468c7a 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -296,6 +296,7 @@ DECLARE_EVENT_CLASS(sched_stat_template,
 		__entry->delay	= delay;
 	)
 	TP_perf_assign(
+		__perf_task(tsk);
 		__perf_count(delay);
 	),
 
-- 
1.7.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] trace: prepare to collect call chains of non-current task.
  2011-09-26 15:55 ` [PATCH 2/4] trace: prepare to collect call chains of non-current task Andrew Vagin
@ 2011-09-27 14:02   ` Peter Zijlstra
  2011-10-14 19:12   ` Arun Sharma
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2011-09-27 14:02 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: linux-kernel, Steven Rostedt, Frederic Weisbecker, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo

On Mon, 2011-09-26 at 19:55 +0400, Andrew Vagin wrote:
> +perf_callchain_kernel(struct perf_callchain_entry *entry,
> +                       struct task_struct *tsk, struct pt_regs *regs) 

Hmm, we have event->ctx->task, but I guess for per-cpu events that's
useless..



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] trace: add ability to collect call chain of non-current task.
  2011-09-26 15:55 ` [PATCH 3/4] trace: add ability to collect call chain " Andrew Vagin
@ 2011-09-27 14:05   ` Peter Zijlstra
  2011-09-27 20:55     ` Frederic Weisbecker
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2011-09-27 14:05 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: linux-kernel, Steven Rostedt, Frederic Weisbecker, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo

On Mon, 2011-09-26 at 19:55 +0400, Andrew Vagin wrote:
> 
> Know issues:
> * Now call chains for non-current tasks are collected on x86 only,
>   but it may be done for other architectures simply.
> * It collects only kernel call chains, because we can't get direct
>   access to memory of other processes and this operation should be
>   fast enough. 

Also, it changes the semantics of tracepoint, normally you return the
callchain leading to the tracepoint, this changes that.


I'm not entirely against it, as I can see the use, but I would like to
solicit other opinions.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] trace: add ability to collect call chain of non-current task.
  2011-09-27 14:05   ` Peter Zijlstra
@ 2011-09-27 20:55     ` Frederic Weisbecker
  2011-09-28 13:53       ` Andrew Vagin
  2011-12-08  2:02       ` Arun Sharma
  0 siblings, 2 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2011-09-27 20:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Vagin, linux-kernel, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo

On Tue, Sep 27, 2011 at 04:05:51PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-09-26 at 19:55 +0400, Andrew Vagin wrote:
> > 
> > Know issues:
> > * Now call chains for non-current tasks are collected on x86 only,
> >   but it may be done for other architectures simply.
> > * It collects only kernel call chains, because we can't get direct
> >   access to memory of other processes and this operation should be
> >   fast enough. 
> 
> Also, it changes the semantics of tracepoint, normally you return the
> callchain leading to the tracepoint, this changes that.
> 
> 
> I'm not entirely against it, as I can see the use, but I would like to
> solicit other opinions.

There is also a problem in perf tools because we are dealing with a
callchain that doesn't belong to the current thread memory mapping and symbol
space. I believe that's a problem once we deal with the userspace part of
the callchain.

That and the fact there are other ways to get that callchain like taking
the one of the last sched_switch event from the waked thread.

Perhaps we should use the sched_switch event and a post_sched_switch
event, compute the time difference as a weight and use the callchain of any of
those. Perhaps as a plugin in perf report?

In any case that rather sounds like something that should be done in post-processing
in userspace, in perf tools.

We should probably avoid the remote callchains, sounds like asking for complications
everywhere.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] trace: add ability to collect call chain of non-current task.
  2011-09-27 20:55     ` Frederic Weisbecker
@ 2011-09-28 13:53       ` Andrew Vagin
  2011-12-08  2:02       ` Arun Sharma
  1 sibling, 0 replies; 17+ messages in thread
From: Andrew Vagin @ 2011-09-28 13:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Andrew Vagin, linux-kernel, Steven Rostedt,
	Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo

On 09/28/2011 12:55 AM, Frederic Weisbecker wrote:
> On Tue, Sep 27, 2011 at 04:05:51PM +0200, Peter Zijlstra wrote:
>> On Mon, 2011-09-26 at 19:55 +0400, Andrew Vagin wrote:
>>> Know issues:
>>> * Now call chains for non-current tasks are collected on x86 only,
>>>    but it may be done for other architectures simply.
>>> * It collects only kernel call chains, because we can't get direct
>>>    access to memory of other processes and this operation should be
>>>    fast enough.
>> Also, it changes the semantics of tracepoint, normally you return the
>> callchain leading to the tracepoint, this changes that.
>>
>>
>> I'm not entirely against it, as I can see the use, but I would like to
>> solicit other opinions.
> There is also a problem in perf tools because we are dealing with a
> callchain that doesn't belong to the current thread memory mapping and symbol
> space. I believe that's a problem once we deal with the userspace part of
> the callchain.
>
> That and the fact there are other ways to get that callchain like taking
> the one of the last sched_switch event from the waked thread.
>
> Perhaps we should use the sched_switch event and a post_sched_switch
> event, compute the time difference as a weight and use the callchain of any of
> those. Perhaps as a plugin in perf report?
>
> In any case that rather sounds like something that should be done in post-processing
> in userspace, in perf tools.
>
> We should probably avoid the remote callchains, sounds like asking for complications
> everywhere.
Thank you for comments and suggestions. I take timeout to think over them.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] perf: fix counter of ftrace events
  2011-09-26 15:55 ` [PATCH 1/4] perf: fix counter of ftrace events Andrew Vagin
@ 2011-10-14 19:11   ` Arun Sharma
  0 siblings, 0 replies; 17+ messages in thread
From: Arun Sharma @ 2011-10-14 19:11 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: linux-kernel, Steven Rostedt, Frederic Weisbecker, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo

On Mon, Sep 26, 2011 at 07:55:32PM +0400, Andrew Vagin wrote:
> Each event adds some points to its counters. By default it adds 1,
> and a number of points may be transmited in event's parameters.
> 
> E.g. sched:sched_stat_runtime adds how long process has been running.
> 
> But this functionality was broken by v2.6.31-rc5-392-gf413cdb
> and now the event's parameters doesn't affect on a number of points.
> 
> TP_perf_assign isn't defined, so __perf_count(c) isn't executed and
> __count is always equal to 1.
> 
> Signed-off-by: Andrew Vagin <avagin@openvz.org>

Acked-by: Arun Sharma <asharma@fb.com>

 -Arun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] trace: prepare to collect call chains of non-current task.
  2011-09-26 15:55 ` [PATCH 2/4] trace: prepare to collect call chains of non-current task Andrew Vagin
  2011-09-27 14:02   ` Peter Zijlstra
@ 2011-10-14 19:12   ` Arun Sharma
  1 sibling, 0 replies; 17+ messages in thread
From: Arun Sharma @ 2011-10-14 19:12 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: linux-kernel, Steven Rostedt, Frederic Weisbecker, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo

On Mon, Sep 26, 2011 at 07:55:33PM +0400, Andrew Vagin wrote:
> This patch adds argument "task" to all functions between perf_tp_event()
> and perf_callchain(). All logic is in the next patch.
> 
> We need "task" and "regs" simultaneously, because regs contains
> state of current task, before it entered in perf code. Task contains
> pointer to task_struct and it will be used for collecting call chains
> and filling event data, if task isn't current.
> 
> Signed-off-by: Andrew Vagin <avagin@openvz.org>

Acked-by: Arun Sharma <asharma@fb.com>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [PATCH 3/4] trace: add ability to collect call chain of non-current task.
  2011-09-27 20:55     ` Frederic Weisbecker
  2011-09-28 13:53       ` Andrew Vagin
@ 2011-12-08  2:02       ` Arun Sharma
  2011-12-09 11:07         ` Andrey Vagin
  1 sibling, 1 reply; 17+ messages in thread
From: Arun Sharma @ 2011-12-08  2:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Andrew Vagin, linux-kernel, Steven Rostedt,
	Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo

[-- Attachment #1: Type: text/plain, Size: 1219 bytes --]

On 7/22/64 12:06 PM, Frederic Weisbecker wrote:
[..]
>
> That and the fact there are other ways to get that callchain like taking
> the one of the last sched_switch event from the waked thread.
>
> Perhaps we should use the sched_switch event and a post_sched_switch
> event, compute the time difference as a weight and use the callchain of any of
> those. Perhaps as a plugin in perf report?
>
> In any case that rather sounds like something that should be done in post-processing
> in userspace, in perf tools.
>
> We should probably avoid the remote callchains, sounds like asking for complications
> everywhere.

Agreed on remote callchains and maintaining consistency about what the 
tracepoints mean.

As I said on the other thread, post-processing in userspace has the 
issue that we collect more info than we actually need and under load, 
perf record can't keep up.

Attached is an alternative approach that does what you allude to above.

perf record -agPe sched:sched_switch --filter "delay > 1000000" -- sleep 1

allows us to collect a lot less. For some reason, "perf script" shows 
the correct delay field, but the sample period still contains 1 (i.e 
__perf_count() hint is not working for me).

  -Arun

[-- Attachment #2: sched-switch-delay.patch --]
[-- Type: text/plain, Size: 4650 bytes --]

commit c58595bd2899685a2b9d94374203708bf279c2d9
Author: Arun Sharma <asharma@fb.com>
Date:   Wed Dec 7 16:17:00 2011 -0800

    Make sleep/iowait delay available at the sched_switch tracepoint.
    Also move trace_sched_switch() to finish_task_switch so the event
    applies to next rather than prev.
  
    If such a move doesn't make sense, we might have to consider adding
    a new event that's called something else.

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 959ff18..f67cf35 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -113,9 +113,10 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
 TRACE_EVENT(sched_switch,
 
 	TP_PROTO(struct task_struct *prev,
-		 struct task_struct *next),
+		 struct task_struct *next,
+		 unsigned long now),
 
-	TP_ARGS(prev, next),
+	TP_ARGS(prev, next, now),
 
 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
@@ -125,6 +126,7 @@ TRACE_EVENT(sched_switch,
 		__array(	char,	next_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	next_pid			)
 		__field(	int,	next_prio			)
+		__field(	long,	delay				)
 	),
 
 	TP_fast_assign(
@@ -135,9 +137,20 @@ TRACE_EVENT(sched_switch,
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
+#ifdef CONFIG_SCHEDSTATS
+ 		__entry->delay		= next->se.statistics.block_start ? next->se.statistics.block_start
+ 					  : next->se.statistics.sleep_start ? next->se.statistics.sleep_start : 0;
+ 		__entry->delay = __entry->delay ? now - __entry->delay : 0;
+#else
+ 		__entry->delay = 0;
+#endif
+	)
+
+	TP_perf_assign(
+		__perf_count(__entry->delay);
 	),
 
-	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
+	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d delay=%ld",
 		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
 		__entry->prev_state & (TASK_STATE_MAX-1) ?
 		  __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
@@ -145,7 +158,7 @@ TRACE_EVENT(sched_switch,
 				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
 				{ 128, "W" }) : "R",
 		__entry->prev_state & TASK_STATE_MAX ? "+" : "",
-		__entry->next_comm, __entry->next_pid, __entry->next_prio)
+		__entry->next_comm, __entry->next_pid, __entry->next_prio, __entry->delay)
 );
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index df2452b..c1e64f9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3156,7 +3156,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
-	trace_sched_switch(prev, next);
 }
 
 /**
@@ -3180,6 +3179,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	struct mm_struct *mm = rq->prev_mm;
 	long prev_state;
 
+	trace_sched_switch(prev, current, rq->clock);
 	rq->prev_mm = NULL;
 
 	/*
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index fc6a8e9..58d4bdc 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -880,7 +880,6 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if (unlikely(delta > se->statistics.sleep_max))
 			se->statistics.sleep_max = delta;
 
-		se->statistics.sleep_start = 0;
 		se->statistics.sum_sleep_runtime += delta;
 
 		if (tsk) {
@@ -897,7 +896,6 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if (unlikely(delta > se->statistics.block_max))
 			se->statistics.block_max = delta;
 
-		se->statistics.block_start = 0;
 		se->statistics.sum_sleep_runtime += delta;
 
 		if (tsk) {
@@ -1070,8 +1068,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 			if (tsk->state & TASK_INTERRUPTIBLE)
 				se->statistics.sleep_start = rq_of(cfs_rq)->clock;
+			else
+				se->statistics.sleep_start = 0;
 			if (tsk->state & TASK_UNINTERRUPTIBLE)
 				se->statistics.block_start = rq_of(cfs_rq)->clock;
+			else
+				se->statistics.block_start = 0;
 		}
 #endif
 	}
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 7e62c0a..b8b98d4 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -50,7 +50,7 @@ tracing_sched_switch_trace(struct trace_array *tr,
 }
 
 static void
-probe_sched_switch(void *ignore, struct task_struct *prev, struct task_struct *next)
+probe_sched_switch(void *ignore, struct task_struct *prev, struct task_struct *next, unsigned long now)
 {
 	struct trace_array_cpu *data;
 	unsigned long flags;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] trace: add ability to collect call chain of non-current task.
  2011-12-08  2:02       ` Arun Sharma
@ 2011-12-09 11:07         ` Andrey Vagin
  2011-12-14 20:14           ` Arun Sharma
  0 siblings, 1 reply; 17+ messages in thread
From: Andrey Vagin @ 2011-12-09 11:07 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Frederic Weisbecker, Peter Zijlstra, Andrew Vagin, linux-kernel,
	Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo

Hello Arun,
>
> Agreed on remote callchains and maintaining consistency about what the
> tracepoints mean.
>
> As I said on the other thread, post-processing in userspace has the
> issue that we collect more info than we actually need and under load,
> perf record can't keep up.
>
> Attached is an alternative approach that does what you allude to above.

* Your method doesn't work for rt scheduler.
* It doesn't distinguish blocking time and sleeping time.
* This patch does a bit mess between subsystems...

Yes, this method may have the right to life. Could you correct this 
patch and send it in lkml as a separate mail?

>
> perf record -agPe sched:sched_switch --filter "delay > 1000000" -- sleep 1

Why do you need the option "-a" ?

>
> allows us to collect a lot less. For some reason, "perf script" shows

> the correct delay field, but the sample period still contains 1 (i.e
> __perf_count() hint is not working for me).

Which kernel do you use? Does it contain "[PATCH] event: don't divide 
events if it has field period"? It works fine with my kernel...

>
> -Arun


> +#ifdef CONFIG_SCHEDSTATS
> + 		__entry->delay		= next->se.statistics.block_start ? next->se.statistics.block_start
> + 					  : next->se.statistics.sleep_start ? next->se.statistics.sleep_start : 0;
The previous code is hard to read...
> + 		__entry->delay = __entry->delay ? now - __entry->delay : 0;
> +#else
> + 		__entry->delay = 0;
> +#endif
next->se.statistics.{block,sleep}_start should be zeroized here, 
otherwise a next sched_switch will report non-zero delay again.
> +	)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] trace: add ability to collect call chain of non-current task.
  2011-12-09 11:07         ` Andrey Vagin
@ 2011-12-14 20:14           ` Arun Sharma
  2011-12-15 15:19             ` Andrew Vagin
  0 siblings, 1 reply; 17+ messages in thread
From: Arun Sharma @ 2011-12-14 20:14 UTC (permalink / raw)
  To: avagin
  Cc: Frederic Weisbecker, Peter Zijlstra, linux-kernel,
	Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo



On 12/9/11 3:07 AM, Andrey Vagin wrote:

>> Attached is an alternative approach that does what you allude to above.
>
> * Your method doesn't work for rt scheduler.

This is no worse than sched_stat_{iowait,sleep} which are also CFS specific?

> * It doesn't distinguish blocking time and sleeping time.

Can't we distinguish based on next->state?

> * This patch does a bit mess between subsystems...

Which subsystems?

>> perf record -agPe sched:sched_switch --filter "delay > 1000000" --
>> sleep 1
>
> Why do you need the option "-a" ?
>

I'd love to see it work without the -a (so normal users can use it).

I guess I'll need to move the trace_sched_switch() tracepoint to be 
after perf_event_task_sched_in() in finish_task_switch().


>>
>> allows us to collect a lot less. For some reason, "perf script" shows
>
>> the correct delay field, but the sample period still contains 1 (i.e
>> __perf_count() hint is not working for me).
>
> Which kernel do you use? Does it contain "[PATCH] event: don't divide
> events if it has field period"? It works fine with my kernel...
>

I was using 3.2-rc2. I picked up the patch you reference above.

# perf report -D | grep SAMPLE.*period:
3 1437330416084 0x26d908 [0x250]: PERF_RECORD_SAMPLE(IP, 1): 7276/7806: 
0xffffffff8102c0a7 period: 140736819516096 addr: 0
3 1437330422290 0x26db58 [0x250]: PERF_RECORD_SAMPLE(IP, 1): 7276/7806: 
0xffffffff8102c0a7 period: 140736819516096 addr: 0
3 1437330428485 0x26dda8 [0x250]: PERF_RECORD_SAMPLE(IP, 1): 7276/7806: 
0xffffffff8102c0a7 period: 140736819516096 addr: 0

# perf script | grep next_comm=test.*delay | head -3
             test 13070 [023]  1437.320529: sched_switch: 
prev_comm=kworker/0:1 prev_pid=0 prev_prio=120 prev_state=R ==> 
next_comm=test next_pid=13070 next_prio=120 delay=1096877
             test 13070 [023]  1437.320531: sched_switch: 
prev_comm=kworker/0:1 prev_pid=0 prev_prio=120 prev_state=R ==> 
next_comm=test next_pid=13070 next_prio=120 delay=1096877
             test 13070 [023]  1437.320532: sched_switch: 
prev_comm=kworker/0:1 prev_pid=0 prev_prio=120 prev_state=R ==> 
next_comm=test next_pid=13070 next_prio=120 delay=1096877

The period still looks bogus i.e. doesn't match delay.

Thanks for reviewing. I'll clean up some of these issues and post a new 
patch in a couple of days.

  -Arun


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] trace: add ability to collect call chain of non-current task.
  2011-12-14 20:14           ` Arun Sharma
@ 2011-12-15 15:19             ` Andrew Vagin
  2011-12-15 19:08               ` Arun Sharma
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Vagin @ 2011-12-15 15:19 UTC (permalink / raw)
  To: Arun Sharma
  Cc: avagin, Frederic Weisbecker, Peter Zijlstra, linux-kernel,
	Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo

On 12/15/2011 12:14 AM, Arun Sharma wrote:
>
>
> On 12/9/11 3:07 AM, Andrey Vagin wrote:
>
>>> Attached is an alternative approach that does what you allude to above.
>>
>> * Your method doesn't work for rt scheduler.
>
> This is no worse than sched_stat_{iowait,sleep} which are also CFS 
> specific?
Sorry. You are right. It's the same.
>
>> * It doesn't distinguish blocking time and sleeping time.
>
> Can't we distinguish based on next->state?
It will be "Running". We can distinguish based on block_start of sleep_start
>
>> * This patch does a bit mess between subsystems...
>
> Which subsystems?
sched_switch belongs to scheduler.
sched_stat belongs to sched_fair.

I am not sure, that it's bad. Send this patches in LKML.

>
>>> perf record -agPe sched:sched_switch --filter "delay > 1000000" --
>>> sleep 1
>>
>> Why do you need the option "-a" ?
>>
>
> I'd love to see it work without the -a (so normal users can use it).
>
> I guess I'll need to move the trace_sched_switch() tracepoint to be 
> after perf_event_task_sched_in() in finish_task_switch().
>
>
>>>
>>> allows us to collect a lot less. For some reason, "perf script" shows
>>
>>> the correct delay field, but the sample period still contains 1 (i.e
>>> __perf_count() hint is not working for me).
>>
>> Which kernel do you use? Does it contain "[PATCH] event: don't divide
>> events if it has field period"? It works fine with my kernel...
>>
>
> I was using 3.2-rc2. I picked up the patch you reference above.
>
> # perf report -D | grep SAMPLE.*period:
> 3 1437330416084 0x26d908 [0x250]: PERF_RECORD_SAMPLE(IP, 1): 
> 7276/7806: 0xffffffff8102c0a7 period: 140736819516096 addr: 0
> 3 1437330422290 0x26db58 [0x250]: PERF_RECORD_SAMPLE(IP, 1): 
> 7276/7806: 0xffffffff8102c0a7 period: 140736819516096 addr: 0
> 3 1437330428485 0x26dda8 [0x250]: PERF_RECORD_SAMPLE(IP, 1): 
> 7276/7806: 0xffffffff8102c0a7 period: 140736819516096 addr: 0
>
> # perf script | grep next_comm=test.*delay | head -3
>             test 13070 [023]  1437.320529: sched_switch: 
> prev_comm=kworker/0:1 prev_pid=0 prev_prio=120 prev_state=R ==> 
> next_comm=test next_pid=13070 next_prio=120 delay=1096877
>             test 13070 [023]  1437.320531: sched_switch: 
> prev_comm=kworker/0:1 prev_pid=0 prev_prio=120 prev_state=R ==> 
> next_comm=test next_pid=13070 next_prio=120 delay=1096877
>             test 13070 [023]  1437.320532: sched_switch: 
> prev_comm=kworker/0:1 prev_pid=0 prev_prio=120 prev_state=R ==> 
> next_comm=test next_pid=13070 next_prio=120 delay=1096877
>
> The period still looks bogus i.e. doesn't match delay.

I pushed my tree on github git://github.com/avagin/perf.git. Look at 
branch "arun" https://github.com/avagin/perf/tree/arun

It works for me. You can find output of commands bellow.

# ./perf record -e sched:sched_switch -agP
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.091 MB perf.data (~3954 samples) ]

# ./perf report -D | grep SAMPLE.*period:
1 1539849049077 0x10d78 [0xc8]: PERF_RECORD_SAMPLE(IP, 1): 34/34: 
0xffffffff810527c4 period: 7714488 addr: 0

#./perf script
  kworker/1:2    34 [001]  1539.849049: sched_switch: prev_comm=perf 
prev_pid=2700 prev_prio=120 prev_state=R ==> next_comm=kworker/1:2 
next_pid=34 next_prio=120 delay=7714488
> Thanks for reviewing. I'll clean up some of these issues and post a 
> new patch in a couple of days.
>
>  -Arun
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] trace: add ability to collect call chain of non-current task.
  2011-12-15 15:19             ` Andrew Vagin
@ 2011-12-15 19:08               ` Arun Sharma
  2011-12-16  0:15                 ` Arun Sharma
  0 siblings, 1 reply; 17+ messages in thread
From: Arun Sharma @ 2011-12-15 19:08 UTC (permalink / raw)
  To: avagin
  Cc: avagin, Frederic Weisbecker, Peter Zijlstra, linux-kernel,
	Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo



On 12/15/11 7:19 AM, Andrew Vagin wrote:

>>> * It doesn't distinguish blocking time and sleeping time.
>>
>> Can't we distinguish based on next->state?
> It will be "Running". We can distinguish based on block_start of
> sleep_start

Sounds good.

>>
>>> * This patch does a bit mess between subsystems...
>>
>> Which subsystems?
> sched_switch belongs to scheduler.
> sched_stat belongs to sched_fair.
>

My reading of the code is that 'struct sched_statistics' is independent 
of the scheduling class. But only sched_fair.c is actually populating 
the stats. Other classes could populate those fields if they choose to.

> It works for me. You can find output of commands bellow.
>
> # ./perf record -e sched:sched_switch -agP

Thanks. I forgot to use the -P option you added on my cmdline.

I still need to investigate why per-process mode (command line without 
-a) doesn't work. Will send an updated patch once that's fixed.

  -Arun


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] trace: add ability to collect call chain of non-current task.
  2011-12-15 19:08               ` Arun Sharma
@ 2011-12-16  0:15                 ` Arun Sharma
  0 siblings, 0 replies; 17+ messages in thread
From: Arun Sharma @ 2011-12-16  0:15 UTC (permalink / raw)
  To: avagin
  Cc: avagin, Frederic Weisbecker, Peter Zijlstra, linux-kernel,
	Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo

On 12/15/11 11:08 AM, Arun Sharma wrote:

> I still need to investigate why per-process mode (command line without
> -a) doesn't work. Will send an updated patch once that's fixed.
>

The per-process mode works fine after moving the trace point to be after:

         perf_event_task_sched_in(prev, current);

App developers using this feature here say they're fine not having the 
block vs sleep distinction because they intend to use it with "-g" and 
the callstack makes it obvious which one it was.

If we really need this info, we should be able to derive it in user 
space via perf inject.

  -Arun

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-12-16  0:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-26 15:55 [PATCH 0/4] trace: add ability to collect call chains of non-current task Andrew Vagin
2011-09-26 15:55 ` [PATCH 1/4] perf: fix counter of ftrace events Andrew Vagin
2011-10-14 19:11   ` Arun Sharma
2011-09-26 15:55 ` [PATCH 2/4] trace: prepare to collect call chains of non-current task Andrew Vagin
2011-09-27 14:02   ` Peter Zijlstra
2011-10-14 19:12   ` Arun Sharma
2011-09-26 15:55 ` [PATCH 3/4] trace: add ability to collect call chain " Andrew Vagin
2011-09-27 14:05   ` Peter Zijlstra
2011-09-27 20:55     ` Frederic Weisbecker
2011-09-28 13:53       ` Andrew Vagin
2011-12-08  2:02       ` Arun Sharma
2011-12-09 11:07         ` Andrey Vagin
2011-12-14 20:14           ` Arun Sharma
2011-12-15 15:19             ` Andrew Vagin
2011-12-15 19:08               ` Arun Sharma
2011-12-16  0:15                 ` Arun Sharma
2011-09-26 15:55 ` [PATCH 4/4] events: sched_stat_template saves call chains of a target task Andrew Vagin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.