* Profiling sleep times?
@ 2011-10-03 19:38 Arun Sharma
2011-10-03 20:17 ` Peter Zijlstra
0 siblings, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-03 19:38 UTC (permalink / raw)
To: linux-perf-users; +Cc: acme, Peter Zijlstra, mingo, Stephane Eranian
Some of our users want to use perf to profile not just the code that
consumes cycles, but also the code that ends up waiting for I/O -
otherwise known as wall clock profiling.
I could not find ways of getting this info from the perf tool as-is.
Wondering if a software event such as PERF_COUNT_SW_SLEEP_CLOCK below
makes sense.
The idea is, if a task sleeps for 1ms, it should generate 1000x more
samples vs a task that sleeps for 1us. Also, the callchain emitted
should be the user stack.
If such an event is useful to a larger set of users, I could try to work
out the details of how to get to event->attr.freq in the context switch
path with low overhead and run some tests to verify that the profile
that comes out of "perf report" looks sane.
We'll also need ways of combining PERF_COUNT_SW_TASK_CLOCK and
PERF_COUNT_SW_SLEEP_CLOCK (in userspace?) to get the full picture.
-Arun
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c2da40d..a3e2fb4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -106,6 +106,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_PAGE_FAULTS_MAJ = 6,
PERF_COUNT_SW_ALIGNMENT_FAULTS = 7,
PERF_COUNT_SW_EMULATION_FAULTS = 8,
+ PERF_COUNT_SW_SLEEP_CLOCK = 9,
PERF_COUNT_SW_MAX, /* non-ABI */
};
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 7406f36..e973862 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -877,8 +877,10 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq,
struct sched_entity *se)
se->statistics.sum_sleep_runtime += delta;
if (tsk) {
+ u64 freq = 1000000; /* XXX: Use event->attr.freq
? */
account_scheduler_latency(tsk, delta >> 10, 1);
trace_sched_stat_sleep(tsk, delta);
+ perf_sw_event(PERF_COUNT_SW_SLEEP_CLOCK,
delta/freq, 0, NULL, 0);
}
}
if (se->statistics.block_start) {
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-03 19:38 Profiling sleep times? Arun Sharma
@ 2011-10-03 20:17 ` Peter Zijlstra
2011-10-03 21:53 ` Arun Sharma
0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-03 20:17 UTC (permalink / raw)
To: Arun Sharma; +Cc: linux-perf-users, acme, mingo, Stephane Eranian
On Mon, 2011-10-03 at 12:38 -0700, Arun Sharma wrote:
> trace_sched_stat_sleep(tsk, delta);
That one should be accessible as a tracepoint and will add delay to
count on each occurrence.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-03 20:17 ` Peter Zijlstra
@ 2011-10-03 21:53 ` Arun Sharma
2011-10-04 8:34 ` Peter Zijlstra
0 siblings, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-03 21:53 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-perf-users, acme, mingo, Stephane Eranian
On 10/3/11 1:17 PM, Peter Zijlstra wrote:
> On Mon, 2011-10-03 at 12:38 -0700, Arun Sharma wrote:
>> trace_sched_stat_sleep(tsk, delta);
>
> That one should be accessible as a tracepoint and will add delay to
> count on each occurrence.
Right - the tracepoint seems to work. Was looking for the user stack
trace as well. For eg:
# cat foo.c
#include <unistd.h>
#include <time.h>
#include <sys/select.h>
main()
{
struct timespec ts1;
struct timeval tv1;
int i;
for (i = 0; i < 1000; i++) {
ts1.tv_sec = 0;
ts1.tv_nsec = 1000000;
nanosleep(&ts1, NULL);
tv1.tv_sec = 0;
tv1.tv_usec = 4000;
select(0, NULL, NULL, NULL, &tv1);
}
}
I want something that gives me 4 times as many samples on select() as
nanosleep().
perf record -g -e sched:sched_stat_sleep -- ./foo
doesn't seem to do it. Additionally, if I have a burn_1us_cpu() loop in
there, I'd expect to see as many samples as nanosleep() in the wall
clock profile.
-Arun
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-03 21:53 ` Arun Sharma
@ 2011-10-04 8:34 ` Peter Zijlstra
2011-10-06 21:56 ` Arun Sharma
0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-04 8:34 UTC (permalink / raw)
To: Arun Sharma; +Cc: linux-perf-users, acme, mingo, Stephane Eranian
On Mon, 2011-10-03 at 14:53 -0700, Arun Sharma wrote:
> On 10/3/11 1:17 PM, Peter Zijlstra wrote:
> > On Mon, 2011-10-03 at 12:38 -0700, Arun Sharma wrote:
> >> trace_sched_stat_sleep(tsk, delta);
> >
> > That one should be accessible as a tracepoint and will add delay to
> > count on each occurrence.
>
> Right - the tracepoint seems to work. Was looking for the user stack
> trace as well. For eg:
>
> # cat foo.c
> #include <unistd.h>
> #include <time.h>
> #include <sys/select.h>
>
> main()
> {
> struct timespec ts1;
> struct timeval tv1;
> int i;
>
>
> for (i = 0; i < 1000; i++) {
> ts1.tv_sec = 0;
> ts1.tv_nsec = 1000000;
> nanosleep(&ts1, NULL);
>
> tv1.tv_sec = 0;
> tv1.tv_usec = 4000;
> select(0, NULL, NULL, NULL, &tv1);
> }
> }
>
> I want something that gives me 4 times as many samples on select() as
> nanosleep().
>
> perf record -g -e sched:sched_stat_sleep -- ./foo
>
> doesn't seem to do it. Additionally, if I have a burn_1us_cpu() loop in
> there, I'd expect to see as many samples as nanosleep() in the wall
> clock profile.
Would you per-chance be suffering from this:
http://lkml.kernel.org/r/1317052535-1765247-2-git-send-email-avagin@openvz.org
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-04 8:34 ` Peter Zijlstra
@ 2011-10-06 21:56 ` Arun Sharma
2011-10-07 0:05 ` Arun Sharma
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Arun Sharma @ 2011-10-06 21:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-perf-users, acme, mingo, Stephane Eranian,
Frederic Weisbecker, Andrew Vagin
On 10/4/11 1:34 AM, Peter Zijlstra wrote:
>> # cat foo.c
>> #include<unistd.h>
>> #include<time.h>
>> #include<sys/select.h>
>>
>> main()
>> {
>> struct timespec ts1;
>> struct timeval tv1;
>> int i;
>>
>>
>> for (i = 0; i< 1000; i++) {
>> ts1.tv_sec = 0;
>> ts1.tv_nsec = 1000000;
>> nanosleep(&ts1, NULL);
>>
>> tv1.tv_sec = 0;
>> tv1.tv_usec = 4000;
>> select(0, NULL, NULL, NULL,&tv1);
>> }
>> }
>>
> Would you per-chance be suffering from this:
>
> http://lkml.kernel.org/r/1317052535-1765247-2-git-send-email-avagin@openvz.org
>
Thanks for the pointer. Yes - Andrew's patches help. But looks like we
need more user space plumbing like Frederic noted.
perf record -ge sched:sched_stat_sleep -- ./foo
doesn't quite work.
perf record -age sched:sched_stat_sleep -- ./foo
gives me:
58.62% foo [unknown] [k] 0
|
--- schedule
|
|--54.99%-- schedule_hrtimeout_range_clock
| schedule_hrtimeout_range
| poll_schedule_timeout
| do_select
| core_sys_select
| sys_select
| system_call_fastpath
|
|--44.81%-- do_nanosleep
| hrtimer_nanosleep
| sys_nanosleep
| system_call_fastpath
--0.20%-- [...]
i.e. select() should be weighted by 4x vs nanosleep() as confirmed via:
perf script | grep comm=foo
foo 15516 [006] 2291.187831: sched_stat_sleep: comm=foo
pid=15516 delay=4054262 [ns]
foo 15516 [006] 2291.187832: sched_stat_sleep: comm=foo
pid=15516 delay=4054262 [ns]
foo 15516 [006] 2291.188895: sched_stat_sleep: comm=foo
pid=15516 delay=1053565 [ns]
foo 15516 [006] 2291.188896: sched_stat_sleep: comm=foo
pid=15516 delay=1053565 [ns]
foo 15516 [006] 2291.188897: sched_stat_sleep: comm=foo
pid=15516 delay=1053565 [ns]
Andrew, are you already working on user space patches?
-Arun
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-06 21:56 ` Arun Sharma
@ 2011-10-07 0:05 ` Arun Sharma
2011-10-07 1:30 ` Peter Zijlstra
2011-10-08 1:45 ` avagin
2 siblings, 0 replies; 23+ messages in thread
From: Arun Sharma @ 2011-10-07 0:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-perf-users, acme, mingo, Stephane Eranian,
Frederic Weisbecker, Andrew Vagin
[-- Attachment #1: Type: text/plain, Size: 1979 bytes --]
On 10/6/11 2:56 PM, Arun Sharma wrote:
> i.e. select() should be weighted by 4x vs nanosleep() as confirmed via:
>
> perf script | grep comm=foo
>
> foo 15516 [006] 2291.187831: sched_stat_sleep: comm=foo pid=15516
> delay=4054262 [ns]
> foo 15516 [006] 2291.187832: sched_stat_sleep: comm=foo pid=15516
> delay=4054262 [ns]
> foo 15516 [006] 2291.188895: sched_stat_sleep: comm=foo pid=15516
> delay=1053565 [ns]
> foo 15516 [006] 2291.188896: sched_stat_sleep: comm=foo pid=15516
> delay=1053565 [ns]
> foo 15516 [006] 2291.188897: sched_stat_sleep: comm=foo pid=15516
> delay=1053565 [ns]
>
> Andrew, are you already working on user space patches?
Attached is a quick hack (mainly meant as a RFC to get the discussion
going).
After the patch I get the expected result:
0.16% foo [unknown] [k] 0
|
--- schedule
|
|--79.26%-- schedule_hrtimeout_range_clock
| schedule_hrtimeout_range
| poll_schedule_timeout
| do_select
| core_sys_select
| sys_select
| system_call_fastpath
|
|--16.87%-- do_nanosleep
| hrtimer_nanosleep
| sys_nanosleep
| system_call_fastpath
|
--3.88%-- pipe_wait
pipe_read
do_sync_read
vfs_read
sys_read
system_call_fastpath
We still have the issue that per-process profiling doesn't work.
-Arun
[-- Attachment #2: 0001-Use-delay-instead-of-sample-period.patch --]
[-- Type: text/plain, Size: 3309 bytes --]
From a2d6e0698eebb038b2d8e845137008dc3478cc51 Mon Sep 17 00:00:00 2001
From: Arun Sharma <asharma@fb.com>
Date: Thu, 6 Oct 2011 16:48:44 -0700
Subject: [PATCH] Use delay instead of sample->period
---
tools/perf/builtin-report.c | 6 ++++++
tools/perf/util/probe-finder.c | 6 ++++--
tools/perf/util/trace-event-parse.c | 20 ++++++++++++++++++++
tools/perf/util/trace-event.h | 1 +
4 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index d7ff277..6821597 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -25,6 +25,7 @@
#include "util/evsel.h"
#include "util/header.h"
#include "util/session.h"
+#include "util/trace-event.h"
#include "util/parse-options.h"
#include "util/parse-events.h"
@@ -111,6 +112,11 @@ static int process_sample_event(union perf_event *event,
struct perf_session *session)
{
struct addr_location al;
+ struct perf_event_attr *attr = &evsel->attr;
+
+ if (attr->type == PERF_TYPE_TRACEPOINT) {
+ sample->period = trace_get_delay(sample->raw_data, sample->raw_size);
+ }
if (perf_event__preprocess_sample(event, session, &al, sample,
annotate_init) < 0) {
diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c
index 5d73262..e7179ff 100644
--- a/tools/perf/util/probe-finder.c
+++ b/tools/perf/util/probe-finder.c
@@ -19,9 +19,11 @@
*
*/
+#define _GNU_SOURCE
#include <sys/utsname.h>
#include <sys/types.h>
#include <sys/stat.h>
+#include <ctype.h>
#include <fcntl.h>
#include <errno.h>
#include <stdio.h>
@@ -30,13 +32,13 @@
#include <stdlib.h>
#include <string.h>
#include <stdarg.h>
-#include <ctype.h>
#include <dwarf-regs.h>
+#undef _GNU_SOURCE
#include <linux/bitops.h>
+#include "util.h"
#include "event.h"
#include "debug.h"
-#include "util.h"
#include "symbol.h"
#include "probe-finder.h"
diff --git a/tools/perf/util/trace-event-parse.c b/tools/perf/util/trace-event-parse.c
index 0a7ed5b..78b7194 100644
--- a/tools/perf/util/trace-event-parse.c
+++ b/tools/perf/util/trace-event-parse.c
@@ -2944,6 +2944,26 @@ void print_trace_event(int cpu, void *data, int size)
pretty_print(data, size, event);
}
+size_t trace_get_delay(void *data, int size __unused)
+{
+ struct event *event;
+ int type;
+ struct format_field *field;
+
+ type = trace_parse_common_type(data);
+
+ event = trace_find_event(type);
+ if (!event) {
+ warning("ug! no event found for type %d", type);
+ return 0;
+ }
+
+ field = find_field(event, "delay");
+ if (!field)
+ die("can't find delay field in trace");
+ return read_size(data + field->offset, field->size);
+}
+
static void print_fields(struct print_flag_sym *field)
{
printf("{ %s, %s }", field->value, field->str);
diff --git a/tools/perf/util/trace-event.h b/tools/perf/util/trace-event.h
index f674dda..587b553 100644
--- a/tools/perf/util/trace-event.h
+++ b/tools/perf/util/trace-event.h
@@ -264,6 +264,7 @@ unsigned long long eval_flag(const char *flag);
int read_tracing_data(int fd, struct list_head *pattrs);
ssize_t read_tracing_data_size(int fd, struct list_head *pattrs);
+size_t trace_get_delay(void *data, int size);
/* taken from kernel/trace/trace.h */
enum trace_flag_type {
--
1.7.4
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-06 21:56 ` Arun Sharma
2011-10-07 0:05 ` Arun Sharma
@ 2011-10-07 1:30 ` Peter Zijlstra
2011-10-07 5:42 ` avagin
2011-10-07 17:58 ` Arun Sharma
2011-10-08 1:45 ` avagin
2 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-07 1:30 UTC (permalink / raw)
To: Arun Sharma
Cc: linux-perf-users, acme, mingo, Stephane Eranian,
Frederic Weisbecker, Andrew Vagin
On Thu, 2011-10-06 at 14:56 -0700, Arun Sharma wrote:
> foo 15516 [006] 2291.187831: sched_stat_sleep: comm=foo
> pid=15516 delay=4054262 [ns]
> foo 15516 [006] 2291.187832: sched_stat_sleep: comm=foo
> pid=15516 delay=4054262 [ns]
> foo 15516 [006] 2291.188895: sched_stat_sleep: comm=foo
> pid=15516 delay=1053565 [ns]
> foo 15516 [006] 2291.188896: sched_stat_sleep: comm=foo
> pid=15516 delay=1053565 [ns]
> foo 15516 [006] 2291.188897: sched_stat_sleep: comm=foo
> pid=15516 delay=1053565 [ns]
But the idea of the __perf_count() thing:
DECLARE_EVENT_CLASS(sched_stat_template,
TP_PROTO(struct task_struct *tsk, u64 delay),
TP_ARGS(tsk, delay),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( u64, delay )
),
TP_fast_assign(
memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
__entry->pid = tsk->pid;
__entry->delay = delay;
)
TP_perf_assign(
__perf_count(delay);
),
TP_printk("comm=%s pid=%d delay=%Lu [ns]",
__entry->comm, __entry->pid,
(unsigned long long)__entry->delay)
);
is that the counter is incremented with the delay, so the event should
get weighted right.
So having to get the delay out of the raw tracepoint data shouldn't be
needed.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-07 1:30 ` Peter Zijlstra
@ 2011-10-07 5:42 ` avagin
2011-10-07 9:33 ` Peter Zijlstra
2011-10-07 17:58 ` Arun Sharma
1 sibling, 1 reply; 23+ messages in thread
From: avagin @ 2011-10-07 5:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Arun Sharma, linux-perf-users, acme, mingo, Stephane Eranian,
Frederic Weisbecker
On 10/07/2011 05:30 AM, Peter Zijlstra wrote:
> On Thu, 2011-10-06 at 14:56 -0700, Arun Sharma wrote:
>> foo 15516 [006] 2291.187831: sched_stat_sleep: comm=foo
>> pid=15516 delay=4054262 [ns]
>> foo 15516 [006] 2291.187832: sched_stat_sleep: comm=foo
>> pid=15516 delay=4054262 [ns]
>> foo 15516 [006] 2291.188895: sched_stat_sleep: comm=foo
>> pid=15516 delay=1053565 [ns]
>> foo 15516 [006] 2291.188896: sched_stat_sleep: comm=foo
>> pid=15516 delay=1053565 [ns]
>> foo 15516 [006] 2291.188897: sched_stat_sleep: comm=foo
>> pid=15516 delay=1053565 [ns]
>
> But the idea of the __perf_count() thing:
__perf_count() doesn't work now and I have sent the patch, which fixes
it. Could you commit it? It's subject:
[PATCH 1/4] perf: fix counter of ftrace events
but it's not all.
By default perf doesn't use SAMPLE_PERIOD, it bases on events count.
Each "trace" event is divided on some "perf" events in
perf_swevent_overflow().
The number of perf events should be proportional __perf_count(), but the
number of perf events restricted by sysctl_perf_event_sample_rate / HZ.
In my case it's 100 events.
....
if (hwc->interrupts != MAX_INTERRUPTS) {
hwc->interrupts++;
if (HZ * hwc->interrupts > (u64)sysctl_perf_event_sample_rate) {
....
So Arun Sharma probably has perf.data which contains packages of events
and each package contains 100 events irrespective of "delay".
>
> DECLARE_EVENT_CLASS(sched_stat_template,
>
> TP_PROTO(struct task_struct *tsk, u64 delay),
>
> TP_ARGS(tsk, delay),
>
> TP_STRUCT__entry(
> __array( char, comm, TASK_COMM_LEN )
> __field( pid_t, pid )
> __field( u64, delay )
> ),
>
> TP_fast_assign(
> memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> __entry->pid = tsk->pid;
> __entry->delay = delay;
> )
> TP_perf_assign(
> __perf_count(delay);
> ),
>
> TP_printk("comm=%s pid=%d delay=%Lu [ns]",
> __entry->comm, __entry->pid,
> (unsigned long long)__entry->delay)
> );
>
>
> is that the counter is incremented with the delay, so the event should
> get weighted right.
>
> So having to get the delay out of the raw tracepoint data shouldn't be
> needed.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-07 5:42 ` avagin
@ 2011-10-07 9:33 ` Peter Zijlstra
0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-07 9:33 UTC (permalink / raw)
To: avagin
Cc: Arun Sharma, linux-perf-users, acme, mingo, Stephane Eranian,
Frederic Weisbecker
On Fri, 2011-10-07 at 09:42 +0400, avagin@gmail.com wrote:
> Could you commit it? It's subject:
> [PATCH 1/4] perf: fix counter of ftrace events
>
tip/master:
commit 92e51938f5d005026ba4bb5b1fae5a86dc195b86
Author: Andrew Vagin <avagin@openvz.org>
Date: Mon Sep 26 19:55:32 2011 +0400
perf: Fix counter of ftrace events
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-07 1:30 ` Peter Zijlstra
2011-10-07 5:42 ` avagin
@ 2011-10-07 17:58 ` Arun Sharma
2011-10-07 23:16 ` avagin
1 sibling, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-07 17:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-perf-users, acme, mingo, Stephane Eranian,
Frederic Weisbecker, Andrew Vagin
On 10/6/11 6:30 PM, Peter Zijlstra wrote:
>
> But the idea of the __perf_count() thing:
> [..]
> TP_perf_assign(
> __perf_count(delay);
> ),
>
> TP_printk("comm=%s pid=%d delay=%Lu [ns]",
> __entry->comm, __entry->pid,
> (unsigned long long)__entry->delay)
> );
>
>
> is that the counter is incremented with the delay, so the event should
> get weighted right.
>
> So having to get the delay out of the raw tracepoint data shouldn't be
> needed.
How does this work for dynamic tracepoints? Or tracepoints with multiple
dimensions where the user might want to query each dimension separately?
Andrew: what do you think about generalizing my patch to accept a
command line option(s) to specify which fields to use for the purpose of
computing the histogram?
For static tracepoints, TP_perf_assign() could act as a hint on which
field to default to.
-Arun
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-07 17:58 ` Arun Sharma
@ 2011-10-07 23:16 ` avagin
0 siblings, 0 replies; 23+ messages in thread
From: avagin @ 2011-10-07 23:16 UTC (permalink / raw)
To: Arun Sharma
Cc: Peter Zijlstra, linux-perf-users, acme, mingo, Stephane Eranian,
Frederic Weisbecker
Hello Arun,
> Andrew: what do you think about generalizing my patch to accept a
> command line option(s) to specify which fields to use for the purpose of
> computing the histogram?
I have no objection, but I don't think, that we really need that.
Now we have not got a real use case. All events which I've seen have not
more than one parameters, which may be used as weight and for this one
we already have parameter "period".
I already said, that you have trouble with sched_stat_sleed due to some
issues in kernel.
The first issue is that __perf_cout doesn't work and I sent the patch,
which fixes it. ([PATCH] perf: fix counter of ftrace events)
And the second issue is that the trace events are divided on some perf
events and their number is restricted by sysctl_perf_event_sample_rate/HZ.
I don't sure, that we should generate more than one "perf" event on each
"trace" event. I think the better way to use SAMPLE_PERIOD and now I
think in this direction.
If you want to fix the bug with sched_stat_sleep, you need to fix the
second issue. I found workaround for your case:
#./perf record -age sched:sched_stat_sleep --filter="comm == foo" -c
100000 -F 100000 -- ~/foo
# ./perf report
# Events: 5K sched:sched_stat_sleep
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ......
#
99.98% foo [unknown] [k] 0
|
--- schedule
|
|--80.20%-- schedule_hrtimeout_range_clock
| schedule_hrtimeout_range
| poll_schedule_timeout
| do_select
| core_sys_select
| sys_select
| system_call_fastpath
|
--19.80%-- do_nanosleep
hrtimer_nanosleep
sys_nanosleep
system_call_fastpath
>
> For static tracepoints, TP_perf_assign() could act as a hint on which
> field to default to.
>
> -Arun
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-06 21:56 ` Arun Sharma
2011-10-07 0:05 ` Arun Sharma
2011-10-07 1:30 ` Peter Zijlstra
@ 2011-10-08 1:45 ` avagin
2011-10-10 18:50 ` Arun Sharma
2 siblings, 1 reply; 23+ messages in thread
From: avagin @ 2011-10-08 1:45 UTC (permalink / raw)
To: Arun Sharma
Cc: Peter Zijlstra, linux-perf-users, acme, mingo, Stephane Eranian,
Frederic Weisbecker
[-- Attachment #1: Type: text/plain, Size: 416 bytes --]
> Andrew, are you already working on user space patches?
Yes, I'm working. I've attached the draft version.
The example of usage:
#./perf record -ag -e sched:sched_switch --filter "prev_state == 1" -e
sched:sched_process_exit -e sched:sched_stat_sleep --filter "comm ==
foo" ~/foo
#./perf inject -s -i perf.data -o perf.data.d
#./perf report -i perf.data.d
I will be glad to receive any comments.
>
> -Arun
[-- Attachment #2: 0001-perf-teach-perf-inject-to-work-with-files.patch --]
[-- Type: text/plain, Size: 2745 bytes --]
From c2cd0d73c11687fbc67884aba337f497e13890f9 Mon Sep 17 00:00:00 2001
From: Andrew Vagin <avagin@openvz.org>
Date: Tue, 4 Oct 2011 16:10:23 +0400
Subject: [PATCH 1/3] perf: teach "perf inject" to work with files
Before this patch "perf inject" can only handle data from pipe.
I want to use "perf inject" for reworking events. Look at my following patch.
Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
tools/perf/builtin-inject.c | 33 +++++++++++++++++++++++++++++++--
1 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 8dfc12b..8df8b71 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -13,7 +13,12 @@
#include "util/parse-options.h"
-static char const *input_name = "-";
+static char const *input_name = "-";
+static const char *output_name = "-";
+static int pipe_output = 0;
+static int output;
+static u64 bytes_written = 0;
+
static bool inject_build_ids;
static int perf_event__repipe_synth(union perf_event *event,
@@ -25,12 +30,14 @@ static int perf_event__repipe_synth(union perf_event *event,
size = event->header.size;
while (size) {
- int ret = write(STDOUT_FILENO, buf, size);
+ int ret = write(output, buf, size);
if (ret < 0)
return -errno;
size -= ret;
buf += ret;
+
+ bytes_written += ret;
}
return 0;
@@ -213,8 +220,14 @@ static int __cmd_inject(void)
if (session == NULL)
return -ENOMEM;
+ if (!pipe_output)
+ lseek(output, session->header.data_offset, SEEK_SET);
ret = perf_session__process_events(session, &inject_ops);
+ if (!pipe_output) {
+ session->header.data_size += bytes_written;
+ perf_session__write_header(session, session->evlist, output, true);
+ }
perf_session__delete(session);
return ret;
@@ -228,6 +241,10 @@ static const char * const report_usage[] = {
static const struct option options[] = {
OPT_BOOLEAN('b', "build-ids", &inject_build_ids,
"Inject build-ids into the output stream"),
+ OPT_STRING('i', "input", &input_name, "file",
+ "input file name"),
+ OPT_STRING('o', "output", &output_name, "file",
+ "output file name"),
OPT_INCR('v', "verbose", &verbose,
"be more verbose (show build ids, etc)"),
OPT_END()
@@ -243,6 +260,18 @@ int cmd_inject(int argc, const char **argv, const char *prefix __used)
if (argc)
usage_with_options(report_usage, options);
+ if (!strcmp(output_name, "-")) {
+ pipe_output = 1;
+ output = STDOUT_FILENO;
+ } else {
+ output = open(output_name, O_CREAT| O_WRONLY | O_TRUNC,
+ S_IRUSR | S_IWUSR);
+ if (output < 0) {
+ perror("failed to create output file");
+ exit(-1);
+ }
+ }
+
if (symbol__init() < 0)
return -1;
--
1.7.1
[-- Attachment #3: 0003-perf-add-scripts-for-collecting-D-state-statistics.patch --]
[-- Type: text/plain, Size: 1420 bytes --]
From e08809ba075d92e60f669fc62c48128e06c834fb Mon Sep 17 00:00:00 2001
From: Andrew Vagin <avagin@openvz.org>
Date: Thu, 6 Oct 2011 12:18:44 +0400
Subject: [PATCH 3/3] perf: add scripts for collecting D-state statistics
Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
.../perf/scripts/python/bin/task-in-d-state-record | 2 ++
.../perf/scripts/python/bin/task-in-d-state-report | 6 ++++++
2 files changed, 8 insertions(+), 0 deletions(-)
create mode 100644 tools/perf/scripts/python/bin/task-in-d-state-record
create mode 100644 tools/perf/scripts/python/bin/task-in-d-state-report
diff --git a/tools/perf/scripts/python/bin/task-in-d-state-record b/tools/perf/scripts/python/bin/task-in-d-state-record
new file mode 100644
index 0000000..d70bed0
--- /dev/null
+++ b/tools/perf/scripts/python/bin/task-in-d-state-record
@@ -0,0 +1,2 @@
+#!/bin/bash
+perf record -ag -e sched:sched_switch --filter "prev_state == 2" -e sched:sched_stat_iowait -e sched:sched_process_exit $@
diff --git a/tools/perf/scripts/python/bin/task-in-d-state-report b/tools/perf/scripts/python/bin/task-in-d-state-report
new file mode 100644
index 0000000..f1ab71e
--- /dev/null
+++ b/tools/perf/scripts/python/bin/task-in-d-state-report
@@ -0,0 +1,6 @@
+#!/bin/bash
+# description: D-state statistics
+# args:
+perf inject -s -i perf.data -o perf.data.d || exit
+perf report -i perf.data.d || exit
+unlink perf.data.d
--
1.7.1
[-- Attachment #4: 0002-perf-teach-perf-inject-to-merge-sched_stat_-and-sche.patch --]
[-- Type: text/plain, Size: 4355 bytes --]
From 7a3152c03d07d30f47ab4fe9295212330bf821e4 Mon Sep 17 00:00:00 2001
From: Andrew Vagin <avagin@openvz.org>
Date: Tue, 4 Oct 2011 16:54:15 +0400
Subject: [PATCH 2/3] perf: teach perf inject to merge sched_stat_* and sched_switch events
You may want to know where and how long a task is sleeping. A callchain
may be found in sched_switch and a time slice in stat_iowait, so I add
handler in perf inject for merging this events.
My code saves sched_switch event for each process and when it meets
stat_iowait, it reports the sched_switch event, because it contains a
correct callchain. By another words it replaces all stat_iowait events
on proper sched_switch events.
My code doesn't change events.
Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
tools/perf/builtin-inject.c | 87 +++++++++++++++++++++++++++++++++++++++++++
1 files changed, 87 insertions(+), 0 deletions(-)
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 8df8b71..10bdd65 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -12,6 +12,8 @@
#include "util/debug.h"
#include "util/parse-options.h"
+#include "util/trace-event.h"
+
static char const *input_name = "-";
static const char *output_name = "-";
@@ -20,6 +22,7 @@ static int output;
static u64 bytes_written = 0;
static bool inject_build_ids;
+static bool inject_sched_stat;
static int perf_event__repipe_synth(union perf_event *event,
struct perf_session *session __used)
@@ -179,6 +182,85 @@ repipe:
return 0;
}
+struct event_entry
+{
+ struct list_head list;
+ u32 pid;
+ struct perf_event_header header;
+};
+
+#define ENT_SIZE(size) ((size) + offsetof(event_entry, header))
+
+static LIST_HEAD(samples);
+
+static int perf_event__sched_stat(union perf_event *event,
+ struct perf_sample *sample,
+ struct perf_evsel *evsel __used,
+ struct perf_session *session)
+{
+ int type;
+ struct event *e;
+ const char *evname = NULL;
+ uint32_t size;
+ void *buf = event;
+ struct event_entry *ent;
+ union perf_event *event_sw = NULL;
+ struct perf_sample sample_sw;
+ int sched_process_exit;
+
+ size = event->header.size;
+
+ type = trace_parse_common_type(sample->raw_data);
+ e = trace_find_event(type);
+ if (e)
+ evname = e->name;
+
+ sched_process_exit = !strcmp(evname, "sched_process_exit");
+
+ if (!strcmp(evname, "sched_switch") || sched_process_exit) {
+ list_for_each_entry(ent, &samples, list)
+ if (sample->pid == ent->pid)
+ break;
+
+ if (&ent->list != &samples) {
+ list_del(&ent->list);
+ free(ent);
+ }
+
+ if (sched_process_exit)
+ return 0;
+
+ ent = malloc(size + offsetof(struct event_entry, header));
+ ent->pid = sample->pid;
+ memcpy(&ent->header, buf, size);
+ list_add(&ent->list, &samples);
+ return 0;
+
+ } else if (!strncmp(evname, "sched_stat_", 11)) {
+ u32 pid;
+
+ pid = raw_field_value(e, "pid", sample->raw_data);
+
+ list_for_each_entry(ent, &samples, list) {
+ if (pid == ent->pid)
+ break;
+ }
+
+ if (&ent->list == &samples) {
+ pr_warning("Could not find sched_switch for pid %u\n", pid);
+ return 0;
+ }
+
+ event_sw = (union perf_event *) &ent->header;
+ perf_session__parse_sample(session, event_sw, &sample_sw);
+ perf_event__repipe(event_sw, &sample_sw, session);
+ return 0;
+ }
+
+ perf_event__repipe(event, sample, session);
+
+ return 0;
+}
struct perf_event_ops inject_ops = {
.sample = perf_event__repipe_sample,
.mmap = perf_event__repipe,
@@ -214,6 +296,9 @@ static int __cmd_inject(void)
inject_ops.mmap = perf_event__repipe_mmap;
inject_ops.fork = perf_event__repipe_task;
inject_ops.tracing_data = perf_event__repipe_tracing_data;
+ } else if (inject_sched_stat) {
+ inject_ops.sample = perf_event__sched_stat;
+ inject_ops.ordered_samples = true;
}
session = perf_session__new(input_name, O_RDONLY, false, true, &inject_ops);
@@ -241,6 +326,8 @@ static const char * const report_usage[] = {
static const struct option options[] = {
OPT_BOOLEAN('b', "build-ids", &inject_build_ids,
"Inject build-ids into the output stream"),
+ OPT_BOOLEAN('s', "sched-stat", &inject_sched_stat,
+ "correct call-chains for shed-stat-*"),
OPT_STRING('i', "input", &input_name, "file",
"input file name"),
OPT_STRING('o', "output", &output_name, "file",
--
1.7.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-08 1:45 ` avagin
@ 2011-10-10 18:50 ` Arun Sharma
2011-10-12 7:41 ` Ingo Molnar
0 siblings, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-10 18:50 UTC (permalink / raw)
To: avagin
Cc: Peter Zijlstra, linux-perf-users, acme, mingo, Stephane Eranian,
Frederic Weisbecker
On 10/7/11 6:45 PM, avagin@gmail.com wrote:
>> Andrew, are you already working on user space patches?
>
> Yes, I'm working. I've attached the draft version.
Generally I'm happy with the direction of these patches. For doing wall
time based profiling we'll need support in perf inject to merge sleep +
cycles events.
Re: examples of why one might need histograms based on various fields in
a dynamic kernel probe:
I might want to add a probe such as:
perf probe --add 'sys_read fd=%di'
perf probe --add 'sys_read%return bytes=$retval'
and expect to find stack traces that are responsible for most I/O calls
based on bytes vs fd vs number of syscalls.
-Arun
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-10 18:50 ` Arun Sharma
@ 2011-10-12 7:41 ` Ingo Molnar
2011-10-13 5:39 ` Andrew Vagin
2011-10-14 21:19 ` Arun Sharma
0 siblings, 2 replies; 23+ messages in thread
From: Ingo Molnar @ 2011-10-12 7:41 UTC (permalink / raw)
To: Arun Sharma
Cc: avagin, Peter Zijlstra, linux-perf-users, acme, Stephane Eranian,
Frederic Weisbecker
* Arun Sharma <asharma@fb.com> wrote:
> On 10/7/11 6:45 PM, avagin@gmail.com wrote:
> >>Andrew, are you already working on user space patches?
> >
> >Yes, I'm working. I've attached the draft version.
>
> Generally I'm happy with the direction of these patches. [...]
Ok, will someone please send an updated series of all missing
patches, with all acks included, for potential v3.2 pickup?
Sleep time profiling is obviously a very hot feature ...
Thanks,
Ingo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-12 7:41 ` Ingo Molnar
@ 2011-10-13 5:39 ` Andrew Vagin
2011-10-14 21:19 ` Arun Sharma
1 sibling, 0 replies; 23+ messages in thread
From: Andrew Vagin @ 2011-10-13 5:39 UTC (permalink / raw)
To: Ingo Molnar
Cc: Arun Sharma, Peter Zijlstra, linux-perf-users, acme,
Stephane Eranian, Frederic Weisbecker
> Ok, will someone please send an updated series of all missing
> patches, with all acks included, for potential v3.2 pickup?
I'm in vacation now, so I'm going to send my patches in two-three weeks.
Sorry for delay.
>
> Sleep time profiling is obviously a very hot feature ...
>
> Thanks,
>
> Ingo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-12 7:41 ` Ingo Molnar
2011-10-13 5:39 ` Andrew Vagin
@ 2011-10-14 21:19 ` Arun Sharma
2011-10-15 17:00 ` Ingo Molnar
1 sibling, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-14 21:19 UTC (permalink / raw)
To: Ingo Molnar
Cc: avagin, Peter Zijlstra, linux-perf-users, acme, Stephane Eranian,
Frederic Weisbecker
On 10/12/11 12:41 AM, Ingo Molnar wrote:
>> Generally I'm happy with the direction of these patches. [...]
>
> Ok, will someone please send an updated series of all missing
> patches, with all acks included, for potential v3.2 pickup?
>
> Sleep time profiling is obviously a very hot feature ...
>
I acked two of Andrew's patches. The third one is the most important
one. I haven't had a chance to review that carefully yet - although it's
working well for me.
We still need to figure out what's the best way to munge all this data
in user space to implement wall clock profiling.
-Arun
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-14 21:19 ` Arun Sharma
@ 2011-10-15 17:00 ` Ingo Molnar
2011-10-15 19:22 ` Peter Zijlstra
0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2011-10-15 17:00 UTC (permalink / raw)
To: Arun Sharma
Cc: avagin, Peter Zijlstra, linux-perf-users, acme, Stephane Eranian,
Frederic Weisbecker
* Arun Sharma <asharma@fb.com> wrote:
> On 10/12/11 12:41 AM, Ingo Molnar wrote:
> >>Generally I'm happy with the direction of these patches. [...]
> >
> >Ok, will someone please send an updated series of all missing
> >patches, with all acks included, for potential v3.2 pickup?
> >
> >Sleep time profiling is obviously a very hot feature ...
> >
>
> I acked two of Andrew's patches. The third one is the most
> important one. I haven't had a chance to review that carefully yet
> - although it's working well for me.
>
> We still need to figure out what's the best way to munge all this
> data in user space to implement wall clock profiling.
Well, if 'perf report' can show sleep time ordered entries just fine
by default (if a proper perf record was done), which also works in
call graph recording mode then i'm a happy camper.
Sleep time should really just be a different notion of 'cost of the
function/callchain' and fit into the existing scheme, right?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-15 17:00 ` Ingo Molnar
@ 2011-10-15 19:22 ` Peter Zijlstra
2011-10-15 19:29 ` Peter Zijlstra
0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-15 19:22 UTC (permalink / raw)
To: Ingo Molnar
Cc: Arun Sharma, avagin, linux-perf-users, acme, Stephane Eranian,
Frederic Weisbecker
On Sat, 2011-10-15 at 19:00 +0200, Ingo Molnar wrote:
> * Arun Sharma <asharma@fb.com> wrote:
>
> > On 10/12/11 12:41 AM, Ingo Molnar wrote:
> > >>Generally I'm happy with the direction of these patches. [...]
> > >
> > >Ok, will someone please send an updated series of all missing
> > >patches, with all acks included, for potential v3.2 pickup?
> > >
> > >Sleep time profiling is obviously a very hot feature ...
> > >
> >
> > I acked two of Andrew's patches. The third one is the most
> > important one. I haven't had a chance to review that carefully yet
> > - although it's working well for me.
> >
> > We still need to figure out what's the best way to munge all this
> > data in user space to implement wall clock profiling.
>
> Well, if 'perf report' can show sleep time ordered entries just fine
> by default (if a proper perf record was done), which also works in
> call graph recording mode then i'm a happy camper.
>
> Sleep time should really just be a different notion of 'cost of the
> function/callchain' and fit into the existing scheme, right?
The problem with andrew's patches is that it wrecks the callchain
semantics. The waittime tracepoint is in the wakeup path (and hence
generates the wakee's callchain) whereas they really want the callchain
of the woken task to show where it spend time.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-15 19:22 ` Peter Zijlstra
@ 2011-10-15 19:29 ` Peter Zijlstra
2011-10-18 1:07 ` Arun Sharma
0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-15 19:29 UTC (permalink / raw)
To: Ingo Molnar
Cc: Arun Sharma, avagin, linux-perf-users, acme, Stephane Eranian,
Frederic Weisbecker
On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote:
> > Sleep time should really just be a different notion of 'cost of the
> > function/callchain' and fit into the existing scheme, right?
>
> The problem with andrew's patches is that it wrecks the callchain
> semantics. The waittime tracepoint is in the wakeup path (and hence
> generates the wakee's callchain) whereas they really want the callchain
> of the woken task to show where it spend time.
We could of course try to move the tracepoint into the schedule path, so
we issue it the first time the task gets scheduled after the wakeup, but
I suspect that will just add more overhead, and we really could do
without that.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-15 19:29 ` Peter Zijlstra
@ 2011-10-18 1:07 ` Arun Sharma
2011-10-22 10:49 ` Frederic Weisbecker
0 siblings, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-18 1:07 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, avagin, linux-perf-users, acme, Stephane Eranian,
Frederic Weisbecker
On 10/15/11 12:29 PM, Peter Zijlstra wrote:
> On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote:
>
>>> Sleep time should really just be a different notion of 'cost of the
>>> function/callchain' and fit into the existing scheme, right?
>>
>> The problem with andrew's patches is that it wrecks the callchain
>> semantics. The waittime tracepoint is in the wakeup path (and hence
>> generates the wakee's callchain) whereas they really want the callchain
>> of the woken task to show where it spend time.
>
> We could of course try to move the tracepoint into the schedule path, so
> we issue it the first time the task gets scheduled after the wakeup, but
> I suspect that will just add more overhead, and we really could do
> without that.
Do we need to define new tracepoints? I suspect we could make the
existing ones:
trace_sched_stat_wait()
trace_sched_stat_sleep()
work for this purpose. The length of time the task was not on the cpu
could then be computed as: sleep+wait. The downside is that the
complexity moves to user space.
perf record -e sched:sched_stat_sleep,sched:sched_stat_wait,...
Re: changing the semantics of tracepoint callchains
Yeah - this could be surprising. Luckily, most tracepoints retain their
semantics, but a few special ones don't. I guess we just need to
document the new behavior.
-Arun
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-18 1:07 ` Arun Sharma
@ 2011-10-22 10:49 ` Frederic Weisbecker
2011-10-22 16:22 ` Andrew Wagin
2011-10-23 0:27 ` Arun Sharma
0 siblings, 2 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-10-22 10:49 UTC (permalink / raw)
To: Arun Sharma
Cc: Peter Zijlstra, Ingo Molnar, avagin, linux-perf-users, acme,
Stephane Eranian
On Mon, Oct 17, 2011 at 06:07:00PM -0700, Arun Sharma wrote:
> On 10/15/11 12:29 PM, Peter Zijlstra wrote:
> >On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote:
> >
> >>>Sleep time should really just be a different notion of 'cost of the
> >>>function/callchain' and fit into the existing scheme, right?
> >>
> >>The problem with andrew's patches is that it wrecks the callchain
> >>semantics. The waittime tracepoint is in the wakeup path (and hence
> >>generates the wakee's callchain) whereas they really want the callchain
> >>of the woken task to show where it spend time.
> >
> >We could of course try to move the tracepoint into the schedule path, so
> >we issue it the first time the task gets scheduled after the wakeup, but
> >I suspect that will just add more overhead, and we really could do
> >without that.
>
> Do we need to define new tracepoints? I suspect we could make the
> existing ones:
>
> trace_sched_stat_wait()
> trace_sched_stat_sleep()
>
> work for this purpose. The length of time the task was not on the
> cpu could then be computed as: sleep+wait. The downside is that the
> complexity moves to user space.
>
> perf record -e sched:sched_stat_sleep,sched:sched_stat_wait,...
>
> Re: changing the semantics of tracepoint callchains
>
> Yeah - this could be surprising. Luckily, most tracepoints retain
> their semantics, but a few special ones don't. I guess we just need
> to document the new behavior.
That's not only a problem of semantics although that alone is a problem,
people will seldom read the documentation for corner cases, we should
really stay consistant here: if remote callchains are really needed, we
want a specific interface for that, not abusing the existing one that would
only confuse people.
Now I still think doing remote callchains is asking for troubles: we need to
ensure the target is really sleeping and is not going to be scheduled
concurrently otherwise you might get weird or stale results. So the user needs
to know which tracepoints are safe to perform this.
Then comes the problem to deal with remote callchains in userspace: the event
comes from a task but the callchain is from another. You need the perf tools
to handle remote dsos/mapping/sym etc...
That's a lot of unnecessary complications.
I think we should use something like a perf report plugin: perhaps something
that can create a virtual event on top of real ones: compute the sched:sched_switch
events, find the time tasks are sleeping and create virtual sleep events on top
of that with a period weighted with the sleep time.
Just a thought.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-22 10:49 ` Frederic Weisbecker
@ 2011-10-22 16:22 ` Andrew Wagin
2011-10-23 0:27 ` Arun Sharma
1 sibling, 0 replies; 23+ messages in thread
From: Andrew Wagin @ 2011-10-22 16:22 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Arun Sharma, Peter Zijlstra, Ingo Molnar, linux-perf-users, acme,
Stephane Eranian
Hi, All.
Sorry for late response. I have vacation.
I see the miscomuniction. You explained me, that remote callchains was
not a good idea and now I think the same. I added plugin to "perf
inject".
The example of usage:
#./perf record -ag -e sched:sched_switch --filter "prev_state == 1" -e
sched:sched_process_exit -e sched:sched_stat_sleep --filter "comm ==
foo" ~/foo
#./perf inject -s -i perf.data -o perf.data.d
#./perf report -i perf.data.d
I'm going to send patches soon.
2011/10/22 Frederic Weisbecker <fweisbec@gmail.com>:
> On Mon, Oct 17, 2011 at 06:07:00PM -0700, Arun Sharma wrote:
>> On 10/15/11 12:29 PM, Peter Zijlstra wrote:
>> >On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote:
>> >
>> >>>Sleep time should really just be a different notion of 'cost of the
>> >>>function/callchain' and fit into the existing scheme, right?
>> >>
>> >>The problem with andrew's patches is that it wrecks the callchain
>> >>semantics. The waittime tracepoint is in the wakeup path (and hence
>> >>generates the wakee's callchain) whereas they really want the callchain
>> >>of the woken task to show where it spend time.
>> >
>> >We could of course try to move the tracepoint into the schedule path, so
>> >we issue it the first time the task gets scheduled after the wakeup, but
>> >I suspect that will just add more overhead, and we really could do
>> >without that.
>>
>> Do we need to define new tracepoints? I suspect we could make the
>> existing ones:
>>
>> trace_sched_stat_wait()
>> trace_sched_stat_sleep()
>>
>> work for this purpose. The length of time the task was not on the
>> cpu could then be computed as: sleep+wait. The downside is that the
>> complexity moves to user space.
>>
>> perf record -e sched:sched_stat_sleep,sched:sched_stat_wait,...
>>
>> Re: changing the semantics of tracepoint callchains
>>
>> Yeah - this could be surprising. Luckily, most tracepoints retain
>> their semantics, but a few special ones don't. I guess we just need
>> to document the new behavior.
>
> That's not only a problem of semantics although that alone is a problem,
> people will seldom read the documentation for corner cases, we should
> really stay consistant here: if remote callchains are really needed, we
> want a specific interface for that, not abusing the existing one that would
> only confuse people.
>
> Now I still think doing remote callchains is asking for troubles: we need to
> ensure the target is really sleeping and is not going to be scheduled
> concurrently otherwise you might get weird or stale results. So the user needs
> to know which tracepoints are safe to perform this.
> Then comes the problem to deal with remote callchains in userspace: the event
> comes from a task but the callchain is from another. You need the perf tools
> to handle remote dsos/mapping/sym etc...
>
> That's a lot of unnecessary complications.
>
> I think we should use something like a perf report plugin: perhaps something
> that can create a virtual event on top of real ones: compute the sched:sched_switch
> events, find the time tasks are sleeping and create virtual sleep events on top
> of that with a period weighted with the sleep time.
> Just a thought.
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times?
2011-10-22 10:49 ` Frederic Weisbecker
2011-10-22 16:22 ` Andrew Wagin
@ 2011-10-23 0:27 ` Arun Sharma
1 sibling, 0 replies; 23+ messages in thread
From: Arun Sharma @ 2011-10-23 0:27 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Peter Zijlstra, Ingo Molnar, avagin, linux-perf-users, acme,
Stephane Eranian
On 10/22/11 3:49 AM, Frederic Weisbecker wrote:
> That's not only a problem of semantics although that alone is a problem,
> people will seldom read the documentation for corner cases, we should
> really stay consistant here: if remote callchains are really needed, we
> want a specific interface for that, not abusing the existing one that would
> only confuse people.
A separate interface sounds good.
>
> Now I still think doing remote callchains is asking for troubles: we need to
> ensure the target is really sleeping and is not going to be scheduled
> concurrently otherwise you might get weird or stale results. So the user needs
> to know which tracepoints are safe to perform this.
I expect this interface to be used in a small number of well known
places in the kernel.
[...]
>
> I think we should use something like a perf report plugin: perhaps something
> that can create a virtual event on top of real ones: compute the sched:sched_switch
> events, find the time tasks are sleeping and create virtual sleep events on top
> of that with a period weighted with the sleep time.
> Just a thought.
Right - whether we're doing wall-clock profiling or sleep profiling,
it'll involve looking at multiple real events.
I still see one problem with doing: perf record -ag -e <bunch of events>
and trying to sort through what happened. Unprivileged users who don't
have permissions to system wide profiling, but have privileges to
profile their own processes get locked out of this feature.
-Arun
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2011-10-23 0:28 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-03 19:38 Profiling sleep times? Arun Sharma
2011-10-03 20:17 ` Peter Zijlstra
2011-10-03 21:53 ` Arun Sharma
2011-10-04 8:34 ` Peter Zijlstra
2011-10-06 21:56 ` Arun Sharma
2011-10-07 0:05 ` Arun Sharma
2011-10-07 1:30 ` Peter Zijlstra
2011-10-07 5:42 ` avagin
2011-10-07 9:33 ` Peter Zijlstra
2011-10-07 17:58 ` Arun Sharma
2011-10-07 23:16 ` avagin
2011-10-08 1:45 ` avagin
2011-10-10 18:50 ` Arun Sharma
2011-10-12 7:41 ` Ingo Molnar
2011-10-13 5:39 ` Andrew Vagin
2011-10-14 21:19 ` Arun Sharma
2011-10-15 17:00 ` Ingo Molnar
2011-10-15 19:22 ` Peter Zijlstra
2011-10-15 19:29 ` Peter Zijlstra
2011-10-18 1:07 ` Arun Sharma
2011-10-22 10:49 ` Frederic Weisbecker
2011-10-22 16:22 ` Andrew Wagin
2011-10-23 0:27 ` Arun Sharma
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.