All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)
@ 2012-06-04  8:09 Andrew Vagin
  2012-06-04  8:09 ` [PATCH 1/5] sched: event: add trace events when a task starts executing on a cpu Andrew Vagin
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Andrew Vagin @ 2012-06-04  8:09 UTC (permalink / raw)
  Cc: Arun Sharma, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

Do you want to know where your code waits locks for a long time?
Yes! It's for you. This feature helps you to find bottlenecks.

It's not artificial task. Once one of my colleague was investigating a
scalability problem. He pressed sysrq-t some times and tried to merge
call-chains by hand. But perf can do that better.

Here is fourth attempt to add this functionality in the kernel.

Previous attempts:
* [PATCH 0/4] trace: add ability to collect call chains of non-current task.

* [PATCH rh6 0/7] Teach perf tool to collect D-state statistics
This version doesn't work for non-root users.

* "sched/tracing: Add a new tracepoint for sleeptime" by Arun Sharma
It has been committed, but then reverted due to performance reason.

This series resolves a problem with events filtering and non-root users.
Before I tried to use sched_stat* events, but it occurs in a context of
another task. For this case logic of perf_event_context* doesn't work.

This series adds one more event sched:sched_switch_finish, it's sent
when a task starts executing on a cpu. When a task goes out from cpu,
sched:sched_switch is sent.

The time difference between sched_switch and sched_switch_finish is a time
when a task wasn't being executed on a cpu.

For profiling sleep time, events of both types should be collected and
filtered by a defined state (interruptible of uninterruptible sleep).

Then events should be post-processed with help "perf inject". It saves the time
of sched_switch event for each process and when it meets stat_switch_finish,
it sets time difference in sample->period and saves this new sample
in the new perf.data.

./perf record -e sched:sched_switch -g --filter 'prev_state == 1' \
	-e sched:sched_switch_finish --filter 'state == 1' -P  ~/foo

./perf inject -i ~/perf.data -o perf.data.new -s

A test program:
#include<unistd.h>
#include<time.h>
#include<sys/select.h>

int main()
{
          struct timespec ts1;
          struct timeval tv1;
          int i;
        long s;


          for (i = 0; i<  2; i++) {
                  ts1.tv_sec = 0;
                  ts1.tv_nsec = 10000000;
                  nanosleep(&ts1, NULL);

                  tv1.tv_sec = 0;
                  tv1.tv_usec = 40000;
                  select(0, NULL, NULL, NULL,&tv1);
                if (i % 100 == 0) {
                        s = time(NULL);
                        while (time(NULL) - s < 4);
                }
          }
        return 1;
}

Profiling results:
# Overhead  Command      Shared Object                  Symbol
# ........  .......  .................  ......................
#
   100.00%      foo  [kernel.kallsyms]  [k] finish_task_switch
                |
                --- finish_task_switch
                    __schedule
                    schedule
                   |          
                   |--79.85%-- schedule_hrtimeout_range_clock
                   |          schedule_hrtimeout_range
                   |          poll_schedule_timeout
                   |          do_select
                   |          core_sys_select
                   |          sys_select
                   |          system_call
                   |          __select
                   |          __libc_start_main
                   |          
                    --20.15%-- do_nanosleep
                              hrtimer_nanosleep
                              sys_nanosleep
                              system_call
                              __GI___libc_nanosleep
                              __libc_start_main

Andrew Vagin (5):
  event: add an event sched_switch_finish
  sched: send the event sched_switch before perf_event_task_sched_out
  sched: save a previous state on task_struct
  perf: teach "perf inject" to work with files
  perf: teach perf inject to merge sched_switch* events

 include/linux/sched.h        |    3 +
 include/trace/events/sched.h |   29 ++++++++++
 kernel/sched/core.c          |    3 +-
 tools/perf/builtin-inject.c  |  119 +++++++++++++++++++++++++++++++++++++++---
 4 files changed, 146 insertions(+), 8 deletions(-)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] sched: event: add trace events when a task starts executing on a cpu
  2012-06-04  8:09 [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Andrew Vagin
@ 2012-06-04  8:09 ` Andrew Vagin
  2012-06-04  8:09 ` [PATCH 2/5] sched: send the event sched_switch before perf_event_task_sched_out Andrew Vagin
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Andrew Vagin @ 2012-06-04  8:09 UTC (permalink / raw)
  Cc: Arun Sharma, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

sched_switch_finish is sent when a task starts executing on a cpu.
It sends in context of this task, that's why sched_switch can't be used.
sched_switch is send when a task goes out from a cpu.

This two events will be needed for profiling sleep time.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 include/trace/events/sched.h |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index ea7a203..b2219d8 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -149,6 +149,25 @@ TRACE_EVENT(sched_switch,
 		__entry->next_comm, __entry->next_pid, __entry->next_prio)
 );
 
+TRACE_EVENT(sched_switch_finish,
+
+	TP_PROTO(struct task_struct *p),
+
+	TP_ARGS(p),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+	),
+
+	TP_fast_assign(
+		__entry->pid	= p->pid;
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("comm=%s pid=%d", __entry->comm, __entry->pid)
+);
+
 /*
  * Tracepoint for a task being migrated:
  */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] sched: send the event sched_switch before perf_event_task_sched_out
  2012-06-04  8:09 [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Andrew Vagin
  2012-06-04  8:09 ` [PATCH 1/5] sched: event: add trace events when a task starts executing on a cpu Andrew Vagin
@ 2012-06-04  8:09 ` Andrew Vagin
  2012-06-04  8:09 ` [PATCH 3/5] sched: save a previous state on task_struct Andrew Vagin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Andrew Vagin @ 2012-06-04  8:09 UTC (permalink / raw)
  Cc: Arun Sharma, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

Otherwice they can't be filtered for a defined task.

perf record -e sched:sched_switch ./foo

This command doesn't report any event without this patch.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 kernel/sched/core.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 39eb601..e17a3ec 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1911,12 +1911,12 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
+	trace_sched_switch(prev, next);
 	sched_info_switch(prev, next);
 	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
-	trace_sched_switch(prev, next);
 }
 
 /**
@@ -1976,6 +1976,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 		kprobe_flush_task(prev);
 		put_task_struct(prev);
 	}
+	trace_sched_switch_finish(current);
 }
 
 #ifdef CONFIG_SMP
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] sched: save a previous state on task_struct
  2012-06-04  8:09 [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Andrew Vagin
  2012-06-04  8:09 ` [PATCH 1/5] sched: event: add trace events when a task starts executing on a cpu Andrew Vagin
  2012-06-04  8:09 ` [PATCH 2/5] sched: send the event sched_switch before perf_event_task_sched_out Andrew Vagin
@ 2012-06-04  8:09 ` Andrew Vagin
  2012-06-04  8:09 ` [PATCH 4/5] perf: teach "perf inject" to work with files Andrew Vagin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Andrew Vagin @ 2012-06-04  8:09 UTC (permalink / raw)
  Cc: Arun Sharma, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

prev_state is a state of task, when it stops executing on cpu.

It's needed for filtering sched:sched_switch_finish.
For example If we want to track when a task is sleeping:

./perf record -e sched:sched_switch -g --filter 'prev_state == 1' \
	-e sched:sched_switch_finish --filter 'state == 1' -P  foo

The previous state is set in sched:sched_switch, so it works when
both events sched:sched_switch and sched:sched_switch_finish are
enabled.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 include/linux/sched.h        |    3 +++
 include/trace/events/sched.h |   12 +++++++++++-
 2 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f45c0b2..172bbe3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1576,6 +1576,9 @@ struct task_struct {
 	struct uprobe_task *utask;
 	int uprobe_srcu_id;
 #endif
+#ifdef CONFIG_EVENT_TRACING
+	volatile long prev_state;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index b2219d8..c1b70b1 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -133,6 +133,7 @@ TRACE_EVENT(sched_switch,
 		__entry->prev_pid	= prev->pid;
 		__entry->prev_prio	= prev->prio;
 		__entry->prev_state	= __trace_sched_switch_state(prev);
+		prev->prev_state	= __trace_sched_switch_state(prev);
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
@@ -158,14 +159,23 @@ TRACE_EVENT(sched_switch_finish,
 	TP_STRUCT__entry(
 		__array(	char,	comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	pid			)
+		__field(	long,	state			)
 	),
 
 	TP_fast_assign(
 		__entry->pid	= p->pid;
+		__entry->state	= p->prev_state;
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 	),
 
-	TP_printk("comm=%s pid=%d", __entry->comm, __entry->pid)
+	TP_printk("comm=%s pid=%d state=%s%s",
+		__entry->comm, __entry->pid,
+		__entry->state & (TASK_STATE_MAX-1) ?
+		  __print_flags(__entry->state & (TASK_STATE_MAX-1), "|",
+				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
+				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
+				{ 128, "W" }) : "R",
+		__entry->state & TASK_STATE_MAX ? "+" : "")
 );
 
 /*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] perf: teach "perf inject" to work with files
  2012-06-04  8:09 [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Andrew Vagin
                   ` (2 preceding siblings ...)
  2012-06-04  8:09 ` [PATCH 3/5] sched: save a previous state on task_struct Andrew Vagin
@ 2012-06-04  8:09 ` Andrew Vagin
  2012-06-04  8:09 ` [PATCH 5/5] perf: teach perf inject to merge sched_switch* events for profiling sleep times Andrew Vagin
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Andrew Vagin @ 2012-06-04  8:09 UTC (permalink / raw)
  Cc: Arun Sharma, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

Before this patch "perf inject" can only handle data from pipe.

I want to use "perf inject" for reworking events. Look at my following patch.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 tools/perf/builtin-inject.c |   33 +++++++++++++++++++++++++++++++--
 1 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 3beab48..bc30a40 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -14,7 +14,12 @@
 
 #include "util/parse-options.h"
 
-static char		const *input_name = "-";
+static char		const *input_name	= "-";
+static const char	*output_name		= "-";
+static int		pipe_output		= 0;
+static int		output;
+static u64		bytes_written		= 0;
+
 static bool		inject_build_ids;
 
 static int perf_event__repipe_synth(struct perf_tool *tool __used,
@@ -27,12 +32,14 @@ static int perf_event__repipe_synth(struct perf_tool *tool __used,
 	size = event->header.size;
 
 	while (size) {
-		int ret = write(STDOUT_FILENO, buf, size);
+		int ret = write(output, buf, size);
 		if (ret < 0)
 			return -errno;
 
 		size -= ret;
 		buf += ret;
+
+		bytes_written += ret;
 	}
 
 	return 0;
@@ -244,8 +251,14 @@ static int __cmd_inject(void)
 	if (session == NULL)
 		return -ENOMEM;
 
+	if (!pipe_output)
+		lseek(output, session->header.data_offset, SEEK_SET);
 	ret = perf_session__process_events(session, &perf_inject);
 
+	if (!pipe_output) {
+		session->header.data_size = bytes_written;
+		perf_session__write_header(session, session->evlist, output, true);
+	}
 	perf_session__delete(session);
 
 	return ret;
@@ -259,6 +272,10 @@ static const char * const report_usage[] = {
 static const struct option options[] = {
 	OPT_BOOLEAN('b', "build-ids", &inject_build_ids,
 		    "Inject build-ids into the output stream"),
+	OPT_STRING('i', "input", &input_name, "file",
+		    "input file name"),
+	OPT_STRING('o', "output", &output_name, "file",
+		    "output file name"),
 	OPT_INCR('v', "verbose", &verbose,
 		 "be more verbose (show build ids, etc)"),
 	OPT_END()
@@ -274,6 +291,18 @@ int cmd_inject(int argc, const char **argv, const char *prefix __used)
 	if (argc)
 		usage_with_options(report_usage, options);
 
+	if (!strcmp(output_name, "-")) {
+		pipe_output = 1;
+		output = STDOUT_FILENO;
+	} else {
+		output = open(output_name, O_CREAT| O_WRONLY | O_TRUNC,
+							S_IRUSR | S_IWUSR);
+		if (output < 0) {
+			perror("failed to create output file");
+			exit(-1);
+		}
+	}
+
 	if (symbol__init() < 0)
 		return -1;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] perf: teach perf inject to merge sched_switch* events for profiling sleep times
  2012-06-04  8:09 [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Andrew Vagin
                   ` (3 preceding siblings ...)
  2012-06-04  8:09 ` [PATCH 4/5] perf: teach "perf inject" to work with files Andrew Vagin
@ 2012-06-04  8:09 ` Andrew Vagin
  2012-06-04 12:40 ` [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Peter Zijlstra
  2012-06-04 17:03 ` Arun Sharma
  6 siblings, 0 replies; 15+ messages in thread
From: Andrew Vagin @ 2012-06-04  8:09 UTC (permalink / raw)
  Cc: Arun Sharma, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

The time difference between sched_switch and sched_switch_finish is a time
when a task wasn't being executed on a cpu.

My code saves the time of sched_switch event for each process and
when it meets stat_switch_finish, it sets time difference in sample->period
and saves this new sample in the new perf.data.

As a result we can see where and how long task sleeps.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 tools/perf/builtin-inject.c |   88 ++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index bc30a40..631051e 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -13,6 +13,8 @@
 #include "util/debug.h"
 
 #include "util/parse-options.h"
+#include "util/trace-event.h"
+
 
 static char		const *input_name	= "-";
 static const char	*output_name		= "-";
@@ -21,6 +23,9 @@ static int		output;
 static u64		bytes_written		= 0;
 
 static bool		inject_build_ids;
+static bool		inject_sched_stat;
+
+struct perf_session	*session;
 
 static int perf_event__repipe_synth(struct perf_tool *tool __used,
 				    union perf_event *event,
@@ -47,7 +52,7 @@ static int perf_event__repipe_synth(struct perf_tool *tool __used,
 
 static int perf_event__repipe_op2_synth(struct perf_tool *tool,
 					union perf_event *event,
-					struct perf_session *session __used)
+					struct perf_session *s __used)
 {
 	return perf_event__repipe_synth(tool, event, NULL);
 }
@@ -59,7 +64,7 @@ static int perf_event__repipe_event_type_synth(struct perf_tool *tool,
 }
 
 static int perf_event__repipe_tracing_data_synth(union perf_event *event,
-						 struct perf_session *session __used)
+						 struct perf_session *s __used)
 {
 	return perf_event__repipe_synth(NULL, event, NULL);
 }
@@ -119,12 +124,12 @@ static int perf_event__repipe_task(struct perf_tool *tool,
 }
 
 static int perf_event__repipe_tracing_data(union perf_event *event,
-					   struct perf_session *session)
+					   struct perf_session *s)
 {
 	int err;
 
 	perf_event__repipe_synth(NULL, event, NULL);
-	err = perf_event__process_tracing_data(event, session);
+	err = perf_event__process_tracing_data(event, s);
 
 	return err;
 }
@@ -210,6 +215,75 @@ repipe:
 	return 0;
 }
 
+struct event_entry
+{
+	struct list_head list;
+	u32 pid;
+	u64 time;
+};
+
+static LIST_HEAD(samples);
+
+static int perf_event__sched_stat(struct perf_tool *tool,
+				      union perf_event *event,
+				      struct perf_sample *sample,
+				      struct perf_evsel *evsel __used,
+				      struct machine *machine)
+{
+	int type;
+	struct event_format *e;
+	const char *evname = NULL;
+	struct event_entry *ent;
+
+	type = trace_parse_common_type(sample->raw_data);
+	e = trace_find_event(type);
+	if (e)
+		evname = e->name;
+
+	if (!strcmp(evname, "sched_switch")) {
+		list_for_each_entry(ent, &samples, list)
+			if (sample->pid == ent->pid)
+				break;
+
+		if (&ent->list != &samples) {
+			list_del(&ent->list);
+			free(ent);
+		}
+
+		ent = malloc(sizeof(struct event_entry));
+		ent->pid = sample->pid;
+		ent->time = sample->time;
+		list_add(&ent->list, &samples);
+		return 0;
+
+	} else if (!strcmp(evname, "sched_switch_finish")) {
+		union perf_event *event_sw;
+
+		list_for_each_entry(ent, &samples, list)
+			if (sample->pid == ent->pid)
+				break;
+
+		if (&ent->list == &samples) {
+			pr_debug("Could not find sched_switch for pid %u\n", sample->pid);
+			return 0;
+		}
+
+		event_sw = malloc(event->header.size);
+		if (event_sw == NULL)
+			return -1;
+
+		memcpy(event_sw, event, event->header.size);
+		sample->period = sample->time - ent->time;
+		perf_session__synthesize_sample(session, event_sw, sample);
+		perf_event__repipe(tool, event_sw, sample, machine);
+		free(event_sw);
+		return 0;
+	}
+
+	perf_event__repipe(tool, event, sample, machine);
+
+	return 0;
+}
 struct perf_tool perf_inject = {
 	.sample		= perf_event__repipe_sample,
 	.mmap		= perf_event__repipe,
@@ -235,7 +309,6 @@ static void sig_handler(int sig __attribute__((__unused__)))
 
 static int __cmd_inject(void)
 {
-	struct perf_session *session;
 	int ret = -EINVAL;
 
 	signal(SIGINT, sig_handler);
@@ -245,6 +318,9 @@ static int __cmd_inject(void)
 		perf_inject.mmap	 = perf_event__repipe_mmap;
 		perf_inject.fork	 = perf_event__repipe_task;
 		perf_inject.tracing_data = perf_event__repipe_tracing_data;
+	} else if (inject_sched_stat) {
+		perf_inject.sample	= perf_event__sched_stat;
+		perf_inject.ordered_samples = true;
 	}
 
 	session = perf_session__new(input_name, O_RDONLY, false, true, &perf_inject);
@@ -272,6 +348,8 @@ static const char * const report_usage[] = {
 static const struct option options[] = {
 	OPT_BOOLEAN('b', "build-ids", &inject_build_ids,
 		    "Inject build-ids into the output stream"),
+	OPT_BOOLEAN('s', "sched-switch", &inject_sched_stat,
+		    "merge sched-switch* for profiling sleep time"),
 	OPT_STRING('i', "input", &input_name, "file",
 		    "input file name"),
 	OPT_STRING('o', "output", &output_name, "file",
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)
  2012-06-04  8:09 [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Andrew Vagin
                   ` (4 preceding siblings ...)
  2012-06-04  8:09 ` [PATCH 5/5] perf: teach perf inject to merge sched_switch* events for profiling sleep times Andrew Vagin
@ 2012-06-04 12:40 ` Peter Zijlstra
  2012-06-04 12:50   ` Peter Zijlstra
                     ` (4 more replies)
  2012-06-04 17:03 ` Arun Sharma
  6 siblings, 5 replies; 15+ messages in thread
From: Peter Zijlstra @ 2012-06-04 12:40 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Arun Sharma, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, linux-kernel

On Mon, 2012-06-04 at 12:09 +0400, Andrew Vagin wrote:
> This series resolves a problem with events filtering and non-root users.
> Before I tried to use sched_stat* events, but it occurs in a context of
> another task. For this case logic of perf_event_context* doesn't work.
> 
> This series adds one more event sched:sched_switch_finish, it's sent
> when a task starts executing on a cpu. When a task goes out from cpu,
> sched:sched_switch is sent.
> 
> The time difference between sched_switch and sched_switch_finish is a time
> when a task wasn't being executed on a cpu.


So I really don't like this stuff either.. :-)

Mostly because it adds very similar stuff to what's already there and
then doesn't try and consolidate things.

And partly because you're measuring the 'wrong' thing, the delay between
sched-out and sched-in includes a number of things, only one part of
that is the delay due to blocking.

And as a result you need that hideous prev_state crap.


Would something like the below work?

The one thing I'm not entirely sure of is if this is a sekjoerity issue
or not.. anybody? I would think a task was entitled to know who woke it
and wherefrom etc..

---
 include/linux/ftrace_event.h |    5 +++--
 include/linux/perf_event.h   |    3 ++-
 include/trace/events/sched.h |    4 ++++
 include/trace/ftrace.h       |    6 +++++-
 kernel/events/core.c         |   19 ++++++++++++++++++-
 kernel/sched/core.c          |    2 +-
 6 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 176a939..55b96e3 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -303,9 +303,10 @@ extern void *perf_trace_buf_prepare(int size, unsigned short type,
 
 static inline void
 perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
-		       u64 count, struct pt_regs *regs, void *head)
+		       u64 count, struct pt_regs *regs, void *head,
+		       struct task_struct *task)
 {
-	perf_tp_event(addr, count, raw_data, size, regs, head, rctx);
+	perf_tp_event(addr, count, raw_data, size, regs, head, rctx, task);
 }
 #endif
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f325786..a3d7b90 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1269,7 +1269,8 @@ static inline bool perf_paranoid_kernel(void)
 extern void perf_event_init(void);
 extern void perf_tp_event(u64 addr, u64 count, void *record,
 			  int entry_size, struct pt_regs *regs,
-			  struct hlist_head *head, int rctx);
+			  struct hlist_head *head, int rctx,
+			  struct task_struct *task);
 extern void perf_bp_event(struct perf_event *event, void *data);
 
 #ifndef perf_misc_flags
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index ea7a203..31da561 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -75,6 +75,10 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
 		__entry->target_cpu	= task_cpu(p);
 	),
 
+	TP_perf_assign(
+		__perf_task(p);
+	),
+
 	TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
 		  __entry->comm, __entry->pid, __entry->prio,
 		  __entry->success, __entry->target_cpu)
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 7697249..db14daf 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -711,6 +711,9 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
 #undef __perf_count
 #define __perf_count(c) __count = (c)
 
+#undef __perf_task
+#define __perf_task(t) __task = (t)
+
 #undef TP_perf_assign
 #define TP_perf_assign(args...) args
 
@@ -724,6 +727,7 @@ perf_trace_##call(void *__data, proto)					\
 	struct ftrace_raw_##call *entry;				\
 	struct pt_regs __regs;						\
 	u64 __addr = 0, __count = 1;					\
+	struct task_struct *__task = NULL;				\
 	struct hlist_head *head;					\
 	int __entry_size;						\
 	int __data_size;						\
@@ -751,7 +755,7 @@ perf_trace_##call(void *__data, proto)					\
 									\
 	head = this_cpu_ptr(event_call->perf_events);			\
 	perf_trace_buf_submit(entry, __entry_size, rctx, __addr,	\
-		__count, &__regs, head);				\
+		__count, &__regs, head, __task);			\
 }
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5b06cbb..cb84cbd6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5204,7 +5204,8 @@ static int perf_tp_event_match(struct perf_event *event,
 }
 
 void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
-		   struct pt_regs *regs, struct hlist_head *head, int rctx)
+		   struct pt_regs *regs, struct hlist_head *head, int rctx,
+		   struct task_struct *task)
 {
 	struct perf_sample_data data;
 	struct perf_event *event;
@@ -5223,6 +5224,22 @@ void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
 			perf_swevent_event(event, count, &data, regs);
 	}
 
+	/*
+	 * If we got specified a target task, also iterate its context and
+	 * deliver this event there too.
+	 */
+	if (task) {
+		struct perf_event_context *ctx;
+
+		rcu_read_lock();
+		ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
+		list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
+			if (perf_tp_event_match(event, &data, regs))
+				perf_swevent_event(event, count, &data, regs);
+		}
+		rcu_read_unlock();
+	}
+
 	perf_swevent_put_recursion_context(rctx);
 }
 EXPORT_SYMBOL_GPL(perf_tp_event);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 420e118..8ac193c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1912,11 +1912,11 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
 	sched_info_switch(prev, next);
+	trace_sched_switch(prev, next); /* before we flip the trace context */
 	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
-	trace_sched_switch(prev, next);
 }
 
 /**


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)
  2012-06-04 12:40 ` [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Peter Zijlstra
@ 2012-06-04 12:50   ` Peter Zijlstra
  2012-06-04 13:46   ` Steven Rostedt
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2012-06-04 12:50 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Arun Sharma, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, linux-kernel

On Mon, 2012-06-04 at 14:40 +0200, Peter Zijlstra wrote:
> +       if (task) {
> +               struct perf_event_context *ctx;
> +
> +               rcu_read_lock();
> +               ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
> +               list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {

Ah, this needs a little something like the below:

			if (!(event->attr.type == PERF_TYPE_TRACEPOINT &&
			      event->attr.config == $something))
				continue;

Not exactly sure where we hid the tracepoint_id at this point..


> +                       if (perf_tp_event_match(event, &data, regs))
> +                               perf_swevent_event(event, count, &data, regs);
> +               }
> +               rcu_read_unlock();
> +       } 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)
  2012-06-04 12:40 ` [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Peter Zijlstra
  2012-06-04 12:50   ` Peter Zijlstra
@ 2012-06-04 13:46   ` Steven Rostedt
  2012-06-04 17:01   ` Arun Sharma
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2012-06-04 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Vagin, Arun Sharma, Oleg Strikov, Frederic Weisbecker,
	Ingo Molnar, linux-kernel

On Mon, 2012-06-04 at 14:40 +0200, Peter Zijlstra wrote:

> The one thing I'm not entirely sure of is if this is a sekjoerity issue
> or not.. anybody? I would think a task was entitled to know who woke it
> and wherefrom etc..

"sekjoerity"? Sure, play games with us native English speakers, who
would pronounce that as "seek-joe-rity" and be totally confused :-p
Who's joe, and why are we seeking him?

Anyway, the answer is yes it is. Well, that's because *everything* in
the kernel is a security issue. Now the real question is, can someone
use it to do harm. Well, yes. But can they use it to do more harm than
they can with other methods that exist today? Probably not.

An attacker with an unprivileged account could probably analyze a system
with just 'ps', to figure out what they can and cannot do. Perhaps they
could use perf to analyze what other things are happening, and even set
up their tools to use perf to time attacks. A wakeup can tell a user if
they were blocked on a mutex, and who just let go of that mutex to wake
the user up.

Can this information be used to continue some other kind of attack?
Maybe. But is it a big enough risk that it outweighs the usefulness of
the tool? Probably not.

As I said earlier, all kernel issues deal with finding joe. But the
question is a simple matter of risk vs usability. If you want your
system to be really secure, then lock it in a vault and do not allow
anything to connect to it. That is minimizing risk, but at the cost of
usability.

If a sysadmin doesn't want this open, then just have them keep the
paranoid level up for activating perf, and do not let users run it.

-- Steve



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)
  2012-06-04 12:40 ` [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Peter Zijlstra
  2012-06-04 12:50   ` Peter Zijlstra
  2012-06-04 13:46   ` Steven Rostedt
@ 2012-06-04 17:01   ` Arun Sharma
  2012-06-04 20:52     ` Peter Zijlstra
  2012-06-06  8:32   ` Ingo Molnar
  2012-06-12 13:15   ` Andrew Wagin
  4 siblings, 1 reply; 15+ messages in thread
From: Arun Sharma @ 2012-06-04 17:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Vagin, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, linux-kernel

On 6/4/12 5:40 AM, Peter Zijlstra wrote:

> And as a result you need that hideous prev_state crap.
>
>
> Would something like the below work?
>
> The one thing I'm not entirely sure of is if this is a sekjoerity issue
> or not.. anybody? I would think a task was entitled to know who woke it
> and wherefrom etc..

Frederic had some comments on this topic in the last round:

http://thread.gmane.org/gmane.linux.kernel/1195368/focus=1195959

 > We should probably avoid the remote callchains, sounds like asking 
for complications everywhere.

Do they still apply? Frederic?

  -Arun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)
  2012-06-04  8:09 [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Andrew Vagin
                   ` (5 preceding siblings ...)
  2012-06-04 12:40 ` [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Peter Zijlstra
@ 2012-06-04 17:03 ` Arun Sharma
  6 siblings, 0 replies; 15+ messages in thread
From: Arun Sharma @ 2012-06-04 17:03 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Oleg Strikov, Steven Rostedt, Frederic Weisbecker, Ingo Molnar,
	Peter Zijlstra, linux-kernel

On 6/4/12 1:09 AM, Andrew Vagin wrote:
>
> * "sched/tracing: Add a new tracepoint for sleeptime" by Arun Sharma
> It has been committed, but then reverted due to performance reason.

Correctness actually. The first sample could potentially be bogus and 
needs to be discarded in userspace. Hasn't been a problem in practice 
for us.

  -Arun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)
  2012-06-04 17:01   ` Arun Sharma
@ 2012-06-04 20:52     ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2012-06-04 20:52 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Andrew Vagin, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, linux-kernel

On Mon, 2012-06-04 at 10:01 -0700, Arun Sharma wrote:
> On 6/4/12 5:40 AM, Peter Zijlstra wrote:
> 
> > And as a result you need that hideous prev_state crap.
> >
> >
> > Would something like the below work?
> >
> > The one thing I'm not entirely sure of is if this is a sekjoerity issue
> > or not.. anybody? I would think a task was entitled to know who woke it
> > and wherefrom etc..
> 
> Frederic had some comments on this topic in the last round:
> 
> http://thread.gmane.org/gmane.linux.kernel/1195368/focus=1195959
> 
>  > We should probably avoid the remote callchains, sounds like asking 
> for complications everywhere.
> 
> Do they still apply? Frederic?

Probably, but note that the approaches are slightly different. The
previous thing would wreck the callchain for everybody, the proposed
thing will make more events.

Also, you don't need the callchain on the wakeup side of things. You
only want the callchain of the sched_switch site.

We could even dis-allow the callchain unwind when event->ctx->task &&
event->ctx->task != current -- pretend the unwind failed by writing 0
entries.

That also avoids the 'difficultly' of exposing and trying to interpret
another task's callchain.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)
  2012-06-04 12:40 ` [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Peter Zijlstra
                     ` (2 preceding siblings ...)
  2012-06-04 17:01   ` Arun Sharma
@ 2012-06-06  8:32   ` Ingo Molnar
  2012-06-12 13:15   ` Andrew Wagin
  4 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2012-06-06  8:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Vagin, Arun Sharma, Oleg Strikov, Steven Rostedt,
	Frederic Weisbecker, linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> The one thing I'm not entirely sure of is if this is a 
> sekjoerity issue or not.. anybody? I would think a task was 
> entitled to know who woke it and wherefrom etc..

I wouldn't worry about that really in the first approximation. 
/proc already gives out much of that information already.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)
  2012-06-04 12:40 ` [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Peter Zijlstra
                     ` (3 preceding siblings ...)
  2012-06-06  8:32   ` Ingo Molnar
@ 2012-06-12 13:15   ` Andrew Wagin
  2012-06-12 17:46     ` Peter Zijlstra
  4 siblings, 1 reply; 15+ messages in thread
From: Andrew Wagin @ 2012-06-12 13:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arun Sharma, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, linux-kernel

Hello,

> So I really don't like this stuff either.. :-)
>
> Mostly because it adds very similar stuff to what's already there and
> then doesn't try and consolidate things.
>
> And partly because you're measuring the 'wrong' thing, the delay between
> sched-out and sched-in includes a number of things, only one part of
> that is the delay due to blocking.
>
> And as a result you need that hideous prev_state crap.
>

Yes, you are right.

>
>
> Would something like the below work?

Yes, I like it. Are you going to commit this? Thank for this work.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)
  2012-06-12 13:15   ` Andrew Wagin
@ 2012-06-12 17:46     ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2012-06-12 17:46 UTC (permalink / raw)
  To: Andrew Wagin
  Cc: Arun Sharma, Oleg Strikov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, linux-kernel

On Tue, 2012-06-12 at 17:15 +0400, Andrew Wagin wrote:
> Yes, I like it. Are you going to commit this? 

No its unfinished and untested.. not even sure its seen a compiler.

Also, in another email I wrote:

> Also, you don't need the callchain on the wakeup side of things. You
> only want the callchain of the sched_switch site.
> 
> We could even dis-allow the callchain unwind when event->ctx->task &&
> event->ctx->task != current -- pretend the unwind failed by writing 0
> entries.
> 
> That also avoids the 'difficultly' of exposing and trying to interpret
> another task's callchain.

That probably wants to be done as well.

If you have the time, please look into this. Hoping of course Frederic
and others don't object to this.. :-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-06-12 17:46 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-04  8:09 [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Andrew Vagin
2012-06-04  8:09 ` [PATCH 1/5] sched: event: add trace events when a task starts executing on a cpu Andrew Vagin
2012-06-04  8:09 ` [PATCH 2/5] sched: send the event sched_switch before perf_event_task_sched_out Andrew Vagin
2012-06-04  8:09 ` [PATCH 3/5] sched: save a previous state on task_struct Andrew Vagin
2012-06-04  8:09 ` [PATCH 4/5] perf: teach "perf inject" to work with files Andrew Vagin
2012-06-04  8:09 ` [PATCH 5/5] perf: teach perf inject to merge sched_switch* events for profiling sleep times Andrew Vagin
2012-06-04 12:40 ` [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4) Peter Zijlstra
2012-06-04 12:50   ` Peter Zijlstra
2012-06-04 13:46   ` Steven Rostedt
2012-06-04 17:01   ` Arun Sharma
2012-06-04 20:52     ` Peter Zijlstra
2012-06-06  8:32   ` Ingo Molnar
2012-06-12 13:15   ` Andrew Wagin
2012-06-12 17:46     ` Peter Zijlstra
2012-06-04 17:03 ` Arun Sharma

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.