All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: split sched_switch trace event into two
@ 2015-06-24 23:19 Cong Wang
  2015-06-25  1:05 ` Steven Rostedt
  2015-06-25  7:48 ` Peter Zijlstra
  0 siblings, 2 replies; 5+ messages in thread
From: Cong Wang @ 2015-06-24 23:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: Cong Wang, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Cong Wang

Currently we only have one sched_switch trace event
for task switching, which is generated very early during
task switch. When we try to monitor per-container events,
this is not what we expect.

For example, we have a process A which is in the cgroup
we monitor, and process B which isn't, when kernel switches
from B to A, the sched_switch event is not recorded for this
cgroup since it belongs to B (current process is still B
util we finish the switch), but we require this event to
signal that process A in this cgroup gets scheduled. This is
crucial for calculating schedule latency.

So split the sched_switch event into two: sched_in event
before we perform the switch, and sched_out event after we
perform the switch.

For compatibility, the sched_switch event is not touched.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
---
 include/trace/events/sched.h | 51 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c          |  2 ++
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index d57a575..c31f1e0 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -112,8 +112,57 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
 #endif /* CREATE_TRACE_POINTS */
 
 /*
- * Tracepoint for task switches, performed by the scheduler:
+ * Tracepoints for task switches, performed by the scheduler:
  */
+TRACE_EVENT(sched_out,
+
+	TP_PROTO(struct task_struct *curr),
+
+	TP_ARGS(curr),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	int,	prio			)
+		__field(	long,	state			)
+	),
+
+	TP_fast_assign(
+		__entry->prio	= curr->prio;
+		__entry->state	= __trace_sched_switch_state(curr);
+		memcpy(__entry->comm, curr->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("comm=%s prio=%d state=%s%s",
+		__entry->comm, __entry->prio,
+		__entry->state & (TASK_STATE_MAX-1) ?
+		  __print_flags(__entry->state & (TASK_STATE_MAX-1), "|",
+				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
+				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
+				{ 128, "K" }, { 256, "W" }, { 512, "P" },
+				{ 1024, "N" }) : "R",
+		__entry->state & TASK_STATE_MAX ? "+" : "")
+);
+
+TRACE_EVENT(sched_in,
+
+	TP_PROTO(struct task_struct *next),
+
+	TP_ARGS(next),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	int,	prio			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, next->comm, TASK_COMM_LEN);
+		__entry->prio	= next->prio;
+	),
+
+	TP_printk("comm=%s prio=%d",
+		__entry->comm, __entry->prio)
+);
+
 TRACE_EVENT(sched_switch,
 
 	TP_PROTO(struct task_struct *prev,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c86935a..681fc50 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2219,6 +2219,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
 	trace_sched_switch(prev, next);
+	trace_sched_out(prev);
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
@@ -2288,6 +2289,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	}
 
 	tick_nohz_task_switch(current);
+	trace_sched_in(current);
 	return rq;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: split sched_switch trace event into two
  2015-06-24 23:19 [PATCH] sched: split sched_switch trace event into two Cong Wang
@ 2015-06-25  1:05 ` Steven Rostedt
  2015-06-25 17:04   ` Cong Wang
  2015-06-25  7:48 ` Peter Zijlstra
  1 sibling, 1 reply; 5+ messages in thread
From: Steven Rostedt @ 2015-06-25  1:05 UTC (permalink / raw)
  To: Cong Wang; +Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Cong Wang

On Wed, 24 Jun 2015 16:19:33 -0700
Cong Wang <xiyou.wangcong@gmail.com> wrote:

> For compatibility, the sched_switch event is not touched.

Yes, and sched_out() should not be added.

> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> Signed-off-by: Cong Wang <cwang@twopensource.com>
> ---
>  include/trace/events/sched.h | 51 +++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/core.c          |  2 ++
>  2 files changed, 52 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index d57a575..c31f1e0 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -112,8 +112,57 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
>  #endif /* CREATE_TRACE_POINTS */
>  
>  /*
> - * Tracepoint for task switches, performed by the scheduler:
> + * Tracepoints for task switches, performed by the scheduler:
>   */
> +TRACE_EVENT(sched_out,
> +
> +	TP_PROTO(struct task_struct *curr),
> +
> +	TP_ARGS(curr),
> +
> +	TP_STRUCT__entry(
> +		__array(	char,	comm,	TASK_COMM_LEN	)
> +		__field(	int,	prio			)
> +		__field(	long,	state			)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->prio	= curr->prio;
> +		__entry->state	= __trace_sched_switch_state(curr);
> +		memcpy(__entry->comm, curr->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("comm=%s prio=%d state=%s%s",
> +		__entry->comm, __entry->prio,
> +		__entry->state & (TASK_STATE_MAX-1) ?
> +		  __print_flags(__entry->state & (TASK_STATE_MAX-1), "|",
> +				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
> +				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
> +				{ 128, "K" }, { 256, "W" }, { 512, "P" },
> +				{ 1024, "N" }) : "R",
> +		__entry->state & TASK_STATE_MAX ? "+" : "")
> +);
> +
> +TRACE_EVENT(sched_in,
> +
> +	TP_PROTO(struct task_struct *next),
> +
> +	TP_ARGS(next),
> +
> +	TP_STRUCT__entry(
> +		__array(	char,	comm,	TASK_COMM_LEN	)
> +		__field(	int,	prio			)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, next->comm, TASK_COMM_LEN);
> +		__entry->prio	= next->prio;
> +	),
> +
> +	TP_printk("comm=%s prio=%d",
> +		__entry->comm, __entry->prio)
> +);
> +
>  TRACE_EVENT(sched_switch,
>  
>  	TP_PROTO(struct task_struct *prev,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c86935a..681fc50 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2219,6 +2219,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>  		    struct task_struct *next)
>  {
>  	trace_sched_switch(prev, next);
> +	trace_sched_out(prev);

Tracepoints are low overhead, but they do take up space. This is a
useless tracepoint. If anything, I'll work on adding an alias or
something. But please don't add a tracepoint next to a tracepoint that
encompasses the data.

>  	sched_info_switch(rq, prev, next);
>  	perf_event_task_sched_out(prev, next);
>  	fire_sched_out_preempt_notifiers(prev, next);
> @@ -2288,6 +2289,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  	}
>  
>  	tick_nohz_task_switch(current);
> +	trace_sched_in(current);

Why not have a:

	sched_switch_post(prev, current);

That way, the hook can be useful for other tools.

-- Steve

>  	return rq;
>  }
>  


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: split sched_switch trace event into two
  2015-06-24 23:19 [PATCH] sched: split sched_switch trace event into two Cong Wang
  2015-06-25  1:05 ` Steven Rostedt
@ 2015-06-25  7:48 ` Peter Zijlstra
  2015-06-25 17:17   ` Cong Wang
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2015-06-25  7:48 UTC (permalink / raw)
  To: Cong Wang; +Cc: linux-kernel, Steven Rostedt, Ingo Molnar, Cong Wang

On Wed, Jun 24, 2015 at 04:19:33PM -0700, Cong Wang wrote:
> Currently we only have one sched_switch trace event
> for task switching, which is generated very early during
> task switch. When we try to monitor per-container events,
> this is not what we expect.

Adjust your expectations?

> For example, we have a process A which is in the cgroup
> we monitor, and process B which isn't, when kernel switches
> from B to A, the sched_switch event is not recorded for this
> cgroup since it belongs to B (current process is still B
> util we finish the switch), but we require this event to
> signal that process A in this cgroup gets scheduled. This is
> crucial for calculating schedule latency.

I don't get it. This is global data in the root pid-space.

The switch data includes both the previous and the next task. Just look
up their corresponding cgroups and be done with it.

If you cannot get what you want from it, you're doing it wrong.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: split sched_switch trace event into two
  2015-06-25  1:05 ` Steven Rostedt
@ 2015-06-25 17:04   ` Cong Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Cong Wang @ 2015-06-25 17:04 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, Ingo Molnar, Peter Zijlstra, Cong Wang

On Wed, Jun 24, 2015 at 6:05 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c86935a..681fc50 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2219,6 +2219,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>>                   struct task_struct *next)
>>  {
>>       trace_sched_switch(prev, next);
>> +     trace_sched_out(prev);
>
> Tracepoints are low overhead, but they do take up space. This is a
> useless tracepoint. If anything, I'll work on adding an alias or
> something. But please don't add a tracepoint next to a tracepoint that
> encompasses the data.


Good point, agreed.

The reason why I picked sched_in and sched_out is that their names are
much better than sched_switch and sched_switch_post as you proposed.

Before you add tracepoint alias, adding sched_switch_post is better,
we probably can just add sched_in and sched_out as aliases to them after
your work is done. ;)

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: split sched_switch trace event into two
  2015-06-25  7:48 ` Peter Zijlstra
@ 2015-06-25 17:17   ` Cong Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Cong Wang @ 2015-06-25 17:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Steven Rostedt, Ingo Molnar, Cong Wang

On Thu, Jun 25, 2015 at 12:48 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 24, 2015 at 04:19:33PM -0700, Cong Wang wrote:
>> Currently we only have one sched_switch trace event
>> for task switching, which is generated very early during
>> task switch. When we try to monitor per-container events,
>> this is not what we expect.
>
> Adjust your expectations?
>
>> For example, we have a process A which is in the cgroup
>> we monitor, and process B which isn't, when kernel switches
>> from B to A, the sched_switch event is not recorded for this
>> cgroup since it belongs to B (current process is still B
>> util we finish the switch), but we require this event to
>> signal that process A in this cgroup gets scheduled. This is
>> crucial for calculating schedule latency.
>
> I don't get it. This is global data in the root pid-space.
>
> The switch data includes both the previous and the next task. Just look
> up their corresponding cgroups and be done with it.


This is exactly how I _workaround_ this issue for now. :)

But it is not that easy, because we need to track new processes
and dying processes to maintain the pid list.

>
> If you cannot get what you want from it, you're doing it wrong.

The problem is why user-space has to do that (as mentioned
above) just to track their cgroup since we already have perf_event
cgroup.

All what I want is tracking several sched events within a given
perf_event cgroup and calculating the sched latency (like how
`perf sched` does), and sched_switch is an exception since
it is related with two processes which can be in two different
perf_event cgroups, and since it is generated early during
task switch, all sched_switch events belong to 'prev' task, we
lose the track of the sched_switch for 'next' task.

Fixing it in kernel is easier than workaround in user-space,
this is why I come up with this patch.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-06-25 17:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-24 23:19 [PATCH] sched: split sched_switch trace event into two Cong Wang
2015-06-25  1:05 ` Steven Rostedt
2015-06-25 17:04   ` Cong Wang
2015-06-25  7:48 ` Peter Zijlstra
2015-06-25 17:17   ` Cong Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.