linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] trace: reset sleep/block start time on task switch
@ 2012-01-20  2:20 Arun Sharma
  2012-01-23 11:34 ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Arun Sharma @ 2012-01-20  2:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Arun Sharma, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Andrew Vagin, Frederic Weisbecker, Ingo Molnar

Without this patch, the first sample we get on a
task might be bad because of a stale sleep_start
value that wasn't reset at the last task switch
because the tracepoint was not active.

The problem can be worked around via perf record
--filter "sleeptime < some-large-number" in practice
and it's not clear if the added code to the context
switch path is worth it.

I'm posting this patch regardless, just in case
more people start noticing this and start wondering
where the bogus numbers came from.

Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: linux-kernel@vger.kernel.org
---
 include/trace/events/sched.h |    3 ---
 kernel/sched/core.c          |    3 +++
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 6ba596b..814cdf1 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -378,9 +378,6 @@ static inline u64 trace_get_sleeptime(struct task_struct *tsk)
 
 	block = tsk->se.statistics.block_start;
 	sleep = tsk->se.statistics.sleep_start;
-	tsk->se.statistics.block_start = 0;
-	tsk->se.statistics.sleep_start = 0;
-
 	return block ? block : sleep ? sleep : 0;
 #else
 	return 0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 457c881..6349cee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1937,7 +1937,10 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	local_irq_enable();
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 	finish_lock_switch(rq, prev);
+
 	trace_sched_stat_sleeptime(current, rq->clock);
+	current->se.statistics.block_start = 0;
+	current->se.statistics.sleep_start = 0;
 
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-20  2:20 [PATCH] trace: reset sleep/block start time on task switch Arun Sharma
@ 2012-01-23 11:34 ` Peter Zijlstra
  2012-01-23 18:41   ` Arun Sharma
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2012-01-23 11:34 UTC (permalink / raw)
  To: Arun Sharma
  Cc: linux-kernel, Arnaldo Carvalho de Melo, Andrew Vagin,
	Frederic Weisbecker, Ingo Molnar

On Thu, 2012-01-19 at 18:20 -0800, Arun Sharma wrote:
> Without this patch, the first sample we get on a
> task might be bad because of a stale sleep_start
> value that wasn't reset at the last task switch
> because the tracepoint was not active.
> 
> The problem can be worked around via perf record
> --filter "sleeptime < some-large-number" in practice
> and it's not clear if the added code to the context
> switch path is worth it.
> 
> I'm posting this patch regardless, just in case
> more people start noticing this and start wondering
> where the bogus numbers came from.
> 
> Signed-off-by: Arun Sharma <asharma@fb.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
> Cc: Andrew Vagin <avagin@openvz.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: linux-kernel@vger.kernel.org
> ---
>  include/trace/events/sched.h |    3 ---
>  kernel/sched/core.c          |    3 +++
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 6ba596b..814cdf1 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -378,9 +378,6 @@ static inline u64 trace_get_sleeptime(struct task_struct *tsk)
>  
>  	block = tsk->se.statistics.block_start;
>  	sleep = tsk->se.statistics.sleep_start;
> -	tsk->se.statistics.block_start = 0;
> -	tsk->se.statistics.sleep_start = 0;
> -
>  	return block ? block : sleep ? sleep : 0;
>  #else
>  	return 0;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 457c881..6349cee 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1937,7 +1937,10 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>  	local_irq_enable();
>  #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
>  	finish_lock_switch(rq, prev);
> +
>  	trace_sched_stat_sleeptime(current, rq->clock);
> +	current->se.statistics.block_start = 0;
> +	current->se.statistics.sleep_start = 0;
>  
>  	fire_sched_in_preempt_notifiers(current);
>  	if (mm)


Its not just your tracepoint data being wrong, it'll wreck all related
stats :/

This'll fail to compile for !CONFIG_SCHEDSTAT I guess.. I should have
paid more attention to the initial patch, that tracepoint having
side-effects is a big no-no.

Having unconditional writes there is somewhat sad, but I suspect putting
a conditional around it isn't going to help much.. bah can we
restructure things so we don't need this?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-23 11:34 ` Peter Zijlstra
@ 2012-01-23 18:41   ` Arun Sharma
  2012-01-23 21:03     ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Arun Sharma @ 2012-01-23 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Arnaldo Carvalho de Melo, Andrew Vagin,
	Frederic Weisbecker, Ingo Molnar, Steven Rostedt

On 1/23/12 3:34 AM, Peter Zijlstra wrote:

> Its not just your tracepoint data being wrong, it'll wreck all related
> stats :/

I see: these stats are also used in sched_debug.c.

>
> This'll fail to compile for !CONFIG_SCHEDSTAT I guess.. I should have
> paid more attention to the initial patch, that tracepoint having
> side-effects is a big no-no.
>
> Having unconditional writes there is somewhat sad, but I suspect putting
> a conditional around it isn't going to help much..

For performance reasons?


> bah can we
> restructure things so we don't need this?
>

We can go back to the old code, where these values were getting reset in 
{en,de}queue_sleeper(). But we'll have to do it conditionally, so the 
values are preserved till context switch time when we need it there.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1003,6 +1003,8 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, 
struct sched_entity *se)
                 if (unlikely(delta > se->statistics.sleep_max))
                         se->statistics.sleep_max = delta;

+               if (!trace_sched_stat_sleeptime_enabled())
+                       se->statistics.sleep_start = 0;
                 se->statistics.sum_sleep_runtime += delta;

                 if (tsk) {
@@ -1019,6 +1021,8 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, 
struct sched_entity *se)
                 if (unlikely(delta > se->statistics.block_max))
                         se->statistics.block_max = delta;

+               if (!trace_sched_stat_sleeptime_enabled())
+                       se->statistics.block_start = 0;
                 se->statistics.sum_sleep_runtime += delta;

                 if (tsk) {

This looks pretty ugly too, I don't know how to check for a tracepoint 
being active (Steven?). The only advantage of this approach is that it's 
in the sleep/wakeup path, rather than the context switch path.

Conceptually, the following seems to be the simplest:

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1939,8 +1939,10 @@ static void finish_task_switch(struct rq *rq, 
struct task_struct *prev)
         finish_lock_switch(rq, prev);

         trace_sched_stat_sleeptime(current, rq->clock);
+#ifdef CONFIG_SCHEDSTAT
         current->se.statistics.block_start = 0;
         current->se.statistics.sleep_start = 0;
+#endif /* CONFIG_SCHEDSTAT */

Perhaps we can reorder fields in sched_statistics so we touch one 
cacheline here instead of two?

  -Arun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-23 18:41   ` Arun Sharma
@ 2012-01-23 21:03     ` Peter Zijlstra
  2012-01-23 23:02       ` Arun Sharma
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2012-01-23 21:03 UTC (permalink / raw)
  To: Arun Sharma
  Cc: linux-kernel, Arnaldo Carvalho de Melo, Andrew Vagin,
	Frederic Weisbecker, Ingo Molnar, Steven Rostedt

On Mon, 2012-01-23 at 10:41 -0800, Arun Sharma wrote:
> On 1/23/12 3:34 AM, Peter Zijlstra wrote:

> > This'll fail to compile for !CONFIG_SCHEDSTAT I guess.. I should have
> > paid more attention to the initial patch, that tracepoint having
> > side-effects is a big no-no.
> >
> > Having unconditional writes there is somewhat sad, but I suspect putting
> > a conditional around it isn't going to help much..
> 
> For performance reasons?

Yep.

> 
> > bah can we
> > restructure things so we don't need this?
> >
> 
> We can go back to the old code, where these values were getting reset in 
> {en,de}queue_sleeper(). But we'll have to do it conditionally, so the 
> values are preserved till context switch time when we need it there.
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1003,6 +1003,8 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, 
> struct sched_entity *se)
>                  if (unlikely(delta > se->statistics.sleep_max))
>                          se->statistics.sleep_max = delta;
> 
> +               if (!trace_sched_stat_sleeptime_enabled())
> +                       se->statistics.sleep_start = 0;
>                  se->statistics.sum_sleep_runtime += delta;
> 
>                  if (tsk) {
> @@ -1019,6 +1021,8 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, 
> struct sched_entity *se)
>                  if (unlikely(delta > se->statistics.block_max))
>                          se->statistics.block_max = delta;
> 
> +               if (!trace_sched_stat_sleeptime_enabled())
> +                       se->statistics.block_start = 0;
>                  se->statistics.sum_sleep_runtime += delta;
> 
>                  if (tsk) {
> 
> This looks pretty ugly too,

Agreed, it still violates the tracepoints shouldn't actually affect the
code principle.

>  I don't know how to check for a tracepoint 
> being active (Steven?). The only advantage of this approach is that it's 
> in the sleep/wakeup path, rather than the context switch path.

Right, thus avoiding the stores for preemptions.

> Conceptually, the following seems to be the simplest:
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1939,8 +1939,10 @@ static void finish_task_switch(struct rq *rq, 
> struct task_struct *prev)
>          finish_lock_switch(rq, prev);
> 
>          trace_sched_stat_sleeptime(current, rq->clock);
> +#ifdef CONFIG_SCHEDSTAT
>          current->se.statistics.block_start = 0;
>          current->se.statistics.sleep_start = 0;
> +#endif /* CONFIG_SCHEDSTAT */
> 
> Perhaps we can reorder fields in sched_statistics so we touch one 
> cacheline here instead of two?

That might help some, but no stores for preemptions would still be best.

I was thinking something along the lines of the below, since enqueue
uses non-zero to mean its set and we need the values to still be
available in schedule dequeue is the last moment we can clear them.

This would limit the stores to the blocking case, your suggestion of
moving them to the same cacheline will then get us back where we started
in terms of performance.

Or did I miss something?


---
 kernel/sched/fair.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84adb2d..60f9ab9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1191,6 +1191,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		if (entity_is_task(se)) {
 			struct task_struct *tsk = task_of(se);
 
+			se->statistics.sleep_start = 0;
+			se->statistics.block_start = 0;
+
 			if (tsk->state & TASK_INTERRUPTIBLE)
 				se->statistics.sleep_start = rq_of(cfs_rq)->clock;
 			if (tsk->state & TASK_UNINTERRUPTIBLE)


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-23 21:03     ` Peter Zijlstra
@ 2012-01-23 23:02       ` Arun Sharma
  2012-01-24 14:27         ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Arun Sharma @ 2012-01-23 23:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Arnaldo Carvalho de Melo, Andrew Vagin,
	Frederic Weisbecker, Ingo Molnar, Steven Rostedt

On 1/23/12 1:03 PM, Peter Zijlstra wrote:

> This would limit the stores to the blocking case, your suggestion of
> moving them to the same cacheline will then get us back where we started
> in terms of performance.
>
> Or did I miss something?
>
>
> ---
>   kernel/sched/fair.c |    3 +++
>   1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 84adb2d..60f9ab9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1191,6 +1191,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>   		if (entity_is_task(se)) {
>   			struct task_struct *tsk = task_of(se);
>
> +			se->statistics.sleep_start = 0;
> +			se->statistics.block_start = 0;
> +

We might still need some additional logic to ignore sleep_start if the 
last context switch was a preemption. Test case Andrew Vagin posted on 
12/21:

nanosleep();
s = time(NULL);
while (time(NULL) - s < 4);

During the busy wait while loop, sleep_start is non-zero and the first 
sample from sched_stat_sleeptime() and anyone else doing the (now - 
sleep_start) computation would get a bogus value until the next dequeue.

I can't think of an obvious way to pass an extra parameter (bool 
preempted) to the tracepoint. Would something like this be too
intrusive?


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index df00cb0..b69bb9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1908,7 +1908,8 @@ prepare_task_switch(struct rq *rq, struct 
task_struct *prev,
  * with the lock held can cause deadlocks; see schedule() for
  * details.)
  */
-static void finish_task_switch(struct rq *rq, struct task_struct *prev)
+static void finish_task_switch(struct rq *rq, struct task_struct *prev,
+			       bool preempted)
	__releases(rq->lock)
{
	struct mm_struct *mm = rq->prev_mm;
@@ -1997,7 +1998,7 @@ asmlinkage void schedule_tail(struct task_struct 
*prev)
{
	struct rq *rq = this_rq();

-	finish_task_switch(rq, prev);
+	finish_task_switch(rq, prev, false);

	/*
	 * FIXME: do we need to worry about rq being invalidated by the
@@ -2019,7 +2020,7 @@ asmlinkage void schedule_tail(struct task_struct 
*prev)
  */
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
-	       struct task_struct *next)
+	       struct task_struct *next, bool preempted)
{
	struct mm_struct *mm, *oldmm;

@@ -2064,7 +2065,7 @@ context_switch(struct rq *rq, struct task_struct 
*prev,
	 * CPUs since it called schedule(), thus the 'rq' on its stack
	 * frame will be invalid.
	 */
-	finish_task_switch(this_rq(), prev);
+	finish_task_switch(this_rq(), prev, preempted);
}

/*
@@ -3155,9 +3156,9 @@ pick_next_task(struct rq *rq)
static void __sched __schedule(void)
{
	struct task_struct *prev, *next;
-	unsigned long *switch_count;
	struct rq *rq;
	int cpu;
+	bool preempted;

need_resched:
	preempt_disable();
@@ -3173,7 +3174,7 @@ need_resched:

	raw_spin_lock_irq(&rq->lock);

-	switch_count = &prev->nivcsw;
+	preempted = true;
	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
		if (unlikely(signal_pending_state(prev->state, prev))) {
			prev->state = TASK_RUNNING;
@@ -3194,7 +3195,7 @@ need_resched:
					try_to_wake_up_local(to_wakeup);
			}
		}
-		switch_count = &prev->nvcsw;
+		preempted = false;
	}

	pre_schedule(rq, prev);
@@ -3210,9 +3211,12 @@ need_resched:
	if (likely(prev != next)) {
		rq->nr_switches++;
		rq->curr = next;
-		++*switch_count;
+		if (preempted)
+			 prev->nivcsw++;
+		else
+			 prev->nvcsw++;

-		context_switch(rq, prev, next); /* unlocks the rq */
+		context_switch(rq, prev, next, preempted); /* unlocks the rq */
		/*
		 * The context switch have flipped the stack from under us
		 * and restored the local variables which were saved when

  -Arun

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-23 23:02       ` Arun Sharma
@ 2012-01-24 14:27         ` Peter Zijlstra
  2012-01-24 21:46           ` Arun Sharma
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2012-01-24 14:27 UTC (permalink / raw)
  To: Arun Sharma
  Cc: linux-kernel, Arnaldo Carvalho de Melo, Andrew Vagin,
	Frederic Weisbecker, Ingo Molnar, Steven Rostedt

On Mon, 2012-01-23 at 15:02 -0800, Arun Sharma wrote:
> > +++ b/kernel/sched/fair.c
> > @@ -1191,6 +1191,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >               if (entity_is_task(se)) {
> >                       struct task_struct *tsk = task_of(se);
> >
> > +                     se->statistics.sleep_start = 0;
> > +                     se->statistics.block_start = 0;
> > +
> 
> We might still need some additional logic to ignore sleep_start if the 
> last context switch was a preemption. Test case Andrew Vagin posted on 
> 12/21:
> 
> nanosleep();
> s = time(NULL);
> while (time(NULL) - s < 4);
> 
> During the busy wait while loop, sleep_start is non-zero and the first 
> sample from sched_stat_sleeptime() and anyone else doing the (now - 
> sleep_start) computation would get a bogus value until the next dequeue. 

Bah, you're right. Also yes your proposal is too intrusive, but that can
be fixed, I actually did, but then I noticed its broken too, it doesn't
matter if the schedule that schedules a task back in preempted another
task or not, what matters is if the task we're scheduling back in was
itself preempted or recently woken. And we simply don't know.

I'm tempted to revert 1ac9bc69 for now, userspace will simply have to
correlate trace_sched_switch() and trace_sched_stat_{sleep,blocked}(),
which shouldn't be too hard.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-24 14:27         ` Peter Zijlstra
@ 2012-01-24 21:46           ` Arun Sharma
  2012-01-25  9:20             ` Frederic Weisbecker
  0 siblings, 1 reply; 16+ messages in thread
From: Arun Sharma @ 2012-01-24 21:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Arnaldo Carvalho de Melo, Andrew Vagin,
	Frederic Weisbecker, Ingo Molnar, Steven Rostedt

On 1/24/12 6:27 AM, Peter Zijlstra wrote:

> Bah, you're right. Also yes your proposal is too intrusive, but that can
> be fixed, I actually did, but then I noticed its broken too, it doesn't
> matter if the schedule that schedules a task back in preempted another
> task or not, what matters is if the task we're scheduling back in was
> itself preempted or recently woken. And we simply don't know.

Yes - we'd need an extra bit in the task_struct to do this right.

> I'm tempted to revert 1ac9bc69 for now, userspace will simply have to
> correlate trace_sched_switch() and trace_sched_stat_{sleep,blocked}(),
> which shouldn't be too hard.

We tried it and it didn't work very well. Especially when used with perf 
record -g. There are too many uninteresting trace_sched_switch() events.

Other possibilites: make a copy of {sleep,block}_start somewhere else in 
the perf_events subsystem in the sleep/wakeup path and leave 
sched_statistics untouched.

  -Arun



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-24 21:46           ` Arun Sharma
@ 2012-01-25  9:20             ` Frederic Weisbecker
  2012-01-25 19:50               ` Arun Sharma
  0 siblings, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2012-01-25  9:20 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Peter Zijlstra, linux-kernel, Arnaldo Carvalho de Melo,
	Andrew Vagin, Ingo Molnar, Steven Rostedt

On Tue, Jan 24, 2012 at 01:46:52PM -0800, Arun Sharma wrote:
> On 1/24/12 6:27 AM, Peter Zijlstra wrote:
> 
> >Bah, you're right. Also yes your proposal is too intrusive, but that can
> >be fixed, I actually did, but then I noticed its broken too, it doesn't
> >matter if the schedule that schedules a task back in preempted another
> >task or not, what matters is if the task we're scheduling back in was
> >itself preempted or recently woken. And we simply don't know.
> 
> Yes - we'd need an extra bit in the task_struct to do this right.
> 
> >I'm tempted to revert 1ac9bc69 for now, userspace will simply have to
> >correlate trace_sched_switch() and trace_sched_stat_{sleep,blocked}(),
> >which shouldn't be too hard.
> 
> We tried it and it didn't work very well. Especially when used with
> perf record -g. There are too many uninteresting
> trace_sched_switch() events.

You mean context switches happening when the prev task doesn't need
to block or so? As it happens with preemption for example?

In this case you can use filters to drop context switches for
which the prev state is not S or D.

> 
> Other possibilites: make a copy of {sleep,block}_start somewhere
> else in the perf_events subsystem in the sleep/wakeup path and leave
> sched_statistics untouched.
> 
>  -Arun
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-25  9:20             ` Frederic Weisbecker
@ 2012-01-25 19:50               ` Arun Sharma
  2012-01-25 20:15                 ` Steven Rostedt
                                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Arun Sharma @ 2012-01-25 19:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, linux-kernel, Arnaldo Carvalho de Melo,
	Andrew Vagin, Ingo Molnar, Steven Rostedt

On 1/25/12 1:20 AM, Frederic Weisbecker wrote:

>>> I'm tempted to revert 1ac9bc69 for now, userspace will simply have to
>>> correlate trace_sched_switch() and trace_sched_stat_{sleep,blocked}(),
>>> which shouldn't be too hard.
>>
>> We tried it and it didn't work very well. Especially when used with
>> perf record -g. There are too many uninteresting
>> trace_sched_switch() events.
>
> You mean context switches happening when the prev task doesn't need
> to block or so? As it happens with preemption for example?
>
> In this case you can use filters to drop context switches for
> which the prev state is not S or D.

We had these filters and still couldn't keep up:

# perf record -agP -e sched:sched_switch --filter "prev_state == 1 || 
prev_state == 2" -e  sched:sched_stat_sleep -- sleep 3
[ perf record: Woken up 107 times to write data ]
[ perf record: Captured and wrote 32.629 MB perf.data (~1425585 samples) ]
Warning:
Processed 104066 events and lost 4 chunks!

Check IO/CPU overload!

  -Arun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-25 19:50               ` Arun Sharma
@ 2012-01-25 20:15                 ` Steven Rostedt
  2012-01-25 22:29                   ` Arun Sharma
  2012-01-26  2:21                 ` Frederic Weisbecker
  2012-02-10 18:43                 ` Peter Zijlstra
  2 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2012-01-25 20:15 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Frederic Weisbecker, Peter Zijlstra, linux-kernel,
	Arnaldo Carvalho de Melo, Andrew Vagin, Ingo Molnar

On Wed, 2012-01-25 at 11:50 -0800, Arun Sharma wrote:

> # perf record -agP -e sched:sched_switch --filter "prev_state == 1 || 
> prev_state == 2" -e  sched:sched_stat_sleep -- sleep 3
> [ perf record: Woken up 107 times to write data ]
> [ perf record: Captured and wrote 32.629 MB perf.data (~1425585 samples) ]
> Warning:
> Processed 104066 events and lost 4 chunks!
> 
> Check IO/CPU overload!

Just for kicks, could you try trace-cmd?

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git

trace-cmd record -e sched_switch -f 'prev_state == 1 || prev_state == 2' -e sched_stat_sleep sleep 3

I'm not sure the filter is even needed, and it should still keep up
fine.

-- Steve



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-25 20:15                 ` Steven Rostedt
@ 2012-01-25 22:29                   ` Arun Sharma
  2012-01-26  2:27                     ` Frederic Weisbecker
  0 siblings, 1 reply; 16+ messages in thread
From: Arun Sharma @ 2012-01-25 22:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, Peter Zijlstra, linux-kernel,
	Arnaldo Carvalho de Melo, Andrew Vagin, Ingo Molnar

On 1/25/12 12:15 PM, Steven Rostedt wrote:

> trace-cmd record -e sched_switch -f 'prev_state == 1 || prev_state == 2' -e sched_stat_sleep sleep 3
>
> I'm not sure the filter is even needed, and it should still keep up
> fine.

Better - I didn't see any overruns with trace-cmd.

Even if we resolve the sampling rate related problems, there is the 
issue of: can we trust that a sampled sched_switch event and a sampled 
sched_stat_sleep event actually match each other?

There is consensus here that touching an extra cacheline in the context 
switch path is a good trade-off given the usefulness of sleep profiling 
in troubleshooting latency problems.

Since it may not be a good trade-off for everyone, we just need to 
figure out where to store some per-task state. To recap two potential 
paths to investigate:

* store sleep_start someplace else that no one other than 
sched_stat_sleeptime() knows about.

* store state in task struct that remembers whether the last context 
switch was a preemption or not.

  -Arun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-25 19:50               ` Arun Sharma
  2012-01-25 20:15                 ` Steven Rostedt
@ 2012-01-26  2:21                 ` Frederic Weisbecker
  2012-02-10 18:43                 ` Peter Zijlstra
  2 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2012-01-26  2:21 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Peter Zijlstra, linux-kernel, Arnaldo Carvalho de Melo,
	Andrew Vagin, Ingo Molnar, Steven Rostedt

On Wed, Jan 25, 2012 at 11:50:53AM -0800, Arun Sharma wrote:
> On 1/25/12 1:20 AM, Frederic Weisbecker wrote:
> 
> >>>I'm tempted to revert 1ac9bc69 for now, userspace will simply have to
> >>>correlate trace_sched_switch() and trace_sched_stat_{sleep,blocked}(),
> >>>which shouldn't be too hard.
> >>
> >>We tried it and it didn't work very well. Especially when used with
> >>perf record -g. There are too many uninteresting
> >>trace_sched_switch() events.
> >
> >You mean context switches happening when the prev task doesn't need
> >to block or so? As it happens with preemption for example?
> >
> >In this case you can use filters to drop context switches for
> >which the prev state is not S or D.
> 
> We had these filters and still couldn't keep up:
> 
> # perf record -agP -e sched:sched_switch --filter "prev_state == 1
> || prev_state == 2" -e  sched:sched_stat_sleep -- sleep 3
> [ perf record: Woken up 107 times to write data ]
> [ perf record: Captured and wrote 32.629 MB perf.data (~1425585 samples) ]
> Warning:
> Processed 104066 events and lost 4 chunks!
> 
> Check IO/CPU overload!

Have you tried to tweak the -m options to increase the size of the buffer?

> 
>  -Arun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-25 22:29                   ` Arun Sharma
@ 2012-01-26  2:27                     ` Frederic Weisbecker
  2012-01-26 19:13                       ` Arun Sharma
  0 siblings, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2012-01-26  2:27 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Steven Rostedt, Peter Zijlstra, linux-kernel,
	Arnaldo Carvalho de Melo, Andrew Vagin, Ingo Molnar

On Wed, Jan 25, 2012 at 02:29:14PM -0800, Arun Sharma wrote:
> On 1/25/12 12:15 PM, Steven Rostedt wrote:
> 
> >trace-cmd record -e sched_switch -f 'prev_state == 1 || prev_state == 2' -e sched_stat_sleep sleep 3
> >
> >I'm not sure the filter is even needed, and it should still keep up
> >fine.
> 
> Better - I didn't see any overruns with trace-cmd.
> 
> Even if we resolve the sampling rate related problems, there is the
> issue of: can we trust that a sampled sched_switch event and a
> sampled sched_stat_sleep event actually match each other?

Well, a sched_stat_sleep event should match the sched_switch with
prev as the last targeted task.

Or am I missing something?

> 
> There is consensus here that touching an extra cacheline in the
> context switch path is a good trade-off given the usefulness of
> sleep profiling in troubleshooting latency problems.
> 
> Since it may not be a good trade-off for everyone, we just need to
> figure out where to store some per-task state. To recap two
> potential paths to investigate:
> 
> * store sleep_start someplace else that no one other than
> sched_stat_sleeptime() knows about.
> 
> * store state in task struct that remembers whether the last context
> switch was a preemption or not.
> 
>  -Arun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-26  2:27                     ` Frederic Weisbecker
@ 2012-01-26 19:13                       ` Arun Sharma
  0 siblings, 0 replies; 16+ messages in thread
From: Arun Sharma @ 2012-01-26 19:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, Peter Zijlstra, linux-kernel,
	Arnaldo Carvalho de Melo, Andrew Vagin, Ingo Molnar

On 1/25/12 6:27 PM, Frederic Weisbecker wrote:

>> Even if we resolve the sampling rate related problems, there is the
>> issue of: can we trust that a sampled sched_switch event and a
>> sampled sched_stat_sleep event actually match each other?
>
> Well, a sched_stat_sleep event should match the sched_switch with
> prev as the last targeted task.
>
> Or am I missing something?

I was thinking about large systems with 100k+ sched_switch events, where 
the user wants to just sample 1k events/sec.

perf record -F 1000 -e sched:sched_switch -a -- sleep 1

Doesn't look like we're using sampling for tracepoint events? The rest 
of this makes sense only if we're sampling tracepoint events (eg: to 
limit the impact on the system being profiled/traced).

On a system with 50k sched_switch and 10k sched_stat_sleep events, if we 
sample at 1000 events/sec, we may lose an event of interest either due 
to finite buffer sizes or to sampling, which is why a single event based 
sleep profiling is of interest to us.

On 1/25/12 6:21 PM, Frederic Weisbecker wrote:
 >
 > Have you tried to tweak the -m options to increase the size of the 
buffer?

-m4:

Processed 276875 events and lost 4772 chunks!

Check IO/CPU overload!

-m5:

   Fatal: failed to mmap with 22 (Invalid argument)

  -Arun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-01-25 19:50               ` Arun Sharma
  2012-01-25 20:15                 ` Steven Rostedt
  2012-01-26  2:21                 ` Frederic Weisbecker
@ 2012-02-10 18:43                 ` Peter Zijlstra
  2012-02-10 20:07                   ` Arun Sharma
  2 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2012-02-10 18:43 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Frederic Weisbecker, linux-kernel, Arnaldo Carvalho de Melo,
	Andrew Vagin, Ingo Molnar, Steven Rostedt

On Wed, 2012-01-25 at 11:50 -0800, Arun Sharma wrote:
> # perf record -agP -e sched:sched_switch --filter "prev_state == 1 || 
> prev_state == 2" -e  sched:sched_stat_sleep -- sleep 3 

Problem with that is it'll try and do a backtrace for all events, not
only the sched_switch but also sched_stat_sleep, which is completely
superfluous (and expensive).

Currently perf tool doesn't support individual -g flags though and its
slightly non-trivial to make it so. Easiest work-around would be two
records and then merge the two files.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] trace: reset sleep/block start time on task switch
  2012-02-10 18:43                 ` Peter Zijlstra
@ 2012-02-10 20:07                   ` Arun Sharma
  0 siblings, 0 replies; 16+ messages in thread
From: Arun Sharma @ 2012-02-10 20:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, linux-kernel, Arnaldo Carvalho de Melo,
	Andrew Vagin, Ingo Molnar, Steven Rostedt

On 2/10/12 10:43 AM, Peter Zijlstra wrote:
> On Wed, 2012-01-25 at 11:50 -0800, Arun Sharma wrote:
>> # perf record -agP -e sched:sched_switch --filter "prev_state == 1 ||
>> prev_state == 2" -e  sched:sched_stat_sleep -- sleep 3
>
> Problem with that is it'll try and do a backtrace for all events, not
> only the sched_switch but also sched_stat_sleep, which is completely
> superfluous (and expensive).
>
> Currently perf tool doesn't support individual -g flags though and its
> slightly non-trivial to make it so. Easiest work-around would be two
> records and then merge the two files.

If we have a sequence of:

SL1, SW1, SL2, SW2, SL3, SW3

I think you're suggesting:

perf.data1: SL1, SL2, SL3 (without -g)
perf.data2: SW1, SW2, SW3 (with -g)

and then munge the data to create perf.data. But making sure that all 
matching sleep and switch events are captured seems hard.

  -Arun



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-02-10 20:08 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-20  2:20 [PATCH] trace: reset sleep/block start time on task switch Arun Sharma
2012-01-23 11:34 ` Peter Zijlstra
2012-01-23 18:41   ` Arun Sharma
2012-01-23 21:03     ` Peter Zijlstra
2012-01-23 23:02       ` Arun Sharma
2012-01-24 14:27         ` Peter Zijlstra
2012-01-24 21:46           ` Arun Sharma
2012-01-25  9:20             ` Frederic Weisbecker
2012-01-25 19:50               ` Arun Sharma
2012-01-25 20:15                 ` Steven Rostedt
2012-01-25 22:29                   ` Arun Sharma
2012-01-26  2:27                     ` Frederic Weisbecker
2012-01-26 19:13                       ` Arun Sharma
2012-01-26  2:21                 ` Frederic Weisbecker
2012-02-10 18:43                 ` Peter Zijlstra
2012-02-10 20:07                   ` Arun Sharma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).