linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE:[PATCH] sched: Add trace for task wake up latency and leave running time
@ 2020-09-02 22:35 gengdongjiu
  2020-09-03  7:42 ` peterz
  0 siblings, 1 reply; 4+ messages in thread
From: gengdongjiu @ 2020-09-02 22:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, thara.gopinath, pauld, vincent.donnefort,
	rdunlap, linux-kernel

Hi Peter,
    Sorry for the late response.

> 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h index
> > 93ecd930efd3..edb622c40a90 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1324,6 +1324,13 @@ struct task_struct {
> >  	/* CPU-specific state of this task: */
> >  	struct thread_struct		thread;
> >
> > +	/* Task wake up time stamp */
> > +	u64				ts_wakeup;
> > +	/* Previous task switch out time stamp */
> > +	u64				pre_ts_end;
> > +	/* Next task switch in time stamp */
> > +	u64				next_ts_start;
> > +	bool				wakeup_state;
> >  	/*
> >  	 * WARNING: on x86, 'thread_struct' contains a variable-sized
> >  	 * structure.  It *MUST* be at the end of 'task_struct'.
> 
> ^^^ did you read that comment?
   Sorry for my carelessness.

> 
> > diff --git a/include/trace/events/sched.h
> > b/include/trace/events/sched.h index fec25b9cfbaf..e99c6d573a42 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -183,6 +183,72 @@ TRACE_EVENT(sched_switch,
> >  		__entry->next_comm, __entry->next_pid, __entry->next_prio)  );
> >
> > +DECLARE_EVENT_CLASS(sched_latency_template,
> > +
> > +	TP_PROTO(bool preempt,
> > +		 struct task_struct *prev,
> > +		 struct task_struct *next,
> > +		 u64 time),
> > +
> > +	TP_ARGS(preempt, prev, next, time),
> > +
> > +	TP_STRUCT__entry(
> > +		__array(	char,	prev_comm,	TASK_COMM_LEN	)
> > +		__field(	pid_t,	prev_pid			)
> > +		__field(	int,	prev_prio			)
> > +		__field(	long,	prev_state			)
> > +		__array(	char,	next_comm,	TASK_COMM_LEN	)
> > +		__field(	pid_t,	next_pid			)
> > +		__field(	int,	next_prio			)
> > +		__field(	u64,	time				)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> > +		__entry->prev_pid	= prev->pid;
> > +		__entry->prev_prio	= prev->prio;
> > +		__entry->prev_state	= __trace_sched_switch_state(preempt, prev);
> > +		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> > +		__entry->next_pid	= next->pid;
> > +		__entry->next_prio	= next->prio;
> > +		__entry->time		= time;
> > +		/* XXX SCHED_DEADLINE */
> > +	),
> > +
> > +	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d passed
> time=%llu (ns)",
> > +		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
> > +
> > +		(__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
> > +		  __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
> > +				{ TASK_INTERRUPTIBLE, "S" },
> > +				{ TASK_UNINTERRUPTIBLE, "D" },
> > +				{ __TASK_STOPPED, "T" },
> > +				{ __TASK_TRACED, "t" },
> > +				{ EXIT_DEAD, "X" },
> > +				{ EXIT_ZOMBIE, "Z" },
> > +				{ TASK_PARKED, "P" },
> > +				{ TASK_DEAD, "I" }) :
> > +		  "R",
> > +
> > +		__entry->prev_state & TASK_REPORT_MAX ? "+" : "",
> > +		__entry->next_comm, __entry->next_pid, __entry->next_prio,
> > +		__entry->time)
> > +);
> 
> NAK, that tracepoint is already broken, we don't want to proliferate the broken.
  
Sorry, What the meaning that tracepoint is already broken? 
Maybe I need to explain the reason that why I add two trace point. 
when using perf tool or Ftrace sysfs to capture the task wake-up latency and the task leaving running queue time, usually the trace data is too large and the CPU utilization rate is too high in the process due to a lot of disk write. Sometimes even the disk is full, the issue still does not reproduced that above two time exceed a certain threshold.  So I added two trace points, using filter we can only record the abnormal trace that includes wakeup latency and leaving running time larger than an threshold. 
Or do you have better solution?

> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c index
> > 8471a0f7eb32..b5a1928dc948 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2464,6 +2464,8 @@ static void ttwu_do_wakeup(struct rq *rq, struct
> > task_struct *p, int wake_flags,  {
> >  	check_preempt_curr(rq, p, wake_flags);
> >  	p->state = TASK_RUNNING;
> > +	p->ts_wakeup = local_clock();
> > +	p->wakeup_state = true;
> >  	trace_sched_wakeup(p);
> >
> >  #ifdef CONFIG_SMP
> 
> NAK, userless overhead.

 When sched switch, we do not know the next task previous state and wakeup timestamp, so I record the task previous state if it is waken from sleep.
 And then it can calculate the wakeup latency when task switch.


> 
> > @@ -2846,6 +2848,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  		success = 1;
> >  		trace_sched_waking(p);
> >  		p->state = TASK_RUNNING;
> > +		p->ts_wakeup = local_clock();
> > +		p->wakeup_state = true;
> >  		trace_sched_wakeup(p);
> >  		goto out;
> >  	}
> 
> idem
> 
> 
> > @@ -3355,6 +3362,8 @@ void wake_up_new_task(struct task_struct *p)
> >  	post_init_entity_util_avg(p);
> >
> >  	activate_task(rq, p, ENQUEUE_NOCLOCK);
> > +	p->ts_wakeup = local_clock();
> > +	p->wakeup_state = true;
> >  	trace_sched_wakeup_new(p);
> >  	check_preempt_curr(rq, p, WF_FORK);
> >  #ifdef CONFIG_SMP
> 
> idem
> 
> > @@ -4521,8 +4530,19 @@ static void __sched notrace __schedule(bool
> > preempt)
> >
> >  		psi_sched_switch(prev, next, !task_on_rq_queued(prev));
> >
> > +		next->next_ts_start = prev->pre_ts_end = local_clock();
> > +
> >  		trace_sched_switch(preempt, prev, next);
> >
> > +		if (next->ts_wakeup && next->wakeup_state) {
> > +			trace_sched_wakeup_latency(preempt, prev, next,
> > +				next->next_ts_start - next->ts_wakeup);
> > +			next->wakeup_state = false;
> > +		}
> > +
> > +		trace_sched_leave_running_time(preempt, prev,
> > +			next, next->next_ts_start - next->pre_ts_end);
> > +
> >  		/* Also unlocks the rq: */
> >  		rq = context_switch(rq, prev, next, &rf);
> >  	} else {
> 
> This really isn't ever going to happen.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RE:[PATCH] sched: Add trace for task wake up latency and leave running time
  2020-09-02 22:35 RE:[PATCH] sched: Add trace for task wake up latency and leave running time gengdongjiu
@ 2020-09-03  7:42 ` peterz
  2020-09-03  8:49   ` RE???[PATCH] " Mel Gorman
  0 siblings, 1 reply; 4+ messages in thread
From: peterz @ 2020-09-03  7:42 UTC (permalink / raw)
  To: gengdongjiu
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, thara.gopinath, pauld, vincent.donnefort,
	rdunlap, linux-kernel

On Wed, Sep 02, 2020 at 10:35:34PM +0000, gengdongjiu wrote:

> > NAK, that tracepoint is already broken, we don't want to proliferate the broken.
>   
> Sorry, What the meaning that tracepoint is already broken? 

Just that, the tracepoint is crap. But we can't fix it because ABI. Did
I tell you I utterly hate tracepoints?

> Maybe I need to explain the reason that why I add two trace point. 
> when using perf tool or Ftrace sysfs to capture the task wake-up latency and the task leaving running queue time, usually the trace data is too large and the CPU utilization rate is too high in the process due to a lot of disk write. Sometimes even the disk is full, the issue still does not reproduced that above two time exceed a certain threshold.  So I added two trace points, using filter we can only record the abnormal trace that includes wakeup latency and leaving running time larger than an threshold. 
> Or do you have better solution?

Learn to use a MUA and wrap your lines at 78 chars like normal people.

Yes, use ftrace synthetic events, or bpf or really anything other than
this.

> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c index
> > > 8471a0f7eb32..b5a1928dc948 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -2464,6 +2464,8 @@ static void ttwu_do_wakeup(struct rq *rq, struct
> > > task_struct *p, int wake_flags,  {
> > >  	check_preempt_curr(rq, p, wake_flags);
> > >  	p->state = TASK_RUNNING;
> > > +	p->ts_wakeup = local_clock();
> > > +	p->wakeup_state = true;
> > >  	trace_sched_wakeup(p);
> > >
> > >  #ifdef CONFIG_SMP
> > 
> > NAK, userless overhead.
> 
>  When sched switch, we do not know the next task previous state and
>  wakeup timestamp, so I record the task previous state if it is waken
>  from sleep.  And then it can calculate the wakeup latency when task
>  switch.

I don't care. You're making things slower.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RE???[PATCH] sched: Add trace for task wake up latency and leave running time
  2020-09-03  7:42 ` peterz
@ 2020-09-03  8:49   ` Mel Gorman
  2020-09-15 22:41     ` Steven Rostedt
  0 siblings, 1 reply; 4+ messages in thread
From: Mel Gorman @ 2020-09-03  8:49 UTC (permalink / raw)
  To: peterz
  Cc: gengdongjiu, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, thara.gopinath, pauld,
	vincent.donnefort, rdunlap, linux-kernel

On Thu, Sep 03, 2020 at 09:42:32AM +0200, peterz@infradead.org wrote:
> > Maybe I need to explain the reason that why I add two trace point. 
> > when using perf tool or Ftrace sysfs to capture the task wake-up latency and the task leaving running queue time, usually the trace data is too large and the CPU utilization rate is too high in the process due to a lot of disk write. Sometimes even the disk is full, the issue still does not reproduced that above two time exceed a certain threshold.  So I added two trace points, using filter we can only record the abnormal trace that includes wakeup latency and leaving running time larger than an threshold. 
> > Or do you have better solution?
> 
> <SNIP>
> 
> Yes, use ftrace synthetic events, or bpf or really anything other than
> this.
> 

In practice, this is what ends up being required. FYI, perf probe is handy
for creating a tracepoint in an arbitrary location. If BPF is not an option
(e.g. older kernel where BPF cannot do what you want or newer kernel
where BPF has regressed in some way) then systemtap is the other option.

> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c index
> > > > 8471a0f7eb32..b5a1928dc948 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -2464,6 +2464,8 @@ static void ttwu_do_wakeup(struct rq *rq, struct
> > > > task_struct *p, int wake_flags,  {
> > > >  	check_preempt_curr(rq, p, wake_flags);
> > > >  	p->state = TASK_RUNNING;
> > > > +	p->ts_wakeup = local_clock();
> > > > +	p->wakeup_state = true;
> > > >  	trace_sched_wakeup(p);
> > > >
> > > >  #ifdef CONFIG_SMP
> > > 
> > > NAK, userless overhead.
> > 
> >  When sched switch, we do not know the next task previous state and
> >  wakeup timestamp, so I record the task previous state if it is waken
> >  from sleep.  And then it can calculate the wakeup latency when task
> >  switch.
> 
> I don't care. You're making things slower.

Which sucks on its own. The other problem is that you cannot trace older
kernels that do not have the updated tracepoints so it's worth at least
pointing out what the options are. In general, trying to accumulate
state inside a tracepoint is a curse because changes in the internal
implementation can render the tracepoint redundant, or worse, misleading.
It *can* be an option when developing a patch and you want detailed
trace information to guide development but be prepared to throw away the
tracepoints before submitting.

If state is to be accumulated between multiple tracepoints then the
primary options are BPF, systemtap, trace-cmd for offline analysis or
opening trace_pipe with a script, parsing the events and accumulate the
state that way. This can lead to a concern that the script must be running
from system start to track the data and yes, this is exactly what you
have to do.  It's not that different to having to monitor /proc/vmstat
over time to get some interesting vm stats as the accumulated state since
the system started is often useless.

Online analysis of trace_pipe is the most expensive for a lot of reasons
but it has the advantage of almost always working regardless of kernel
version as long as the script interprets the tracepoints correctly. It
may be too expensive for production but it's useful when figuring out
what tracepoints are needed and how that state should be accumulated.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RE???[PATCH] sched: Add trace for task wake up latency and leave running time
  2020-09-03  8:49   ` RE???[PATCH] " Mel Gorman
@ 2020-09-15 22:41     ` Steven Rostedt
  0 siblings, 0 replies; 4+ messages in thread
From: Steven Rostedt @ 2020-09-15 22:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: peterz, gengdongjiu, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, thara.gopinath, pauld,
	vincent.donnefort, rdunlap, linux-kernel

On Thu, 3 Sep 2020 09:49:25 +0100
Mel Gorman <mgorman@suse.de> wrote:

> If state is to be accumulated between multiple tracepoints then the
> primary options are BPF, systemtap, trace-cmd for offline analysis or

Note, ftrace now has histograms which use the "synthetic events" that Peter
mentioned, which can give you state between two different trace points. For
example:

  # cd /sys/kernel/tracing
  # echo latency u64 lat pid_t pid int prio > synthetic_events
  # echo 'hist:keys=pid:ts0=common_timestamp.usecs' > events/sched/sched_waking/trigger
  # echo 'hist:keys=next_pid:lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).trace(latency,$lat,next_pid,next_prio) if next_prio < 100' > events/sched/sched_switch/trigger

Which will trigger the synthetic event "latency" when the sched_switch
"next_pid" matches the sched_waking "pid" only if "next_prio" is less than
100 (note lower inside the kernel means higher priority, so these are only
real time tasks).

  # echo 1 > events/synthetic/latency/enable

  # cat trace
            sshd-1049  [000] d...411  3739.316424: latency: lat=3 pid=277 prio=49
          <idle>-0     [001] d...4..  3739.316466: latency: lat=30 pid=275 prio=49
            grep-9054  [000] d...4..  3739.316564: latency: lat=3 pid=277 prio=49
            grep-9054  [000] d...411  3739.316671: latency: lat=3 pid=277 prio=49
          <idle>-0     [001] d...4..  3739.316709: latency: lat=31 pid=275 prio=49
            grep-9054  [000] d...4..  3739.316759: latency: lat=3 pid=277 prio=49
            sshd-1049  [000] d...411  3739.316835: latency: lat=4 pid=277 prio=49
          <idle>-0     [001] d...4..  3739.316893: latency: lat=50 pid=275 prio=49
            sshd-1049  [000] d...4..  3739.316968: latency: lat=4 pid=277 prio=49
            sshd-1049  [000] d...4..  3739.317082: latency: lat=6 pid=277 prio=49
          <idle>-0     [001] d...4..  3739.317115: latency: lat=29 pid=275 prio=49
            sshd-1049  [000] d...411  3739.317194: latency: lat=4 pid=277 prio=49

  # echo 'hist:keys=prio,lat:sort=lat' > events/synthetic/latency/trigger
  # cat events/synthetic/latency/hist

# event histogram
#
# trigger info: hist:keys=prio,lat:vals=hitcount:sort=lat:size=2048 [active]
#

{ prio:         98, lat:          2 } hitcount:          1
{ prio:         98, lat:          3 } hitcount:          1
{ prio:          0, lat:          3 } hitcount:          2
{ prio:         98, lat:          4 } hitcount:          1
{ prio:         98, lat:          5 } hitcount:          3
{ prio:         98, lat:          6 } hitcount:          1
{ prio:         49, lat:          8 } hitcount:          2
{ prio:         49, lat:          9 } hitcount:          1
{ prio:         49, lat:         10 } hitcount:          1
{ prio:         49, lat:         11 } hitcount:          1
{ prio:         98, lat:         22 } hitcount:          2
{ prio:         98, lat:         25 } hitcount:          1
{ prio:         49, lat:         25 } hitcount:          1
{ prio:         98, lat:         26 } hitcount:          1
{ prio:         98, lat:         28 } hitcount:          2
{ prio:         49, lat:         30 } hitcount:          2
{ prio:         49, lat:         31 } hitcount:          1
{ prio:         98, lat:         33 } hitcount:          1
{ prio:         49, lat:         34 } hitcount:          1
{ prio:         49, lat:         35 } hitcount:          1
{ prio:         49, lat:         41 } hitcount:          1
{ prio:         49, lat:         44 } hitcount:          2
{ prio:         49, lat:         54 } hitcount:          1
{ prio:          0, lat:        172 } hitcount:          1

Totals:
    Hits: 32
    Entries: 24
    Dropped: 0

Just FYI ;-)

-- Steve


> opening trace_pipe with a script, parsing the events and accumulate the
> state that way. This can lead to a concern that the script must be running
> from system start to track the data and yes, this is exactly what you
> have to do.  It's not that different to having to monitor /proc/vmstat
> over time to get some interesting vm stats as the accumulated state since
> the system started is often useless.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-09-15 22:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-02 22:35 RE:[PATCH] sched: Add trace for task wake up latency and leave running time gengdongjiu
2020-09-03  7:42 ` peterz
2020-09-03  8:49   ` RE???[PATCH] " Mel Gorman
2020-09-15 22:41     ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).