[3/3] sched/fair: add tracepoints for cfs throttle
diff mbox series

Message ID 20180522062017.5193-4-xiyou.wangcong@gmail.com
State New, archived
Headers show
Series
  • sched/fair: improve CFS throttle
Related show

Commit Message

Cong Wang May 22, 2018, 6:20 a.m. UTC
This patch introduces two sched tracepoints to track
CFS throttle and unthrottle. The use case is to measure
how long each throttle is and to track when a CPU cgroup
is throttled and unthrottled.

Sample output:

     cfs-722   [000] dN.3    51.477702: sched_cfs_throttle: path=/test cpu=0 runtime_remaining=0
  <idle>-0     [000] d.h4    51.536659: sched_cfs_unthrottle: path=/test cpu=0 runtime_remaining=1

Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/trace/events/sched.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          |  2 ++
 2 files changed, 44 insertions(+)

Comments

Cong Wang May 23, 2018, 12:34 a.m. UTC | #1
On Mon, May 21, 2018 at 11:35 PM, Ivan Babrou <ibobrik@gmail.com> wrote:
>
>
> On Mon, May 21, 2018 at 11:20 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>
>> +
>> +       TP_printk("path=%s cpu=%d runtime_remaining=%lld",
>> __get_str(cfs_path),
>> +                 __entry->cpu, __entry->runtime_remaining)
>
>
> Can you add "[ns]" as the unit to runtime_remaining? We have it in
> "sched:sched_stat_runtime".

Hmm, although it is indeed ns, this ->runtime_remaining can be negative,
unlike sched_stat_runtime. So I am not sure if I should marked it as "[ns]".
Peter Zijlstra May 23, 2018, 9:09 a.m. UTC | #2
On Mon, May 21, 2018 at 11:35:38PM -0700, Ivan Babrou wrote:
> > +       TP_printk("path=%s cpu=%d runtime_remaining=%lld",
> > __get_str(cfs_path),
> > +                 __entry->cpu, __entry->runtime_remaining)
> >
> 
> Can you add "[ns]" as the unit to runtime_remaining? We have it in
> "sched:sched_stat_runtime".

Yeah, don't worry. I hate tracepoints, I regret the existing ones, no
new ones will happen.
Cong Wang May 24, 2018, 4:40 a.m. UTC | #3
On Wed, May 23, 2018 at 2:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, May 21, 2018 at 11:35:38PM -0700, Ivan Babrou wrote:
>> > +       TP_printk("path=%s cpu=%d runtime_remaining=%lld",
>> > __get_str(cfs_path),
>> > +                 __entry->cpu, __entry->runtime_remaining)
>> >
>>
>> Can you add "[ns]" as the unit to runtime_remaining? We have it in
>> "sched:sched_stat_runtime".
>
> Yeah, don't worry. I hate tracepoints, I regret the existing ones, no
> new ones will happen.

No matter you hate them or not, they are useful:

https://github.com/iovisor/bcc/blob/master/tools/runqlat.py

Unlike in sched, tracepoints are welcome in networking:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/include/trace/events/tcp.h

I don't want to change your mind, just want to show the facts.
Peter Zijlstra May 24, 2018, 7:11 a.m. UTC | #4
On Wed, May 23, 2018 at 09:40:47PM -0700, Cong Wang wrote:
> On Wed, May 23, 2018 at 2:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > Yeah, don't worry. I hate tracepoints, I regret the existing ones, no
> > new ones will happen.
> 
> No matter you hate them or not, they are useful:

No argument there.

> Unlike in sched, tracepoints are welcome in networking:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/include/trace/events/tcp.h

They are also absolutely disallowed in the vfs..

> I don't want to change your mind, just want to show the facts.

The problem with tracepoints is that they can become ABI and you cannot
change them without breaking tools. This is a crap situation and I'm fed
up with it.
Cong Wang May 24, 2018, 10:23 p.m. UTC | #5
On Thu, May 24, 2018 at 12:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> The problem with tracepoints is that they can become ABI and you cannot
> change them without breaking tools. This is a crap situation and I'm fed
> up with it.

Yeah, I once used perf_event_open() to parse the trace events, I
understand how painful it is. Fortunately tools like bcc mostly hides
the ABI from end user. To be fair, kprobe has the same problem if not
worse, different compilers could inline different kernel functions, not
to mention kernel functions could change at any time.

OTOH, the tracepoint text interface is sufficient to help people to inspect
kernel internals, like in this case I can observe when a cpu cgroup is
throttled or unthrottled. Of course they are not friendly for programming.

Thanks.

Patch
diff mbox series

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index bc01e06bc716..3d9e00972e92 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -570,6 +570,48 @@  TRACE_EVENT(sched_wake_idle_without_ipi,
 
 	TP_printk("cpu=%d", __entry->cpu)
 );
+
+#ifdef CONFIG_CFS_BANDWIDTH
+DECLARE_EVENT_CLASS(sched_fair,
+
+	TP_PROTO(struct cfs_rq *cfs_rq),
+
+	TP_ARGS(cfs_rq),
+
+	TP_STRUCT__entry(
+		__field(	s64,		runtime_remaining	)
+		__field(	int,		cpu			)
+		__dynamic_array(char,		cfs_path,
+				cgroup_path(cfs_rq->tg->css.cgroup, NULL, 0) + 1)
+	),
+
+	TP_fast_assign(
+		__entry->runtime_remaining = cfs_rq->runtime_remaining;
+		__entry->cpu = cpu_of(cfs_rq->rq);
+		cgroup_path(cfs_rq->tg->css.cgroup,
+			    __get_dynamic_array(cfs_path),
+			    __get_dynamic_array_len(cfs_path));
+	),
+
+	TP_printk("path=%s cpu=%d runtime_remaining=%lld", __get_str(cfs_path),
+		  __entry->cpu, __entry->runtime_remaining)
+);
+
+DEFINE_EVENT(sched_fair, sched_cfs_throttle,
+
+	TP_PROTO(struct cfs_rq *cfs_rq),
+
+	TP_ARGS(cfs_rq)
+);
+
+DEFINE_EVENT(sched_fair, sched_cfs_unthrottle,
+
+	TP_PROTO(struct cfs_rq *cfs_rq),
+
+	TP_ARGS(cfs_rq)
+);
+#endif /* CONFIG_CFS_BANDWIDTH */
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a13ee006f39..3bcc40f2f272 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4736,6 +4736,7 @@  static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	cfs_rq->throttled = 1;
 	cfs_rq->throttled_clock = rq_clock(rq);
+	trace_sched_cfs_throttle(cfs_rq);
 	raw_spin_lock(&cfs_b->lock);
 	empty = list_empty(&cfs_b->throttled_cfs_rq);
 
@@ -4766,6 +4767,7 @@  void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
 	cfs_rq->throttled = 0;
+	trace_sched_cfs_unthrottle(cfs_rq);
 
 	update_rq_clock(rq);