All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Tracepoints for deadline scheduler
@ 2016-02-22 17:08 Daniel Bristot de Oliveira
  2016-02-22 17:08 ` [PATCH 1/4] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers Daniel Bristot de Oliveira
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Daniel Bristot de Oliveira @ 2016-02-22 17:08 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Juri Lelli
  Cc: Arnaldo Carvalho de Melo, LKML, linux-rt-users

Deadline tasks behave differently of other tasks because deadline
task's also depend on their period, deadline and runtime.

Hence, the well known sched:sched_wakeup and sched:sched_switch
tracepoints are not always enough to precisely explain the behavior of a
deadline task with respect to the task's period, deadline and runtime
consumption.

This patch series implements tracepoints for the following deadline
scheduler points of interest:

- sched:sched_deadline_replenish: Informs the periodic runtime
replenishment of a deadline task.

- sched:sched_deadline_yield: Informs that a deadline task called
sched_yield(), and will wait for the next period.

- sched:sched_deadline_throttle: Informs that a task consumed all its
available runtime and was throttled.

- sched:sched_deadline_block: Informs that a deadline task went to sleep
waiting to be awakened by another task.

Daniel Bristot de Oliveira (3):
  sched: Move deadline container_of() helper functions into sched.h
  sched/deadline: Tracepoints for deadline scheduler
  tools lib traceevent: Implements '%' operation

Steven Rostedt (Red Hat) (1):
  tracing: Add __print_ns_to_secs() and __print_ns_without_secs()
    helpers

 include/trace/events/sched.h       | 87 ++++++++++++++++++++++++++++++++++++++
 include/trace/trace_events.h       | 25 +++++++++++
 kernel/sched/deadline.c            | 34 +++++++--------
 kernel/sched/sched.h               | 18 ++++++++
 tools/lib/traceevent/event-parse.c |  4 ++
 5 files changed, 149 insertions(+), 19 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers
  2016-02-22 17:08 [PATCH 0/4] Tracepoints for deadline scheduler Daniel Bristot de Oliveira
@ 2016-02-22 17:08 ` Daniel Bristot de Oliveira
  2016-02-22 17:08 ` [PATCH 2/4] sched: Move deadline container_of() helper functions into sched.h Daniel Bristot de Oliveira
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Daniel Bristot de Oliveira @ 2016-02-22 17:08 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Juri Lelli
  Cc: Arnaldo Carvalho de Melo, LKML, linux-rt-users

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

To have nanosecond output displayed in a more human readable format, its
nicer to convert it to a seconds format (XXX.YYYYYYYYY). The problem is that
to do so, the numbers must be divided by NSEC_PER_SEC, and moded too. But as
these numbers are 64 bit, this can not be done simply with '/' and '%'
operators, but must use do_div() instead.

Instead of performing the expensive do_div() in the hot path of the
tracepoint, it is more efficient to perform it during the output phase. But
passing in do_div() can confuse the parser, and do_div() doesn't work
exactly like a normal C function. It modifies the number in place, and we
don't want to modify the actual values in the ring buffer.

Two helper functions are now created:

  __print_ns_to_secs() and __print_ns_without_secs()

They both take a value of nanoseconds, and the former will return that
number divided by NSEC_PER_SEC, and the latter will mod it with NSEC_PER_SEC
giving a way to print a nice human readable format:

 __print_fmt("time=%llu.%09u",
	__print_ns_to_secs(REC->nsec_val),
	__print_ns_without_secs(REC->nsec_val))

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/trace/trace_events.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 170c93b..23a5822 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -283,6 +283,21 @@ TRACE_MAKE_SYSTEM_STR();
 		trace_print_symbols_seq(p, value, symbols);		\
 	})
 
+#undef __print_ns_to_secs
+#define __print_ns_to_secs(value)			\
+	({						\
+		u64 ____val = (u64)value;		\
+		do_div(____val, NSEC_PER_SEC);		\
+		____val;				\
+	})
+
+#undef __print_ns_without_secs
+#define __print_ns_without_secs(value)			\
+	({						\
+		u64 ____val = (u64)value;		\
+		do_div(____val, NSEC_PER_SEC);		\
+	})
+
 #undef __print_symbolic_u64
 #if BITS_PER_LONG == 32
 #define __print_symbolic_u64(value, symbol_array...)			\
@@ -720,6 +735,16 @@ static inline void ftrace_test_probe_##call(void)			\
 #undef __get_bitmask
 #undef __print_array
 
+/*
+ * The below is not executed in the kernel. It is only what is
+ * displayed in the print format for userspace to parse.
+ */
+#undef __print_ns_to_secs
+#define __print_ns_to_secs(val) val / 1000000000UL
+
+#undef __print_ns_without_secs
+#define __print_ns_without_secs(val) val % 1000000000UL
+
 #undef TP_printk
 #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] sched: Move deadline container_of() helper functions into sched.h
  2016-02-22 17:08 [PATCH 0/4] Tracepoints for deadline scheduler Daniel Bristot de Oliveira
  2016-02-22 17:08 ` [PATCH 1/4] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers Daniel Bristot de Oliveira
@ 2016-02-22 17:08 ` Daniel Bristot de Oliveira
  2016-02-22 17:08 ` [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler Daniel Bristot de Oliveira
  2016-02-22 17:08 ` [PATCH 4/4] tools lib traceevent: Implements '%' operation Daniel Bristot de Oliveira
  3 siblings, 0 replies; 22+ messages in thread
From: Daniel Bristot de Oliveira @ 2016-02-22 17:08 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Juri Lelli
  Cc: Arnaldo Carvalho de Melo, LKML, linux-rt-users

Move dl_task_of(), dl_rq_of_se() and rq_of_dl_rq() helper functions
from kernel/sched/deadline.c to kernel/sched/sched.h, so they
can be used on other scheduler files.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 kernel/sched/deadline.c | 18 ------------------
 kernel/sched/sched.h    | 18 ++++++++++++++++++
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index cd64c97..2f6d25a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -20,24 +20,6 @@
 
 struct dl_bandwidth def_dl_bandwidth;
 
-static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
-{
-	return container_of(dl_se, struct task_struct, dl);
-}
-
-static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
-{
-	return container_of(dl_rq, struct rq, dl);
-}
-
-static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
-{
-	struct task_struct *p = dl_task_of(dl_se);
-	struct rq *rq = task_rq(p);
-
-	return &rq->dl;
-}
-
 static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 {
 	return !RB_EMPTY_NODE(&dl_se->rb_node);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f1637..b6e55e1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -749,6 +749,24 @@ static inline void rq_clock_skip_update(struct rq *rq, bool skip)
 		rq->clock_skip_update &= ~RQCF_REQ_SKIP;
 }
 
+static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
+{
+	return container_of(dl_se, struct task_struct, dl);
+}
+
+static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
+{
+	return container_of(dl_rq, struct rq, dl);
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->dl;
+}
+
 #ifdef CONFIG_NUMA
 enum numa_topology_type {
 	NUMA_DIRECT,
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-22 17:08 [PATCH 0/4] Tracepoints for deadline scheduler Daniel Bristot de Oliveira
  2016-02-22 17:08 ` [PATCH 1/4] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers Daniel Bristot de Oliveira
  2016-02-22 17:08 ` [PATCH 2/4] sched: Move deadline container_of() helper functions into sched.h Daniel Bristot de Oliveira
@ 2016-02-22 17:08 ` Daniel Bristot de Oliveira
  2016-02-22 17:32   ` Peter Zijlstra
  2016-02-22 17:48   ` kbuild test robot
  2016-02-22 17:08 ` [PATCH 4/4] tools lib traceevent: Implements '%' operation Daniel Bristot de Oliveira
  3 siblings, 2 replies; 22+ messages in thread
From: Daniel Bristot de Oliveira @ 2016-02-22 17:08 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Juri Lelli
  Cc: Arnaldo Carvalho de Melo, LKML, linux-rt-users

Deadline tasks behave differently of other tasks because deadline
task's also depend on their period, deadline and runtime.

Hence, the well known sched:sched_wakeup and sched:sched_switch
tracepoints are not always enough to precisely explain the behavior of a
deadline task with respect to the task's period, deadline, and runtime
consumption and replenishment.

In order to provide more precise information about the scheduling of
deadline tasks, this patch implements the following tracepoints:

- sched:sched_deadline_replenish: Informs the runtime replenishment of
a deadline task. Trace example:

  <idle>-0     [010] d.h.   268.428878: sched_deadline_replenish: \
	comm=y pid=1608 now=268.428876113 deadline=268.458863627  \
	runtime=20000000

The task y-1608 was replenished with 20000000 ns at 268.428876113
to be used until the deadline at 268.458863627.

- sched:sched_deadline_yield: Informs that a deadline task called
sched_yield(), and will wait for the next period. Trace example:

  y-1608  [010] d...   268.428892: sched_deadline_yield: 	\
	now=268.428891932 deadline=268.458863627		\
	remaining_runtime=19995653

The task y-1608 yielded before its deadline, with 19995653 ns of
remaining runtime.

- sched:sched_deadline_throttle: Informs that a task consumed all its
available runtime and was throttled. Trace example:

  t-1613  [011] d.h.   280.419823: sched_deadline_throttle:	\
	now=280.419823282 deadline=280.430683206 		\
	remaining_runtime=-13280

The task t-1613 overrun its runtime by 13280 ns and was throttled.

- sched:sched_deadline_block: Informs that a deadline task went to sleep
waiting to be awakened by another task. Trace example:

  b-1611  [011] d...   276.228298: sched_deadline_block: 	\
	now=276.228295889 deadline=276.258262555
	remaining_runtime=19966666

The task b-1611 blocked waiting for an external event. Its deadline is at
276.258262555, and it stills have 19966666 ns of remaining runtime on the
current period.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 include/trace/events/sched.h | 87 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/deadline.c      | 16 +++++++-
 2 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9b90c57..1847c29 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -562,6 +562,93 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
 
 	TP_printk("cpu=%d", __entry->cpu)
 );
+
+/*
+ * Tracepoint for runtime replenishment of deadline tasks:
+ */
+TRACE_EVENT(sched_deadline_replenish,
+
+	TP_PROTO(struct sched_dl_entity *dl_se),
+
+	TP_ARGS(dl_se),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	u64,	now			)
+		__field(	u64,	deadline		)
+		__field(	s64,	runtime			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, dl_task_of(dl_se)->comm, TASK_COMM_LEN);
+		__entry->now		= rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se)));
+		__entry->pid		= dl_task_of(dl_se)->pid;
+		__entry->deadline	= dl_se->deadline;
+		__entry->runtime	= dl_se->runtime;
+	),
+
+	TP_printk("comm=%s pid=%d now=%llu.%09u deadline=%llu.%09u runtime=%lld",
+		  __entry->comm,
+		  __entry->pid,
+		  __print_ns_to_secs(__entry->now),
+		  __print_ns_without_secs(__entry->now),
+		  __print_ns_to_secs(__entry->deadline),
+		  __print_ns_without_secs(__entry->deadline),
+		  __entry->runtime)
+);
+
+DECLARE_EVENT_CLASS(sched_deadline_template,
+
+	TP_PROTO(struct sched_dl_entity *dl_se),
+
+	TP_ARGS(dl_se),
+
+	TP_STRUCT__entry(
+		__field(	u64,	now			)
+		__field(	u64,	deadline		)
+		__field(	s64,	runtime			)
+	),
+
+	TP_fast_assign(
+		__entry->now		= rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se)));
+		__entry->deadline	= dl_se->deadline;
+		__entry->runtime	= dl_se->runtime;
+	),
+
+	TP_printk("now=%llu.%09u deadline=%llu.%09u remaining_runtime=%lld",
+		  __print_ns_to_secs(__entry->now),
+		  __print_ns_without_secs(__entry->now),
+		  __print_ns_to_secs(__entry->deadline),
+		  __print_ns_without_secs(__entry->deadline),
+		  __entry->runtime)
+);
+
+/*
+ * Tracepoint for sched_yield() of a deadline task (the task
+ * went to sleep waiting for the next period)
+ */
+DEFINE_EVENT(sched_deadline_template, sched_deadline_yield,
+	     TP_PROTO(struct sched_dl_entity *dl_se),
+	     TP_ARGS(dl_se));
+
+/*
+ * Tracepoint for throttling of a deadline task that consumed all its
+ * runtime
+ */
+DEFINE_EVENT_CONDITION(sched_deadline_template, sched_deadline_throttle,
+	     TP_PROTO(struct sched_dl_entity *dl_se),
+	     TP_ARGS(dl_se),
+	     TP_CONDITION(!dl_se->dl_yielded));
+
+/*
+ * Tracepoint for blocking of a deadline task. The deadline task was
+ * dequeued, but neither by being throttled nor yielding.
+ */
+DEFINE_EVENT_CONDITION(sched_deadline_template, sched_deadline_block,
+	     TP_PROTO(struct sched_dl_entity *dl_se),
+	     TP_ARGS(dl_se),
+	     TP_CONDITION(!dl_se->dl_yielded && !dl_se->dl_throttled));
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2f6d25a..56d0d09 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -17,6 +17,7 @@
 #include "sched.h"
 
 #include <linux/slab.h>
+#include <trace/events/sched.h>
 
 struct dl_bandwidth def_dl_bandwidth;
 
@@ -344,6 +345,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
 	dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
 	dl_se->runtime = pi_se->dl_runtime;
 	dl_se->dl_new = 0;
+	trace_sched_deadline_replenish(dl_se);
 }
 
 /*
@@ -407,6 +409,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
 		dl_se->runtime = pi_se->dl_runtime;
 	}
 
+	trace_sched_deadline_replenish(dl_se);
+
 	if (dl_se->dl_yielded)
 		dl_se->dl_yielded = 0;
 	if (dl_se->dl_throttled)
@@ -495,6 +499,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
 	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
 		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
 		dl_se->runtime = pi_se->dl_runtime;
+		trace_sched_deadline_replenish(dl_se);
 	}
 }
 
@@ -734,6 +739,7 @@ static void update_curr_dl(struct rq *rq)
 	dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec;
 	if (dl_runtime_exceeded(dl_se)) {
 		dl_se->dl_throttled = 1;
+		trace_sched_deadline_throttle(dl_se);
 		__dequeue_task_dl(rq, curr, 0);
 		if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
 			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
@@ -910,6 +916,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	__dequeue_dl_entity(dl_se);
+	trace_sched_deadline_block(dl_se);
 }
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -978,6 +985,7 @@ static void yield_task_dl(struct rq *rq)
 {
 	struct task_struct *p = rq->curr;
 
+	update_rq_clock(rq);
 	/*
 	 * We make the task go to sleep until its current deadline by
 	 * forcing its runtime to zero. This way, update_curr_dl() stops
@@ -986,9 +994,15 @@ static void yield_task_dl(struct rq *rq)
 	 */
 	if (p->dl.runtime > 0) {
 		rq->curr->dl.dl_yielded = 1;
+		if (trace_sched_deadline_yield_enabled()) {
+			u64 delta_exec = rq_clock_task(rq) - p->se.exec_start;
+			/* Subtract the last run till now */
+			if (likely((s64)delta_exec > 0))
+				p->dl.runtime -= delta_exec;
+			trace_sched_deadline_yield(&p->dl);
+		}
 		p->dl.runtime = 0;
 	}
-	update_rq_clock(rq);
 	update_curr_dl(rq);
 	/*
 	 * Tell update_rq_clock() that we've just updated,
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] tools lib traceevent: Implements '%' operation
  2016-02-22 17:08 [PATCH 0/4] Tracepoints for deadline scheduler Daniel Bristot de Oliveira
                   ` (2 preceding siblings ...)
  2016-02-22 17:08 ` [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler Daniel Bristot de Oliveira
@ 2016-02-22 17:08 ` Daniel Bristot de Oliveira
  2016-02-22 20:23   ` Steven Rostedt
  2016-02-25  5:41   ` [tip:perf/core] tools lib traceevent: Implement " tip-bot for Daniel Bristot de Oliveira
  3 siblings, 2 replies; 22+ messages in thread
From: Daniel Bristot de Oliveira @ 2016-02-22 17:08 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Juri Lelli
  Cc: Arnaldo Carvalho de Melo, LKML, linux-rt-users

The operation '%' is not implemented on event-parse.c, causing
an error on the parse of events with '%' operation on its
printk format. For example,

# perf record -e sched:sched_deadline_yield ~/y
  Warning: [sched:sched_deadline_yield] unknown op '%'
....
# perf script
  Warning: [sched:sched_deadline_yield] unknown op '%'
	y  1641 [006]  3364.109319: sched:sched_deadline_yield: \
			[FAILED TO PARSE] now=3364109314595 	\
			deadline=3364139295135 runtime=19975597

This patch implements the '%' operation. With this patch, we see the
correct output:

# perf record -e sched:sched_deadline_yield ~/y
  No Warning

# perf script
	y  4005 [001]  4623.650978: sched:sched_deadline_yield: \
		now=4623.650974050				\
		deadline=4623.680957364 remaining_runtime=19979611

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 tools/lib/traceevent/event-parse.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/lib/traceevent/event-parse.c b/tools/lib/traceevent/event-parse.c
index c3bd294..575e751 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -1951,6 +1951,7 @@ process_op(struct event_format *event, struct print_arg *arg, char **tok)
 		   strcmp(token, "*") == 0 ||
 		   strcmp(token, "^") == 0 ||
 		   strcmp(token, "/") == 0 ||
+		   strcmp(token, "%") == 0 ||
 		   strcmp(token, "<") == 0 ||
 		   strcmp(token, ">") == 0 ||
 		   strcmp(token, "<=") == 0 ||
@@ -3689,6 +3690,9 @@ eval_num_arg(void *data, int size, struct event_format *event, struct print_arg
 		case '/':
 			val = left / right;
 			break;
+		case '%':
+			val = left % right;
+			break;
 		case '*':
 			val = left * right;
 			break;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-22 17:08 ` [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler Daniel Bristot de Oliveira
@ 2016-02-22 17:32   ` Peter Zijlstra
  2016-02-22 17:48     ` Steven Rostedt
  2016-02-22 17:48   ` kbuild test robot
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2016-02-22 17:32 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Steven Rostedt, Ingo Molnar, Thomas Gleixner, Juri Lelli,
	Arnaldo Carvalho de Melo, LKML, linux-rt-users

On Mon, Feb 22, 2016 at 02:08:21PM -0300, Daniel Bristot de Oliveira wrote:
> Deadline tasks behave differently of other tasks because deadline
> task's also depend on their period, deadline and runtime.
> 
> Hence, the well known sched:sched_wakeup and sched:sched_switch
> tracepoints are not always enough to precisely explain the behavior of a
> deadline task with respect to the task's period, deadline, and runtime
> consumption and replenishment.
> 
> In order to provide more precise information about the scheduling of
> deadline tasks, this patch implements the following tracepoints:
> 
> - sched:sched_deadline_replenish: Informs the runtime replenishment of
> a deadline task. Trace example:
> 
>   <idle>-0     [010] d.h.   268.428878: sched_deadline_replenish: \
> 	comm=y pid=1608 now=268.428876113 deadline=268.458863627  \
> 	runtime=20000000
> 
> The task y-1608 was replenished with 20000000 ns at 268.428876113
> to be used until the deadline at 268.458863627.
> 
> - sched:sched_deadline_yield: Informs that a deadline task called
> sched_yield(), and will wait for the next period. Trace example:
> 
>   y-1608  [010] d...   268.428892: sched_deadline_yield: 	\
> 	now=268.428891932 deadline=268.458863627		\
> 	remaining_runtime=19995653
> 
> The task y-1608 yielded before its deadline, with 19995653 ns of
> remaining runtime.
> 
> - sched:sched_deadline_throttle: Informs that a task consumed all its
> available runtime and was throttled. Trace example:
> 
>   t-1613  [011] d.h.   280.419823: sched_deadline_throttle:	\
> 	now=280.419823282 deadline=280.430683206 		\
> 	remaining_runtime=-13280
> 
> The task t-1613 overrun its runtime by 13280 ns and was throttled.
> 
> - sched:sched_deadline_block: Informs that a deadline task went to sleep
> waiting to be awakened by another task. Trace example:
> 
>   b-1611  [011] d...   276.228298: sched_deadline_block: 	\
> 	now=276.228295889 deadline=276.258262555
> 	remaining_runtime=19966666
> 
> The task b-1611 blocked waiting for an external event. Its deadline is at
> 276.258262555, and it stills have 19966666 ns of remaining runtime on the
> current period.

So I'm a bit allergic to tracepoints and this is very flimsy on reasons
why I would want to do this.

As it stands, the existing tracepoint have already been an ABI
trainwreck, why would I want to add more?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-22 17:08 ` [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler Daniel Bristot de Oliveira
  2016-02-22 17:32   ` Peter Zijlstra
@ 2016-02-22 17:48   ` kbuild test robot
  1 sibling, 0 replies; 22+ messages in thread
From: kbuild test robot @ 2016-02-22 17:48 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: kbuild-all, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Juri Lelli, Arnaldo Carvalho de Melo, LKML,
	linux-rt-users

[-- Attachment #1: Type: text/plain, Size: 7068 bytes --]

Hi Daniel,

[auto build test WARNING on tip/sched/core]
[also build test WARNING on v4.5-rc5 next-20160222]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Daniel-Bristot-de-Oliveira/Tracepoints-for-deadline-scheduler/20160223-011407
config: i386-randconfig-s0-201608 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from include/trace/define_trace.h:95:0,
                    from include/trace/events/sched.h:655,
                    from kernel/sched/core.c:91:
   include/trace/events/sched.h: In function 'trace_raw_output_sched_deadline_replenish':
>> include/trace/events/sched.h:591:12: warning: format '%u' expects argument of type 'unsigned int', but argument 6 has type 'long unsigned int' [-Wformat=]
     TP_printk("comm=%s pid=%d now=%llu.%09u deadline=%llu.%09u runtime=%lld",
               ^
   include/trace/trace_events.h:342:22: note: in definition of macro 'DECLARE_EVENT_CLASS'
     trace_seq_printf(s, print);     \
                         ^
   include/trace/trace_events.h:65:9: note: in expansion of macro 'PARAMS'
            PARAMS(print));         \
            ^
   include/trace/events/sched.h:569:1: note: in expansion of macro 'TRACE_EVENT'
    TRACE_EVENT(sched_deadline_replenish,
    ^
>> include/trace/events/sched.h:591:2: note: in expansion of macro 'TP_printk'
     TP_printk("comm=%s pid=%d now=%llu.%09u deadline=%llu.%09u runtime=%lld",
     ^
   include/trace/events/sched.h:591:12: warning: format '%u' expects argument of type 'unsigned int', but argument 8 has type 'long unsigned int' [-Wformat=]
     TP_printk("comm=%s pid=%d now=%llu.%09u deadline=%llu.%09u runtime=%lld",
               ^
   include/trace/trace_events.h:342:22: note: in definition of macro 'DECLARE_EVENT_CLASS'
     trace_seq_printf(s, print);     \
                         ^
   include/trace/trace_events.h:65:9: note: in expansion of macro 'PARAMS'
            PARAMS(print));         \
            ^
   include/trace/events/sched.h:569:1: note: in expansion of macro 'TRACE_EVENT'
    TRACE_EVENT(sched_deadline_replenish,
    ^
>> include/trace/events/sched.h:591:2: note: in expansion of macro 'TP_printk'
     TP_printk("comm=%s pid=%d now=%llu.%09u deadline=%llu.%09u runtime=%lld",
     ^
   include/trace/events/sched.h: In function 'trace_raw_output_sched_deadline_template':
   include/trace/events/sched.h:619:12: warning: format '%u' expects argument of type 'unsigned int', but argument 4 has type 'long unsigned int' [-Wformat=]
     TP_printk("now=%llu.%09u deadline=%llu.%09u remaining_runtime=%lld",
               ^
   include/trace/trace_events.h:342:22: note: in definition of macro 'DECLARE_EVENT_CLASS'
     trace_seq_printf(s, print);     \
                         ^
   include/trace/events/sched.h:619:2: note: in expansion of macro 'TP_printk'
     TP_printk("now=%llu.%09u deadline=%llu.%09u remaining_runtime=%lld",
     ^
   include/trace/events/sched.h:619:12: warning: format '%u' expects argument of type 'unsigned int', but argument 6 has type 'long unsigned int' [-Wformat=]
     TP_printk("now=%llu.%09u deadline=%llu.%09u remaining_runtime=%lld",
               ^
   include/trace/trace_events.h:342:22: note: in definition of macro 'DECLARE_EVENT_CLASS'
     trace_seq_printf(s, print);     \
                         ^
   include/trace/events/sched.h:619:2: note: in expansion of macro 'TP_printk'
     TP_printk("now=%llu.%09u deadline=%llu.%09u remaining_runtime=%lld",
     ^

vim +591 include/trace/events/sched.h

   563		TP_printk("cpu=%d", __entry->cpu)
   564	);
   565	
   566	/*
   567	 * Tracepoint for runtime replenishment of deadline tasks:
   568	 */
 > 569	TRACE_EVENT(sched_deadline_replenish,
   570	
   571		TP_PROTO(struct sched_dl_entity *dl_se),
   572	
   573		TP_ARGS(dl_se),
   574	
   575		TP_STRUCT__entry(
   576			__array(	char,	comm,	TASK_COMM_LEN	)
   577			__field(	pid_t,	pid			)
   578			__field(	u64,	now			)
   579			__field(	u64,	deadline		)
   580			__field(	s64,	runtime			)
   581		),
   582	
   583		TP_fast_assign(
   584			memcpy(__entry->comm, dl_task_of(dl_se)->comm, TASK_COMM_LEN);
   585			__entry->now		= rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se)));
   586			__entry->pid		= dl_task_of(dl_se)->pid;
   587			__entry->deadline	= dl_se->deadline;
   588			__entry->runtime	= dl_se->runtime;
   589		),
   590	
 > 591		TP_printk("comm=%s pid=%d now=%llu.%09u deadline=%llu.%09u runtime=%lld",
   592			  __entry->comm,
   593			  __entry->pid,
   594			  __print_ns_to_secs(__entry->now),
   595			  __print_ns_without_secs(__entry->now),
   596			  __print_ns_to_secs(__entry->deadline),
   597			  __print_ns_without_secs(__entry->deadline),
   598			  __entry->runtime)
   599	);
   600	
   601	DECLARE_EVENT_CLASS(sched_deadline_template,
   602	
   603		TP_PROTO(struct sched_dl_entity *dl_se),
   604	
   605		TP_ARGS(dl_se),
   606	
   607		TP_STRUCT__entry(
   608			__field(	u64,	now			)
   609			__field(	u64,	deadline		)
   610			__field(	s64,	runtime			)
   611		),
   612	
   613		TP_fast_assign(
   614			__entry->now		= rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se)));
   615			__entry->deadline	= dl_se->deadline;
   616			__entry->runtime	= dl_se->runtime;
   617		),
   618	
   619		TP_printk("now=%llu.%09u deadline=%llu.%09u remaining_runtime=%lld",
   620			  __print_ns_to_secs(__entry->now),
   621			  __print_ns_without_secs(__entry->now),
   622			  __print_ns_to_secs(__entry->deadline),
   623			  __print_ns_without_secs(__entry->deadline),
   624			  __entry->runtime)
   625	);
   626	
   627	/*
   628	 * Tracepoint for sched_yield() of a deadline task (the task
   629	 * went to sleep waiting for the next period)
   630	 */
   631	DEFINE_EVENT(sched_deadline_template, sched_deadline_yield,
   632		     TP_PROTO(struct sched_dl_entity *dl_se),
   633		     TP_ARGS(dl_se));
   634	
   635	/*
   636	 * Tracepoint for throttling of a deadline task that consumed all its
   637	 * runtime
   638	 */
   639	DEFINE_EVENT_CONDITION(sched_deadline_template, sched_deadline_throttle,
   640		     TP_PROTO(struct sched_dl_entity *dl_se),
   641		     TP_ARGS(dl_se),
   642		     TP_CONDITION(!dl_se->dl_yielded));
   643	
   644	/*
   645	 * Tracepoint for blocking of a deadline task. The deadline task was
   646	 * dequeued, but neither by being throttled nor yielding.
   647	 */
   648	DEFINE_EVENT_CONDITION(sched_deadline_template, sched_deadline_block,
   649		     TP_PROTO(struct sched_dl_entity *dl_se),
   650		     TP_ARGS(dl_se),
   651		     TP_CONDITION(!dl_se->dl_yielded && !dl_se->dl_throttled));
   652	#endif /* _TRACE_SCHED_H */
   653	
   654	/* This part must be outside protection */
 > 655	#include <trace/define_trace.h>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 30800 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-22 17:32   ` Peter Zijlstra
@ 2016-02-22 17:48     ` Steven Rostedt
  2016-02-22 20:11       ` Daniel Bristot de Oliveira
  2016-02-22 21:30       ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2016-02-22 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Thomas Gleixner,
	Juri Lelli, Arnaldo Carvalho de Melo, LKML, linux-rt-users

On Mon, 22 Feb 2016 18:32:59 +0100
Peter Zijlstra <peterz@infradead.org> wrote:


> So I'm a bit allergic to tracepoints and this is very flimsy on reasons
> why I would want to do this.

Because there's no way to know if SCHED_DEADLINE tasks are doing what
they suppose to without hacking the kernel and adding your own
tracepoints.

> 
> As it stands, the existing tracepoint have already been an ABI
> trainwreck, why would I want to add more?

Yes, this may become a type of ABI, but even the sched switch
tracepoints haven't been that bad. Has it really prevented us from
changing anything?

"trainwreck" is a harsh word, and the information of scheduling
tracepoints have been crucial to finding bugs and such. And has been
tremendously useful in loads of cases.

But let me ask, what would you recommend to finding out if the kernel
has really given your tasks the recommended runtime within a given
period? We can't expect users of SCHED_DEADLINE to be modifying the
kernel themselves.

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-22 17:48     ` Steven Rostedt
@ 2016-02-22 20:11       ` Daniel Bristot de Oliveira
  2016-02-22 21:30       ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Daniel Bristot de Oliveira @ 2016-02-22 20:11 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Juri Lelli,
	Arnaldo Carvalho de Melo, LKML, linux-rt-users



On 02/22/2016 02:48 PM, Steven Rostedt wrote:
> On Mon, 22 Feb 2016 18:32:59 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
>> > So I'm a bit allergic to tracepoints and this is very flimsy on reasons
>> > why I would want to do this.
> Because there's no way to know if SCHED_DEADLINE tasks are doing what
> they suppose to without hacking the kernel and adding your own
> tracepoints.

That is the point. A potential deadline user would have to become a
sched deadline developer to be able to debug its deadline tasks.

Given that many deadline scheduler users are expected from application
field, e.g. automation, and that these users generally do not even know
how to compile a kernel, having a "ready to use" way to observe sched
deadline specific points of interest is essential for deadline
scheduler adoption.

For example, the runtime is an estimated value because there is no
deterministic way define it, so be able to observe the remaining runtime
at the end of an activation is very useful to define a tighter runtime.
On the other hand, it is possible to defined that runtime was under
estimated just by observing that a task was throttled.

Another example: be able to see that a task is blocking in the middle of
an activation is useful on the identification of an unexpected behavior
of a deadline task.

Although it is possible to guess many things by doing user-space
measurements, by using a set of other tracepoints, or even by using both
together. The debug process will not be as easy and precise as using
these four tracepoints.

-- Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] tools lib traceevent: Implements '%' operation
  2016-02-22 17:08 ` [PATCH 4/4] tools lib traceevent: Implements '%' operation Daniel Bristot de Oliveira
@ 2016-02-22 20:23   ` Steven Rostedt
  2016-02-23 14:38     ` Arnaldo Carvalho de Melo
  2016-02-25  5:41   ` [tip:perf/core] tools lib traceevent: Implement " tip-bot for Daniel Bristot de Oliveira
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2016-02-22 20:23 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Arnaldo Carvalho de Melo, LKML, linux-rt-users

On Mon, 22 Feb 2016 14:08:22 -0300
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> The operation '%' is not implemented on event-parse.c, causing
> an error on the parse of events with '%' operation on its
> printk format. For example,
> 
> # perf record -e sched:sched_deadline_yield ~/y
>   Warning: [sched:sched_deadline_yield] unknown op '%'
> ....
> # perf script
>   Warning: [sched:sched_deadline_yield] unknown op '%'
> 	y  1641 [006]  3364.109319: sched:sched_deadline_yield: \
> 			[FAILED TO PARSE] now=3364109314595 	\
> 			deadline=3364139295135 runtime=19975597
> 
> This patch implements the '%' operation. With this patch, we see the
> correct output:
> 
> # perf record -e sched:sched_deadline_yield ~/y
>   No Warning
> 
> # perf script
> 	y  4005 [001]  4623.650978: sched:sched_deadline_yield: \
> 		now=4623.650974050				\
> 		deadline=4623.680957364 remaining_runtime=19979611
> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>

Arnaldo,

This patch is not dependent on the rest of the series and looks like a
good general fix.

Can you pull this one into your tree, and please add my:

Reviewed-by: Steven Rostedt <rsotedt@goodmis.org>

Thanks!

-- Steve

> ---
>  tools/lib/traceevent/event-parse.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/tools/lib/traceevent/event-parse.c b/tools/lib/traceevent/event-parse.c
> index c3bd294..575e751 100644
> --- a/tools/lib/traceevent/event-parse.c
> +++ b/tools/lib/traceevent/event-parse.c
> @@ -1951,6 +1951,7 @@ process_op(struct event_format *event, struct print_arg *arg, char **tok)
>  		   strcmp(token, "*") == 0 ||
>  		   strcmp(token, "^") == 0 ||
>  		   strcmp(token, "/") == 0 ||
> +		   strcmp(token, "%") == 0 ||
>  		   strcmp(token, "<") == 0 ||
>  		   strcmp(token, ">") == 0 ||
>  		   strcmp(token, "<=") == 0 ||
> @@ -3689,6 +3690,9 @@ eval_num_arg(void *data, int size, struct event_format *event, struct print_arg
>  		case '/':
>  			val = left / right;
>  			break;
> +		case '%':
> +			val = left % right;
> +			break;
>  		case '*':
>  			val = left * right;
>  			break;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-22 17:48     ` Steven Rostedt
  2016-02-22 20:11       ` Daniel Bristot de Oliveira
@ 2016-02-22 21:30       ` Peter Zijlstra
  2016-02-22 22:30         ` Steven Rostedt
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2016-02-22 21:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Thomas Gleixner,
	Juri Lelli, Arnaldo Carvalho de Melo, LKML, linux-rt-users

On Mon, Feb 22, 2016 at 12:48:54PM -0500, Steven Rostedt wrote:

> > As it stands, the existing tracepoint have already been an ABI
> > trainwreck, why would I want to add more?
> 
> Yes, this may become a type of ABI, but even the sched switch
> tracepoints haven't been that bad. Has it really prevented us from
> changing anything?

The whole wakeup thing where we _still_ have a dummy argument, and have
been lying about the value for a long time really stinks.

> But let me ask, what would you recommend to finding out if the kernel
> has really given your tasks the recommended runtime within a given
> period? We can't expect users of SCHED_DEADLINE to be modifying the
> kernel themselves.

So why are these deadline specific tracepoint? Why not extend the ones
we already have?

Also, will these tracepoints still work if we implement SCHED_DEADLINE
using Least-Laxity-First or Pfair or some other exotic algorithm? Or
will be forever be bound to EDF just because tracepoint ABI shite?

Worse, the proposed tracepoints are atrocious, look at crap like this:

> +		if (trace_sched_deadline_yield_enabled()) {
> +			u64 delta_exec = rq_clock_task(rq) - p->se.exec_start;
> +			/* Subtract the last run till now */
> +			if (likely((s64)delta_exec > 0))
> +				p->dl.runtime -= delta_exec;
> +			trace_sched_deadline_yield(&p->dl);
> +		}

tracepoints should _NEVER_ change state, ever.

And there's the whole COND tracepoint muck, which also doesn't win any
prices.

So tell me why these specific tracepoints and why the existing ones
could not be extended to include this information. For example, why a
trace_sched_dealine_yield, and not a generic trace_sched_yield() that
works for all classes.

Tell me that the information presented does not pin the implementation.

And clean up the crap.

Then I might maybe consider this.

But do not present me with a bunch of random arse, hacked together
tracepoints and tell me they might be useful, maybe.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-22 21:30       ` Peter Zijlstra
@ 2016-02-22 22:30         ` Steven Rostedt
  2016-02-23 10:40           ` Juri Lelli
  2016-02-23 10:44           ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2016-02-22 22:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Thomas Gleixner,
	Juri Lelli, Arnaldo Carvalho de Melo, LKML, linux-rt-users

On Mon, 22 Feb 2016 22:30:17 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Feb 22, 2016 at 12:48:54PM -0500, Steven Rostedt wrote:
> 
> > > As it stands, the existing tracepoint have already been an ABI
> > > trainwreck, why would I want to add more?  
> > 
> > Yes, this may become a type of ABI, but even the sched switch
> > tracepoints haven't been that bad. Has it really prevented us from
> > changing anything?  
> 
> The whole wakeup thing where we _still_ have a dummy argument, and have
> been lying about the value for a long time really stinks.

Having a dummy argument is not that bad. Yes, it's inconvenient, and
I'm not sure who even uses it (can we delete it without breaking
anything?) But it doesn't prevent us from going forward with
development.

> 
> > But let me ask, what would you recommend to finding out if the kernel
> > has really given your tasks the recommended runtime within a given
> > period? We can't expect users of SCHED_DEADLINE to be modifying the
> > kernel themselves.  
> 
> So why are these deadline specific tracepoint? Why not extend the ones
> we already have?

I'm not sure how to do that and be able to report when a period has
elapsed, and when the next period is coming.

> 
> Also, will these tracepoints still work if we implement SCHED_DEADLINE
> using Least-Laxity-First or Pfair or some other exotic algorithm? Or
> will be forever be bound to EDF just because tracepoint ABI shite?

Can we come up with generic numbers? I mean, the user that asks for
their task to have a specific runtime within a specific
period/deadline, these seem to be generic already. I'll have to read up
on those that you mention, but do that not have a "replenish" for when
the period starts again? And then a yield, showing the task has given
up its remaining time, or a block, where a task is scheduled out
because it blocked on a lock?

> 
> Worse, the proposed tracepoints are atrocious, look at crap like this:
> 
> > +		if (trace_sched_deadline_yield_enabled()) {
> > +			u64 delta_exec = rq_clock_task(rq) - p->se.exec_start;
> > +			/* Subtract the last run till now */
> > +			if (likely((s64)delta_exec > 0))
> > +				p->dl.runtime -= delta_exec;
> > +			trace_sched_deadline_yield(&p->dl);
> > +		}  
> 
> tracepoints should _NEVER_ change state, ever.

Heh, it's not really changing state. The code directly after this is:

	p->dl.runtime = 0;

Without updating dl.runtime, you the tracepoint would report inaccurate
remaining time. Without that, you would get reports of yielding with
full runtimes, making it look like you never ran at all.


> 
> And there's the whole COND tracepoint muck, which also doesn't win any
> prices.

It keeps us from adding flags that may end up going away and us
maintaining a dummy field forever ;-)

> 
> So tell me why these specific tracepoints and why the existing ones
> could not be extended to include this information. For example, why a
> trace_sched_dealine_yield, and not a generic trace_sched_yield() that
> works for all classes.

But what about reporting actual runtime, and when the next period will
come. That only matters for deadline.

> 
> Tell me that the information presented does not pin the implementation.
> 
> And clean up the crap.
> 
> Then I might maybe consider this.
> 
> But do not present me with a bunch of random arse, hacked together
> tracepoints and tell me they might be useful, maybe.


They ARE useful. These are the tracepoints I'm currently using to
debug the deadline scheduler with. They have been indispensable for my
current work.

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-22 22:30         ` Steven Rostedt
@ 2016-02-23 10:40           ` Juri Lelli
  2016-02-23 10:44           ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2016-02-23 10:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Daniel Bristot de Oliveira, Ingo Molnar,
	Thomas Gleixner, Juri Lelli, Arnaldo Carvalho de Melo, LKML,
	linux-rt-users

Hi,

On 22/02/16 17:30, Steven Rostedt wrote:
> On Mon, 22 Feb 2016 22:30:17 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Feb 22, 2016 at 12:48:54PM -0500, Steven Rostedt wrote:
> > 

[...]

> > 
> > > But let me ask, what would you recommend to finding out if the kernel
> > > has really given your tasks the recommended runtime within a given
> > > period? We can't expect users of SCHED_DEADLINE to be modifying the
> > > kernel themselves.  
> > 
> > So why are these deadline specific tracepoint? Why not extend the ones
> > we already have?
> 
> I'm not sure how to do that and be able to report when a period has
> elapsed, and when the next period is coming.
> 
> > 
> > Also, will these tracepoints still work if we implement SCHED_DEADLINE
> > using Least-Laxity-First or Pfair or some other exotic algorithm? Or
> > will be forever be bound to EDF just because tracepoint ABI shite?
> 
> Can we come up with generic numbers? I mean, the user that asks for
> their task to have a specific runtime within a specific
> period/deadline, these seem to be generic already. I'll have to read up
> on those that you mention, but do that not have a "replenish" for when
> the period starts again? And then a yield, showing the task has given
> up its remaining time, or a block, where a task is scheduled out
> because it blocked on a lock?
> 

AFAICT throttle, yield and block seem fairly generic. How we make the
definition generic w.r.t. the arguments we want to print is a different
matter, though. :-/

I should refresh my memory about the above mentioned algorithms, and
"replenish" might be particular to the current implementation. However,
shouldn't any algo that has a "throttle" event also have a corresponding
"un-throttle (replenish)" event? I guess being able to throttle
misbehaving tasks is a property that we will always desire.

> > 
> > Worse, the proposed tracepoints are atrocious, look at crap like this:
> > 
> > > +		if (trace_sched_deadline_yield_enabled()) {
> > > +			u64 delta_exec = rq_clock_task(rq) - p->se.exec_start;
> > > +			/* Subtract the last run till now */
> > > +			if (likely((s64)delta_exec > 0))
> > > +				p->dl.runtime -= delta_exec;
> > > +			trace_sched_deadline_yield(&p->dl);
> > > +		}  
> > 
> > tracepoints should _NEVER_ change state, ever.
> 
> Heh, it's not really changing state. The code directly after this is:
> 
> 	p->dl.runtime = 0;
> 
> Without updating dl.runtime, you the tracepoint would report inaccurate
> remaining time. Without that, you would get reports of yielding with
> full runtimes, making it look like you never ran at all.
> 

Right, I guess that might be useful to understand if you over-
dimensioned the reservation. Can't we make this a macro or do the
computation local to the tracepoint itself so that code looks nicer?

[...]

> > 
> > So tell me why these specific tracepoints and why the existing ones
> > could not be extended to include this information. For example, why a
> > trace_sched_dealine_yield, and not a generic trace_sched_yield() that
> > works for all classes.
> 
> But what about reporting actual runtime, and when the next period will
> come. That only matters for deadline.
> 

As said above, the event looks generic enough to me. Not sure how to
make printed arguments generic, though.

> > 
> > Tell me that the information presented does not pin the implementation.
> > 
> > And clean up the crap.
> > 
> > Then I might maybe consider this.
> > 
> > But do not present me with a bunch of random arse, hacked together
> > tracepoints and tell me they might be useful, maybe.
> 
> 
> They ARE useful. These are the tracepoints I'm currently using to
> debug the deadline scheduler with. They have been indispensable for my
> current work.
> 

I also think they are very useful. I always had some sort of tracepoints
that I continue forward porting during the development of
SCHED_DEADLINE, and I couldn't really understand what was going on
without those. There is value if we can agree on the mainline
incarnation of such tracepoints now, IMHO.

Best,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-22 22:30         ` Steven Rostedt
  2016-02-23 10:40           ` Juri Lelli
@ 2016-02-23 10:44           ` Peter Zijlstra
  2016-02-23 13:10             ` Steven Rostedt
                               ` (3 more replies)
  1 sibling, 4 replies; 22+ messages in thread
From: Peter Zijlstra @ 2016-02-23 10:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Thomas Gleixner,
	Juri Lelli, Arnaldo Carvalho de Melo, LKML, linux-rt-users

On Mon, Feb 22, 2016 at 05:30:43PM -0500, Steven Rostedt wrote:
> On Mon, 22 Feb 2016 22:30:17 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Feb 22, 2016 at 12:48:54PM -0500, Steven Rostedt wrote:
> > 
> > > > As it stands, the existing tracepoint have already been an ABI
> > > > trainwreck, why would I want to add more?  
> > > 
> > > Yes, this may become a type of ABI, but even the sched switch
> > > tracepoints haven't been that bad. Has it really prevented us from
> > > changing anything?  
> > 
> > The whole wakeup thing where we _still_ have a dummy argument, and have
> > been lying about the value for a long time really stinks.
> 
> Having a dummy argument is not that bad. Yes, it's inconvenient, and
> I'm not sure who even uses it (can we delete it without breaking
> anything?) But it doesn't prevent us from going forward with
> development.

No it very much illustrates the problem and is a very clear indication
that tracepoints are an ABI.

Yes, we more or less got away with it this time, but it does prohibit us
from doing sane things, see below.

> > > But let me ask, what would you recommend to finding out if the kernel
> > > has really given your tasks the recommended runtime within a given
> > > period? We can't expect users of SCHED_DEADLINE to be modifying the
> > > kernel themselves.  
> > 
> > So why are these deadline specific tracepoint? Why not extend the ones
> > we already have?
> 
> I'm not sure how to do that and be able to report when a period has
> elapsed, and when the next period is coming.

Seriously? Sit still for a minute and think perhaps?

> > Also, will these tracepoints still work if we implement SCHED_DEADLINE
> > using Least-Laxity-First or Pfair or some other exotic algorithm? Or
> > will be forever be bound to EDF just because tracepoint ABI shite?
> 
> Can we come up with generic numbers? I mean, the user that asks for
> their task to have a specific runtime within a specific
> period/deadline, these seem to be generic already. I'll have to read up
> on those that you mention, but do that not have a "replenish" for when
> the period starts again? And then a yield, showing the task has given
> up its remaining time, or a block, where a task is scheduled out
> because it blocked on a lock?

You tell me; why do you expect me to think for you? These should all
have been questions you asked yourself before you even considered
posting tracepoints.

ABI remember? We're careful with that.

> > Worse, the proposed tracepoints are atrocious, look at crap like this:
> > 
> > > +		if (trace_sched_deadline_yield_enabled()) {
> > > +			u64 delta_exec = rq_clock_task(rq) - p->se.exec_start;
> > > +			/* Subtract the last run till now */
> > > +			if (likely((s64)delta_exec > 0))
> > > +				p->dl.runtime -= delta_exec;
> > > +			trace_sched_deadline_yield(&p->dl);
> > > +		}  
> > 
> > tracepoints should _NEVER_ change state, ever.
> 
> Heh, it's not really changing state. The code directly after this is:
> 
> 	p->dl.runtime = 0;

Yes, it more or less 'works', but its still atrocious shite. Its the
worst kind of anti pattern possible.

Suppose someone comes and removes that line, and ignores the tracepoint
stuff, because, hell its a tracepoint, those don't modify stuff.

Its just really, utterly bad practice.

You've done this tracing code long enough, you really should _KNOW_
this.

> > So tell me why these specific tracepoints and why the existing ones
> > could not be extended to include this information. For example, why a
> > trace_sched_dealine_yield, and not a generic trace_sched_yield() that
> > works for all classes.
> 
> But what about reporting actual runtime, and when the next period will
> come. That only matters for deadline.

How is that an answer to the question? Are you implying a generic
trace_sched_yield() call could not do this?

> > But do not present me with a bunch of random arse, hacked together
> > tracepoints and tell me they might be useful, maybe.
> 
> 
> They ARE useful. These are the tracepoints I'm currently using to
> debug the deadline scheduler with. They have been indispensable for my
> current work.

They are, most obviously, a hacked together debug session for sure. This
is _NOT_ what you commit.

Now ideally we'd do something like the below, but because trainwreck, we
cannot actually do this I think :-(

It gets you about half of what your patch does, but shows how to also
do a generic sched_yield(). The replenish might have to remain special,
although both CFS and RT also have replenishes, albeit significantly
different.

I really dislike tracepoints, and I'm >.< close to proposing a patch
removing them all from the scheduler.

---
 include/trace/events/sched.h | 97 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 86 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9b90c57517a9..b902eb71830b 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -103,9 +103,15 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
 	     TP_PROTO(struct task_struct *p),
 	     TP_ARGS(p));
 
+#define TASK_STATE_PREEMPT	(TASK_STATE_MAX << 0)
+#define TASK_STATE_THROTTLED	(TASK_STATE_MAX << 1)
+#define TASK_STATE_YIELDED	(TASK_STATE_MAX << 2)
+
 #ifdef CREATE_TRACE_POINTS
 static inline long __trace_sched_switch_state(bool preempt, struct task_struct *p)
 {
+	long state = p->state;
+
 #ifdef CONFIG_SCHED_DEBUG
 	BUG_ON(p != current);
 #endif /* CONFIG_SCHED_DEBUG */
@@ -114,10 +120,49 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
 	 * Preemption ignores task state, therefore preempted tasks are always
 	 * RUNNING (we will not have dequeued if state != RUNNING).
 	 */
-	return preempt ? TASK_RUNNING | TASK_STATE_MAX : p->state;
+	if (preempt) {
+		state = TASK_RUNNING | TASK_STATE_MAX;
+	} else if (dl_task(p)) {
+		if (p->dl.dl_throttled)
+			state |= TASK_STATE_THROTTLED;
+		else if (p->dl.dl_yielded)
+			state |= TASK_STATE_YIELDED;
+	}
+
+	return state;
 }
 #endif /* CREATE_TRACE_POINTS */
 
+#define __trace_sched_switch_fields(name)	do {			\
+	__entry->name##_policy = name->policy;				\
+	switch (name->policy) {						\
+	case SCHED_IDLE:						\
+	case SCHED_BATCH:						\
+	case SCHED_NORMAL:						\
+		__entry->name##_f1 = PRIO_TO_NICE(name->static_prio);	\
+		__entry->name##_f2 = 0;					\
+		__entry->name##_f3 = 0;					\
+		break;							\
+	case SCHED_RR:							\
+	case SCHED_FIFO:						\
+		__entry->name##_f1 = USER_PRIO(name->normal_prio);	\
+		__entry->name##_f2 = 0;					\
+		__entry->name##_f3 = 0;					\
+		break;							\
+	case SCHED_DEADLINE:						\
+		__entry->name##_f1 = name->dl.runtime;			\
+		__entry->name##_f2 = name->dl.deadline;			\
+		__entry->name##_f3 = name->dl.dl_period;		\
+		break;							\
+	default:							\
+		__entry->name##_f1 = 0;					\
+		__entry->name##_f2 = 0;					\
+		__entry->name##_f3 = 0;					\
+		break;							\
+	}								\
+} while (0)
+
+
 /*
  * Tracepoint for task switches, performed by the scheduler:
  */
@@ -132,33 +177,63 @@ TRACE_EVENT(sched_switch,
 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	prev_pid			)
-		__field(	int,	prev_prio			)
+		__field(	int,	prev_policy			)
+		__field(	s64,	prev_f1				)
+		__field(	u64,	prev_f2				)
+		__field(	u64,	prev_f3				)
 		__field(	long,	prev_state			)
+
 		__array(	char,	next_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	next_pid			)
-		__field(	int,	next_prio			)
+		__field(	int,	next_policy			)
+		__field(	s64,	next_f1				)
+		__field(	u64,	next_f2				)
+		__field(	u64,	next_f3				)
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->prev_pid	= prev->pid;
-		__entry->prev_prio	= prev->prio;
+		__trace_sched_switch_fields(prev);
 		__entry->prev_state	= __trace_sched_switch_state(preempt, prev);
-		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+
+		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
-		__entry->next_prio	= next->prio;
+		__trace_sched_switch_fields(next);
 	),
 
-	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
-		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
+	TP_printk("prev_comm=%s prev_pid=%d prev=%s:{%Ld,%Lu,%Lu} prev_state=%s%s ==> next_comm=%s next_pid=%d next=%s:{%Ld,%Lu,%Lu}",
+		__entry->prev_comm, __entry->prev_pid,
+		__print_symbolic(__entry->prev_policy,
+				{SCHED_IDLE,	"IDLE"},
+				{SCHED_BATCH,	"BATCH"},
+				{SCHED_NORMAL,	"NORMAL"},
+				{SCHED_RR,	"RR"},
+				{SCHED_FIFO,	"FIFO"},
+				{SCHED_DEADLINE,"DEADLINE"}),
+		__entry->prev_f1, __entry->prev_f2, __entry->prev_f3,
+
 		__entry->prev_state & (TASK_STATE_MAX-1) ?
 		  __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
 				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
 				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
 				{ 128, "K" }, { 256, "W" }, { 512, "P" },
 				{ 1024, "N" }) : "R",
-		__entry->prev_state & TASK_STATE_MAX ? "+" : "",
-		__entry->next_comm, __entry->next_pid, __entry->next_prio)
+		__print_flags(__entry->prev_state & ~(TASK_STATE_MAX-1), "|",
+				{ TASK_STATE_PREEMPT,   "+" },
+				{ TASK_STATE_THROTTLED, "t" },
+				{ TASK_STATE_YIELDED,	"y" }),
+
+		__entry->next_comm, __entry->next_pid,
+		__print_symbolic(__entry->next_policy,
+				{SCHED_IDLE,	"IDLE"},
+				{SCHED_BATCH,	"BATCH"},
+				{SCHED_NORMAL,	"NORMAL"},
+				{SCHED_RR,	"RR"},
+				{SCHED_FIFO,	"FIFO"},
+				{SCHED_DEADLINE,"DEADLINE"}),
+		__entry->next_f1, __entry->next_f2, __entry->next_f3,
+		)
 );
 
 /*

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-23 10:44           ` Peter Zijlstra
@ 2016-02-23 13:10             ` Steven Rostedt
  2016-02-24  8:48               ` Ingo Molnar
  2016-02-23 14:27             ` Peter Zijlstra
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2016-02-23 13:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Thomas Gleixner,
	Juri Lelli, Arnaldo Carvalho de Melo, LKML, linux-rt-users

On Tue, 23 Feb 2016 11:44:08 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> No it very much illustrates the problem and is a very clear indication
> that tracepoints are an ABI.

Yes they are. But note, they can change if nobody notices ;-)


> > 
> > Heh, it's not really changing state. The code directly after this is:
> > 
> > 	p->dl.runtime = 0;  
> 
> Yes, it more or less 'works', but its still atrocious shite. Its the
> worst kind of anti pattern possible.
> 
> Suppose someone comes and removes that line, and ignores the tracepoint
> stuff, because, hell its a tracepoint, those don't modify stuff.
> 
> Its just really, utterly bad practice.
> 
> You've done this tracing code long enough, you really should _KNOW_
> this.

You're right. I got too caught up in the cleverness of the hack to
acknowledge it is a hack.

But Daniel has a patch to clean up the yield code which would also help
in making this hack unnecessary for this tracepoint.

> 
> > > So tell me why these specific tracepoints and why the existing ones
> > > could not be extended to include this information. For example, why a
> > > trace_sched_dealine_yield, and not a generic trace_sched_yield() that
> > > works for all classes.  
> > 
> > But what about reporting actual runtime, and when the next period will
> > come. That only matters for deadline.  
> 
> How is that an answer to the question? Are you implying a generic
> trace_sched_yield() call could not do this?
> 
> > > But do not present me with a bunch of random arse, hacked together
> > > tracepoints and tell me they might be useful, maybe.  
> > 
> > 
> > They ARE useful. These are the tracepoints I'm currently using to
> > debug the deadline scheduler with. They have been indispensable for my
> > current work.  
> 
> They are, most obviously, a hacked together debug session for sure. This
> is _NOT_ what you commit.
> 
> Now ideally we'd do something like the below, but because trainwreck, we
> cannot actually do this I think :-(
> 
> It gets you about half of what your patch does, but shows how to also
> do a generic sched_yield(). The replenish might have to remain special,
> although both CFS and RT also have replenishes, albeit significantly
> different.

OK, I admit. I was very single focused on deadline scheduler. I wasn't
looking at how this could work with the rest of the scheduler. I'll
take a look at the patches you posted.

Thanks!

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-23 10:44           ` Peter Zijlstra
  2016-02-23 13:10             ` Steven Rostedt
@ 2016-02-23 14:27             ` Peter Zijlstra
  2016-02-23 16:19             ` Daniel Bristot de Oliveira
  2016-02-24  2:29             ` Daniel Bristot de Oliveira
  3 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2016-02-23 14:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Thomas Gleixner,
	Juri Lelli, Arnaldo Carvalho de Melo, LKML, linux-rt-users

On Tue, Feb 23, 2016 at 11:44:08AM +0100, Peter Zijlstra wrote:
>  include/trace/events/sched.h | 97 +++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 86 insertions(+), 11 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 9b90c57517a9..b902eb71830b 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -103,9 +103,15 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
>  	     TP_PROTO(struct task_struct *p),
>  	     TP_ARGS(p));
>  
> +#define TASK_STATE_PREEMPT	(TASK_STATE_MAX << 0)
> +#define TASK_STATE_THROTTLED	(TASK_STATE_MAX << 1)
> +#define TASK_STATE_YIELDED	(TASK_STATE_MAX << 2)
> +
>  #ifdef CREATE_TRACE_POINTS
>  static inline long __trace_sched_switch_state(bool preempt, struct task_struct *p)
>  {
> +	long state = p->state;
> +
>  #ifdef CONFIG_SCHED_DEBUG
>  	BUG_ON(p != current);
>  #endif /* CONFIG_SCHED_DEBUG */
> @@ -114,10 +120,49 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
>  	 * Preemption ignores task state, therefore preempted tasks are always
>  	 * RUNNING (we will not have dequeued if state != RUNNING).
>  	 */
> -	return preempt ? TASK_RUNNING | TASK_STATE_MAX : p->state;
> +	if (preempt) {
> +		state = TASK_RUNNING | TASK_STATE_MAX;
> +	} else if (dl_task(p)) {
> +		if (p->dl.dl_throttled)
> +			state |= TASK_STATE_THROTTLED;
> +		else if (p->dl.dl_yielded)
> +			state |= TASK_STATE_YIELDED;

These might want to be inverted, that is, yielded will have throttled
set, so yield should take precedence.

> +	}
> +
> +	return state;
>  }
>  #endif /* CREATE_TRACE_POINTS */
>  
> +#define __trace_sched_switch_fields(name)	do {			\
> +	__entry->name##_policy = name->policy;				\
> +	switch (name->policy) {						\

Sadly we cannot use policy, for that isn't updated on PI. The best we
can do is dl_task(), rt_task() else ...

> +	case SCHED_IDLE:						\

This doesn't have nice, so should go with default;

> +	case SCHED_BATCH:						\
> +	case SCHED_NORMAL:						\
> +		__entry->name##_f1 = PRIO_TO_NICE(name->static_prio);	\
> +		__entry->name##_f2 = 0;					\
> +		__entry->name##_f3 = 0;					\
> +		break;							\
> +	case SCHED_RR:							\
> +	case SCHED_FIFO:						\
> +		__entry->name##_f1 = USER_PRIO(name->normal_prio);	\
> +		__entry->name##_f2 = 0;					\
> +		__entry->name##_f3 = 0;					\
> +		break;							\
> +	case SCHED_DEADLINE:						\
> +		__entry->name##_f1 = name->dl.runtime;			\
> +		__entry->name##_f2 = name->dl.deadline;			\
> +		__entry->name##_f3 = name->dl.dl_period;		\
> +		break;							\
> +	default:							\
> +		__entry->name##_f1 = 0;					\
> +		__entry->name##_f2 = 0;					\
> +		__entry->name##_f3 = 0;					\
> +		break;							\
> +	}								\
> +} while (0)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] tools lib traceevent: Implements '%' operation
  2016-02-22 20:23   ` Steven Rostedt
@ 2016-02-23 14:38     ` Arnaldo Carvalho de Melo
  2016-02-23 14:44       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-02-23 14:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Juri Lelli, LKML, linux-rt-users

Em Mon, Feb 22, 2016 at 03:23:03PM -0500, Steven Rostedt escreveu:
> On Mon, 22 Feb 2016 14:08:22 -0300
> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> 
> > The operation '%' is not implemented on event-parse.c, causing
> > an error on the parse of events with '%' operation on its
> > printk format. For example,
> > 
> > # perf record -e sched:sched_deadline_yield ~/y
> >   Warning: [sched:sched_deadline_yield] unknown op '%'
> > ....
> > # perf script
> >   Warning: [sched:sched_deadline_yield] unknown op '%'
> > 	y  1641 [006]  3364.109319: sched:sched_deadline_yield: \
> > 			[FAILED TO PARSE] now=3364109314595 	\
> > 			deadline=3364139295135 runtime=19975597
> > 
> > This patch implements the '%' operation. With this patch, we see the
> > correct output:
> > 
> > # perf record -e sched:sched_deadline_yield ~/y
> >   No Warning
> > 
> > # perf script
> > 	y  4005 [001]  4623.650978: sched:sched_deadline_yield: \
> > 		now=4623.650974050				\
> > 		deadline=4623.680957364 remaining_runtime=19979611
> > 
> > Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
> 
> Arnaldo,
> 
> This patch is not dependent on the rest of the series and looks like a
> good general fix.
> 
> Can you pull this one into your tree, and please add my:
> 
> Reviewed-by: Steven Rostedt <rsotedt@goodmis.org>

Done, thanks.

- Arnaldo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] tools lib traceevent: Implements '%' operation
  2016-02-23 14:38     ` Arnaldo Carvalho de Melo
@ 2016-02-23 14:44       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-02-23 14:44 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, LKML, acme, linux-rt-users

Em Tue, Feb 23, 2016 at 11:38:19AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Feb 22, 2016 at 03:23:03PM -0500, Steven Rostedt escreveu:
> > On Mon, 22 Feb 2016 14:08:22 -0300
> > Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> > 
> > > The operation '%' is not implemented on event-parse.c, causing
> > > an error on the parse of events with '%' operation on its
> > > printk format. For example,
> > > 
> > > # perf record -e sched:sched_deadline_yield ~/y
> > >   Warning: [sched:sched_deadline_yield] unknown op '%'
> > > ....
> > > # perf script

Daniel, please add two spaces before examples, specially examples with
the '#' root prompt.

Had I not done that when applying this patch, those lines would have
been removed by git-am, as it would interpret those as comments.

Thanks,

- Arnaldo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-23 10:44           ` Peter Zijlstra
  2016-02-23 13:10             ` Steven Rostedt
  2016-02-23 14:27             ` Peter Zijlstra
@ 2016-02-23 16:19             ` Daniel Bristot de Oliveira
  2016-02-24  2:29             ` Daniel Bristot de Oliveira
  3 siblings, 0 replies; 22+ messages in thread
From: Daniel Bristot de Oliveira @ 2016-02-23 16:19 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: Ingo Molnar, Thomas Gleixner, Juri Lelli,
	Arnaldo Carvalho de Melo, LKML, linux-rt-users



On 02/23/2016 07:44 AM, Peter Zijlstra wrote:
>>> Worse, the proposed tracepoints are atrocious, look at crap like this:
>>> > > 
>>>> > > > +		if (trace_sched_deadline_yield_enabled()) {
>>>> > > > +			u64 delta_exec = rq_clock_task(rq) - p->se.exec_start;
>>>> > > > +			/* Subtract the last run till now */
>>>> > > > +			if (likely((s64)delta_exec > 0))
>>>> > > > +				p->dl.runtime -= delta_exec;
>>>> > > > +			trace_sched_deadline_yield(&p->dl);
>>>> > > > +		}  
>>> > > 
>>> > > tracepoints should _NEVER_ change state, ever.
>> > 
>> > Heh, it's not really changing state. The code directly after this is:
>> > 
>> > 	p->dl.runtime = 0;
> Yes, it more or less 'works', but its still atrocious shite. Its the
> worst kind of anti pattern possible.
> 
> Suppose someone comes and removes that line, and ignores the tracepoint
> stuff, because, hell its a tracepoint, those don't modify stuff.
> 
> Its just really, utterly bad practice.

It is possible to clean up this by removing the "p->dl.runtime = 0;"
from yield_task_dl(), to carry the dl_runtime until update_curr_dl().
We end up not proposing this change because we thought it could be
too much intrusive. But seems that it was the correct approach.

Btw, your patch for "Always calculate end of period on sched_yield()"
already do the necessary clean up, so in the possible next version of
these tracepoint the hack will disappear. 

>>> > > So tell me why these specific tracepoints and why the existing ones
>>> > > could not be extended to include this information. For example, why a
>>> > > trace_sched_dealine_yield, and not a generic trace_sched_yield() that
>>> > > works for all classes.
>> > 
>> > But what about reporting actual runtime, and when the next period will
>> > come. That only matters for deadline.
> How is that an answer to the question? Are you implying a generic
> trace_sched_yield() call could not do this?

I agree that a trace_sched_yield() could partially do this. But each
scheduler would have its own specific data to be printed, and it is hard
to define how many and the type of these data.

>>> > > But do not present me with a bunch of random arse, hacked together
>>> > > tracepoints and tell me they might be useful, maybe.
>> > 
>> > 
>> > They ARE useful. These are the tracepoints I'm currently using to
>> > debug the deadline scheduler with. They have been indispensable for my
>> > current work.
> They are, most obviously, a hacked together debug session for sure. This
> is _NOT_ what you commit.
> 
> Now ideally we'd do something like the below, but because trainwreck, we
> cannot actually do this I think :-(

I do liked this patch, but I agree that bad things can happen on
applications that already use sched:sched_switch :-(.
 
> I really dislike tracepoints, and I'm >.< close to proposing a patch
> removing them all from the scheduler.

Scheduler tracepoints become such a standard not only for debugging the
kernel, but also for understanding performance issues and finding
optimizations for user-space tasks, e.g. finding the best way to spread
task among processors on a large NUMA box. Many people would cry
if sched tracepoints disappears :'-(.

> @@ -132,33 +177,63 @@ TRACE_EVENT(sched_switch,
>  	TP_STRUCT__entry(
>  		__array(	char,	prev_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	prev_pid			)
> -		__field(	int,	prev_prio			)
> +		__field(	int,	prev_policy			)
> +		__field(	s64,	prev_f1				)
> +		__field(	u64,	prev_f2				)
> +		__field(	u64,	prev_f3				)

This helps me to explain why did I chose to create deadline specific
tracepoints.

It is not possible to define in advance a common set of parameters to be
traced/printed for all future (exotic) scheduler, because each scheduler
will use its own set of parameters. Hence, the number and type of the
parameter can vary. How many fields do we need? and which are the types
of these fields? What if a scheduler needs two s64 fields and no u64?

Moreover, it is not possible to define the changes that will be needed
on classical schedulers algorithms for them to fit on Linux's
abstractions and restrictions. Therefore, it is not safe to mention
that LLF would (theoretically) use the same set of parameters of a EDF
scheduler, because the theoretical LLF would need to be modified
to fit on Linux. For example, there is no throttling concept on the
classical deadline scheduler. So, the proposed tracepoints are only valid
for Linux's sched deadline scheduler (deadline scheduler + CBS).

That is why I believe that it would be a better approach to have
scheduler specific tracepoints for the sched_*_entity particularities
and sched_* specific points interest.

Finally, I was based in the same approach used on the fair scheduler's
sched:sched_stat* tracepoints: they report sched_entity data on fair.c,
and my patches report sched_deadline_entity data on deadline.c.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-23 10:44           ` Peter Zijlstra
                               ` (2 preceding siblings ...)
  2016-02-23 16:19             ` Daniel Bristot de Oliveira
@ 2016-02-24  2:29             ` Daniel Bristot de Oliveira
  3 siblings, 0 replies; 22+ messages in thread
From: Daniel Bristot de Oliveira @ 2016-02-24  2:29 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: Ingo Molnar, Thomas Gleixner, Juri Lelli,
	Arnaldo Carvalho de Melo, LKML, linux-rt-users



On 02/23/2016 07:44 AM, Peter Zijlstra wrote:
> Now ideally we'd do something like the below, but because trainwreck, we
> cannot actually do this I think :-(

Some other considerations:

1) The majority of tasks run on NORMAL scheduler with default nice. So,
prev=NORMAL:{0,0,0} and next=NORMAL:{0,0,0} does not help too much.
The tracepoint size increase from 64 to 112, but this difference will not
be useful for the majority of the user/use-cases.

2) The sched:switch can be wider than this 212 characters output:
"              m-1604  [004] d...  1940.393070: sched_switch: prev_comm=m prev_pid=1604 prev=DEADLINE:{-967616,1940402119572,30000000} prev_state=Rt ==> next_comm=kworker/4:2H next_pid=1507 next=NORMAL:{-20,0,0}"

3) On the other tracepoints, rostedt suggested me to also include the
now= field because the clock printed in the trace prefix is
not the same clock used by the deadline scheduler.

4) Although neither perf record/script nor trace-cmd record/report broke,
both printed only the old fields. So, both would need to be updated and
carry both the old and the new version.

Well, I bet there will be many other applications that were not well
written as perf and trace-cmd :-(

5) A new scheduler will probably require modifications on this
tracepoint, hence on all other tools that use this tracepoint too.

6) I liked all the code to compose the output string! But it is not
fair to say that the "format" is intuitive. One may need to put the
finger in the monitor and use a calculator to see that the filter
for a yielded tasks is "prev_state == 8192".

Btw, it only works for deadline task yield, and the yield call of all
other classes will need to be changed to support this. The question
is, is it worth for other classes?

7) This patch is currently showing the runtime zeroed on yield.
So, it needs the clean up on yield_task_dl() to properly
inform the unused runtime of the task's activation.

More than it, to be precise, it must push the zeroing of the
runtime of an yielded task to the replenishment code. Your
patch for the sched yield already do it :-) but once there,
the zeroing can't be moved back without break this
tracepoint :-(.

8) we still need to define a tracepoint for replenishment, but
as it takes place on the following points:

a) sched_setattr()
b) periodic replenishment
c) on the wakeup after a blocking.

There is no a "single tracepoint" already available to "attach" this
information. So seems the replenishment tracepoint is still needed.

Moreover, although other scheduler may have this concept, I am
not sure if they take place under the same conditions.

So, AFAICS, this integrate approach is neither adding more information,
nor preventing ABI problems now and possibly in the future.

On the other hand, although a per-scheduler approach would increase
the number of tracepoints, the tracepoints tends to be simpler, confined
to the scheduler's .c and more coherent to the scheduler's abstractions.

I fear that the mix of the deadline scheduler and other schedulers
information, and the adaptation on other tracepoints to make the deadline
scheduler points of interest fit on current tracepoints, would add
undesired noise/imprecision on the interpretation of the sched deadline
actions/abstractions :-(.

-- Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler
  2016-02-23 13:10             ` Steven Rostedt
@ 2016-02-24  8:48               ` Ingo Molnar
  0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2016-02-24  8:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Daniel Bristot de Oliveira, Ingo Molnar,
	Thomas Gleixner, Juri Lelli, Arnaldo Carvalho de Melo, LKML,
	linux-rt-users


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 23 Feb 2016 11:44:08 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > No it very much illustrates the problem and is a very clear indication
> > that tracepoints are an ABI.
> 
> Yes they are. But note, they can change if nobody notices ;-)

Which is rather seldom for something as high profile as core scheduler
tracepoints ...

You can be cavalier about tracepoints in subsystems where the tracepoints are at 
most used by the maintainers with specialized, closely maintained tooling that is 
easy to change. Not so much about scheduler tracepoints which are being widely 
utilized by various projects.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:perf/core] tools lib traceevent: Implement '%' operation
  2016-02-22 17:08 ` [PATCH 4/4] tools lib traceevent: Implements '%' operation Daniel Bristot de Oliveira
  2016-02-22 20:23   ` Steven Rostedt
@ 2016-02-25  5:41   ` tip-bot for Daniel Bristot de Oliveira
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2016-02-25  5:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, rostedt, linux-rt-users, juri.lelli, acme, tglx,
	linux-kernel, bristot, hpa, peterz

Commit-ID:  0e47b38dcd24c78d0699b42f28d5986154d2aa11
Gitweb:     http://git.kernel.org/tip/0e47b38dcd24c78d0699b42f28d5986154d2aa11
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Mon, 22 Feb 2016 14:08:22 -0300
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Tue, 23 Feb 2016 11:40:51 -0300

tools lib traceevent: Implement '%' operation

The operation '%' is not implemented on event-parse.c, causing an error
when parsing events with '%' the operation in its printk format. For
example,

  # perf record -e sched:sched_deadline_yield ~/yield-test
    Warning: [sched:sched_deadline_yield] unknown op '%'
  ....
  # perf script
    Warning: [sched:sched_deadline_yield] unknown op '%'
        test  1641 [006]  3364.109319: sched:sched_deadline_yield: \
                        [FAILED TO PARSE] now=3364109314595        \
                        deadline=3364139295135 runtime=19975597

This patch implements the '%' operation. With this patch, we see the
correct output:

  # perf record -e sched:sched_deadline_yield ~/yield-test
    No Warning

  # perf script
        yield-test  4005 [001]  4623.650978: sched:sched_deadline_yield: \
                now=4623.650974050                                       \
                deadline=4623.680957364 remaining_runtime=19979611

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/5c96a395c56cea6d3d13d949051bdece86cc26e0.1456157869.git.bristot@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/lib/traceevent/event-parse.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/lib/traceevent/event-parse.c b/tools/lib/traceevent/event-parse.c
index c3bd294..575e751 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -1951,6 +1951,7 @@ process_op(struct event_format *event, struct print_arg *arg, char **tok)
 		   strcmp(token, "*") == 0 ||
 		   strcmp(token, "^") == 0 ||
 		   strcmp(token, "/") == 0 ||
+		   strcmp(token, "%") == 0 ||
 		   strcmp(token, "<") == 0 ||
 		   strcmp(token, ">") == 0 ||
 		   strcmp(token, "<=") == 0 ||
@@ -3689,6 +3690,9 @@ eval_num_arg(void *data, int size, struct event_format *event, struct print_arg
 		case '/':
 			val = left / right;
 			break;
+		case '%':
+			val = left % right;
+			break;
 		case '*':
 			val = left * right;
 			break;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-02-25  6:30 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-22 17:08 [PATCH 0/4] Tracepoints for deadline scheduler Daniel Bristot de Oliveira
2016-02-22 17:08 ` [PATCH 1/4] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers Daniel Bristot de Oliveira
2016-02-22 17:08 ` [PATCH 2/4] sched: Move deadline container_of() helper functions into sched.h Daniel Bristot de Oliveira
2016-02-22 17:08 ` [PATCH 3/4] sched/deadline: Tracepoints for deadline scheduler Daniel Bristot de Oliveira
2016-02-22 17:32   ` Peter Zijlstra
2016-02-22 17:48     ` Steven Rostedt
2016-02-22 20:11       ` Daniel Bristot de Oliveira
2016-02-22 21:30       ` Peter Zijlstra
2016-02-22 22:30         ` Steven Rostedt
2016-02-23 10:40           ` Juri Lelli
2016-02-23 10:44           ` Peter Zijlstra
2016-02-23 13:10             ` Steven Rostedt
2016-02-24  8:48               ` Ingo Molnar
2016-02-23 14:27             ` Peter Zijlstra
2016-02-23 16:19             ` Daniel Bristot de Oliveira
2016-02-24  2:29             ` Daniel Bristot de Oliveira
2016-02-22 17:48   ` kbuild test robot
2016-02-22 17:08 ` [PATCH 4/4] tools lib traceevent: Implements '%' operation Daniel Bristot de Oliveira
2016-02-22 20:23   ` Steven Rostedt
2016-02-23 14:38     ` Arnaldo Carvalho de Melo
2016-02-23 14:44       ` Arnaldo Carvalho de Melo
2016-02-25  5:41   ` [tip:perf/core] tools lib traceevent: Implement " tip-bot for Daniel Bristot de Oliveira

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.