linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy
@ 2016-09-16 21:09 Julien Desfossez
  2016-09-16 21:09 ` [RFC PATCH 2/6] sched: get effective policy and rt_prio Julien Desfossez
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Julien Desfossez @ 2016-09-16 21:09 UTC (permalink / raw)
  To: peterz, tglx, rostedt, mingo, daolivei
  Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez

Tasks with RT or deadline scheduling class may inherit from a task with
a "fair" scheduling class. This priority inheritance changes the
scheduling class, but not the task "policy" field.

Therefore, the fair scheduler should not assume that policy !=
SCHED_NORMAL is the same as (policy == SCHED_BATCH || policy ==
SCHED_IDLE), because the policy could also be SCHED_RR, SCHED_FIFO, or
SCHED_DEADLINE.

The incorrect comparison in check_preempt_wakeup makes RR, FIFO and
DEADLINE tasks which inherit from a fair task behave as if they were
IDLE or BATCH tasks, thus awaiting the following tick before preempting
the current task.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07aaa7f..f3aef21 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5669,7 +5669,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
 	 * is driven by the tick):
 	 */
-	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
+	if (unlikely(p->policy == SCHED_BATCH || p->policy == SCHED_IDLE) ||
+			!sched_feat(WAKEUP_PREEMPTION))
 		return;
 
 	find_matching_se(&se, &pse);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/6] sched: get effective policy and rt_prio
  2016-09-16 21:09 [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy Julien Desfossez
@ 2016-09-16 21:09 ` Julien Desfossez
  2016-09-16 21:09 ` [RFC PATCH 3/6] tracing: add sched_update_prio Julien Desfossez
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Julien Desfossez @ 2016-09-16 21:09 UTC (permalink / raw)
  To: peterz, tglx, rostedt, mingo, daolivei
  Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez

Helper functions to get the effective policy and rt_priority from the
prio and policy values. This is useful in PI situations because these
fields are not updated in the task, only the sched_class is temporarily
modified.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
 include/linux/sched/rt.h | 10 ++++++++++
 kernel/locking/rtmutex.c | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index a30b172..42b37ce 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -25,6 +25,8 @@ static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
 	return tsk->pi_blocked_on != NULL;
 }
+extern int rt_mutex_get_effective_policy(int policy, int prio);
+extern int rt_mutex_get_effective_rt_prio(int prio);
 #else
 static inline int rt_mutex_getprio(struct task_struct *p)
 {
@@ -46,6 +48,14 @@ static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
 	return false;
 }
+static inline int rt_mutex_get_effective_policy(int policy, int prio);
+{
+	return policy;
+}
+static inline int rt_mutex_get_effective_rt_prio(int prio)
+{
+	return task->rt_priority;
+}
 #endif
 
 extern void normalize_rt_tasks(void);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 1ec0f48..b66954f 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -294,6 +294,42 @@ int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
 }
 
 /*
+ * Get the effective policy based on the current prio value.
+ */
+int rt_mutex_get_effective_policy(int policy, int prio)
+{
+	if (dl_prio(prio))
+		return SCHED_DEADLINE;
+
+	/* With RT, the default class is SCHED_FIFO. */
+	if (rt_prio(prio)) {
+		if (policy == SCHED_RR)
+			return SCHED_RR;
+		return SCHED_FIFO;
+	}
+
+	/* With fair, the default class is SCHED_NORMAL. */
+	switch (policy) {
+	case SCHED_NORMAL:
+	case SCHED_IDLE:
+	case SCHED_BATCH:
+		return policy;
+	}
+	return SCHED_NORMAL;
+}
+
+/*
+ * Get the effective rt priority based on the current prio value.
+ */
+int rt_mutex_get_effective_rt_prio(int prio)
+{
+	if (!rt_prio(prio))
+		return 0;
+
+	return MAX_RT_PRIO - 1 - prio;
+}
+
+/*
  * Adjust the priority of a task, after its pi_waiters got modified.
  *
  * This can be both boosting and unboosting. task->pi_lock must be held.
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/6] tracing: add sched_update_prio
  2016-09-16 21:09 [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy Julien Desfossez
  2016-09-16 21:09 ` [RFC PATCH 2/6] sched: get effective policy and rt_prio Julien Desfossez
@ 2016-09-16 21:09 ` Julien Desfossez
  2016-09-16 21:09 ` [RFC PATCH 4/6] tracing: add TRACE_EVENT_MAP Julien Desfossez
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Julien Desfossez @ 2016-09-16 21:09 UTC (permalink / raw)
  To: peterz, tglx, rostedt, mingo, daolivei
  Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez

This tracepoint allows to keep track of all explicit priority changes of
a task. It outputs the scheduling policy, the nice value, the
rt_priority and the deadline-related attributes (dl_runtime, dl_deadline
and dl_period).

It is emitted in the code path of the sched_setscheduler, sched_setattr,
sched_setparam, and nice system calls.

This allows the analysis of real-time scheduling delays based on the
configured scheduling priorities and policies, which cannot be performed
with the current instrumentation in sched_switch. Also, instead of
exposing the internal kernel prio field, this tracepoint only outputs
the user-visible priority attributes.

The effective priority of running threads can also be temporarily
changed in the PI code, but a dedicated tracepoint is already in place
to cover this case.

Here are a few output examples:
After fork of a normal task:
sched_update_prio: comm=bash pid=2104, policy=SCHED_NORMAL, nice=0,
	rt_priority=0, dl_runtime=0, dl_deadline=0, dl_period=0

renice -n 10 of a normal task:
sched_update_prio: comm=sleep pid=2130, policy=SCHED_NORMAL, nice=10,
	rt_priority=0, dl_runtime=0, dl_deadline=0, dl_period=0

SCHED_FIFO 60:
sched_update_prio: comm=chrt pid=2105, policy=SCHED_FIFO, nice=0,
	rt_priority=60, dl_runtime=0, dl_deadline=0, dl_period=0

SCHED_RR 60:
sched_update_prio: comm=chrt pid=2109, policy=SCHED_RR, nice=0,
	rt_priority=60, dl_runtime=0, dl_deadline=0, dl_period=0

SCHED_DEADLINE:
sched_update_prio: comm=b pid=2110, policy=SCHED_DEADLINE, nice=0,
	rt_priority=0, dl_runtime=10000000, dl_deadline=30000000,
	dl_period=30000000

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
 include/trace/events/sched.h | 68 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          |  3 ++
 2 files changed, 71 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9b90c57..bc695e4 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -8,6 +8,34 @@
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
 
+#define SCHEDULING_POLICY				\
+	EM( SCHED_NORMAL,	"SCHED_NORMAL")		\
+	EM( SCHED_FIFO,		"SCHED_FIFO")		\
+	EM( SCHED_RR,		"SCHED_RR")		\
+	EM( SCHED_BATCH,	"SCHED_BATCH")		\
+	EM( SCHED_IDLE,		"SCHED_IDLE")		\
+	EMe(SCHED_DEADLINE,	"SCHED_DEADLINE")
+
+/*
+ * First define the enums in the above macros to be exported to userspace
+ * via TRACE_DEFINE_ENUM().
+ */
+#undef EM
+#undef EMe
+#define EM(a, b)	TRACE_DEFINE_ENUM(a);
+#define EMe(a, b)	TRACE_DEFINE_ENUM(a);
+
+SCHEDULING_POLICY
+
+/*
+ * Now redefine the EM() and EMe() macros to map the enums to the strings
+ * that will be printed in the output.
+ */
+#undef EM
+#undef EMe
+#define EM(a, b)	{a, b},
+#define EMe(a, b)	{a, b}
+
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
  */
@@ -562,6 +590,46 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
 
 	TP_printk("cpu=%d", __entry->cpu)
 );
+
+/*
+ * Tracepoint for showing scheduling priority changes.
+ */
+TRACE_EVENT(sched_update_prio,
+
+	TP_PROTO(struct task_struct *tsk),
+
+	TP_ARGS(tsk),
+
+	TP_STRUCT__entry(
+		__array( char,	comm,	TASK_COMM_LEN	)
+		__field( pid_t,	pid			)
+		__field( unsigned int,	policy		)
+		__field( int,	nice			)
+		__field( unsigned int,	rt_priority	)
+		__field( u64,	dl_runtime		)
+		__field( u64,	dl_deadline		)
+		__field( u64,	dl_period		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		__entry->pid		= tsk->pid;
+		__entry->policy		= tsk->policy;
+		__entry->nice		= task_nice(tsk);
+		__entry->rt_priority	= tsk->rt_priority;
+		__entry->dl_runtime	= tsk->dl.dl_runtime;
+		__entry->dl_deadline	= tsk->dl.dl_deadline;
+		__entry->dl_period	= tsk->dl.dl_period;
+	),
+
+	TP_printk("comm=%s pid=%d, policy=%s, nice=%d, rt_priority=%u, "
+			"dl_runtime=%Lu, dl_deadline=%Lu, dl_period=%Lu",
+			__entry->comm, __entry->pid,
+			__print_symbolic(__entry->policy, SCHEDULING_POLICY),
+			__entry->nice, __entry->rt_priority,
+			__entry->dl_runtime, __entry->dl_deadline,
+			__entry->dl_period)
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5f7b3c..979a635 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3751,6 +3751,7 @@ void set_user_nice(struct task_struct *p, long nice)
 			resched_curr(rq);
 	}
 out_unlock:
+	trace_sched_update_prio(p);
 	task_rq_unlock(rq, p, &rf);
 }
 EXPORT_SYMBOL(set_user_nice);
@@ -3955,6 +3956,8 @@ static void __setscheduler(struct rq *rq, struct task_struct *p,
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
+
+	trace_sched_update_prio(p);
 }
 
 static void
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 4/6] tracing: add TRACE_EVENT_MAP
  2016-09-16 21:09 [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy Julien Desfossez
  2016-09-16 21:09 ` [RFC PATCH 2/6] sched: get effective policy and rt_prio Julien Desfossez
  2016-09-16 21:09 ` [RFC PATCH 3/6] tracing: add sched_update_prio Julien Desfossez
@ 2016-09-16 21:09 ` Julien Desfossez
  2016-09-16 21:09 ` [RFC PATCH 5/6] tracing: extend scheduling tracepoints Julien Desfossez
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Julien Desfossez @ 2016-09-16 21:09 UTC (permalink / raw)
  To: peterz, tglx, rostedt, mingo, daolivei
  Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez

This macro allows to create an alias of an existing TRACE_EVENT. A
TRACE_EVENT_MAP connects a new probe to an existing tracepoint, so we
can use it to create another output of the same tracepoint without
changing the instrumented code.

This allows to create alternate versions of existing tracepoints to
output more/other fields only in specific use-cases and not all the time
(which could break existing tools and/or bloat the trace with too many
useless fields).

The usage is the same as the TRACE_EVENT macro with the addition of the
"map" parameter which is the name of the alias, the "name" field is the
name of the original tracepoint:
TRACE_EVENT_MAP(name, map, proto, args, tstruct, assign, print)
DEFINE_EVENT_MAP(template, name, map, proto, args)

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
 include/linux/trace_events.h | 14 ++++++++++++-
 include/linux/tracepoint.h   | 11 +++++++++-
 include/trace/define_trace.h |  4 ++++
 include/trace/perf.h         |  7 +++++++
 include/trace/trace_events.h | 50 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_events.c  | 15 +++++++++----
 6 files changed, 95 insertions(+), 6 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index be00761..1f7e0ec 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -217,6 +217,7 @@ enum {
 	TRACE_EVENT_FL_TRACEPOINT_BIT,
 	TRACE_EVENT_FL_KPROBE_BIT,
 	TRACE_EVENT_FL_UPROBE_BIT,
+	TRACE_EVENT_FL_MAP_BIT,
 };
 
 /*
@@ -231,6 +232,7 @@ enum {
  *  TRACEPOINT    - Event is a tracepoint
  *  KPROBE        - Event is a kprobe
  *  UPROBE        - Event is a uprobe
+ *  MAP           - Event maps to a tracepoint as an alias
  */
 enum {
 	TRACE_EVENT_FL_FILTERED		= (1 << TRACE_EVENT_FL_FILTERED_BIT),
@@ -241,10 +243,16 @@ enum {
 	TRACE_EVENT_FL_TRACEPOINT	= (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
 	TRACE_EVENT_FL_KPROBE		= (1 << TRACE_EVENT_FL_KPROBE_BIT),
 	TRACE_EVENT_FL_UPROBE		= (1 << TRACE_EVENT_FL_UPROBE_BIT),
+	TRACE_EVENT_FL_MAP		= (1 << TRACE_EVENT_FL_MAP_BIT),
 };
 
 #define TRACE_EVENT_FL_UKPROBE (TRACE_EVENT_FL_KPROBE | TRACE_EVENT_FL_UPROBE)
 
+struct trace_event_map {
+	struct tracepoint	*tp;
+	char			*name;
+};
+
 struct trace_event_call {
 	struct list_head	list;
 	struct trace_event_class *class;
@@ -252,6 +260,8 @@ struct trace_event_call {
 		char			*name;
 		/* Set TRACE_EVENT_FL_TRACEPOINT flag when using "tp" */
 		struct tracepoint	*tp;
+		/* Set TRACE_EVENT_FL_MAP flag when using "map" instead */
+		struct trace_event_map	*map;
 	};
 	struct trace_event	event;
 	char			*print_fmt;
@@ -282,7 +292,9 @@ struct trace_event_call {
 static inline const char *
 trace_event_name(struct trace_event_call *call)
 {
-	if (call->flags & TRACE_EVENT_FL_TRACEPOINT)
+	if (call->flags & TRACE_EVENT_FL_MAP)
+		return call->map->name;
+	else if (call->flags & TRACE_EVENT_FL_TRACEPOINT)
 		return call->tp ? call->tp->name : NULL;
 	else
 		return call->name;
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index be586c6..b8ab12a 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -276,6 +276,7 @@ static inline void tracepoint_synchronize_unregister(void)
 
 #define DEFINE_TRACE_FN(name, reg, unreg)
 #define DEFINE_TRACE(name)
+#define DEFINE_TRACE_MAP(name, map)
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
 #define EXPORT_TRACEPOINT_SYMBOL(name)
 
@@ -466,6 +467,13 @@ static inline void tracepoint_synchronize_unregister(void)
  *
  * A set of (un)registration functions can be passed to the variant
  * TRACE_EVENT_FN to perform any (un)registration work.
+ *
+ * TRACE_EVENT_MAP can be used to create alternate versions of a
+ * TRACE_EVENT without modifying the instrumented code. It connects
+ * a different probe to an existing tracepoint, so other fields can be
+ * extracted. The "name" field is the name of the original TRACE_EVENT,
+ * the "map" field is the name of the alias. They can be enabled
+ * independently.
  */
 
 #define DECLARE_EVENT_CLASS(name, proto, args, tstruct, assign, print)
@@ -493,9 +501,10 @@ static inline void tracepoint_synchronize_unregister(void)
 			      struct, assign, print)		\
 	DECLARE_TRACE_CONDITION(name, PARAMS(proto),		\
 				PARAMS(args), PARAMS(cond))
-
 #define TRACE_EVENT_FLAGS(event, flag)
 
 #define TRACE_EVENT_PERF_PERM(event, expr...)
 
+#define TRACE_EVENT_MAP(name, map, proto, args, struct, assign, print)
+
 #endif /* ifdef TRACE_EVENT (see note above) */
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index 6e3945f..fdd8845 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -26,6 +26,9 @@
 #define TRACE_EVENT(name, proto, args, tstruct, assign, print)	\
 	DEFINE_TRACE(name)
 
+#undef TRACE_EVENT_MAP
+#define TRACE_EVENT_MAP(name, map, proto, args, tstruct, assign, print)
+
 #undef TRACE_EVENT_CONDITION
 #define TRACE_EVENT_CONDITION(name, proto, args, cond, tstruct, assign, print) \
 	TRACE_EVENT(name,						\
@@ -100,6 +103,7 @@
 #undef TRACE_EVENT_FN
 #undef TRACE_EVENT_FN_COND
 #undef TRACE_EVENT_CONDITION
+#undef TRACE_EVENT_MAP
 #undef DECLARE_EVENT_CLASS
 #undef DEFINE_EVENT
 #undef DEFINE_EVENT_FN
diff --git a/include/trace/perf.h b/include/trace/perf.h
index 04fe68bb..563d441 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -86,5 +86,12 @@
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
 	DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
 
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, call, map, proto, args)		\
+static inline void perf_test_probe_##map(void)				\
+{									\
+	check_trace_callback_type_##call(perf_trace_##template);	\
+}
+
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
 #endif /* CONFIG_PERF_EVENTS */
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 467e12f..deb5bc6 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -65,6 +65,15 @@
 			     PARAMS(print));		       \
 	DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
 
+#undef TRACE_EVENT_MAP
+#define TRACE_EVENT_MAP(name, map, proto, args, tstruct, assign, print) \
+	DECLARE_EVENT_CLASS(map,			       \
+			     PARAMS(proto),		       \
+			     PARAMS(args),		       \
+			     PARAMS(tstruct),		       \
+			     PARAMS(assign),		       \
+			     PARAMS(print));		       \
+	DEFINE_EVENT_MAP(map, name, map, PARAMS(proto), PARAMS(args));
 
 #undef __field
 #define __field(type, item)		type	item;
@@ -108,6 +117,11 @@
 	static struct trace_event_call	__used		\
 	__attribute__((__aligned__(4))) event_##name
 
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, name, map, proto, args)	\
+	static struct trace_event_call	__used		\
+	__attribute__((__aligned__(4))) event_##map
+
 #undef DEFINE_EVENT_FN
 #define DEFINE_EVENT_FN(template, name, proto, args, reg, unreg)	\
 	DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
@@ -191,6 +205,9 @@
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)
 
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, name, map, proto, args)
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
 	DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
@@ -426,6 +443,9 @@
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)
 
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, name, map, proto, args)
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
 	DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
@@ -506,6 +526,9 @@
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)
 
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, name, map, proto, args)
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
 	DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
@@ -700,6 +723,13 @@
 	check_trace_callback_type_##call(trace_event_raw_event_##template); \
 }
 
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, call, map, proto, args)		\
+static inline void ftrace_test_probe_##map(void)			\
+{									\
+	check_trace_callback_type_##call(trace_event_raw_event_##template); \
+}
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)
 
@@ -749,6 +779,26 @@
 static struct trace_event_call __used					\
 __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
 
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(_template, _call, _map, _proto, _args)		\
+									\
+static struct trace_event_map event_map_##_map = {			\
+	.tp = &__tracepoint_##_call,					\
+	.name = #_map,							\
+};									\
+									\
+static struct trace_event_call __used event_##_map = {			\
+	.class			= &event_class_##_template,		\
+	{								\
+		.map			= &event_map_##_map,		\
+	},								\
+	.event.funcs		= &trace_event_type_funcs_##_template,	\
+	.print_fmt		= print_fmt_##_template,		\
+	.flags			= TRACE_EVENT_FL_TRACEPOINT | TRACE_EVENT_FL_MAP,\
+};									\
+static struct trace_event_call __used					\
+__attribute__((section("_ftrace_events"))) *__event_##_map = &event_##_map
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, call, proto, args, print)		\
 									\
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 03c0a48..8b46dd8 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -327,26 +327,33 @@ int trace_event_reg(struct trace_event_call *call,
 		    enum trace_reg type, void *data)
 {
 	struct trace_event_file *file = data;
+	struct tracepoint *tp;
 
 	WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
+
+	if (call->flags & TRACE_EVENT_FL_MAP)
+		tp = call->map->tp;
+	else
+		tp = call->tp;
+
 	switch (type) {
 	case TRACE_REG_REGISTER:
-		return tracepoint_probe_register(call->tp,
+		return tracepoint_probe_register(tp,
 						 call->class->probe,
 						 file);
 	case TRACE_REG_UNREGISTER:
-		tracepoint_probe_unregister(call->tp,
+		tracepoint_probe_unregister(tp,
 					    call->class->probe,
 					    file);
 		return 0;
 
 #ifdef CONFIG_PERF_EVENTS
 	case TRACE_REG_PERF_REGISTER:
-		return tracepoint_probe_register(call->tp,
+		return tracepoint_probe_register(tp,
 						 call->class->perf_probe,
 						 call);
 	case TRACE_REG_PERF_UNREGISTER:
-		tracepoint_probe_unregister(call->tp,
+		tracepoint_probe_unregister(tp,
 					    call->class->perf_probe,
 					    call);
 		return 0;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 5/6] tracing: extend scheduling tracepoints
  2016-09-16 21:09 [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy Julien Desfossez
                   ` (2 preceding siblings ...)
  2016-09-16 21:09 ` [RFC PATCH 4/6] tracing: add TRACE_EVENT_MAP Julien Desfossez
@ 2016-09-16 21:09 ` Julien Desfossez
  2016-09-16 21:09 ` [RFC PATCH 6/6] tracing: extend sched_pi_setprio Julien Desfossez
  2016-09-20 20:49 ` [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy Thomas Gleixner
  5 siblings, 0 replies; 10+ messages in thread
From: Julien Desfossez @ 2016-09-16 21:09 UTC (permalink / raw)
  To: peterz, tglx, rostedt, mingo, daolivei
  Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez

Create alternate versions of the sched_switch, sched_waking and
sched_process_fork tracepoint probes to output priority-related fields
and PI top-waiter if any.

This uses the TRACE_EVENT_MAP macro, so the instrumented code and the
already existing tracepoints are untouched.

We only expose the priority-related fields visible from userspace,
leaving out the "prio" value which should really be a kernel-internal
representation of the task priority, and must be expected to be
eventually deprecated. The values output are the effective values, not
necessarily the normal values.

We also output the comm and PID of the process blocked by the task if it
is in a PI situation. These fields allow to quickly identify the PI
situations without requiring to keep track of all the
sched_pi_setprio/sched_pi_update_prio events and state.

The values traced are the effective values, which may differ from the
thread normal values in PI scenarios.

Here is an example of the output from these new probes:
sched_process_fork_prio: comm=bash, pid=1988, child_comm=bash,
	child_pid=2129, child_policy=SCHED_NORMAL, child_nice=0,
	child_rt_priority=0, child_dl_runtime=0,
	child_dl_deadline=0, child_dl_period=0

No PI:
sched_switch_prio: prev_comm=swapper/6, prev_pid=0, prev_policy=SCHED_NORMAL,
	prev_nice=0, prev_rt_priority=0, prev_dl_runtime=0,
	prev_dl_deadline=0, prev_dl_period=0, prev_state=R,
	prev_top_waiter_comm=, prev_top_waiter_pid=-1 ==>
	next_comm=bash, next_pid=3817, next_policy=SCHED_NORMAL,
	next_nice=0, next_rt_priority=0, next_dl_runtime=0,
	next_dl_deadline=0, next_dl_period=0, next_top_waiter_comm=,
	next_top_waiter_pid=-1

sched_waking_prio: comm=migration/6, pid=38, target_cpu=006,
	policy=SCHED_FIFO, nice=0, rt_priority=99, dl_runtime=0,
	dl_deadline=0, dl_period=0, top_waiter_comm=, top_waiter_pid=-1

PI:
sched_switch_prio: prev_comm=swapper/1, prev_pid=0, prev_policy=SCHED_NORMAL,
	prev_nice=0, prev_rt_priority=0, prev_dl_runtime=0,
	prev_dl_deadline=0, prev_dl_period=0, prev_state=R,
	prev_top_waiter_comm=, prev_top_waiter_pid=-1 ==>
	next_comm=lowprio1, next_pid=3818, next_policy=SCHED_NORMAL,
	next_nice=0, next_rt_priority=90, next_dl_runtime=0,
	next_dl_deadline=0, next_dl_period=0,
	next_top_waiter_comm=highprio0, next_top_waiter_pid=3820

sched_waking_prio: comm=lowprio1, pid=3818, target_cpu=001, policy=SCHED_FIFO,
	  nice=0, rt_priority=90, dl_runtime=0, dl_deadline=0,
	  dl_period=0, top_waiter_comm=highprio0, top_waiter_pid=3820

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
 include/trace/events/sched.h | 222 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 222 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index bc695e4..11b3358 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -131,6 +131,64 @@
 	     TP_PROTO(struct task_struct *p),
 	     TP_ARGS(p));
 
+TRACE_EVENT_MAP(sched_waking, sched_waking_prio,
+
+	TP_PROTO(struct task_struct *p),
+
+	TP_ARGS(p),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	int,	target_cpu		)
+		__field( 	unsigned int,	policy		)
+		__field( 	int,	nice			)
+		__field( 	unsigned int,	rt_priority	)
+		__field( 	u64,	dl_runtime		)
+		__field( 	u64,	dl_deadline		)
+		__field( 	u64,	dl_period		)
+		__array(	char,	top_waiter_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	top_waiter_pid		)
+	),
+
+	TP_fast_assign(
+		struct task_struct *top_waiter = rt_mutex_get_top_task(p);
+
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->target_cpu	= task_cpu(p);
+		__entry->policy		= rt_mutex_get_effective_policy(
+						p->policy, p->prio);
+		__entry->nice		= task_nice(p);
+		__entry->rt_priority	= rt_mutex_get_effective_rt_prio(
+						p->prio);
+		__entry->dl_runtime	= dl_prio(p->prio) ?
+						p->dl.dl_runtime : 0;
+		__entry->dl_deadline	= dl_prio(p->prio) ?
+						p->dl.dl_deadline : 0;
+		__entry->dl_period	= dl_prio(p->prio) ?
+						p->dl.dl_period : 0;
+		if (top_waiter) {
+			memcpy(__entry->top_waiter_comm, top_waiter->comm,
+					TASK_COMM_LEN);
+			__entry->top_waiter_pid	= top_waiter->pid;
+		} else {
+			__entry->top_waiter_comm[0] = '\0';
+			__entry->top_waiter_pid	= -1;
+		}
+	),
+
+	TP_printk("comm=%s, pid=%d, target_cpu=%03d, policy=%s, "
+			"nice=%d, rt_priority=%u, dl_runtime=%Lu, "
+			"dl_deadline=%Lu, dl_period=%Lu, "
+			"top_waiter_comm=%s, top_waiter_pid=%d",
+		  __entry->comm, __entry->pid, __entry->target_cpu,
+		  __print_symbolic(__entry->policy, SCHEDULING_POLICY),
+		  __entry->nice, __entry->rt_priority, __entry->dl_runtime,
+		  __entry->dl_deadline, __entry->dl_period,
+		  __entry->top_waiter_comm, __entry->top_waiter_pid)
+);
+
 #ifdef CREATE_TRACE_POINTS
 static inline long __trace_sched_switch_state(bool preempt, struct task_struct *p)
 {
@@ -190,6 +248,121 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
 );
 
 /*
+ * Tracepoint for task switches, performed by the scheduler:
+ */
+TRACE_EVENT_MAP(sched_switch, sched_switch_prio,
+	TP_PROTO(bool preempt,
+		 struct task_struct *prev,
+		 struct task_struct *next),
+
+	TP_ARGS(preempt, prev, next),
+
+	TP_STRUCT__entry(
+		__array(	char,	prev_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	prev_pid			)
+		__field(	long,	prev_state			)
+		__field( 	unsigned int,	prev_policy		)
+		__field( 	int,	prev_nice			)
+		__field( 	unsigned int,	prev_rt_priority	)
+		__field( 	u64,	prev_dl_runtime			)
+		__field( 	u64,	prev_dl_deadline		)
+		__field( 	u64,	prev_dl_period			)
+		__array(	char,	prev_top_waiter_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	prev_top_waiter_pid		)
+		__array(	char,	next_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	next_pid			)
+		__field( 	unsigned int,	next_policy		)
+		__field( 	int,	next_nice			)
+		__field( 	unsigned int,	next_rt_priority	)
+		__field( 	u64,	next_dl_runtime			)
+		__field( 	u64,	next_dl_deadline		)
+		__field( 	u64,	next_dl_period			)
+		__array(	char,	next_top_waiter_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	next_top_waiter_pid		)
+	),
+
+	TP_fast_assign(
+		struct task_struct *prev_top = rt_mutex_get_top_task(prev);
+		struct task_struct *next_top = rt_mutex_get_top_task(next);
+
+		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+		__entry->prev_pid		= prev->pid;
+		__entry->prev_state		= __trace_sched_switch_state(
+							preempt, prev);
+		__entry->prev_policy		= rt_mutex_get_effective_policy(
+							prev->policy, prev->prio);
+		__entry->prev_nice		= task_nice(prev);
+		__entry->prev_rt_priority	= rt_mutex_get_effective_rt_prio(
+							prev->prio);
+		__entry->prev_dl_runtime	= dl_prio(prev->prio) ?
+							prev->dl.dl_runtime : 0;
+		__entry->prev_dl_deadline	= dl_prio(prev->prio) ?
+							prev->dl.dl_deadline : 0;
+		__entry->prev_dl_period		= dl_prio(prev->prio) ?
+							prev->dl.dl_period : 0;
+		if (prev_top) {
+			memcpy(__entry->prev_top_waiter_comm, prev_top->comm,
+					TASK_COMM_LEN);
+			__entry->prev_top_waiter_pid	= prev_top->pid;
+		} else {
+			__entry->prev_top_waiter_comm[0] = '\0';
+			__entry->prev_top_waiter_pid	= -1;
+		}
+
+		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+		__entry->next_pid		= next->pid;
+		__entry->next_policy		= rt_mutex_get_effective_policy(
+							next->policy, prev->prio);
+		__entry->next_nice		= task_nice(next);
+		__entry->next_rt_priority	= rt_mutex_get_effective_rt_prio(
+							next->prio);
+		__entry->next_dl_runtime	= dl_prio(next->prio) ?
+							next->dl.dl_runtime : 0;
+		__entry->next_dl_deadline	= dl_prio(next->prio) ?
+							next->dl.dl_deadline : 0;
+		__entry->next_dl_period		= dl_prio(next->prio) ?
+							next->dl.dl_period : 0;
+		if (next_top) {
+			memcpy(__entry->next_top_waiter_comm, next_top->comm,
+					TASK_COMM_LEN);
+			__entry->next_top_waiter_pid	= next_top->pid;
+		} else {
+			__entry->next_top_waiter_comm[0] = '\0';
+			__entry->next_top_waiter_pid	= -1;
+		}
+	),
+
+	TP_printk("prev_comm=%s, prev_pid=%d, prev_policy=%s, prev_nice=%d, "
+			"prev_rt_priority=%u, prev_dl_runtime=%Lu, "
+			"prev_dl_deadline=%Lu, prev_dl_period=%Lu, "
+			"prev_state=%s%s, prev_top_waiter_comm=%s, "
+			"prev_top_waiter_pid=%d ==> next_comm=%s, next_pid=%d, "
+			"next_policy=%s, next_nice=%d, next_rt_priority=%u, "
+			"next_dl_runtime=%Lu, next_dl_deadline=%Lu, "
+			"next_dl_period=%Lu, next_top_waiter_comm=%s, "
+			"next_top_waiter_pid=%d",
+		__entry->prev_comm, __entry->prev_pid,
+		__print_symbolic(__entry->prev_policy, SCHEDULING_POLICY),
+		__entry->prev_nice, __entry->prev_rt_priority,
+		__entry->prev_dl_runtime, __entry->prev_dl_deadline,
+		__entry->prev_dl_period,
+		__entry->prev_state & (TASK_STATE_MAX-1) ?
+		  __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
+				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
+				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
+				{ 128, "K" }, { 256, "W" }, { 512, "P" },
+				{ 1024, "N" }) : "R",
+		__entry->prev_state & TASK_STATE_MAX ? "+" : "",
+		__entry->prev_top_waiter_comm, __entry->prev_top_waiter_pid,
+		__entry->next_comm, __entry->next_pid,
+		__print_symbolic(__entry->next_policy, SCHEDULING_POLICY),
+		__entry->next_nice, __entry->next_rt_priority,
+		__entry->next_dl_runtime, __entry->next_dl_deadline,
+		__entry->next_dl_period, __entry->next_top_waiter_comm,
+		__entry->next_top_waiter_pid)
+);
+
+/*
  * Tracepoint for a task being migrated:
  */
 TRACE_EVENT(sched_migrate_task,
@@ -316,6 +489,55 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
 		__entry->child_comm, __entry->child_pid)
 );
 
+TRACE_EVENT_MAP(sched_process_fork, sched_process_fork_prio,
+
+	TP_PROTO(struct task_struct *parent, struct task_struct *child),
+
+	TP_ARGS(parent, child),
+
+	TP_STRUCT__entry(
+		__array(	char,	parent_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	parent_pid			)
+		__array(	char,	child_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	child_pid			)
+		__field( 	unsigned int,	child_policy		)
+		__field( 	int,	child_nice			)
+		__field( 	unsigned int,	child_rt_priority	)
+		__field( 	u64,	child_dl_runtime		)
+		__field( 	u64,	child_dl_deadline		)
+		__field( 	u64,	child_dl_period			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->parent_comm, parent->comm, TASK_COMM_LEN);
+		__entry->parent_pid	= parent->pid;
+		memcpy(__entry->child_comm, child->comm, TASK_COMM_LEN);
+		__entry->child_pid	= child->pid;
+		__entry->child_policy		= rt_mutex_get_effective_policy(
+							child->policy, child->prio);
+		__entry->child_nice		= task_nice(child);
+		__entry->child_rt_priority	= rt_mutex_get_effective_rt_prio(
+							child->prio);
+		__entry->child_dl_runtime	= dl_prio(child->prio) ?
+							child->dl.dl_runtime : 0;
+		__entry->child_dl_deadline	= dl_prio(child->prio) ?
+							child->dl.dl_deadline : 0;
+		__entry->child_dl_period	= dl_prio(child->prio) ?
+							child->dl.dl_period : 0;
+	),
+
+	TP_printk("comm=%s, pid=%d, child_comm=%s, child_pid=%d, "
+			"child_policy=%s, child_nice=%d, "
+			"child_rt_priority=%u, child_dl_runtime=%Lu, "
+			"child_dl_deadline=%Lu, child_dl_period=%Lu",
+		__entry->parent_comm, __entry->parent_pid,
+		__entry->child_comm, __entry->child_pid,
+		__print_symbolic(__entry->child_policy, SCHEDULING_POLICY),
+		__entry->child_nice, __entry->child_rt_priority,
+		__entry->child_dl_runtime, __entry->child_dl_deadline,
+		__entry->child_dl_period)
+);
+
 /*
  * Tracepoint for exec:
  */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 6/6] tracing: extend sched_pi_setprio
  2016-09-16 21:09 [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy Julien Desfossez
                   ` (3 preceding siblings ...)
  2016-09-16 21:09 ` [RFC PATCH 5/6] tracing: extend scheduling tracepoints Julien Desfossez
@ 2016-09-16 21:09 ` Julien Desfossez
  2016-09-20 20:49 ` [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy Thomas Gleixner
  5 siblings, 0 replies; 10+ messages in thread
From: Julien Desfossez @ 2016-09-16 21:09 UTC (permalink / raw)
  To: peterz, tglx, rostedt, mingo, daolivei
  Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez

Use the TRACE_EVENT_MAP macro to extend the sched_pi_setprio into
sched_pi_update_prio. The pre-existing event is untouched. This gets rid
of the old/new prio fields, and instead outputs the scheduling update
based on the top waiter of the rtmutex.

Boosting:
sched_pi_update_prio: comm=lowprio1, pid=3818, old_policy=SCHED_NORMAL,
	old_nice=0, old_rt_priority=0, old_dl_runtime=0,
	old_dl_deadline=0, old_dl_period=0, top_waiter_comm=highprio0,
	top_waiter_pid=3820, new_policy=SCHED_FIFO, new_nice=0,
	new_rt_priority=90, new_dl_runtime=0, new_dl_deadline=0,
	new_dl_period=0

Unboosting:
sched_pi_update_prio: comm=lowprio1, pid=3818, old_policy=SCHED_FIFO,
	old_nice=0, old_rt_priority=90, old_dl_runtime=0, old_dl_deadline=0,
	old_dl_period=0, top_waiter_comm=, top_waiter_pid=-1,
	new_policy=SCHED_NORMAL, new_nice=0, new_rt_priority=0,
	new_dl_runtime=0, new_dl_deadline=0, new_dl_period=0

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
 include/trace/events/sched.h | 96 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 96 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 11b3358..c69ee74 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -686,6 +686,102 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
 			__entry->oldprio, __entry->newprio)
 );
 
+/*
+ * Extract the complete scheduling information from the before
+ * and after the change of priority.
+ */
+TRACE_EVENT_MAP(sched_pi_setprio, sched_pi_update_prio,
+
+	TP_PROTO(struct task_struct *tsk, int newprio),
+
+	TP_ARGS(tsk, newprio),
+
+	TP_STRUCT__entry(
+		__array( char,	comm,	TASK_COMM_LEN	)
+		__field( pid_t,	pid			)
+		__field( unsigned int,	old_policy		)
+		__field( int,	old_nice			)
+		__field( unsigned int,	old_rt_priority	)
+		__field( u64,	old_dl_runtime			)
+		__field( u64,	old_dl_deadline		)
+		__field( u64,	old_dl_period			)
+		__array( char,	top_waiter_comm,	TASK_COMM_LEN	)
+		__field( pid_t,	top_waiter_pid			)
+		__field( unsigned int,	new_policy		)
+		__field( int,	new_nice			)
+		__field( unsigned int,	new_rt_priority	)
+		__field( u64,	new_dl_runtime		)
+		__field( u64,	new_dl_deadline		)
+		__field( u64,	new_dl_period			)
+	),
+
+	TP_fast_assign(
+		struct task_struct *top_waiter = rt_mutex_get_top_task(tsk);
+
+		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		__entry->pid			= tsk->pid;
+		__entry->old_policy		= rt_mutex_get_effective_policy(
+							tsk->policy, tsk->prio);
+		__entry->old_nice		= task_nice(tsk);
+		__entry->old_rt_priority	= rt_mutex_get_effective_rt_prio(
+							tsk->prio);
+		__entry->old_dl_runtime	= dl_prio(tsk->prio) ?
+							tsk->dl.dl_runtime : 0;
+		__entry->old_dl_deadline	= dl_prio(tsk->prio) ?
+							tsk->dl.dl_deadline : 0;
+		__entry->old_dl_period		= dl_prio(tsk->prio) ?
+							tsk->dl.dl_period : 0;
+		if (top_waiter) {
+			memcpy(__entry->top_waiter_comm, top_waiter->comm, TASK_COMM_LEN);
+			__entry->top_waiter_pid		= top_waiter->pid;
+			/*
+			 * The effective policy depends on the current policy of
+			 * the target task.
+			 */
+			__entry->new_policy		= rt_mutex_get_effective_policy(
+								tsk->policy, top_waiter->prio);
+			__entry->new_nice		= task_nice(top_waiter);
+			__entry->new_rt_priority	= rt_mutex_get_effective_rt_prio(
+								top_waiter->prio);
+			__entry->new_dl_runtime	= dl_prio(top_waiter->prio) ?
+								top_waiter->dl.dl_runtime : 0;
+			__entry->new_dl_deadline	= dl_prio(top_waiter->prio) ?
+								top_waiter->dl.dl_deadline : 0;
+			__entry->new_dl_period	= dl_prio(top_waiter->prio) ?
+								top_waiter->dl.dl_period : 0;
+		} else {
+			__entry->top_waiter_comm[0]	= '\0';
+			__entry->top_waiter_pid		= -1;
+			__entry->new_policy		= 0;
+			__entry->new_nice		= 0;
+			__entry->new_rt_priority	= 0;
+			__entry->new_dl_runtime	= 0;
+			__entry->new_dl_deadline	= 0;
+			__entry->new_dl_period	= 0;
+		}
+	),
+
+	TP_printk("comm=%s, pid=%d, old_policy=%s, old_nice=%d, "
+			"old_rt_priority=%u, old_dl_runtime=%Lu, "
+			"old_dl_deadline=%Lu, old_dl_period=%Lu, "
+			"top_waiter_comm=%s, top_waiter_pid=%d, new_policy=%s, "
+			"new_nice=%d, new_rt_priority=%u, "
+			"new_dl_runtime=%Lu, new_dl_deadline=%Lu, "
+			"new_dl_period=%Lu",
+		__entry->comm, __entry->pid,
+		__print_symbolic(__entry->old_policy, SCHEDULING_POLICY),
+		__entry->old_nice, __entry->old_rt_priority,
+		__entry->old_dl_runtime, __entry->old_dl_deadline,
+		__entry->old_dl_period,
+		__entry->top_waiter_comm, __entry->top_waiter_pid,
+		__entry->new_policy >= 0 ?
+			__print_symbolic(__entry->new_policy,
+				SCHEDULING_POLICY) : "",
+		__entry->new_nice, __entry->new_rt_priority,
+		__entry->new_dl_runtime, __entry->new_dl_deadline,
+		__entry->new_dl_period)
+);
+
 #ifdef CONFIG_DETECT_HUNG_TASK
 TRACE_EVENT(sched_process_hang,
 	TP_PROTO(struct task_struct *tsk),
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy
  2016-09-16 21:09 [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy Julien Desfossez
                   ` (4 preceding siblings ...)
  2016-09-16 21:09 ` [RFC PATCH 6/6] tracing: extend sched_pi_setprio Julien Desfossez
@ 2016-09-20 20:49 ` Thomas Gleixner
  2016-09-20 22:04   ` Mathieu Desnoyers
  5 siblings, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2016-09-20 20:49 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: peterz, rostedt, mingo, daolivei, mathieu.desnoyers, linux-kernel

On Fri, 16 Sep 2016, Julien Desfossez wrote:

> Tasks with RT or deadline scheduling class may inherit from a task with
> a "fair" scheduling class.

This makes no sense. A RT/DL task can never inherit anything from a sched
fair task. That would be inverted priority inheritance.

> This priority inheritance changes the scheduling class, but not the task
> "policy" field.
>
> Therefore, the fair scheduler should not assume that policy !=
> SCHED_NORMAL is the same as (policy == SCHED_BATCH || policy ==
> SCHED_IDLE), because the policy could also be SCHED_RR, SCHED_FIFO, or
> SCHED_DEADLINE.
> 
> The incorrect comparison in check_preempt_wakeup makes RR, FIFO and
> DEADLINE tasks which inherit from a fair task behave as if they were
> IDLE or BATCH tasks, thus awaiting the following tick before preempting
> the current task.

This is just wrong.

Priority/deadline inheritance elevates a fair task to RR/FIFO/DL, i.e. to
the scheduling class of the task which is blocked on a resource held by the
fair task.

The check_preempt_curr() callback of a scheduling class is only invoked
when the freshly woken task is in the same scheduling class as the task
which is currently on the cpu.

So which problem are you actually solving?

> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>

Who wrote the patch? 

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy
  2016-09-20 20:49 ` [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy Thomas Gleixner
@ 2016-09-20 22:04   ` Mathieu Desnoyers
  2016-09-20 22:56     ` Thomas Gleixner
  0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Desnoyers @ 2016-09-20 22:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Julien Desfossez, Peter Zijlstra, rostedt, Ingo Molnar, daolivei,
	linux-kernel

----- On Sep 20, 2016, at 4:49 PM, Thomas Gleixner tglx@linutronix.de wrote:

> On Fri, 16 Sep 2016, Julien Desfossez wrote:
> 
>> Tasks with RT or deadline scheduling class may inherit from a task with
>> a "fair" scheduling class.
> 
> This makes no sense. A RT/DL task can never inherit anything from a sched
> fair task. That would be inverted priority inheritance.
> 
>> This priority inheritance changes the scheduling class, but not the task
>> "policy" field.
>>
>> Therefore, the fair scheduler should not assume that policy !=
>> SCHED_NORMAL is the same as (policy == SCHED_BATCH || policy ==
>> SCHED_IDLE), because the policy could also be SCHED_RR, SCHED_FIFO, or
>> SCHED_DEADLINE.
>> 
>> The incorrect comparison in check_preempt_wakeup makes RR, FIFO and
>> DEADLINE tasks which inherit from a fair task behave as if they were
>> IDLE or BATCH tasks, thus awaiting the following tick before preempting
>> the current task.
> 
> This is just wrong.
> 
> Priority/deadline inheritance elevates a fair task to RR/FIFO/DL, i.e. to
> the scheduling class of the task which is blocked on a resource held by the
> fair task.
> 
> The check_preempt_curr() callback of a scheduling class is only invoked
> when the freshly woken task is in the same scheduling class as the task
> which is currently on the cpu.
> 
> So which problem are you actually solving?

So what is then puzzling us is this:

rt_mutex_setprio()

        if (dl_prio(prio)) {
                struct task_struct *pi_task = rt_mutex_get_top_task(p);
                if (!dl_prio(p->normal_prio) ||
                    (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
                        p->dl.dl_boosted = 1;
                        queue_flag |= ENQUEUE_REPLENISH;
                } else
                        p->dl.dl_boosted = 0;
                p->sched_class = &dl_sched_class;
        } else if (rt_prio(prio)) {
                if (dl_prio(oldprio))
                        p->dl.dl_boosted = 0;
                if (oldprio < prio)
                        queue_flag |= ENQUEUE_HEAD;
                p->sched_class = &rt_sched_class;
        } else {
                if (dl_prio(oldprio))
                        p->dl.dl_boosted = 0;
                if (rt_prio(oldprio))
                        p->rt.timeout = 0;
                p->sched_class = &fair_sched_class;
        }

So in the 3rd block, this is the case where we inherit a
new prio which is neither LD nor RT, so it's "fair".

If we try to assign a fair prio to a task of DL or RT
prio, the dl_boosted is set to 0, or the rt timeout is
set to 0. However, we do change the sched_class of the
target task to &fair_sched_class.

This code path seems to imply that a RT or DL task can
change sched_class to "fair". Indeed, it makes no sense,
so I have the feeling we're missing something important
here.


> 
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
> 
> Who wrote the patch?

Julien is the author.

Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy
  2016-09-20 22:04   ` Mathieu Desnoyers
@ 2016-09-20 22:56     ` Thomas Gleixner
  2016-09-21  0:09       ` Julien Desfossez
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2016-09-20 22:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Julien Desfossez, Peter Zijlstra, rostedt, Ingo Molnar, daolivei,
	linux-kernel

On Tue, 20 Sep 2016, Mathieu Desnoyers wrote:
> So what is then puzzling us is this:
> 
> rt_mutex_setprio()
> 
>         if (dl_prio(prio)) {
>                 struct task_struct *pi_task = rt_mutex_get_top_task(p);
>                 if (!dl_prio(p->normal_prio) ||
>                     (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
>                         p->dl.dl_boosted = 1;
>                         queue_flag |= ENQUEUE_REPLENISH;
>                 } else
>                         p->dl.dl_boosted = 0;
>                 p->sched_class = &dl_sched_class;
>         } else if (rt_prio(prio)) {
>                 if (dl_prio(oldprio))
>                         p->dl.dl_boosted = 0;
>                 if (oldprio < prio)
>                         queue_flag |= ENQUEUE_HEAD;
>                 p->sched_class = &rt_sched_class;
>         } else {
>                 if (dl_prio(oldprio))
>                         p->dl.dl_boosted = 0;
>                 if (rt_prio(oldprio))
>                         p->rt.timeout = 0;
>                 p->sched_class = &fair_sched_class;
>         }
> 
> So in the 3rd block, this is the case where we inherit a
> new prio which is neither LD nor RT, so it's "fair".
> 
> If we try to assign a fair prio to a task of DL or RT
> prio, the dl_boosted is set to 0, or the rt timeout is
> set to 0. However, we do change the sched_class of the
> target task to &fair_sched_class.
> 
> This code path seems to imply that a RT or DL task can
> change sched_class to "fair". Indeed, it makes no sense,
> so I have the feeling we're missing something important
> here.

Yes. Look at the call site of this. This is called in two cases:

 - The task blocked on a lock pushes the task owning the lock and in that
   case it is guaranteed that the blocked task is in the same or in a
   higher scheduling class. We have no support for inverse PI (yet).

 - The task which got boosted deboosts itself after releasing the lock. In
   that case it falls back to its original class/priority/bandwidth.

Hope that helps.

> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
> >> Cc: Thomas Gleixner <tglx@linutronix.de>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >> Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
> > 
> > Who wrote the patch?
> 
> Julien is the author.

So the SOB chain is wrong because you neither authored the patch nor you
conveyed it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy
  2016-09-20 22:56     ` Thomas Gleixner
@ 2016-09-21  0:09       ` Julien Desfossez
  0 siblings, 0 replies; 10+ messages in thread
From: Julien Desfossez @ 2016-09-21  0:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, Peter Zijlstra, rostedt, Ingo Molnar,
	daolivei, linux-kernel

On 21-Sep-2016 12:56:32 AM, Thomas Gleixner wrote:
> On Tue, 20 Sep 2016, Mathieu Desnoyers wrote:
> > So what is then puzzling us is this:
> > 
> > rt_mutex_setprio()
> > 
> >         if (dl_prio(prio)) {
> >                 struct task_struct *pi_task = rt_mutex_get_top_task(p);
> >                 if (!dl_prio(p->normal_prio) ||
> >                     (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
> >                         p->dl.dl_boosted = 1;
> >                         queue_flag |= ENQUEUE_REPLENISH;
> >                 } else
> >                         p->dl.dl_boosted = 0;
> >                 p->sched_class = &dl_sched_class;
> >         } else if (rt_prio(prio)) {
> >                 if (dl_prio(oldprio))
> >                         p->dl.dl_boosted = 0;
> >                 if (oldprio < prio)
> >                         queue_flag |= ENQUEUE_HEAD;
> >                 p->sched_class = &rt_sched_class;
> >         } else {
> >                 if (dl_prio(oldprio))
> >                         p->dl.dl_boosted = 0;
> >                 if (rt_prio(oldprio))
> >                         p->rt.timeout = 0;
> >                 p->sched_class = &fair_sched_class;
> >         }
> > 
> > So in the 3rd block, this is the case where we inherit a
> > new prio which is neither LD nor RT, so it's "fair".
> > 
> > If we try to assign a fair prio to a task of DL or RT
> > prio, the dl_boosted is set to 0, or the rt timeout is
> > set to 0. However, we do change the sched_class of the
> > target task to &fair_sched_class.
> > 
> > This code path seems to imply that a RT or DL task can
> > change sched_class to "fair". Indeed, it makes no sense,
> > so I have the feeling we're missing something important
> > here.
> 
> Yes. Look at the call site of this. This is called in two cases:
> 
>  - The task blocked on a lock pushes the task owning the lock and in that
>    case it is guaranteed that the blocked task is in the same or in a
>    higher scheduling class. We have no support for inverse PI (yet).
> 
>  - The task which got boosted deboosts itself after releasing the lock. In
>    that case it falls back to its original class/priority/bandwidth.
> 
> Hope that helps.

Thanks for clarifying that, so indeed there is no risk of ambiguity here
between the scheduling class and the policy for fair tasks so this patch
is useless.

This was the only fix of the patchset, the other patches aim to extract
accurate scheduling informations in the trace.

> > >> Cc: Peter Zijlstra <peterz@infradead.org>
> > >> Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
> > >> Cc: Thomas Gleixner <tglx@linutronix.de>
> > >> Cc: Ingo Molnar <mingo@redhat.com>
> > >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > >> Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
> > > 
> > > Who wrote the patch?
> > 
> > Julien is the author.
> 
> So the SOB chain is wrong because you neither authored the patch nor you
> conveyed it.

My bad, sorry about that.

Thanks,

Julien

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-09-21  0:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-16 21:09 [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy Julien Desfossez
2016-09-16 21:09 ` [RFC PATCH 2/6] sched: get effective policy and rt_prio Julien Desfossez
2016-09-16 21:09 ` [RFC PATCH 3/6] tracing: add sched_update_prio Julien Desfossez
2016-09-16 21:09 ` [RFC PATCH 4/6] tracing: add TRACE_EVENT_MAP Julien Desfossez
2016-09-16 21:09 ` [RFC PATCH 5/6] tracing: extend scheduling tracepoints Julien Desfossez
2016-09-16 21:09 ` [RFC PATCH 6/6] tracing: extend sched_pi_setprio Julien Desfossez
2016-09-20 20:49 ` [RFC PATCH 1/6] sched/fair: check_preempt_wakeup: Fix assumption on the default policy Thomas Gleixner
2016-09-20 22:04   ` Mathieu Desnoyers
2016-09-20 22:56     ` Thomas Gleixner
2016-09-21  0:09       ` Julien Desfossez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).