All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v7 0/5] *** Latency histograms ***
@ 2016-09-20 13:51 Binoy Jayan
  2016-09-20 13:51 ` [RFC PATCH v7 1/5] tracing: Dereference pointers without RCU checks Binoy Jayan
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Binoy Jayan @ 2016-09-20 13:51 UTC (permalink / raw)
  To: Steven Rostedt (Red Hat), Thomas Gleixner
  Cc: Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Mark Brown, Binoy Jayan

Latency Histograms
=====================

1. Introduction
----------------
This series consists of patches meant for capturing latencies which
occur in a realtime system and to render it as a histogram of meaningful
data. Few of these patches were originally part of the PREEMPT_RT patchet
and the rt linux tree. These are adapted to make use of the kernel event
infrastructure and histogram triggers. This patch series aims at preparing
these patches for mainlining. The individual changelog describe the specific
patchset. To check the feasibility of mainlining this series, please refer
section 5. The original patch set (Latency hist) may be found under section 6.
The patch series also consists of an additonal histogram feature and a bug fix.

v6: https://lkml.org/lkml/2016/9/7/253

2. Changes v6 -> v7:
---------------------
 - Patch to capture additional per process data added as part of 
   wakeup latency patch
   tracing: wakeup latency events and histograms
 - Removed tracepoint condition '*_enabled' and let tracer core handle it
 - Adjusted order of the new struct member 'struct hrtimer' to reduce footprint
 - Documented new member 'tim_expiry' in 'struct hrtimer'
 - For patch 3, use the timestamp from the tracer itself to avoid repeated
   requests to hardware
 - Removed unwanted type casts 
 - Added comments for new tracepoints

3. Questions which were unanswered in v6
-----------------------------------------
- How is this facility useful on it's own?
To capture potential sources of latencies

- What is the value of adding this?
The goal is not to document the deterministic execution time of irq
(the "IRQ latency"), but to provide evidence that both the scheduler
and the context switch guarantee that a waiting user-space real-time
application will start not later than the maximum acceptable latency.
This made it necessary to determine the "preemption latency" sometimes
also referred to as "total latency" 

- How is it used and what valuable debug information can be collected?
Information collected are IRQ latency, Preemption latency and total latency

For more information, please refer 'Latency Hist article' in the reference
section mentioned below.

4. Switchtime latency
----------------------

The existing two on-line methods to record system latency (timer and
wakeup latency) do not completely reflect the user-space wakeup time,
since the duration of the context switch is not taken into account. This
patch adds two new histograms - one that records the duration of the
context switch and another one that records the sum of the timer delay,
the wakeup latency and the newly available duration of the context
switch. The latter histogram probably is best suitable to determine the
worst-case total preemption latency of a given system.

NB: This patch (refer 'Switchtime latency' under section 6) is presently
not part of the PREEMPT_RT patchset and also has not been adapted to work
with the kernel event infrastructure. It can be done if the other patches
can be considered for mainlining.

5. Mainlining and alternatives
-------------------------------
As of now, the preemptirqsoff, timer and wakeup latencies are part of
the PREEMPT_RT patch series. But not the switchtime latency which is
maintained separately in an off-tree repository. The preemptirqsoff,
and timer latency histogram alone may not make sense due to the reasons
mentioned in section 3. The implementation of these includes modifying
the 'task_struct' which might not sound feasible to many stakeholders.

Also, the switchtime histogram have not been adapted for mainlining
for the same reason. Refer 'Latency Hist article' for more details.

An alternative to keeping the timestamp in 'task_struct' as suggested
by Daniel Wagner (for example in wakeup histogram) is to create kprobe
events for 'wakeup' and 'wakeup_sched_switch' from userland and use the
'hist' triggers to somehow find the difference between the timestamps in
each of the above events. This could involve using tracing_maps by
creating a global variable to keep the process's context. But seems like
too much overhead during runtime. Moreover, this would involve
considerable work from the user space like many of the tracing
utilities and defeats the purpose of simplicity of the in-kernel histograms.

6. Reference
-------------
Latency hist:          https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/commit/?h=v3.14-rt-rebase&id=56d50cc34943bbba12b8c5942ee1ae3b29f73acb
Latency Hist article:  https://www.osadl.org/Single-View.111+M538f5aa49d6.0.html
Switchtime latency:    http://www.spinics.net/lists/linux-rt-users/msg14377.html

-Binoy

Binoy Jayan (3):
  tracing: Add preemptirqsoff timing events
  tracing: Measure delayed hrtimer offset latency
  tracing: wakeup latency events and histograms

Daniel Wagner (2):
  tracing: Dereference pointers without RCU checks
  tracing: Add hist trigger support for generic fields

 include/linux/hrtimer.h             |  4 +++
 include/linux/rculist.h             | 36 +++++++++++++++++++++
 include/linux/sched.h               |  6 ++++
 include/linux/tracepoint.h          |  4 +--
 include/trace/events/latency.h      | 62 +++++++++++++++++++++++++++++++++++++
 include/trace/events/sched.h        | 34 ++++++++++++++++++++
 kernel/time/Kconfig                 |  8 +++++
 kernel/time/hrtimer.c               | 47 ++++++++++++++++++++++++++++
 kernel/trace/Kconfig                | 10 ++++++
 kernel/trace/trace_events.c         | 13 ++++----
 kernel/trace/trace_events_filter.c  |  4 +--
 kernel/trace/trace_events_hist.c    | 36 +++++++++++++++++----
 kernel/trace/trace_events_trigger.c |  6 ++--
 kernel/trace/trace_irqsoff.c        | 43 ++++++++++++++++---------
 kernel/trace/trace_sched_wakeup.c   | 35 +++++++++++++++++++--
 15 files changed, 312 insertions(+), 36 deletions(-)
 create mode 100644 include/trace/events/latency.h

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v7 1/5] tracing: Dereference pointers without RCU checks
  2016-09-20 13:51 [RFC PATCH v7 0/5] *** Latency histograms *** Binoy Jayan
@ 2016-09-20 13:51 ` Binoy Jayan
  2016-09-20 13:51 ` [RFC PATCH v7 2/5] tracing: Add hist trigger support for generic fields Binoy Jayan
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Binoy Jayan @ 2016-09-20 13:51 UTC (permalink / raw)
  To: Steven Rostedt (Red Hat), Thomas Gleixner
  Cc: Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Mark Brown, Binoy Jayan

From: Daniel Wagner <daniel.wagner@bmw-carit.de>

The tracepoint can't be used in code section where we are in the
middle of a state transition.

For example if we place a tracepoint inside start/stop_critical_section(),
lockdep complains with

[    0.035589] WARNING: CPU: 0 PID: 3 at kernel/locking/lockdep.c:3560 \
check_flags.part.36+0x1bc/0x210() [    0.036000] \
DEBUG_LOCKS_WARN_ON(current->softirqs_enabled) [    0.036000] Kernel panic - not \
syncing: panic_on_warn set ... [    0.036000]
[    0.036000] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc7+ #460
[    0.036000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS \
1.7.5-20140709_153950- 04/01/2014 [    0.036000]  ffffffff81f2463a ffff88007c93bb98 \
ffffffff81afb317 0000000000000001 [    0.036000]  ffffffff81f212b3 ffff88007c93bc18 \
ffffffff81af7bc2 ffff88007c93bbb8 [    0.036000]  ffffffff00000008 ffff88007c93bc28 \
ffff88007c93bbc8 000000000093bbd8 [    0.036000] Call Trace:
[    0.036000]  [<ffffffff81afb317>] dump_stack+0x4f/0x7b
[    0.036000]  [<ffffffff81af7bc2>] panic+0xc0/0x1e9
[    0.036000]  [<ffffffff81b07188>] ? _raw_spin_unlock_irqrestore+0x38/0x80
[    0.036000]  [<ffffffff81052a60>] warn_slowpath_common+0xc0/0xc0
[    0.036000]  [<ffffffff81057116>] ? __local_bh_enable+0x36/0x70
[    0.036000]  [<ffffffff81052aa6>] warn_slowpath_fmt+0x46/0x50
[    0.036000]  [<ffffffff810a468c>] check_flags.part.36+0x1bc/0x210
[    0.036000]  [<ffffffff810a5168>] lock_is_held+0x78/0x90
[    0.036000]  [<ffffffff81057116>] ? __local_bh_enable+0x36/0x70
[    0.036000]  [<ffffffff810588ab>] ? __do_softirq+0x3db/0x500
[    0.036000]  [<ffffffff81134495>] trace_preempt_on+0x255/0x260
[    0.036000]  [<ffffffff81081c0b>] preempt_count_sub+0xab/0xf0
[    0.036000]  [<ffffffff81057116>] __local_bh_enable+0x36/0x70
[    0.036000]  [<ffffffff810588ab>] __do_softirq+0x3db/0x500
[    0.036000]  [<ffffffff810589ef>] run_ksoftirqd+0x1f/0x60
[    0.036000]  [<ffffffff81079653>] smpboot_thread_fn+0x193/0x2a0
[    0.036000]  [<ffffffff810794c0>] ? SyS_setgroups+0x150/0x150
[    0.036000]  [<ffffffff81075312>] kthread+0xf2/0x110
[    0.036000]  [<ffffffff81b026b3>] ? wait_for_completion+0xc3/0x120
[    0.036000]  [<ffffffff81081c0b>] ? preempt_count_sub+0xab/0xf0
[    0.036000]  [<ffffffff81075220>] ? kthread_create_on_node+0x240/0x240
[    0.036000]  [<ffffffff81b07cc8>] ret_from_fork+0x58/0x90
[    0.036000]  [<ffffffff81075220>] ? kthread_create_on_node+0x240/0x240
[    0.036000] ---[ end Kernel panic - not syncing: panic_on_warn set ...

PeterZ was so kind to explain it to me what is happening:

"__local_bh_enable() tests if this is the last SOFTIRQ_OFFSET, if so it
tells lockdep softirqs are enabled with trace_softirqs_on() after that
we go an actually modify the preempt_count with preempt_count_sub().
Then in preempt_count_sub() you call into trace_preempt_on() if this
was the last preempt_count increment but you do that _before_ you
actually change the preempt_count with __preempt_count_sub() at this
point lockdep and preempt_count think the world differs and *boom*"

So the simplest way to avoid this is by disabling the consistency
checks.

We also need to take care of the iterating in trace_events_trigger.c
to avoid a splatter in conjunction with the hist trigger.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 include/linux/rculist.h             | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/tracepoint.h          |  4 ++--
 kernel/trace/trace_events_filter.c  |  4 ++--
 kernel/trace/trace_events_trigger.c |  6 +++---
 4 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 8beb98d..bee836b 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -279,6 +279,24 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
 	container_of(lockless_dereference(ptr), type, member)
 
 /**
+ * list_entry_rcu_notrace - get the struct for this entry (for tracing)
+ * @ptr:        the &struct list_head pointer.
+ * @type:       the type of the struct this is embedded in.
+ * @member:     the name of the list_head within the struct.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
+ *
+ * This is the same as list_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define list_entry_rcu_notrace(ptr, type, member) \
+({ \
+	typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
+	container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), type, member); \
+})
+
+/**
  * Where are list_empty_rcu() and list_first_entry_rcu()?
  *
  * Implementing those functions following their counterparts list_empty() and
@@ -391,6 +409,24 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
 	     pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_rcu_notrace	-	iterate over rcu list of given type (for tracing)
+ * @pos:	the type * to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the list_head within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently with
+ * the _rcu list-mutation primitives such as list_add_rcu()
+ * as long as the traversal is guarded by rcu_read_lock().
+ *
+ * This is the same as list_for_each_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define list_for_each_entry_rcu_notrace(pos, head, member) \
+	for (pos = list_entry_rcu_notrace((head)->next, typeof(*pos), member); \
+		&pos->member != (head); \
+		pos = list_entry_rcu_notrace(pos->member.next, typeof(*pos), member))
+
+/**
  * list_for_each_entry_continue_rcu - continue iteration over list of given type
  * @pos:	the type * to use as a loop cursor.
  * @head:	the head for your list.
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index be586c6..d0e2a82 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -138,7 +138,7 @@ extern void syscall_unregfunc(void);
 			return;						\
 		prercu;							\
 		rcu_read_lock_sched_notrace();				\
-		it_func_ptr = rcu_dereference_sched((tp)->funcs);	\
+		it_func_ptr = rcu_dereference_raw_notrace((tp)->funcs);	\
 		if (it_func_ptr) {					\
 			do {						\
 				it_func = (it_func_ptr)->func;		\
@@ -189,7 +189,7 @@ extern void syscall_unregfunc(void);
 				TP_CONDITION(cond),,);			\
 		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
 			rcu_read_lock_sched_notrace();			\
-			rcu_dereference_sched(__tracepoint_##name.funcs);\
+			rcu_dereference_raw_notrace(__tracepoint_##name.funcs);\
 			rcu_read_unlock_sched_notrace();		\
 		}							\
 	}								\
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 9daa9b3..a00a8d5 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -608,11 +608,11 @@ int filter_match_preds(struct event_filter *filter, void *rec)
 	/*
 	 * n_preds, root and filter->preds are protect with preemption disabled.
 	 */
-	root = rcu_dereference_sched(filter->root);
+	root = rcu_dereference_raw_notrace(filter->root);
 	if (!root)
 		return 1;
 
-	data.preds = preds = rcu_dereference_sched(filter->preds);
+	data.preds = preds = rcu_dereference_raw_notrace(filter->preds);
 	ret = walk_pred_tree(preds, root, filter_match_preds_cb, &data);
 	WARN_ON(ret);
 	return data.match;
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index a975571..4489fb4 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -71,14 +71,14 @@ event_triggers_call(struct trace_event_file *file, void *rec)
 	if (list_empty(&file->triggers))
 		return tt;
 
-	list_for_each_entry_rcu(data, &file->triggers, list) {
+	list_for_each_entry_rcu_notrace(data, &file->triggers, list) {
 		if (data->paused)
 			continue;
 		if (!rec) {
 			data->ops->func(data, rec);
 			continue;
 		}
-		filter = rcu_dereference_sched(data->filter);
+		filter = rcu_dereference_raw_notrace(data->filter);
 		if (filter && !filter_match_preds(filter, rec))
 			continue;
 		if (event_command_post_trigger(data->cmd_ops)) {
@@ -111,7 +111,7 @@ event_triggers_post_call(struct trace_event_file *file,
 {
 	struct event_trigger_data *data;
 
-	list_for_each_entry_rcu(data, &file->triggers, list) {
+	list_for_each_entry_rcu_notrace(data, &file->triggers, list) {
 		if (data->paused)
 			continue;
 		if (data->cmd_ops->trigger_type & tt)
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v7 2/5] tracing: Add hist trigger support for generic fields
  2016-09-20 13:51 [RFC PATCH v7 0/5] *** Latency histograms *** Binoy Jayan
  2016-09-20 13:51 ` [RFC PATCH v7 1/5] tracing: Dereference pointers without RCU checks Binoy Jayan
@ 2016-09-20 13:51 ` Binoy Jayan
  2016-09-20 13:51 ` [RFC PATCH v7 3/5] tracing: Add preemptirqsoff timing events Binoy Jayan
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Binoy Jayan @ 2016-09-20 13:51 UTC (permalink / raw)
  To: Steven Rostedt (Red Hat), Thomas Gleixner
  Cc: Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Mark Brown

From: Daniel Wagner <daniel.wagner@bmw-carit.de>

Whenever a trace is printed the generic fields (CPU, COMM) are
reconstructed (see trace_print_context()). CPU is taken from the
trace_iterator and COMM is extracted from the savedcmd map (see
__trace_find_cmdline()).

We can't reconstruct this information for hist events. Therefore this
information needs to be stored when a new event is added to the hist
buffer.

There is already support for extracting the COMM for the common_pid
field. For this the tracing_map_ops infrasture is used. Unfortunately, we
can't reuse it because it extends an existing hist_field. That means we
first need to add a hist_field before we are able to make reuse of
trace_map_ops.

Furthermore, it is quite easy to extend the current code to support
those two fields by adding hist_field_cpu() and hist_field_comm().

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
---
 kernel/trace/trace_events.c      | 13 +++++++------
 kernel/trace/trace_events_hist.c | 36 ++++++++++++++++++++++++++++++------
 2 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 03c0a48..ea8da30 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -150,9 +150,10 @@ int trace_define_field(struct trace_event_call *call, const char *type,
 }
 EXPORT_SYMBOL_GPL(trace_define_field);
 
-#define __generic_field(type, item, filter_type)			\
+#define __generic_field(type, item, filter_type, size)			\
 	ret = __trace_define_field(&ftrace_generic_fields, #type,	\
-				   #item, 0, 0, is_signed_type(type),	\
+				   #item, 0, size,			\
+				   is_signed_type(type),                \
 				   filter_type);			\
 	if (ret)							\
 		return ret;
@@ -170,10 +171,10 @@ static int trace_define_generic_fields(void)
 {
 	int ret;
 
-	__generic_field(int, CPU, FILTER_CPU);
-	__generic_field(int, cpu, FILTER_CPU);
-	__generic_field(char *, COMM, FILTER_COMM);
-	__generic_field(char *, comm, FILTER_COMM);
+	__generic_field(int, CPU, FILTER_CPU, sizeof(int));
+	__generic_field(int, cpu, FILTER_CPU, sizeof(int));
+	__generic_field(char *, COMM, FILTER_COMM, TASK_COMM_LEN + 1);
+	__generic_field(char *, comm, FILTER_COMM, TASK_COMM_LEN + 1);
 
 	return ret;
 }
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index f3a960e..7ed6743 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -75,6 +75,16 @@ static u64 hist_field_log2(struct hist_field *hist_field, void *event)
 	return (u64) ilog2(roundup_pow_of_two(val));
 }
 
+static u64 hist_field_cpu(struct hist_field *hist_field, void *event)
+{
+	return (u64) smp_processor_id();
+}
+
+static u64 hist_field_comm(struct hist_field *hist_field, void *event)
+{
+	return (u64) (unsigned long) current->comm;
+}
+
 #define DEFINE_HIST_FIELD_FN(type)					\
 static u64 hist_field_##type(struct hist_field *hist_field, void *event)\
 {									\
@@ -119,6 +129,8 @@ enum hist_field_flags {
 	HIST_FIELD_FL_SYSCALL		= 128,
 	HIST_FIELD_FL_STACKTRACE	= 256,
 	HIST_FIELD_FL_LOG2		= 512,
+	HIST_FIELD_FL_CPU		= 1024,
+	HIST_FIELD_FL_COMM		= 2048,
 };
 
 struct hist_trigger_attrs {
@@ -374,7 +386,13 @@ static struct hist_field *create_hist_field(struct ftrace_event_field *field,
 	if (WARN_ON_ONCE(!field))
 		goto out;
 
-	if (is_string_field(field)) {
+	if (field->filter_type == FILTER_CPU) {
+		flags |= HIST_FIELD_FL_CPU;
+		hist_field->fn = hist_field_cpu;
+	} else if (field->filter_type == FILTER_COMM) {
+		flags |= HIST_FIELD_FL_COMM;
+		hist_field->fn = hist_field_comm;
+	} else if (is_string_field(field)) {
 		flags |= HIST_FIELD_FL_STRING;
 
 		if (field->filter_type == FILTER_STATIC_STRING)
@@ -748,7 +766,8 @@ static int create_tracing_map_fields(struct hist_trigger_data *hist_data)
 
 			if (hist_field->flags & HIST_FIELD_FL_STACKTRACE)
 				cmp_fn = tracing_map_cmp_none;
-			else if (is_string_field(field))
+			else if (is_string_field(field) ||
+				 hist_field->flags & HIST_FIELD_FL_COMM)
 				cmp_fn = tracing_map_cmp_string;
 			else
 				cmp_fn = tracing_map_cmp_num(field->size,
@@ -856,11 +875,9 @@ static inline void add_to_key(char *compound_key, void *key,
 			      struct hist_field *key_field, void *rec)
 {
 	size_t size = key_field->size;
+	struct ftrace_event_field *field = key_field->field;
 
 	if (key_field->flags & HIST_FIELD_FL_STRING) {
-		struct ftrace_event_field *field;
-
-		field = key_field->field;
 		if (field->filter_type == FILTER_DYN_STRING)
 			size = *(u32 *)(rec + field->offset) >> 16;
 		else if (field->filter_type == FILTER_PTR_STRING)
@@ -871,6 +888,8 @@ static inline void add_to_key(char *compound_key, void *key,
 		/* ensure NULL-termination */
 		if (size > key_field->size - 1)
 			size = key_field->size - 1;
+	} else if (field->filter_type == FILTER_STATIC_STRING) {
+		size = field->size;
 	}
 
 	memcpy(compound_key + key_field->offset, key, size);
@@ -906,7 +925,8 @@ static void event_hist_trigger(struct event_trigger_data *data, void *rec)
 			key = entries;
 		} else {
 			field_contents = key_field->fn(key_field, rec);
-			if (key_field->flags & HIST_FIELD_FL_STRING) {
+			if (key_field->flags & (HIST_FIELD_FL_STRING |
+						HIST_FIELD_FL_COMM)) {
 				key = (void *)(unsigned long)field_contents;
 				use_compound_key = true;
 			} else
@@ -1004,6 +1024,10 @@ hist_trigger_entry_print(struct seq_file *m,
 		} else if (key_field->flags & HIST_FIELD_FL_STRING) {
 			seq_printf(m, "%s: %-50s", key_field->field->name,
 				   (char *)(key + key_field->offset));
+		} else if (key_field->flags & HIST_FIELD_FL_COMM) {
+			seq_printf(m, "%s: %-16s",
+				   key_field->field->name,
+				   (char *)(key + key_field->offset));
 		} else {
 			uval = *(u64 *)(key + key_field->offset);
 			seq_printf(m, "%s: %10llu", key_field->field->name,
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v7 3/5] tracing: Add preemptirqsoff timing events
  2016-09-20 13:51 [RFC PATCH v7 0/5] *** Latency histograms *** Binoy Jayan
  2016-09-20 13:51 ` [RFC PATCH v7 1/5] tracing: Dereference pointers without RCU checks Binoy Jayan
  2016-09-20 13:51 ` [RFC PATCH v7 2/5] tracing: Add hist trigger support for generic fields Binoy Jayan
@ 2016-09-20 13:51 ` Binoy Jayan
  2016-09-20 13:51 ` [RFC PATCH v7 4/5] tracing: Measure delayed hrtimer offset latency Binoy Jayan
  2016-09-20 13:51 ` [RFC PATCH v7 5/5] tracing: wakeup latency events and histograms Binoy Jayan
  4 siblings, 0 replies; 12+ messages in thread
From: Binoy Jayan @ 2016-09-20 13:51 UTC (permalink / raw)
  To: Steven Rostedt (Red Hat), Thomas Gleixner
  Cc: Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Mark Brown, Binoy Jayan, Carsten Emde

Potential sources of latencies are code segments where interrupts,
preemption or both are disabled (aka critical sections). To create
histograms of potential sources of latency, the kernel stores the time
stamp at the start of a critical section, determines the time elapsed
when the end of the section is reached, and increments the frequency
counter of that latency value - irrespective of whether any concurrently
running process is affected by latency or not.

With the tracepoints for irqs off, preempt off and critical timing added
at the end of the critical sections, the potential sources of latencies
which occur in these sections can be measured in nanoseconds. With the
hist triggers in place, the histogram plots may be generated, with per-cpu
breakdown of events captured. It is based on linux kernel's event
infrastructure.

The following filter(s) may be used to with the histograms

'hist:key=latency.log2:val=hitcount:sort=latency'
'hist:key=ltype,latency:val=hitcount:sort=latency if cpu==1'
'hist:key=ltype:val=latency:sort=ltype if ltype==0 && cpu==2'

Where ltype is
1: IRQSOFF latency
2: PREEMPTOFF Latency
3: Critical Timings

Enable one or more of the following tracers to capture the associated
latencies i.e. irq/preempt/critical timing

In '/sys/kernel/debug/tracing'

echo irqsoff > current_tracer - irq and critical time latencies
echo preemptoff > current_tracer - preempt and critical time latencies
echo preemptirqsoff > current_tracer - irq, preempt and critical timing

[
- Initial work and idea by Carsten as part of PREEMPT_RT patch series
  Link: https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/commit/?h=v3.14-rt-rebase&id=56d50cc34943bbba12b8c5942ee1ae3b29f73acb
  No code was taken from the RT patch.
- RFC using hist infrastructure code by Daniel.
- Got it re-written in shape as Daniel suggested to take over author ship.
]

Cc: Carsten Emde <C.Emde@osadl.org>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 include/trace/events/latency.h | 62 ++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_irqsoff.c   | 43 +++++++++++++++++++----------
 2 files changed, 91 insertions(+), 14 deletions(-)
 create mode 100644 include/trace/events/latency.h

diff --git a/include/trace/events/latency.h b/include/trace/events/latency.h
new file mode 100644
index 0000000..66442d5
--- /dev/null
+++ b/include/trace/events/latency.h
@@ -0,0 +1,62 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM latency
+
+#if !defined(_TRACE_HIST_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HIST_H
+
+#include <linux/tracepoint.h>
+
+#ifndef __TRACE_LATENCY_TYPE
+#define __TRACE_LATENCY_TYPE
+
+enum latency_type {
+	LT_NONE,
+	LT_IRQ,
+	LT_PREEMPT,
+	LT_CRITTIME,
+	LT_MAX
+};
+
+TRACE_DEFINE_ENUM(LT_IRQ);
+TRACE_DEFINE_ENUM(LT_PREEMPT);
+TRACE_DEFINE_ENUM(LT_CRITTIME);
+
+#define show_ltype(type)			\
+	__print_symbolic(type,			\
+		{ LT_IRQ,	"IRQ" },	\
+		{ LT_PREEMPT,	"PREEMPT" },	\
+		{ LT_CRITTIME,	"CRIT_TIME" })
+#endif
+
+DECLARE_EVENT_CLASS(latency_template,
+	TP_PROTO(int ltype, u64 latency),
+
+	TP_ARGS(ltype, latency),
+
+	TP_STRUCT__entry(
+		__field(int,		ltype)
+		__field(u64,		latency)
+	),
+
+	TP_fast_assign(
+		__entry->ltype		= ltype;
+		__entry->latency	= latency;
+	),
+
+	TP_printk("ltype=%s(%d), latency=%lu", show_ltype(__entry->ltype),
+		  __entry->ltype, (unsigned long) __entry->latency)
+);
+
+/**
+ * latency_preemptirqsoff - called when a cpu exits state of preemption / irq
+ * @ltype:	type of the critical section. Refer 'show_ltype'
+ * @latency:	latency in nano seconds
+ */
+DEFINE_EVENT(latency_template, latency_preemptirqsoff,
+	    TP_PROTO(int ltype, u64 latency),
+	    TP_ARGS(ltype, latency));
+
+#endif /* _TRACE_HIST_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index 03cdff8..4f7442d 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -13,9 +13,14 @@
 #include <linux/uaccess.h>
 #include <linux/module.h>
 #include <linux/ftrace.h>
+#include <linux/percpu-defs.h>
+#include <trace/events/sched.h>
 
 #include "trace.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/latency.h>
+
 static struct trace_array		*irqsoff_trace __read_mostly;
 static int				tracer_enabled __read_mostly;
 
@@ -298,11 +303,18 @@ static bool report_latency(struct trace_array *tr, cycle_t delta)
 	return true;
 }
 
+static inline void latency_preemptirqsoff_timing(enum latency_type type,
+						 cycle_t delta)
+{
+	trace_latency_preemptirqsoff(type, (u64) delta);
+}
+
 static void
 check_critical_timing(struct trace_array *tr,
 		      struct trace_array_cpu *data,
 		      unsigned long parent_ip,
-		      int cpu)
+		      int cpu,
+		      enum latency_type ltype)
 {
 	cycle_t T0, T1, delta;
 	unsigned long flags;
@@ -312,6 +324,7 @@ check_critical_timing(struct trace_array *tr,
 	T1 = ftrace_now(cpu);
 	delta = T1-T0;
 
+	latency_preemptirqsoff_timing(ltype, delta);
 	local_save_flags(flags);
 
 	pc = preempt_count();
@@ -351,7 +364,8 @@ out:
 }
 
 static inline void
-start_critical_timing(unsigned long ip, unsigned long parent_ip)
+start_critical_timing(unsigned long ip, unsigned long parent_ip,
+		      enum latency_type ltype)
 {
 	int cpu;
 	struct trace_array *tr = irqsoff_trace;
@@ -387,7 +401,8 @@ start_critical_timing(unsigned long ip, unsigned long parent_ip)
 }
 
 static inline void
-stop_critical_timing(unsigned long ip, unsigned long parent_ip)
+stop_critical_timing(unsigned long ip, unsigned long parent_ip,
+		     enum latency_type ltype)
 {
 	int cpu;
 	struct trace_array *tr = irqsoff_trace;
@@ -414,7 +429,7 @@ stop_critical_timing(unsigned long ip, unsigned long parent_ip)
 
 	local_save_flags(flags);
 	__trace_function(tr, ip, parent_ip, flags, preempt_count());
-	check_critical_timing(tr, data, parent_ip ? : ip, cpu);
+	check_critical_timing(tr, data, parent_ip ? : ip, cpu, ltype);
 	data->critical_start = 0;
 	atomic_dec(&data->disabled);
 }
@@ -423,14 +438,14 @@ stop_critical_timing(unsigned long ip, unsigned long parent_ip)
 void start_critical_timings(void)
 {
 	if (preempt_trace() || irq_trace())
-		start_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
+		start_critical_timing(CALLER_ADDR0, CALLER_ADDR1, LT_CRITTIME);
 }
 EXPORT_SYMBOL_GPL(start_critical_timings);
 
 void stop_critical_timings(void)
 {
 	if (preempt_trace() || irq_trace())
-		stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
+		stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1, LT_CRITTIME);
 }
 EXPORT_SYMBOL_GPL(stop_critical_timings);
 
@@ -439,13 +454,13 @@ EXPORT_SYMBOL_GPL(stop_critical_timings);
 void time_hardirqs_on(unsigned long a0, unsigned long a1)
 {
 	if (!preempt_trace() && irq_trace())
-		stop_critical_timing(a0, a1);
+		stop_critical_timing(a0, a1, LT_IRQ);
 }
 
 void time_hardirqs_off(unsigned long a0, unsigned long a1)
 {
 	if (!preempt_trace() && irq_trace())
-		start_critical_timing(a0, a1);
+		start_critical_timing(a0, a1, LT_IRQ);
 }
 
 #else /* !CONFIG_PROVE_LOCKING */
@@ -472,28 +487,28 @@ inline void print_irqtrace_events(struct task_struct *curr)
 void trace_hardirqs_on(void)
 {
 	if (!preempt_trace() && irq_trace())
-		stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
+		stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1, LT_IRQ);
 }
 EXPORT_SYMBOL(trace_hardirqs_on);
 
 void trace_hardirqs_off(void)
 {
 	if (!preempt_trace() && irq_trace())
-		start_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
+		start_critical_timing(CALLER_ADDR0, CALLER_ADDR1, LT_IRQ);
 }
 EXPORT_SYMBOL(trace_hardirqs_off);
 
 __visible void trace_hardirqs_on_caller(unsigned long caller_addr)
 {
 	if (!preempt_trace() && irq_trace())
-		stop_critical_timing(CALLER_ADDR0, caller_addr);
+		stop_critical_timing(CALLER_ADDR0, caller_addr, LT_IRQ);
 }
 EXPORT_SYMBOL(trace_hardirqs_on_caller);
 
 __visible void trace_hardirqs_off_caller(unsigned long caller_addr)
 {
 	if (!preempt_trace() && irq_trace())
-		start_critical_timing(CALLER_ADDR0, caller_addr);
+		start_critical_timing(CALLER_ADDR0, caller_addr, LT_IRQ);
 }
 EXPORT_SYMBOL(trace_hardirqs_off_caller);
 
@@ -504,13 +519,13 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller);
 void trace_preempt_on(unsigned long a0, unsigned long a1)
 {
 	if (preempt_trace() && !irq_trace())
-		stop_critical_timing(a0, a1);
+		stop_critical_timing(a0, a1, LT_PREEMPT);
 }
 
 void trace_preempt_off(unsigned long a0, unsigned long a1)
 {
 	if (preempt_trace() && !irq_trace())
-		start_critical_timing(a0, a1);
+		start_critical_timing(a0, a1, LT_PREEMPT);
 }
 #endif /* CONFIG_PREEMPT_TRACER */
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v7 4/5] tracing: Measure delayed hrtimer offset latency
  2016-09-20 13:51 [RFC PATCH v7 0/5] *** Latency histograms *** Binoy Jayan
                   ` (2 preceding siblings ...)
  2016-09-20 13:51 ` [RFC PATCH v7 3/5] tracing: Add preemptirqsoff timing events Binoy Jayan
@ 2016-09-20 13:51 ` Binoy Jayan
  2016-09-20 14:19   ` Thomas Gleixner
  2016-09-20 13:51 ` [RFC PATCH v7 5/5] tracing: wakeup latency events and histograms Binoy Jayan
  4 siblings, 1 reply; 12+ messages in thread
From: Binoy Jayan @ 2016-09-20 13:51 UTC (permalink / raw)
  To: Steven Rostedt (Red Hat), Thomas Gleixner
  Cc: Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Mark Brown, Binoy Jayan, Carsten Emde

Measure latencies caused due to delayed timer offsets in nanoseconds.
It shows the latency captured due to a delayed timer expire event. It
happens for example when a timer misses its deadline due to disabled
interrupts. A process if scheduled as a result of the timer expiration
suffers this latency. It is used to calculate the total wakeup latency
of a process which is the sum of the delayed timer offset and the
wakeup latency.

[
Initial work and idea by Carsten
Link: https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/commit/?h=v3.14-rt-rebase&id=56d50cc34943bbba12b8c5942ee1ae3b29f73acb
]

Cc: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 include/linux/hrtimer.h |  4 ++++
 include/linux/sched.h   |  3 +++
 kernel/time/Kconfig     |  8 ++++++++
 kernel/time/hrtimer.c   | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 62 insertions(+)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 5e00f80..05d8086 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -90,6 +90,7 @@ enum hrtimer_restart {
  * @is_rel:	Set if the timer was armed relative
  * @start_pid:  timer statistics field to store the pid of the task which
  *		started the timer
+ * @tim_expiry: hrtimer expiry time or 0 in case already expired
  * @start_site:	timer statistics field to store the site where the timer
  *		was started
  * @start_comm: timer statistics field to store the name of the process which
@@ -104,6 +105,9 @@ struct hrtimer {
 	struct hrtimer_clock_base	*base;
 	u8				state;
 	u8				is_rel;
+#ifdef CONFIG_TRACE_DELAYED_TIMER_OFFSETS
+	ktime_t				tim_expiry;
+#endif
 #ifdef CONFIG_TIMER_STATS
 	int				start_pid;
 	void				*start_site;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..7bf67f8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1891,6 +1891,9 @@ struct task_struct {
 	/* bitmask and counter of trace recursion */
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
+#ifdef CONFIG_TRACE_DELAYED_TIMER_OFFSETS
+	long timer_offset;
+#endif /* CONFIG_TRACE_DELAYED_TIMER_OFFSETS */
 #ifdef CONFIG_KCOV
 	/* Coverage collection mode enabled for this task (0 if disabled). */
 	enum kcov_mode kcov_mode;
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 4008d9f..de4793c 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -193,5 +193,13 @@ config HIGH_RES_TIMERS
 	  hardware is not capable then this option only increases
 	  the size of the kernel image.
 
+config TRACE_DELAYED_TIMER_OFFSETS
+	depends on HIGH_RES_TIMERS
+	select GENERIC_TRACER
+	bool "Delayed Timer Offsets"
+	help
+	  Capture offsets of delayed hrtimer in nanoseconds. It is used
+	  to construct wakeup latency histogram.
+
 endmenu
 endif
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 9ba7c82..7048f86 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -56,6 +56,8 @@
 
 #include "tick-internal.h"
 
+static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer);
+
 /*
  * The timer bases:
  *
@@ -960,6 +962,47 @@ static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
 	return tim;
 }
 
+#ifdef CONFIG_TRACE_DELAYED_TIMER_OFFSETS
+static inline void latency_hrtimer_timing_start(struct hrtimer *timer,
+					 struct hrtimer_clock_base *new_base,
+					 ktime_t tim)
+{
+	ktime_t now = new_base->get_time();
+
+	if (ktime_to_ns(tim) < ktime_to_ns(now))
+		timer->tim_expiry = now;
+	else
+		timer->tim_expiry = ktime_set(0, 0);
+}
+
+static inline void latency_hrtimer_timing_stop(struct hrtimer *timer,
+						ktime_t basenow)
+{
+	long latency;
+	struct task_struct *task;
+
+	latency = ktime_to_ns(ktime_sub(basenow,
+			      ktime_to_ns(timer->tim_expiry) ?
+			      timer->tim_expiry : hrtimer_get_expires(timer)));
+	task = timer->function == hrtimer_wakeup ?
+			container_of(timer, struct hrtimer_sleeper,
+				     timer)->task : NULL;
+	if (task && latency > 0)
+		task->timer_offset = latency;
+}
+#else
+static inline void latency_hrtimer_timing_start(struct hrtimer *timer,
+					 struct hrtimer_clock_base *new_base,
+					 ktime_t tim)
+{
+}
+static inline void latency_hrtimer_timing_stop(struct hrtimer *timer,
+						ktime_t basenow)
+{
+}
+
+#endif
+
 /**
  * hrtimer_start_range_ns - (re)start an hrtimer on the current CPU
  * @timer:	the timer to be added
@@ -992,6 +1035,8 @@ void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 
 	timer_stats_hrtimer_set_start_info(timer);
 
+	latency_hrtimer_timing_start(timer, new_base, tim);
+
 	leftmost = enqueue_hrtimer(timer, new_base);
 	if (!leftmost)
 		goto unlock;
@@ -1284,6 +1329,8 @@ static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now)
 
 			timer = container_of(node, struct hrtimer, node);
 
+			latency_hrtimer_timing_stop(timer, basenow);
+
 			/*
 			 * The immediate goal for using the softexpires is
 			 * minimizing wakeups, not running timers at the
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v7 5/5] tracing: wakeup latency events and histograms
  2016-09-20 13:51 [RFC PATCH v7 0/5] *** Latency histograms *** Binoy Jayan
                   ` (3 preceding siblings ...)
  2016-09-20 13:51 ` [RFC PATCH v7 4/5] tracing: Measure delayed hrtimer offset latency Binoy Jayan
@ 2016-09-20 13:51 ` Binoy Jayan
  4 siblings, 0 replies; 12+ messages in thread
From: Binoy Jayan @ 2016-09-20 13:51 UTC (permalink / raw)
  To: Steven Rostedt (Red Hat), Thomas Gleixner
  Cc: Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Mark Brown, Binoy Jayan, Carsten Emde

These latencies usually occurs during the wakeup of a process. To
determine this latency, the kernel stores the time stamp when a process
is scheduled to be woken up, and determines the duration of the wakeup
time shortly before control is passed over to this process. Note that the
apparent latency in user space may be somewhat longer, since the process
may be interrupted after control is passed over to it but before the
execution in user space takes place. Simply measuring the interval between
enqueuing and wakeup may also not appropriate in cases when a process is
scheduled as a result of a timer expiration. The timer may have missed its
deadline, e.g. due to disabled interrupts, but this latency would not be
registered. Therefore, the offsets of missed hrtimers are recorded in the
same histogram. The missed hrtimer offsets and the wakeup latency together
contribute to the total latency. With the histogram triggers in place, the
plots may be generated, with per-cpu breakdown of events captured and
the latency measured in nanoseconds.

The following histogram triggers may be used:

'hist:key=cpu,ccomm:val=wakeup_lat,total_lat:sort=cpu'
'hist:key=cpu,ccomm:val=timeroffset,total_lat:sort=cpu'
'hist:key=cpu,ccomm:val=total_lat:sort=cpu'
'hist:key=ccomm:val=total_lat if cpu==0'

Enable the tracer 'wakeup' or 'wakeup_rt' to capture wakeup latencies of
the respective processes.

In '/sys/kernel/debug/tracing'

echo wakeup > current_tracer

[
Initial work and idea by Carsten
Link: https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/commit/?h=v3.14-rt-rebase&id=56d50cc34943bbba12b8c5942ee1ae3b29f73acb
]

Cc: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 include/linux/sched.h             |  3 +++
 include/trace/events/sched.h      | 34 ++++++++++++++++++++++++++++++++++
 kernel/trace/Kconfig              | 10 ++++++++++
 kernel/trace/trace_sched_wakeup.c | 35 ++++++++++++++++++++++++++++++++---
 4 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7bf67f8..82f3b62 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1894,6 +1894,9 @@ struct task_struct {
 #ifdef CONFIG_TRACE_DELAYED_TIMER_OFFSETS
 	long timer_offset;
 #endif /* CONFIG_TRACE_DELAYED_TIMER_OFFSETS */
+#ifdef CONFIG_TRACE_EVENTS_WAKEUP_LATENCY
+	u64 wakeup_timestamp_start;
+#endif
 #ifdef CONFIG_KCOV
 	/* Coverage collection mode enabled for this task (0 if disabled). */
 	enum kcov_mode kcov_mode;
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9b90c57..c8b81d0 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -562,6 +562,40 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
 
 	TP_printk("cpu=%d", __entry->cpu)
 );
+
+#ifdef CONFIG_TRACE_EVENTS_WAKEUP_LATENCY
+/**
+ * latency_wakeup - Called when process is woken up
+ * @next:	task to be woken up
+ * @wakeup_lat:	process wakeup latency in nano seconds
+ */
+TRACE_EVENT(latency_wakeup,
+
+	TP_PROTO(struct task_struct *next, u64 wakeup_latency),
+	TP_ARGS(next, wakeup_latency),
+
+	TP_STRUCT__entry(
+		__array(char,		ccomm,  TASK_COMM_LEN)
+		__field(int,		cprio)
+		__field(unsigned long,	wakeup_lat)
+		__field(unsigned long,	timeroffset)
+		__field(unsigned long,	total_lat)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->ccomm, next->comm, TASK_COMM_LEN);
+		__entry->cprio  = next->prio;
+		__entry->wakeup_lat = wakeup_latency;
+		__entry->timeroffset = next->timer_offset;
+		__entry->total_lat = wakeup_latency + next->timer_offset;
+	),
+
+	TP_printk("curr=%s[%d] wakeup_lat=%lu timeroffset=%ld total_lat=%lu",
+		__entry->ccomm, __entry->cprio, __entry->wakeup_lat,
+		__entry->timeroffset, __entry->total_lat)
+);
+#endif
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index f4b86e8..20cf135 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -634,6 +634,16 @@ config RING_BUFFER_STARTUP_TEST
 
 	 If unsure, say N
 
+config TRACE_EVENTS_WAKEUP_LATENCY
+       bool "Trace wakeup latency events"
+       depends on TRACE_DELAYED_TIMER_OFFSETS
+       depends on SCHED_TRACER
+       help
+        Generate the total wakeup latency of a process. It includes the
+        wakeup latency and the timer offset latency. Wakeup latency is the
+        difference in the time when a process is scheduled to be woken up
+        and when it is actually woken up. It depends on the wakeup tracer.
+
 config TRACE_ENUM_MAP_FILE
        bool "Show enum mappings for trace events"
        depends on TRACING
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 9d4399b..ec9e5a8 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -423,6 +423,29 @@ tracing_sched_wakeup_trace(struct trace_array *tr,
 		trace_buffer_unlock_commit(tr, buffer, event, flags, pc);
 }
 
+#ifdef CONFIG_TRACE_EVENTS_WAKEUP_LATENCY
+static inline void latency_wakeup_start(struct task_struct *task, u64 ts)
+{
+	task->wakeup_timestamp_start = ts;
+}
+
+static inline void latency_wakeup_stop(struct task_struct *next, u64 ts)
+{
+	if (next->wakeup_timestamp_start != 0) {
+		trace_latency_wakeup(next, ts - next->wakeup_timestamp_start);
+		next->wakeup_timestamp_start = 0;
+	}
+}
+#else
+static inline void latency_wakeup_start(struct task_struct *task, u64 ts)
+{
+}
+
+static inline void latency_wakeup_stop(struct task_struct *next, u64 ts)
+{
+}
+#endif
+
 static void notrace
 probe_wakeup_sched_switch(void *ignore, bool preempt,
 			  struct task_struct *prev, struct task_struct *next)
@@ -436,6 +459,10 @@ probe_wakeup_sched_switch(void *ignore, bool preempt,
 
 	tracing_record_cmdline(prev);
 
+	cpu = raw_smp_processor_id();
+	T1 = ftrace_now(cpu);
+	latency_wakeup_stop(next, (u64) T1);
+
 	if (unlikely(!tracer_enabled))
 		return;
 
@@ -454,7 +481,6 @@ probe_wakeup_sched_switch(void *ignore, bool preempt,
 	pc = preempt_count();
 
 	/* disable local data, not wakeup_cpu data */
-	cpu = raw_smp_processor_id();
 	disabled = atomic_inc_return(&per_cpu_ptr(wakeup_trace->trace_buffer.data, cpu)->disabled);
 	if (likely(disabled != 1))
 		goto out;
@@ -473,7 +499,6 @@ probe_wakeup_sched_switch(void *ignore, bool preempt,
 	tracing_sched_switch_trace(wakeup_trace, prev, next, flags, pc);
 
 	T0 = data->preempt_timestamp;
-	T1 = ftrace_now(cpu);
 	delta = T1-T0;
 
 	if (!report_latency(wakeup_trace, delta))
@@ -525,6 +550,10 @@ probe_wakeup(void *ignore, struct task_struct *p)
 	unsigned long flags;
 	long disabled;
 	int pc;
+	cycle_t T0;
+
+	T0 = ftrace_now(cpu);
+	latency_wakeup_start(p, (u64) T0);
 
 	if (likely(!tracer_enabled))
 		return;
@@ -580,7 +609,7 @@ probe_wakeup(void *ignore, struct task_struct *p)
 	local_save_flags(flags);
 
 	data = per_cpu_ptr(wakeup_trace->trace_buffer.data, wakeup_cpu);
-	data->preempt_timestamp = ftrace_now(cpu);
+	data->preempt_timestamp = T0;
 	tracing_sched_wakeup_trace(wakeup_trace, p, current, flags, pc);
 
 	/*
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v7 4/5] tracing: Measure delayed hrtimer offset latency
  2016-09-20 13:51 ` [RFC PATCH v7 4/5] tracing: Measure delayed hrtimer offset latency Binoy Jayan
@ 2016-09-20 14:19   ` Thomas Gleixner
  2016-09-21  6:06     ` Binoy Jayan
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2016-09-20 14:19 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Mark Brown, Carsten Emde

On Tue, 20 Sep 2016, Binoy Jayan wrote:
> +#ifdef CONFIG_TRACE_DELAYED_TIMER_OFFSETS
> +static inline void latency_hrtimer_timing_start(struct hrtimer *timer,
> +					 struct hrtimer_clock_base *new_base,
> +					 ktime_t tim)
> +{
> +	ktime_t now = new_base->get_time();
> +
> +	if (ktime_to_ns(tim) < ktime_to_ns(now))
> +		timer->tim_expiry = now;
> +	else
> +		timer->tim_expiry = ktime_set(0, 0);

You still fail to explain why this get_time() magic is required.

This is executed unconditionally when the config switch is enabled and does
not depend on whether the trace functionality is enabled or not. So you are
imposing the extra get_time() call, which can be expensive depending on the
underlying hardware, on every hrtimer start invocation.

Tracing is supposed to have ZERO impact when it is not used and even when
it's in use then the impact should be kept as low as possible. The above
does none of that.

Neither did you provide a proper rationale for this infrastructure in the
changelog.

You can repost that over and over and it will not go anywhere if you don't
start to address the review comments I give you.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v7 4/5] tracing: Measure delayed hrtimer offset latency
  2016-09-20 14:19   ` Thomas Gleixner
@ 2016-09-21  6:06     ` Binoy Jayan
  2016-09-21 15:58       ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Binoy Jayan @ 2016-09-21  6:06 UTC (permalink / raw)
  To: Thomas Gleixner, Carsten Emde
  Cc: Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Masami, Mark Brown

On 20 September 2016 at 19:49, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 20 Sep 2016, Binoy Jayan wrote:
>> +#ifdef CONFIG_TRACE_DELAYED_TIMER_OFFSETS
>> +static inline void latency_hrtimer_timing_start(struct hrtimer *timer,
>> +                                      struct hrtimer_clock_base *new_base,
>> +                                      ktime_t tim)
>> +{
>> +     ktime_t now = new_base->get_time();
>> +
>> +     if (ktime_to_ns(tim) < ktime_to_ns(now))
>> +             timer->tim_expiry = now;
>> +     else
>> +             timer->tim_expiry = ktime_set(0, 0);
>
> You still fail to explain why this get_time() magic is required.
>
> This is executed unconditionally when the config switch is enabled and does
> not depend on whether the trace functionality is enabled or not. So you are
> imposing the extra get_time() call, which can be expensive depending on the
> underlying hardware, on every hrtimer start invocation.
>
> Tracing is supposed to have ZERO impact when it is not used and even when
> it's in use then the impact should be kept as low as possible. The above
> does none of that.
>
> Neither did you provide a proper rationale for this infrastructure in the
> changelog.
>
> You can repost that over and over and it will not go anywhere if you don't
> start to address the review comments I give you.

Hi Thomas,

Sorry, I missed to address this comment from you. From what I understand
why the get_time() is needed is to get the more accurate current time when
the hrtimer base is changed (from the cpu in which it was fired on, to
the current
cpu on which it is currently made to run or restarted) wherein the hrtimer base
needs to be switched to the new cpu provided that it is not running in
pinned mode.

Carsten,

Could you please comment on that?

Thanks,
Binoy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v7 4/5] tracing: Measure delayed hrtimer offset latency
  2016-09-21  6:06     ` Binoy Jayan
@ 2016-09-21 15:58       ` Thomas Gleixner
  2016-09-22  9:14         ` Binoy Jayan
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2016-09-21 15:58 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Carsten Emde, Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Masami, Mark Brown

On Wed, 21 Sep 2016, Binoy Jayan wrote:
> On 20 September 2016 at 19:49, Thomas Gleixner <tglx@linutronix.de> wrote:
> > You still fail to explain why this get_time() magic is required.
> 
> Sorry, I missed to address this comment from you. From what I understand
> why the get_time() is needed is to get the more accurate current time
> when the hrtimer base is changed (from the cpu in which it was fired on,
> to the current cpu on which it is currently made to run or restarted)
> wherein the hrtimer base needs to be switched to the new cpu provided
> that it is not running in pinned mode.

Sorry. This has nothing to do with changing the hrtimer_base, simply
because the time base is the same on all cpus.

> Carsten, Could you please comment on that?

It's not Carstens repsonsibility to explain the nature of the change.

You are submitting that code and so it's your job to provide proper
explanations and justifications. If you can't do that, then how do you
think that the review process, which is a feedback loop between the
reviewer and the submitter, should work?

Answer: It cannot work that way. I hope I don't have to explain why.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v7 4/5] tracing: Measure delayed hrtimer offset latency
  2016-09-21 15:58       ` Thomas Gleixner
@ 2016-09-22  9:14         ` Binoy Jayan
  2016-09-22 17:28           ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Binoy Jayan @ 2016-09-22  9:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Carsten Emde, Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Masami, Mark Brown

Hi Thomas,

Thank you for the reply and sharing your insights.

On 21 September 2016 at 21:28, Thomas Gleixner <tglx@linutronix.de> wrote:

> Sorry. This has nothing to do with changing the hrtimer_base, simply
> because the time base is the same on all cpus.

The condition 'ktime_to_ns(tim) < ktime_to_ns(now)' checks if the timer
has already expired w.r.t. 'soft timeout' value as it does not include
the slack value 'delta_ns'. In that case 'tim_expiry' is normalized to
the current time. (I was under the impression that this inaccuracy
could be because timer was initially running on a different cpu. If that
is not the case, I guess we can use the code mentioned below).

Otherwise it postpones the decision of storing the expiry value
until the hrtimer interrupt. In this case, it calculates the latency
using the hard timeout (which includes the fuzz) as returned by a call
to 'hrtimer_get_expires' in 'latency_hrtimer_timing_stop'.

Since for calculating latency, we use hard timeout value which includes
the slack, and since the actual timeout might have happened in between
the soft and hard timeout, the actual expiry time could be less than
the hard expiry time. This is why latency is checked for negative value
before storing when the trace point is hit.

static inline void latency_hrtimer_timing_start(ktime_t tim)
{
     timer->tim_expiry = tim;
}

static inline void latency_hrtimer_timing_stop(struct hrtimer *timer,
                                                ktime_t basenow)
{
        long latency;
        struct task_struct *task;

        latency = ktime_to_ns(basenow) - hrtimer_get_softexpires_tv64(timer);

        task = timer->function == hrtimer_wakeup ?
                        container_of(timer, struct hrtimer_sleeper,
                                     timer)->task : NULL;
        if (task && latency > 0)   // Now the check for latency may
not be needed
                task->timer_offset = latency;
}

I am using 'hrtimer_get_softexpires_tv64' instead of 'hrtimer_get_expires'
so that 'latency' is never negative. Please let me know if this looks ok.

> It's not Carstens repsonsibility to explain the nature of the change.
> You are submitting that code and so it's your job to provide proper
> explanations and justifications. If you can't do that, then how do you
> think that the review process, which is a feedback loop between the
> reviewer and the submitter, should work?
>
> Answer: It cannot work that way. I hope I don't have to explain why.

Sure, I'll avoid this confusion in the future. I think I should have talked
to him only offline and not here.

Thanks,
Binoy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v7 4/5] tracing: Measure delayed hrtimer offset latency
  2016-09-22  9:14         ` Binoy Jayan
@ 2016-09-22 17:28           ` Thomas Gleixner
  2016-09-22 19:23             ` Mark Brown
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2016-09-22 17:28 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Carsten Emde, Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Masami, Mark Brown

Binoy,

On Thu, 22 Sep 2016, Binoy Jayan wrote:
> The condition 'ktime_to_ns(tim) < ktime_to_ns(now)' checks if the timer
> has already expired w.r.t. 'soft timeout' value as it does not include
> the slack value 'delta_ns'. In that case 'tim_expiry' is normalized to
> the current time.

You are halfways coming close to the point why this is done, but your
conclusion is completely wrong.  The correct keyword is "expired", but
anything else is just random speculation.

> (I was under the impression that this inaccuracy
> could be because timer was initially running on a different cpu. If that
> is not the case, I guess we can use the code mentioned below).

We are not playing a guessing game here.
 
> I am using 'hrtimer_get_softexpires_tv64' instead of 'hrtimer_get_expires'
> so that 'latency' is never negative. Please let me know if this looks ok.

No it does not. And it won't look correct until you finally sit down and
decode and understand the functionality behind this code.

I'm not going to continue this, as it's not my job to explain you the code
which you are trying to submit.

I'm not blaming you, but I blame the responsible persons inside your
company who task you with that and expect that I'm going to do their work
of explaining to you how that code works.

@Arnd, @Mark: I'm starting to get seriously grumpy about that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v7 4/5] tracing: Measure delayed hrtimer offset latency
  2016-09-22 17:28           ` Thomas Gleixner
@ 2016-09-22 19:23             ` Mark Brown
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Brown @ 2016-09-22 19:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binoy Jayan, Carsten Emde, Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Masami

[-- Attachment #1: Type: text/plain, Size: 431 bytes --]

On Thu, Sep 22, 2016 at 07:28:01PM +0200, Thomas Gleixner wrote:

> I'm not blaming you, but I blame the responsible persons inside your
> company who task you with that and expect that I'm going to do their work
> of explaining to you how that code works.

> @Arnd, @Mark: I'm starting to get seriously grumpy about that.

Really sorry about that, obviously that's not the intention.  We'll try
to avoid this happening in future.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-09-22 19:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-20 13:51 [RFC PATCH v7 0/5] *** Latency histograms *** Binoy Jayan
2016-09-20 13:51 ` [RFC PATCH v7 1/5] tracing: Dereference pointers without RCU checks Binoy Jayan
2016-09-20 13:51 ` [RFC PATCH v7 2/5] tracing: Add hist trigger support for generic fields Binoy Jayan
2016-09-20 13:51 ` [RFC PATCH v7 3/5] tracing: Add preemptirqsoff timing events Binoy Jayan
2016-09-20 13:51 ` [RFC PATCH v7 4/5] tracing: Measure delayed hrtimer offset latency Binoy Jayan
2016-09-20 14:19   ` Thomas Gleixner
2016-09-21  6:06     ` Binoy Jayan
2016-09-21 15:58       ` Thomas Gleixner
2016-09-22  9:14         ` Binoy Jayan
2016-09-22 17:28           ` Thomas Gleixner
2016-09-22 19:23             ` Mark Brown
2016-09-20 13:51 ` [RFC PATCH v7 5/5] tracing: wakeup latency events and histograms Binoy Jayan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.