[PATCH v6 0/4] *** Latency histograms - IRQSOFF,PREEMPTOFF, Delayed HRTIMERS ***

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/4] *** Latency histograms - IRQSOFF,PREEMPTOFF, Delayed HRTIMERS ***
@ 2016-09-07 11:13 Binoy Jayan
  2016-09-07 11:13 ` [PATCH v6 1/4] tracing: Deference pointers without RCU checks Binoy Jayan
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Binoy Jayan @ 2016-09-07 11:13 UTC (permalink / raw)
  To: Steven Rostedt (Red Hat)
  Cc: Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Thomas Gleixner, Binoy Jayan

Hi,

Thank you Daniel, Steven for reviewing the code and for the comments.
These set of patches [v6] capture latency events caused by interrupts and
premption disabled in kernel. The patches are based on the hist trigger
feature developed by Tom Zanussi.

v5: https://lkml.org/lkml/2016/9/2/246
v4: https://lkml.org/lkml/2016/8/30/188
v3: https://lkml.org/lkml/2016/8/29/50
v2: https://lkml.org/lkml/2016/8/24/296

v4 -> v5:
  - Add hist trigger support for generic fields
  - hrtimer latency event moved to hrtimer event headers
  - Cleanup

-Binoy

Binoy Jayan (2):
  tracing: Add trace_irqsoff tracepoints
  tracing: Histogram for delayed hrtimer offsets

Daniel Wagner (2):
  tracing: Deference pointers without RCU checks
  tracing: Add hist trigger support for generic fields

 include/linux/hrtimer.h             |  3 ++
 include/linux/rculist.h             | 36 ++++++++++++++++++++++++
 include/linux/tracepoint.h          |  4 +--
 include/trace/events/latency.h      | 56 +++++++++++++++++++++++++++++++++++++
 include/trace/events/timer.h        | 25 +++++++++++++++++
 kernel/time/Kconfig                 |  7 +++++
 kernel/time/hrtimer.c               | 52 ++++++++++++++++++++++++++++++++++
 kernel/trace/trace_events.c         | 13 +++++----
 kernel/trace/trace_events_filter.c  |  4 +--
 kernel/trace/trace_events_hist.c    | 36 ++++++++++++++++++++----
 kernel/trace/trace_events_trigger.c |  6 ++--
 kernel/trace/trace_irqsoff.c        | 35 +++++++++++++++++++++++
 12 files changed, 258 insertions(+), 19 deletions(-)
 create mode 100644 include/trace/events/latency.h

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v6 1/4] tracing: Deference pointers without RCU checks
  2016-09-07 11:13 [PATCH v6 0/4] *** Latency histograms - IRQSOFF,PREEMPTOFF, Delayed HRTIMERS *** Binoy Jayan
@ 2016-09-07 11:13 ` Binoy Jayan
  2016-09-08  6:46   ` Masami Hiramatsu
  2016-09-07 11:13 ` [PATCH v6 2/4] tracing: Add hist trigger support for generic fields Binoy Jayan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Binoy Jayan @ 2016-09-07 11:13 UTC (permalink / raw)
  To: Steven Rostedt (Red Hat)
  Cc: Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Thomas Gleixner, Binoy Jayan

From: Daniel Wagner <daniel.wagner@bmw-carit.de>

The tracepoint can't be used in code section where we are in the
middle of a state transition.

For example if we place a tracepoint inside start/stop_critical_section(),
lockdep complains with

[    0.035589] WARNING: CPU: 0 PID: 3 at kernel/locking/lockdep.c:3560 \
check_flags.part.36+0x1bc/0x210() [    0.036000] \
DEBUG_LOCKS_WARN_ON(current->softirqs_enabled) [    0.036000] Kernel panic - not \
syncing: panic_on_warn set ... [    0.036000]
[    0.036000] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc7+ #460
[    0.036000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS \
1.7.5-20140709_153950- 04/01/2014 [    0.036000]  ffffffff81f2463a ffff88007c93bb98 \
ffffffff81afb317 0000000000000001 [    0.036000]  ffffffff81f212b3 ffff88007c93bc18 \
ffffffff81af7bc2 ffff88007c93bbb8 [    0.036000]  ffffffff00000008 ffff88007c93bc28 \
ffff88007c93bbc8 000000000093bbd8 [    0.036000] Call Trace:
[    0.036000]  [<ffffffff81afb317>] dump_stack+0x4f/0x7b
[    0.036000]  [<ffffffff81af7bc2>] panic+0xc0/0x1e9
[    0.036000]  [<ffffffff81b07188>] ? _raw_spin_unlock_irqrestore+0x38/0x80
[    0.036000]  [<ffffffff81052a60>] warn_slowpath_common+0xc0/0xc0
[    0.036000]  [<ffffffff81057116>] ? __local_bh_enable+0x36/0x70
[    0.036000]  [<ffffffff81052aa6>] warn_slowpath_fmt+0x46/0x50
[    0.036000]  [<ffffffff810a468c>] check_flags.part.36+0x1bc/0x210
[    0.036000]  [<ffffffff810a5168>] lock_is_held+0x78/0x90
[    0.036000]  [<ffffffff81057116>] ? __local_bh_enable+0x36/0x70
[    0.036000]  [<ffffffff810588ab>] ? __do_softirq+0x3db/0x500
[    0.036000]  [<ffffffff81134495>] trace_preempt_on+0x255/0x260
[    0.036000]  [<ffffffff81081c0b>] preempt_count_sub+0xab/0xf0
[    0.036000]  [<ffffffff81057116>] __local_bh_enable+0x36/0x70
[    0.036000]  [<ffffffff810588ab>] __do_softirq+0x3db/0x500
[    0.036000]  [<ffffffff810589ef>] run_ksoftirqd+0x1f/0x60
[    0.036000]  [<ffffffff81079653>] smpboot_thread_fn+0x193/0x2a0
[    0.036000]  [<ffffffff810794c0>] ? SyS_setgroups+0x150/0x150
[    0.036000]  [<ffffffff81075312>] kthread+0xf2/0x110
[    0.036000]  [<ffffffff81b026b3>] ? wait_for_completion+0xc3/0x120
[    0.036000]  [<ffffffff81081c0b>] ? preempt_count_sub+0xab/0xf0
[    0.036000]  [<ffffffff81075220>] ? kthread_create_on_node+0x240/0x240
[    0.036000]  [<ffffffff81b07cc8>] ret_from_fork+0x58/0x90
[    0.036000]  [<ffffffff81075220>] ? kthread_create_on_node+0x240/0x240
[    0.036000] ---[ end Kernel panic - not syncing: panic_on_warn set ...

PeterZ was so kind to explain it to me what is happening:

"__local_bh_enable() tests if this is the last SOFTIRQ_OFFSET, if so it
tells lockdep softirqs are enabled with trace_softirqs_on() after that
we go an actually modify the preempt_count with preempt_count_sub().
Then in preempt_count_sub() you call into trace_preempt_on() if this
was the last preempt_count increment but you do that _before_ you
actually change the preempt_count with __preempt_count_sub() at this
point lockdep and preempt_count think the world differs and *boom*"

So the simplest way to avoid this is by disabling the consistency
checks.

We also need to take care of the iterating in trace_events_trigger.c
to avoid a splatter in conjunction with the hist trigger.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 include/linux/rculist.h             | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/tracepoint.h          |  4 ++--
 kernel/trace/trace_events_filter.c  |  4 ++--
 kernel/trace/trace_events_trigger.c |  6 +++---
 4 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 8beb98d..bee836b 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -279,6 +279,24 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
 	container_of(lockless_dereference(ptr), type, member)
 
 /**
+ * list_entry_rcu_notrace - get the struct for this entry (for tracing)
+ * @ptr:        the &struct list_head pointer.
+ * @type:       the type of the struct this is embedded in.
+ * @member:     the name of the list_head within the struct.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
+ *
+ * This is the same as list_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define list_entry_rcu_notrace(ptr, type, member) \
+({ \
+	typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
+	container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), type, member); \
+})
+
+/**
  * Where are list_empty_rcu() and list_first_entry_rcu()?
  *
  * Implementing those functions following their counterparts list_empty() and
@@ -391,6 +409,24 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
 	     pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_rcu_notrace	-	iterate over rcu list of given type (for tracing)
+ * @pos:	the type * to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the list_head within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently with
+ * the _rcu list-mutation primitives such as list_add_rcu()
+ * as long as the traversal is guarded by rcu_read_lock().
+ *
+ * This is the same as list_for_each_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define list_for_each_entry_rcu_notrace(pos, head, member) \
+	for (pos = list_entry_rcu_notrace((head)->next, typeof(*pos), member); \
+		&pos->member != (head); \
+		pos = list_entry_rcu_notrace(pos->member.next, typeof(*pos), member))
+
+/**
  * list_for_each_entry_continue_rcu - continue iteration over list of given type
  * @pos:	the type * to use as a loop cursor.
  * @head:	the head for your list.
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index be586c6..d0e2a82 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -138,7 +138,7 @@ extern void syscall_unregfunc(void);
 			return;						\
 		prercu;							\
 		rcu_read_lock_sched_notrace();				\
-		it_func_ptr = rcu_dereference_sched((tp)->funcs);	\
+		it_func_ptr = rcu_dereference_raw_notrace((tp)->funcs);	\
 		if (it_func_ptr) {					\
 			do {						\
 				it_func = (it_func_ptr)->func;		\
@@ -189,7 +189,7 @@ extern void syscall_unregfunc(void);
 				TP_CONDITION(cond),,);			\
 		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
 			rcu_read_lock_sched_notrace();			\
-			rcu_dereference_sched(__tracepoint_##name.funcs);\
+			rcu_dereference_raw_notrace(__tracepoint_##name.funcs);\
 			rcu_read_unlock_sched_notrace();		\
 		}							\
 	}								\
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 9daa9b3..a00a8d5 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -608,11 +608,11 @@ int filter_match_preds(struct event_filter *filter, void *rec)
 	/*
 	 * n_preds, root and filter->preds are protect with preemption disabled.
 	 */
-	root = rcu_dereference_sched(filter->root);
+	root = rcu_dereference_raw_notrace(filter->root);
 	if (!root)
 		return 1;
 
-	data.preds = preds = rcu_dereference_sched(filter->preds);
+	data.preds = preds = rcu_dereference_raw_notrace(filter->preds);
 	ret = walk_pred_tree(preds, root, filter_match_preds_cb, &data);
 	WARN_ON(ret);
 	return data.match;
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index a975571..4489fb4 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -71,14 +71,14 @@ event_triggers_call(struct trace_event_file *file, void *rec)
 	if (list_empty(&file->triggers))
 		return tt;
 
-	list_for_each_entry_rcu(data, &file->triggers, list) {
+	list_for_each_entry_rcu_notrace(data, &file->triggers, list) {
 		if (data->paused)
 			continue;
 		if (!rec) {
 			data->ops->func(data, rec);
 			continue;
 		}
-		filter = rcu_dereference_sched(data->filter);
+		filter = rcu_dereference_raw_notrace(data->filter);
 		if (filter && !filter_match_preds(filter, rec))
 			continue;
 		if (event_command_post_trigger(data->cmd_ops)) {
@@ -111,7 +111,7 @@ event_triggers_post_call(struct trace_event_file *file,
 {
 	struct event_trigger_data *data;
 
-	list_for_each_entry_rcu(data, &file->triggers, list) {
+	list_for_each_entry_rcu_notrace(data, &file->triggers, list) {
 		if (data->paused)
 			continue;
 		if (data->cmd_ops->trigger_type & tt)
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 2/4] tracing: Add hist trigger support for generic fields
  2016-09-07 11:13 [PATCH v6 0/4] *** Latency histograms - IRQSOFF,PREEMPTOFF, Delayed HRTIMERS *** Binoy Jayan
  2016-09-07 11:13 ` [PATCH v6 1/4] tracing: Deference pointers without RCU checks Binoy Jayan
@ 2016-09-07 11:13 ` Binoy Jayan
  2016-09-07 13:39   ` kbuild test robot
  2016-09-07 11:13 ` [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints Binoy Jayan
  2016-09-07 11:13 ` [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets Binoy Jayan
  3 siblings, 1 reply; 21+ messages in thread
From: Binoy Jayan @ 2016-09-07 11:13 UTC (permalink / raw)
  To: Steven Rostedt (Red Hat)
  Cc: Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Thomas Gleixner

From: Daniel Wagner <daniel.wagner@bmw-carit.de>

Whenever a trace is printed the generic fields (CPU, COMM) are
reconstructed (see trace_print_context()). CPU is taken from the
trace_iterator and COMM is extracted from the savedcmd map (see
__trace_find_cmdline()).

We can't reconstruct this information for hist events. Therefore this
information needs to be stored when a new event is added to the hist
buffer.

There is already support for extracting the COMM for the common_pid
field. For this the tracing_map_ops infrasture is used. Unfortunately, we
can't reuse it because it extends an existing hist_field. That means we
first need to add a hist_field before we are able to make reuse of
trace_map_ops.

Furthermore, it is quite easy to extend the current code to support
those two fields by adding hist_field_cpu() and hist_field_comm().

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
---
 kernel/trace/trace_events.c      | 13 +++++++------
 kernel/trace/trace_events_hist.c | 36 ++++++++++++++++++++++++++++++------
 2 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 03c0a48..ea8da30 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -150,9 +150,10 @@ int trace_define_field(struct trace_event_call *call, const char *type,
 }
 EXPORT_SYMBOL_GPL(trace_define_field);
 
-#define __generic_field(type, item, filter_type)			\
+#define __generic_field(type, item, filter_type, size)			\
 	ret = __trace_define_field(&ftrace_generic_fields, #type,	\
-				   #item, 0, 0, is_signed_type(type),	\
+				   #item, 0, size,			\
+				   is_signed_type(type),                \
 				   filter_type);			\
 	if (ret)							\
 		return ret;
@@ -170,10 +171,10 @@ static int trace_define_generic_fields(void)
 {
 	int ret;
 
-	__generic_field(int, CPU, FILTER_CPU);
-	__generic_field(int, cpu, FILTER_CPU);
-	__generic_field(char *, COMM, FILTER_COMM);
-	__generic_field(char *, comm, FILTER_COMM);
+	__generic_field(int, CPU, FILTER_CPU, sizeof(int));
+	__generic_field(int, cpu, FILTER_CPU, sizeof(int));
+	__generic_field(char *, COMM, FILTER_COMM, TASK_COMM_LEN + 1);
+	__generic_field(char *, comm, FILTER_COMM, TASK_COMM_LEN + 1);
 
 	return ret;
 }
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index f3a960e..42a1e36 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -75,6 +75,16 @@ static u64 hist_field_log2(struct hist_field *hist_field, void *event)
 	return (u64) ilog2(roundup_pow_of_two(val));
 }
 
+static u64 hist_field_cpu(struct hist_field *hist_field, void *event)
+{
+	return (u64) smp_processor_id();
+}
+
+static u64 hist_field_comm(struct hist_field *hist_field, void *event)
+{
+	return (u64) current->comm;
+}
+
 #define DEFINE_HIST_FIELD_FN(type)					\
 static u64 hist_field_##type(struct hist_field *hist_field, void *event)\
 {									\
@@ -119,6 +129,8 @@ enum hist_field_flags {
 	HIST_FIELD_FL_SYSCALL		= 128,
 	HIST_FIELD_FL_STACKTRACE	= 256,
 	HIST_FIELD_FL_LOG2		= 512,
+	HIST_FIELD_FL_CPU		= 1024,
+	HIST_FIELD_FL_COMM		= 2048,
 };
 
 struct hist_trigger_attrs {
@@ -374,7 +386,13 @@ static struct hist_field *create_hist_field(struct ftrace_event_field *field,
 	if (WARN_ON_ONCE(!field))
 		goto out;
 
-	if (is_string_field(field)) {
+	if (field->filter_type == FILTER_CPU) {
+		flags |= HIST_FIELD_FL_CPU;
+		hist_field->fn = hist_field_cpu;
+	} else if (field->filter_type == FILTER_COMM) {
+		flags |= HIST_FIELD_FL_COMM;
+		hist_field->fn = hist_field_comm;
+	} else if (is_string_field(field)) {
 		flags |= HIST_FIELD_FL_STRING;
 
 		if (field->filter_type == FILTER_STATIC_STRING)
@@ -748,7 +766,8 @@ static int create_tracing_map_fields(struct hist_trigger_data *hist_data)
 
 			if (hist_field->flags & HIST_FIELD_FL_STACKTRACE)
 				cmp_fn = tracing_map_cmp_none;
-			else if (is_string_field(field))
+			else if (is_string_field(field) ||
+				 hist_field->flags & HIST_FIELD_FL_COMM)
 				cmp_fn = tracing_map_cmp_string;
 			else
 				cmp_fn = tracing_map_cmp_num(field->size,
@@ -856,11 +875,9 @@ static inline void add_to_key(char *compound_key, void *key,
 			      struct hist_field *key_field, void *rec)
 {
 	size_t size = key_field->size;
+	struct ftrace_event_field *field = key_field->field;
 
 	if (key_field->flags & HIST_FIELD_FL_STRING) {
-		struct ftrace_event_field *field;
-
-		field = key_field->field;
 		if (field->filter_type == FILTER_DYN_STRING)
 			size = *(u32 *)(rec + field->offset) >> 16;
 		else if (field->filter_type == FILTER_PTR_STRING)
@@ -871,6 +888,8 @@ static inline void add_to_key(char *compound_key, void *key,
 		/* ensure NULL-termination */
 		if (size > key_field->size - 1)
 			size = key_field->size - 1;
+	} else if (field->filter_type == FILTER_STATIC_STRING) {
+		size = field->size;
 	}
 
 	memcpy(compound_key + key_field->offset, key, size);
@@ -906,7 +925,8 @@ static void event_hist_trigger(struct event_trigger_data *data, void *rec)
 			key = entries;
 		} else {
 			field_contents = key_field->fn(key_field, rec);
-			if (key_field->flags & HIST_FIELD_FL_STRING) {
+			if (key_field->flags & (HIST_FIELD_FL_STRING |
+						HIST_FIELD_FL_COMM)) {
 				key = (void *)(unsigned long)field_contents;
 				use_compound_key = true;
 			} else
@@ -1004,6 +1024,10 @@ hist_trigger_entry_print(struct seq_file *m,
 		} else if (key_field->flags & HIST_FIELD_FL_STRING) {
 			seq_printf(m, "%s: %-50s", key_field->field->name,
 				   (char *)(key + key_field->offset));
+		} else if (key_field->flags & HIST_FIELD_FL_COMM) {
+			seq_printf(m, "%s: %-16s",
+				   key_field->field->name,
+				   (char *)(key + key_field->offset));
 		} else {
 			uval = *(u64 *)(key + key_field->offset);
 			seq_printf(m, "%s: %10llu", key_field->field->name,
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints
  2016-09-07 11:13 [PATCH v6 0/4] *** Latency histograms - IRQSOFF,PREEMPTOFF, Delayed HRTIMERS *** Binoy Jayan
  2016-09-07 11:13 ` [PATCH v6 1/4] tracing: Deference pointers without RCU checks Binoy Jayan
  2016-09-07 11:13 ` [PATCH v6 2/4] tracing: Add hist trigger support for generic fields Binoy Jayan
@ 2016-09-07 11:13 ` Binoy Jayan
  2016-09-07 16:42   ` Thomas Gleixner
  2016-09-08  8:06   ` Thomas Gleixner
  2016-09-07 11:13 ` [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets Binoy Jayan
  3 siblings, 2 replies; 21+ messages in thread
From: Binoy Jayan @ 2016-09-07 11:13 UTC (permalink / raw)
  To: Steven Rostedt (Red Hat)
  Cc: Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Thomas Gleixner, Binoy Jayan

This work is based on work by Daniel Wagner. A few tracepoints are added
at the end of the critical section. With the hist trigger in place, the
hist trigger plots may be generated, with per-cpu breakdown of events
captured. It is based on linux kernel's event infrastructure.

The following filter(s) may be used

'hist:key=latency.log2:val=hitcount:sort=latency'
'hist:key=ltype,latency:val=hitcount:sort=latency if cpu==1'
'hist:key=ltype:val=latency:sort=ltype if ltype==0 && cpu==2'

Where ltype is
0: IRQSOFF latency
1: PREEMPTOFF Latency
2: Critical Timings

This captures only the latencies introduced by disabled irqs and
preemption. Additional per process data has to be captured to calculate
the effective latencies introduced for individual processes.

Initial work - latency.patch

[1] https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/commit/?h=v3.14-rt-rebase&id=56d50cc34943bbba12b8c5942ee1ae3b29f73acb

Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 include/trace/events/latency.h | 56 ++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_irqsoff.c   | 35 ++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)
 create mode 100644 include/trace/events/latency.h

diff --git a/include/trace/events/latency.h b/include/trace/events/latency.h
new file mode 100644
index 0000000..96c8757
--- /dev/null
+++ b/include/trace/events/latency.h
@@ -0,0 +1,56 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM latency
+
+#if !defined(_TRACE_HIST_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HIST_H
+
+#include <linux/tracepoint.h>
+
+#ifndef __TRACE_LATENCY_TYPE
+#define __TRACE_LATENCY_TYPE
+
+enum latency_type {
+	LT_IRQ,
+	LT_PREEMPT,
+	LT_CRITTIME,
+	LT_MAX
+};
+
+TRACE_DEFINE_ENUM(LT_IRQ);
+TRACE_DEFINE_ENUM(LT_PREEMPT);
+TRACE_DEFINE_ENUM(LT_CRITTIME);
+
+#define show_ltype(type)			\
+	__print_symbolic(type,			\
+		{ LT_IRQ,	"IRQ" },	\
+		{ LT_PREEMPT,	"PREEMPT" },	\
+		{ LT_CRITTIME,	"CRIT_TIME" })
+#endif
+
+DECLARE_EVENT_CLASS(latency_template,
+	TP_PROTO(int ltype, cycles_t latency),
+
+	TP_ARGS(ltype, latency),
+
+	TP_STRUCT__entry(
+		__field(int,		ltype)
+		__field(cycles_t,	latency)
+	),
+
+	TP_fast_assign(
+		__entry->ltype		= ltype;
+		__entry->latency	= latency;
+	),
+
+	TP_printk("ltype=%s(%d), latency=%lu", show_ltype(__entry->ltype),
+		  __entry->ltype, (unsigned long) __entry->latency)
+);
+
+DEFINE_EVENT(latency_template, latency_preempt,
+	    TP_PROTO(int ltype, cycles_t latency),
+	    TP_ARGS(ltype, latency));
+
+#endif /* _TRACE_HIST_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index 03cdff8..60ee660 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -13,13 +13,19 @@
 #include <linux/uaccess.h>
 #include <linux/module.h>
 #include <linux/ftrace.h>
+#include <linux/percpu-defs.h>
+#include <trace/events/sched.h>
 
 #include "trace.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/latency.h>
+
 static struct trace_array		*irqsoff_trace __read_mostly;
 static int				tracer_enabled __read_mostly;
 
 static DEFINE_PER_CPU(int, tracing_cpu);
+static DEFINE_PER_CPU(cycle_t, lat_ts[LT_MAX]);
 
 static DEFINE_RAW_SPINLOCK(max_trace_lock);
 
@@ -419,9 +425,23 @@ stop_critical_timing(unsigned long ip, unsigned long parent_ip)
 	atomic_dec(&data->disabled);
 }
 
+static inline void latency_preempt_timing_start(enum latency_type ltype)
+{
+	this_cpu_write(lat_ts[ltype], (cycle_t) trace_clock_local());
+}
+
+static inline void latency_preempt_timing_stop(enum latency_type type)
+{
+	trace_latency_preempt(type,
+		(cycle_t) trace_clock_local() - this_cpu_read(lat_ts[type]));
+}
+
 /* start and stop critical timings used to for stoppage (in idle) */
 void start_critical_timings(void)
 {
+	if (unlikely(trace_latency_preempt_enabled()))
+		latency_preempt_timing_start(LT_CRITTIME);
+
 	if (preempt_trace() || irq_trace())
 		start_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
 }
@@ -431,6 +451,9 @@ void stop_critical_timings(void)
 {
 	if (preempt_trace() || irq_trace())
 		stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
+
+	if (unlikely(trace_latency_preempt_enabled()))
+		latency_preempt_timing_stop(LT_CRITTIME);
 }
 EXPORT_SYMBOL_GPL(stop_critical_timings);
 
@@ -438,6 +461,9 @@ EXPORT_SYMBOL_GPL(stop_critical_timings);
 #ifdef CONFIG_PROVE_LOCKING
 void time_hardirqs_on(unsigned long a0, unsigned long a1)
 {
+	if (unlikely(trace_latency_preempt_enabled()))
+		latency_preempt_timing_stop(LT_IRQ);
+
 	if (!preempt_trace() && irq_trace())
 		stop_critical_timing(a0, a1);
 }
@@ -446,6 +472,9 @@ void time_hardirqs_off(unsigned long a0, unsigned long a1)
 {
 	if (!preempt_trace() && irq_trace())
 		start_critical_timing(a0, a1);
+
+	if (unlikely(trace_latency_preempt_enabled()))
+		latency_preempt_timing_start(LT_IRQ);
 }
 
 #else /* !CONFIG_PROVE_LOCKING */
@@ -503,6 +532,9 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller);
 #ifdef CONFIG_PREEMPT_TRACER
 void trace_preempt_on(unsigned long a0, unsigned long a1)
 {
+	if (unlikely(trace_latency_preempt_enabled()))
+		latency_preempt_timing_stop(LT_PREEMPT);
+
 	if (preempt_trace() && !irq_trace())
 		stop_critical_timing(a0, a1);
 }
@@ -511,6 +543,9 @@ void trace_preempt_off(unsigned long a0, unsigned long a1)
 {
 	if (preempt_trace() && !irq_trace())
 		start_critical_timing(a0, a1);
+
+	if (unlikely(trace_latency_preempt_enabled()))
+		latency_preempt_timing_start(LT_PREEMPT);
 }
 #endif /* CONFIG_PREEMPT_TRACER */
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets
  2016-09-07 11:13 [PATCH v6 0/4] *** Latency histograms - IRQSOFF,PREEMPTOFF, Delayed HRTIMERS *** Binoy Jayan
                   ` (2 preceding siblings ...)
  2016-09-07 11:13 ` [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints Binoy Jayan
@ 2016-09-07 11:13 ` Binoy Jayan
  2016-09-08  6:56   ` Masami Hiramatsu
  2016-09-08  7:10   ` Thomas Gleixner
  3 siblings, 2 replies; 21+ messages in thread
From: Binoy Jayan @ 2016-09-07 11:13 UTC (permalink / raw)
  To: Steven Rostedt (Red Hat)
  Cc: Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Thomas Gleixner, Binoy Jayan

Generate a histogram of the latencies of delayed timer offsets in
nanoseconds. It shows the latency captured due to a delayed timer expire
event. It happens for example when a timer misses its deadline due to
disabled interrupts. A process if scheduled as a result of the timer
expiration suffers this latency.

The following filter(s) may be used

'hist:key=common_pid.execname:val=toffset,hitcount'
'hist:key=cpu,tcomm:val=toffset:sort=tcomm'
'hist:key=comm,tcomm:sort=comm,tcomm'

Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 include/linux/hrtimer.h      |  3 +++
 include/trace/events/timer.h | 25 +++++++++++++++++++++
 kernel/time/Kconfig          |  7 ++++++
 kernel/time/hrtimer.c        | 52 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 87 insertions(+)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 5e00f80..9146842 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -104,6 +104,9 @@ struct hrtimer {
 	struct hrtimer_clock_base	*base;
 	u8				state;
 	u8				is_rel;
+#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
+	ktime_t				praecox;
+#endif
 #ifdef CONFIG_TIMER_STATS
 	int				start_pid;
 	void				*start_site;
diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 28c5da6..ee45aed 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -382,6 +382,31 @@ TRACE_EVENT(tick_stop,
 );
 #endif
 
+#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
+TRACE_EVENT(latency_hrtimer_interrupt,
+
+	TP_PROTO(long long toffset, struct task_struct *task),
+
+	TP_ARGS(toffset, task),
+
+	TP_STRUCT__entry(
+		__field(long long,	toffset)
+		__array(char,		tcomm,	TASK_COMM_LEN)
+		__field(int,		tprio)
+	),
+
+	TP_fast_assign(
+		__entry->toffset = toffset;
+		memcpy(__entry->tcomm, task != NULL ? task->comm : "<none>",
+			task != NULL ? TASK_COMM_LEN : 7);
+		__entry->tprio  = task != NULL ? task->prio : -1;
+	),
+
+	TP_printk("toffset=%lld thread=%s[%d]",
+		__entry->toffset, __entry->tcomm, __entry->tprio)
+);
+#endif
+
 #endif /*  _TRACE_TIMER_H */
 
 /* This part must be outside protection */
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 4008d9f..8ff19dd 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -193,5 +193,12 @@ config HIGH_RES_TIMERS
 	  hardware is not capable then this option only increases
 	  the size of the kernel image.
 
+config DELAYED_TIMER_OFFSETS_HIST
+	depends on HIGH_RES_TIMERS
+	select GENERIC_TRACER
+	bool "Delayed Timer Offsets Histogram"
+	help
+	  Generate a histogram of delayed timer offsets in nanoseconds.
+
 endmenu
 endif
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 9ba7c82..432d49a 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -56,6 +56,8 @@
 
 #include "tick-internal.h"
 
+static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer);
+
 /*
  * The timer bases:
  *
@@ -960,6 +962,52 @@ static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
 	return tim;
 }
 
+#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
+static inline void latency_hrtimer_timing_start(struct hrtimer *timer,
+					 struct hrtimer_clock_base *new_base,
+					 ktime_t tim)
+{
+	if (unlikely(trace_latency_hrtimer_interrupt_enabled())) {
+		ktime_t now = new_base->get_time();
+
+		if (ktime_to_ns(tim) < ktime_to_ns(now))
+			timer->praecox = now;
+		else
+			timer->praecox = ktime_set(0, 0);
+	}
+}
+
+static inline void latency_hrtimer_timing_stop(struct hrtimer *timer,
+						ktime_t basenow)
+{
+	long latency;
+	struct task_struct *task;
+
+	if (likely(!trace_latency_hrtimer_interrupt_enabled()))
+		return;
+
+	latency = ktime_to_ns(ktime_sub(basenow,
+			      ktime_to_ns(timer->praecox) ?
+			      timer->praecox : hrtimer_get_expires(timer)));
+	task = timer->function == hrtimer_wakeup ?
+			container_of(timer, struct hrtimer_sleeper,
+				     timer)->task : NULL;
+	if (latency > 0)
+		trace_latency_hrtimer_interrupt((u64) latency, task);
+}
+#else
+static inline void latency_hrtimer_timing_start(struct hrtimer *timer,
+					 struct hrtimer_clock_base *new_base,
+					 ktime_t tim)
+{
+}
+static inline void latency_hrtimer_timing_stop(struct hrtimer *timer,
+						ktime_t basenow)
+{
+}
+
+#endif
+
 /**
  * hrtimer_start_range_ns - (re)start an hrtimer on the current CPU
  * @timer:	the timer to be added
@@ -992,6 +1040,8 @@ void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 
 	timer_stats_hrtimer_set_start_info(timer);
 
+	latency_hrtimer_timing_start(timer, new_base, tim);
+
 	leftmost = enqueue_hrtimer(timer, new_base);
 	if (!leftmost)
 		goto unlock;
@@ -1284,6 +1334,8 @@ static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now)
 
 			timer = container_of(node, struct hrtimer, node);
 
+			latency_hrtimer_timing_stop(timer, basenow);
+
 			/*
 			 * The immediate goal for using the softexpires is
 			 * minimizing wakeups, not running timers at the
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 2/4] tracing: Add hist trigger support for generic fields
  2016-09-07 11:13 ` [PATCH v6 2/4] tracing: Add hist trigger support for generic fields Binoy Jayan
@ 2016-09-07 13:39   ` kbuild test robot
  0 siblings, 0 replies; 21+ messages in thread
From: kbuild test robot @ 2016-09-07 13:39 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: kbuild-all, Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 2057 bytes --]

Hi Daniel,

[auto build test WARNING on tip/perf/core]
[also build test WARNING on v4.8-rc5 next-20160907]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
[Suggest to use git(>=2.9.0) format-patch --base=<commit> (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:    https://github.com/0day-ci/linux/commits/Binoy-Jayan/tracing-Deference-pointers-without-RCU-checks/20160907-205003
config: i386-randconfig-x003-201636 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   kernel/trace/trace_events_hist.c: In function 'hist_field_comm':
>> kernel/trace/trace_events_hist.c:85:9: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
     return (u64) current->comm;
            ^

vim +85 kernel/trace/trace_events_hist.c

    69	}
    70	
    71	static u64 hist_field_log2(struct hist_field *hist_field, void *event)
    72	{
    73		u64 val = *(u64 *)(event + hist_field->field->offset);
    74	
    75		return (u64) ilog2(roundup_pow_of_two(val));
    76	}
    77	
    78	static u64 hist_field_cpu(struct hist_field *hist_field, void *event)
    79	{
    80		return (u64) smp_processor_id();
    81	}
    82	
    83	static u64 hist_field_comm(struct hist_field *hist_field, void *event)
    84	{
  > 85		return (u64) current->comm;
    86	}
    87	
    88	#define DEFINE_HIST_FIELD_FN(type)					\
    89	static u64 hist_field_##type(struct hist_field *hist_field, void *event)\
    90	{									\
    91		type *addr = (type *)(event + hist_field->field->offset);	\
    92										\
    93		return (u64)(unsigned long)*addr;				\

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 21900 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints
  2016-09-07 11:13 ` [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints Binoy Jayan
@ 2016-09-07 16:42   ` Thomas Gleixner
  2016-09-08  5:20     ` Daniel Wagner
  2016-09-08  8:06   ` Thomas Gleixner
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2016-09-07 16:42 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami

On Wed, 7 Sep 2016, Binoy Jayan wrote:

> This work is based on work by Daniel Wagner. A few tracepoints are added

Interesting. The patch you are referring to was written by Carsten
Emde. Can you please get these attributions right? They want to be
reflected in the Tags and if you did not rewrite the whole thing from
scratch then original authorship must be preserved.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints
  2016-09-07 16:42   ` Thomas Gleixner
@ 2016-09-08  5:20     ` Daniel Wagner
  2016-09-08  6:31       ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel Wagner @ 2016-09-08  5:20 UTC (permalink / raw)
  To: Thomas Gleixner, Binoy Jayan
  Cc: Steven Rostedt (Red Hat),
	Ingo Molnar, Arnd Bergmann, linux-kernel, Masami

On 09/07/2016 06:42 PM, Thomas Gleixner wrote:
> On Wed, 7 Sep 2016, Binoy Jayan wrote:
>
>> This work is based on work by Daniel Wagner. A few tracepoints are added
>
> Interesting. The patch you are referring to was written by Carsten
> Emde. Can you please get these attributions right? They want to be
> reflected in the Tags and if you did not rewrite the whole thing from
> scratch then original authorship must be preserved.

This is probably my bad. I didn't point out to Binoy that the original 
idea and patches were done by Carsten. Binoy's work is based on my RFC 
patches from 2015. Thanks for pointing out. Let's add Carsten here as well.

Sorry about that.

cheers,
daniel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints
  2016-09-08  5:20     ` Daniel Wagner
@ 2016-09-08  6:31       ` Thomas Gleixner
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-09-08  6:31 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Binoy Jayan, Steven Rostedt (Red Hat),
	Ingo Molnar, Arnd Bergmann, linux-kernel, Masami

On Thu, 8 Sep 2016, Daniel Wagner wrote:
> On 09/07/2016 06:42 PM, Thomas Gleixner wrote:
> > On Wed, 7 Sep 2016, Binoy Jayan wrote:
> > 
> > > This work is based on work by Daniel Wagner. A few tracepoints are added
> > 
> > Interesting. The patch you are referring to was written by Carsten
> > Emde. Can you please get these attributions right? They want to be
> > reflected in the Tags and if you did not rewrite the whole thing from
> > scratch then original authorship must be preserved.
> 
> This is probably my bad. I didn't point out to Binoy that the original idea
> and patches were done by Carsten. Binoy's work is based on my RFC patches from
> 2015. Thanks for pointing out. Let's add Carsten here as well.

Ditto for the hrtimer stuff .....

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] tracing: Deference pointers without RCU checks
  2016-09-07 11:13 ` [PATCH v6 1/4] tracing: Deference pointers without RCU checks Binoy Jayan
@ 2016-09-08  6:46   ` Masami Hiramatsu
  0 siblings, 0 replies; 21+ messages in thread
From: Masami Hiramatsu @ 2016-09-08  6:46 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Thomas Gleixner

2016-09-07 20:13 GMT+09:00 Binoy Jayan <binoy.jayan@linaro.org>:
>
> From: Daniel Wagner <daniel.wagner@bmw-carit.de>
>
> The tracepoint can't be used in code section where we are in the
> middle of a state transition.
>
> For example if we place a tracepoint inside start/stop_critical_section(),
> lockdep complains with
>
> [    0.035589] WARNING: CPU: 0 PID: 3 at kernel/locking/lockdep.c:3560 \
> check_flags.part.36+0x1bc/0x210() [    0.036000] \
> DEBUG_LOCKS_WARN_ON(current->softirqs_enabled) [    0.036000] Kernel panic - not \
> syncing: panic_on_warn set ... [    0.036000]
> [    0.036000] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc7+ #460
> [    0.036000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS \
> 1.7.5-20140709_153950- 04/01/2014 [    0.036000]  ffffffff81f2463a ffff88007c93bb98 \
> ffffffff81afb317 0000000000000001 [    0.036000]  ffffffff81f212b3 ffff88007c93bc18 \
> ffffffff81af7bc2 ffff88007c93bbb8 [    0.036000]  ffffffff00000008 ffff88007c93bc28 \
> ffff88007c93bbc8 000000000093bbd8 [    0.036000] Call Trace:
> [    0.036000]  [<ffffffff81afb317>] dump_stack+0x4f/0x7b
> [    0.036000]  [<ffffffff81af7bc2>] panic+0xc0/0x1e9
> [    0.036000]  [<ffffffff81b07188>] ? _raw_spin_unlock_irqrestore+0x38/0x80
> [    0.036000]  [<ffffffff81052a60>] warn_slowpath_common+0xc0/0xc0
> [    0.036000]  [<ffffffff81057116>] ? __local_bh_enable+0x36/0x70
> [    0.036000]  [<ffffffff81052aa6>] warn_slowpath_fmt+0x46/0x50
> [    0.036000]  [<ffffffff810a468c>] check_flags.part.36+0x1bc/0x210
> [    0.036000]  [<ffffffff810a5168>] lock_is_held+0x78/0x90
> [    0.036000]  [<ffffffff81057116>] ? __local_bh_enable+0x36/0x70
> [    0.036000]  [<ffffffff810588ab>] ? __do_softirq+0x3db/0x500
> [    0.036000]  [<ffffffff81134495>] trace_preempt_on+0x255/0x260
> [    0.036000]  [<ffffffff81081c0b>] preempt_count_sub+0xab/0xf0
> [    0.036000]  [<ffffffff81057116>] __local_bh_enable+0x36/0x70
> [    0.036000]  [<ffffffff810588ab>] __do_softirq+0x3db/0x500
> [    0.036000]  [<ffffffff810589ef>] run_ksoftirqd+0x1f/0x60
> [    0.036000]  [<ffffffff81079653>] smpboot_thread_fn+0x193/0x2a0
> [    0.036000]  [<ffffffff810794c0>] ? SyS_setgroups+0x150/0x150
> [    0.036000]  [<ffffffff81075312>] kthread+0xf2/0x110
> [    0.036000]  [<ffffffff81b026b3>] ? wait_for_completion+0xc3/0x120
> [    0.036000]  [<ffffffff81081c0b>] ? preempt_count_sub+0xab/0xf0
> [    0.036000]  [<ffffffff81075220>] ? kthread_create_on_node+0x240/0x240
> [    0.036000]  [<ffffffff81b07cc8>] ret_from_fork+0x58/0x90
> [    0.036000]  [<ffffffff81075220>] ? kthread_create_on_node+0x240/0x240
> [    0.036000] ---[ end Kernel panic - not syncing: panic_on_warn set ...
>
> PeterZ was so kind to explain it to me what is happening:
>
> "__local_bh_enable() tests if this is the last SOFTIRQ_OFFSET, if so it
> tells lockdep softirqs are enabled with trace_softirqs_on() after that
> we go an actually modify the preempt_count with preempt_count_sub().
> Then in preempt_count_sub() you call into trace_preempt_on() if this
> was the last preempt_count increment but you do that _before_ you
> actually change the preempt_count with __preempt_count_sub() at this
> point lockdep and preempt_count think the world differs and *boom*"
>
> So the simplest way to avoid this is by disabling the consistency
> checks.
>
> We also need to take care of the iterating in trace_events_trigger.c
> to avoid a splatter in conjunction with the hist trigger.
>

Looks good to me :)

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>

Thanks,

> Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
> Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
> ---
>  include/linux/rculist.h             | 36 ++++++++++++++++++++++++++++++++++++
>  include/linux/tracepoint.h          |  4 ++--
>  kernel/trace/trace_events_filter.c  |  4 ++--
>  kernel/trace/trace_events_trigger.c |  6 +++---
>  4 files changed, 43 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 8beb98d..bee836b 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -279,6 +279,24 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>         container_of(lockless_dereference(ptr), type, member)
>
>  /**
> + * list_entry_rcu_notrace - get the struct for this entry (for tracing)
> + * @ptr:        the &struct list_head pointer.
> + * @type:       the type of the struct this is embedded in.
> + * @member:     the name of the list_head within the struct.
> + *
> + * This primitive may safely run concurrently with the _rcu list-mutation
> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
> + *
> + * This is the same as list_entry_rcu() except that it does
> + * not do any RCU debugging or tracing.
> + */
> +#define list_entry_rcu_notrace(ptr, type, member) \
> +({ \
> +       typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
> +       container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), type, member); \
> +})
> +
> +/**
>   * Where are list_empty_rcu() and list_first_entry_rcu()?
>   *
>   * Implementing those functions following their counterparts list_empty() and
> @@ -391,6 +409,24 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>              pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
>
>  /**
> + * list_for_each_entry_rcu_notrace     -       iterate over rcu list of given type (for tracing)
> + * @pos:       the type * to use as a loop cursor.
> + * @head:      the head for your list.
> + * @member:    the name of the list_head within the struct.
> + *
> + * This list-traversal primitive may safely run concurrently with
> + * the _rcu list-mutation primitives such as list_add_rcu()
> + * as long as the traversal is guarded by rcu_read_lock().
> + *
> + * This is the same as list_for_each_entry_rcu() except that it does
> + * not do any RCU debugging or tracing.
> + */
> +#define list_for_each_entry_rcu_notrace(pos, head, member) \
> +       for (pos = list_entry_rcu_notrace((head)->next, typeof(*pos), member); \
> +               &pos->member != (head); \
> +               pos = list_entry_rcu_notrace(pos->member.next, typeof(*pos), member))
> +
> +/**
>   * list_for_each_entry_continue_rcu - continue iteration over list of given type
>   * @pos:       the type * to use as a loop cursor.
>   * @head:      the head for your list.
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index be586c6..d0e2a82 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -138,7 +138,7 @@ extern void syscall_unregfunc(void);
>                         return;                                         \
>                 prercu;                                                 \
>                 rcu_read_lock_sched_notrace();                          \
> -               it_func_ptr = rcu_dereference_sched((tp)->funcs);       \
> +               it_func_ptr = rcu_dereference_raw_notrace((tp)->funcs); \
>                 if (it_func_ptr) {                                      \
>                         do {                                            \
>                                 it_func = (it_func_ptr)->func;          \
> @@ -189,7 +189,7 @@ extern void syscall_unregfunc(void);
>                                 TP_CONDITION(cond),,);                  \
>                 if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
>                         rcu_read_lock_sched_notrace();                  \
> -                       rcu_dereference_sched(__tracepoint_##name.funcs);\
> +                       rcu_dereference_raw_notrace(__tracepoint_##name.funcs);\
>                         rcu_read_unlock_sched_notrace();                \
>                 }                                                       \
>         }                                                               \
> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
> index 9daa9b3..a00a8d5 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -608,11 +608,11 @@ int filter_match_preds(struct event_filter *filter, void *rec)
>         /*
>          * n_preds, root and filter->preds are protect with preemption disabled.
>          */
> -       root = rcu_dereference_sched(filter->root);
> +       root = rcu_dereference_raw_notrace(filter->root);
>         if (!root)
>                 return 1;
>
> -       data.preds = preds = rcu_dereference_sched(filter->preds);
> +       data.preds = preds = rcu_dereference_raw_notrace(filter->preds);
>         ret = walk_pred_tree(preds, root, filter_match_preds_cb, &data);
>         WARN_ON(ret);
>         return data.match;
> diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
> index a975571..4489fb4 100644
> --- a/kernel/trace/trace_events_trigger.c
> +++ b/kernel/trace/trace_events_trigger.c
> @@ -71,14 +71,14 @@ event_triggers_call(struct trace_event_file *file, void *rec)
>         if (list_empty(&file->triggers))
>                 return tt;
>
> -       list_for_each_entry_rcu(data, &file->triggers, list) {
> +       list_for_each_entry_rcu_notrace(data, &file->triggers, list) {
>                 if (data->paused)
>                         continue;
>                 if (!rec) {
>                         data->ops->func(data, rec);
>                         continue;
>                 }
> -               filter = rcu_dereference_sched(data->filter);
> +               filter = rcu_dereference_raw_notrace(data->filter);
>                 if (filter && !filter_match_preds(filter, rec))
>                         continue;
>                 if (event_command_post_trigger(data->cmd_ops)) {
> @@ -111,7 +111,7 @@ event_triggers_post_call(struct trace_event_file *file,
>  {
>         struct event_trigger_data *data;
>
> -       list_for_each_entry_rcu(data, &file->triggers, list) {
> +       list_for_each_entry_rcu_notrace(data, &file->triggers, list) {
>                 if (data->paused)
>                         continue;
>                 if (data->cmd_ops->trigger_type & tt)
> --
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>



-- 
Masami Hiramatsu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets
  2016-09-07 11:13 ` [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets Binoy Jayan
@ 2016-09-08  6:56   ` Masami Hiramatsu
  2016-09-08  7:10   ` Thomas Gleixner
  1 sibling, 0 replies; 21+ messages in thread
From: Masami Hiramatsu @ 2016-09-08  6:56 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Thomas Gleixner

2016-09-07 20:13 GMT+09:00 Binoy Jayan <binoy.jayan@linaro.org>:
> Generate a histogram of the latencies of delayed timer offsets in
> nanoseconds. It shows the latency captured due to a delayed timer expire
> event. It happens for example when a timer misses its deadline due to
> disabled interrupts. A process if scheduled as a result of the timer
> expiration suffers this latency.
>
> The following filter(s) may be used
>
> 'hist:key=common_pid.execname:val=toffset,hitcount'
> 'hist:key=cpu,tcomm:val=toffset:sort=tcomm'
> 'hist:key=comm,tcomm:sort=comm,tcomm'
>

Ok, so this event is hit only when the hrtimer is delayed.

I just have a comment on the event name:

> +TRACE_EVENT(latency_hrtimer_interrupt,

Since this event will be hit only when the hrtimer is delayed,
would we better add "delayed" in the event name?
e.g. "latency_delayed_hrtimer"

Others are OK for me.

Thanks,

> Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
> ---
>  include/linux/hrtimer.h      |  3 +++
>  include/trace/events/timer.h | 25 +++++++++++++++++++++
>  kernel/time/Kconfig          |  7 ++++++
>  kernel/time/hrtimer.c        | 52 ++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 87 insertions(+)
>
> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
> index 5e00f80..9146842 100644
> --- a/include/linux/hrtimer.h
> +++ b/include/linux/hrtimer.h
> @@ -104,6 +104,9 @@ struct hrtimer {
>         struct hrtimer_clock_base       *base;
>         u8                              state;
>         u8                              is_rel;
> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
> +       ktime_t                         praecox;
> +#endif
>  #ifdef CONFIG_TIMER_STATS
>         int                             start_pid;
>         void                            *start_site;
> diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
> index 28c5da6..ee45aed 100644
> --- a/include/trace/events/timer.h
> +++ b/include/trace/events/timer.h
> @@ -382,6 +382,31 @@ TRACE_EVENT(tick_stop,
>  );
>  #endif
>
> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
> +TRACE_EVENT(latency_hrtimer_interrupt,
> +
> +       TP_PROTO(long long toffset, struct task_struct *task),
> +
> +       TP_ARGS(toffset, task),
> +
> +       TP_STRUCT__entry(
> +               __field(long long,      toffset)
> +               __array(char,           tcomm,  TASK_COMM_LEN)
> +               __field(int,            tprio)
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->toffset = toffset;
> +               memcpy(__entry->tcomm, task != NULL ? task->comm : "<none>",
> +                       task != NULL ? TASK_COMM_LEN : 7);
> +               __entry->tprio  = task != NULL ? task->prio : -1;
> +       ),
> +
> +       TP_printk("toffset=%lld thread=%s[%d]",
> +               __entry->toffset, __entry->tcomm, __entry->tprio)
> +);
> +#endif
> +
>  #endif /*  _TRACE_TIMER_H */
>
>  /* This part must be outside protection */
> diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> index 4008d9f..8ff19dd 100644
> --- a/kernel/time/Kconfig
> +++ b/kernel/time/Kconfig
> @@ -193,5 +193,12 @@ config HIGH_RES_TIMERS
>           hardware is not capable then this option only increases
>           the size of the kernel image.
>
> +config DELAYED_TIMER_OFFSETS_HIST
> +       depends on HIGH_RES_TIMERS
> +       select GENERIC_TRACER
> +       bool "Delayed Timer Offsets Histogram"
> +       help
> +         Generate a histogram of delayed timer offsets in nanoseconds.
> +
>  endmenu
>  endif
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 9ba7c82..432d49a 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -56,6 +56,8 @@
>
>  #include "tick-internal.h"
>
> +static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer);
> +
>  /*
>   * The timer bases:
>   *
> @@ -960,6 +962,52 @@ static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
>         return tim;
>  }
>
> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
> +static inline void latency_hrtimer_timing_start(struct hrtimer *timer,
> +                                        struct hrtimer_clock_base *new_base,
> +                                        ktime_t tim)
> +{
> +       if (unlikely(trace_latency_hrtimer_interrupt_enabled())) {
> +               ktime_t now = new_base->get_time();
> +
> +               if (ktime_to_ns(tim) < ktime_to_ns(now))
> +                       timer->praecox = now;
> +               else
> +                       timer->praecox = ktime_set(0, 0);
> +       }
> +}
> +
> +static inline void latency_hrtimer_timing_stop(struct hrtimer *timer,
> +                                               ktime_t basenow)
> +{
> +       long latency;
> +       struct task_struct *task;
> +
> +       if (likely(!trace_latency_hrtimer_interrupt_enabled()))
> +               return;
> +
> +       latency = ktime_to_ns(ktime_sub(basenow,
> +                             ktime_to_ns(timer->praecox) ?
> +                             timer->praecox : hrtimer_get_expires(timer)));
> +       task = timer->function == hrtimer_wakeup ?
> +                       container_of(timer, struct hrtimer_sleeper,
> +                                    timer)->task : NULL;
> +       if (latency > 0)
> +               trace_latency_hrtimer_interrupt((u64) latency, task);
> +}
> +#else
> +static inline void latency_hrtimer_timing_start(struct hrtimer *timer,
> +                                        struct hrtimer_clock_base *new_base,
> +                                        ktime_t tim)
> +{
> +}
> +static inline void latency_hrtimer_timing_stop(struct hrtimer *timer,
> +                                               ktime_t basenow)
> +{
> +}
> +
> +#endif
> +
>  /**
>   * hrtimer_start_range_ns - (re)start an hrtimer on the current CPU
>   * @timer:     the timer to be added
> @@ -992,6 +1040,8 @@ void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
>
>         timer_stats_hrtimer_set_start_info(timer);
>
> +       latency_hrtimer_timing_start(timer, new_base, tim);
> +
>         leftmost = enqueue_hrtimer(timer, new_base);
>         if (!leftmost)
>                 goto unlock;
> @@ -1284,6 +1334,8 @@ static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now)
>
>                         timer = container_of(node, struct hrtimer, node);
>
> +                       latency_hrtimer_timing_stop(timer, basenow);
> +
>                         /*
>                          * The immediate goal for using the softexpires is
>                          * minimizing wakeups, not running timers at the
> --
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>



-- 
Masami Hiramatsu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets
  2016-09-07 11:13 ` [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets Binoy Jayan
  2016-09-08  6:56   ` Masami Hiramatsu
@ 2016-09-08  7:10   ` Thomas Gleixner
  2016-09-08  9:39     ` Binoy Jayan
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2016-09-08  7:10 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami

On Wed, 7 Sep 2016, Binoy Jayan wrote:
> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
> index 5e00f80..9146842 100644
> --- a/include/linux/hrtimer.h
> +++ b/include/linux/hrtimer.h
> @@ -104,6 +104,9 @@ struct hrtimer {
>  	struct hrtimer_clock_base	*base;
>  	u8				state;
>  	u8				is_rel;
> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
> +	ktime_t				praecox;
> +#endif

Throwing a new struct member at some random place optimizes the struct
footprint, right?

And of course documenting new struct members is optional, correct? I'm
really looking forward for the explanation of that variable name.

> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
> +TRACE_EVENT(latency_hrtimer_interrupt,
> +
> +       TP_PROTO(long long toffset, struct task_struct *task),
> +
> +       TP_ARGS(toffset, task),
> +
> +       TP_STRUCT__entry(
> +               __field(long long,      toffset)
> +               __array(char,           tcomm,  TASK_COMM_LEN)
> +               __field(int,            tprio)
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->toffset = toffset;
> +               memcpy(__entry->tcomm, task != NULL ? task->comm : "<none>",
> +                       task != NULL ? TASK_COMM_LEN : 7);
> +               __entry->tprio  = task != NULL ? task->prio : -1;
> +       ),

What's the value of storing prio? None, if the task is using the DL
scheduler.

> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
> +static inline void latency_hrtimer_timing_start(struct hrtimer *timer,
> +					 struct hrtimer_clock_base *new_base,
> +					 ktime_t tim)
> +{
> +	if (unlikely(trace_latency_hrtimer_interrupt_enabled())) {
> +		ktime_t now = new_base->get_time();
> +
> +		if (ktime_to_ns(tim) < ktime_to_ns(now))
> +			timer->praecox = now;
> +		else
> +			timer->praecox = ktime_set(0, 0);

What's the whole point of this? You're adding an extra get_time() call into
that path. What for? Comments in the code are overrated, right?

> +	}
> +}
> +
> +static inline void latency_hrtimer_timing_stop(struct hrtimer *timer,
> +						ktime_t basenow)
> +{
> +	long latency;
> +	struct task_struct *task;
> +
> +	if (likely(!trace_latency_hrtimer_interrupt_enabled()))
> +		return;
> +
> +	latency = ktime_to_ns(ktime_sub(basenow,
> +			      ktime_to_ns(timer->praecox) ?
> +			      timer->praecox : hrtimer_get_expires(timer)));
> +	task = timer->function == hrtimer_wakeup ?
> +			container_of(timer, struct hrtimer_sleeper,
> +				     timer)->task : NULL;

This is a complete horrible hack. You're tying the task evaluation into a
single instance of hrtimer users. What's the justification for this and why
do you need task at all?

> +	if (latency > 0)
> +		trace_latency_hrtimer_interrupt((u64) latency, task);

And how should latency become < 0 ever? hrtimer interrupt guarantees to
never expire timers prematurely.

Neither the changelog nor the code contain any information about how that
thing is to be used and what the actual value of the stored information
is.

No way that this ad hoc statistics hackery which we carry in RT for a well
known reason is going to go upstream without proper justification and a
weel thought out and documented functionality.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints
  2016-09-07 11:13 ` [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints Binoy Jayan
  2016-09-07 16:42   ` Thomas Gleixner
@ 2016-09-08  8:06   ` Thomas Gleixner
  2016-09-12 13:50     ` Steven Rostedt
  2016-09-14  6:52     ` Binoy Jayan
  1 sibling, 2 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-09-08  8:06 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann, linux-kernel, Masami

On Wed, 7 Sep 2016, Binoy Jayan wrote:
> This captures only the latencies introduced by disabled irqs and
> preemption. Additional per process data has to be captured to calculate
> the effective latencies introduced for individual processes.

And what is the additional per process data and how is it captured and
used?

>  static DEFINE_PER_CPU(int, tracing_cpu);
> +static DEFINE_PER_CPU(cycle_t, lat_ts[LT_MAX]);
>  
>  static DEFINE_RAW_SPINLOCK(max_trace_lock);
>  
> @@ -419,9 +425,23 @@ stop_critical_timing(unsigned long ip, unsigned long parent_ip)
>  	atomic_dec(&data->disabled);
>  }
>  
> +static inline void latency_preempt_timing_start(enum latency_type ltype)
> +{
> +	this_cpu_write(lat_ts[ltype], (cycle_t) trace_clock_local());

What is this silly type cast for? Why can't you just use u64 ?

> +}
> +
> +static inline void latency_preempt_timing_stop(enum latency_type type)
> +{
> +	trace_latency_preempt(type,
> +		(cycle_t) trace_clock_local() - this_cpu_read(lat_ts[type]));

And then of course you use a completely different data type in the trace
itself.

> +DECLARE_EVENT_CLASS(latency_template,
> +	TP_PROTO(int ltype, cycles_t latency),

Are you sure, that you know what you are doing here? If yes, then please
explain it in form of comments so mere mortals can understand it as well.

>  /* start and stop critical timings used to for stoppage (in idle) */
>  void start_critical_timings(void)
>  {
> +	if (unlikely(trace_latency_preempt_enabled()))
> +		latency_preempt_timing_start(LT_CRITTIME);

I doubt, that this conditional is less expensive than a simple
unconditional store to a per cpu variable.

> +
>  	if (preempt_trace() || irq_trace())
>  		start_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
>  }
> @@ -431,6 +451,9 @@ void stop_critical_timings(void)
>  {
>  	if (preempt_trace() || irq_trace())
>  		stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
> +
> +	if (unlikely(trace_latency_preempt_enabled()))
> +		latency_preempt_timing_stop(LT_CRITTIME);

And this is silly as well. You can put the whole evaluation into the trace
event assignement so the tracer core will handle that conditional.

Aside of that it is silly to evaluate trace_clock_local() for the actual
tracepoint simply because that time is already stored in the tracepoint
itself. The flow here is:

	event = trace_event_buffer_lock_reserve();
	entry = ring_buffer_event_data(event);
       	{ <assign>; }  <-- Here we assign the entries by the __field and
                           __array macros.


So you should talk to Steven about having a way to retrieve that time from
entry itself in a generic way.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets
  2016-09-08  7:10   ` Thomas Gleixner
@ 2016-09-08  9:39     ` Binoy Jayan
  2016-09-09 13:30       ` Thomas Gleixner
  2016-09-09 17:10       ` Mark Brown
  0 siblings, 2 replies; 21+ messages in thread
From: Binoy Jayan @ 2016-09-08  9:39 UTC (permalink / raw)
  To: Thomas Gleixner, Carsten Emde
  Cc: Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Masami

[Adding Carsten in cc ]

Thank you Thomas for reviewing this and providing insights.

On 8 September 2016 at 12:40, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 7 Sep 2016, Binoy Jayan wrote:
>> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
>> index 5e00f80..9146842 100644
>> --- a/include/linux/hrtimer.h
>> +++ b/include/linux/hrtimer.h
>> @@ -104,6 +104,9 @@ struct hrtimer {
>>       struct hrtimer_clock_base       *base;
>>       u8                              state;
>>       u8                              is_rel;
>> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
>> +     ktime_t                         praecox;
>> +#endif
>
> Throwing a new struct member at some random place optimizes the struct
> footprint, right?

My bad, I'll move the member two items up, just below the pointer 'base'.

> And of course documenting new struct members is optional, correct? I'm
> really looking forward for the explanation of that variable name.

It marks the start time when a process is scheduled to be woken up as
the result
of expiry of the hrtimer. Will be mentioning it in the comments.

>> +
>> +       TP_fast_assign(
>> +               __entry->toffset = toffset;
>> +               memcpy(__entry->tcomm, task != NULL ? task->comm : "<none>",
>> +                       task != NULL ? TASK_COMM_LEN : 7);
>> +               __entry->tprio  = task != NULL ? task->prio : -1;
>> +       ),
>
> What's the value of storing prio? None, if the task is using the DL
> scheduler.
>
>> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
>> +static inline void latency_hrtimer_timing_start(struct hrtimer *timer,
>> +                                      struct hrtimer_clock_base *new_base,
>> +                                      ktime_t tim)
>> +{
>> +     if (unlikely(trace_latency_hrtimer_interrupt_enabled())) {
>> +             ktime_t now = new_base->get_time();
>> +
>> +             if (ktime_to_ns(tim) < ktime_to_ns(now))
>> +                     timer->praecox = now;
>> +             else
>> +                     timer->praecox = ktime_set(0, 0);
>
> What's the whole point of this? You're adding an extra get_time() call into
> that path. What for? Comments in the code are overrated, right?

Will add comments here.

>> +     }
>> +}
>> +
>> +static inline void latency_hrtimer_timing_stop(struct hrtimer *timer,
>> +                                             ktime_t basenow)
>> +{
>> +     long latency;
>> +     struct task_struct *task;
>> +
>> +     if (likely(!trace_latency_hrtimer_interrupt_enabled()))
>> +             return;
>> +
>> +     latency = ktime_to_ns(ktime_sub(basenow,
>> +                           ktime_to_ns(timer->praecox) ?
>> +                           timer->praecox : hrtimer_get_expires(timer)));
>> +     task = timer->function == hrtimer_wakeup ?
>> +                     container_of(timer, struct hrtimer_sleeper,
>> +                                  timer)->task : NULL;
>
> This is a complete horrible hack. You're tying the task evaluation into a
> single instance of hrtimer users. What's the justification for this and why
> do you need task at all?

If I am understanding it not wrongly, I was trying to mark the time when
and hrtimer is started or restarted and the time when the same expires.
The expiry time is compared against the time now and the actual expiry.
The task indicates the task woken up as a result of the timer expiry.

>
>> +     if (latency > 0)
>> +             trace_latency_hrtimer_interrupt((u64) latency, task);
>
> And how should latency become < 0 ever? hrtimer interrupt guarantees to
> never expire timers prematurely.

Not sure why, but I have seen some negative values here.

> Neither the changelog nor the code contain any information about how that
> thing is to be used and what the actual value of the stored information
> is.
>
> No way that this ad hoc statistics hackery which we carry in RT for a well
> known reason is going to go upstream without proper justification and a
> weel thought out and documented functionality.
>

As Carsten has mentioned in his patch, this latency alone is not useful enough
without the process latency which occur due to the disabled interrupts/preemtion
or because of a timer missing its deadline. Since the process latency histogram
needed tracepoints in scheduler code which is discouraged, I haven't gotten
around to do it yet. I've been trying to calculate latencies by making
use of kprobes
events and tracing_maps but I was finding it little tricky.

Thanks,
Binoy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets
  2016-09-08  9:39     ` Binoy Jayan
@ 2016-09-09 13:30       ` Thomas Gleixner
  2016-09-23 14:20         ` Carsten Emde
  2016-09-09 17:10       ` Mark Brown
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2016-09-09 13:30 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Carsten Emde, Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Masami

On Thu, 8 Sep 2016, Binoy Jayan wrote:
> On 8 September 2016 at 12:40, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
> >> +     ktime_t                         praecox;
> >> +#endif
> > 
> > And of course documenting new struct members is optional, correct? I'm
> > really looking forward for the explanation of that variable name.
> 
> It marks the start time when a process is scheduled to be woken up as the
> result of expiry of the hrtimer. Will be mentioning it in the comments.

That's wrong. It only contains the start time when the timer was not
expired at the point where it is started. Otherwise it's set to 0.

Random explanations are worse than no explanation at all.
 
> >> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
> >> +static inline void latency_hrtimer_timing_start(struct hrtimer *timer,
> >> +                                      struct hrtimer_clock_base *new_base,
> >> +                                      ktime_t tim)
> >> +{
> >> +     if (unlikely(trace_latency_hrtimer_interrupt_enabled())) {
> >> +             ktime_t now = new_base->get_time();
> >> +
> >> +             if (ktime_to_ns(tim) < ktime_to_ns(now))
> >> +                     timer->praecox = now;
> >> +             else
> >> +                     timer->praecox = ktime_set(0, 0);
> >
> > What's the whole point of this? You're adding an extra get_time() call into
> > that path. What for? Comments in the code are overrated, right?
> 
> Will add comments here.

I rather prefer to get an explanation in this mail thread so it can be
discussed before you send out the next version.

> >> +     latency = ktime_to_ns(ktime_sub(basenow,
> >> +                           ktime_to_ns(timer->praecox) ?
> >> +                           timer->praecox : hrtimer_get_expires(timer)));
> >> +     task = timer->function == hrtimer_wakeup ?
> >> +                     container_of(timer, struct hrtimer_sleeper,
> >> +                                  timer)->task : NULL;
> >
> > This is a complete horrible hack. You're tying the task evaluation into a
> > single instance of hrtimer users. What's the justification for this and why
> > do you need task at all?
> 
> If I am understanding it not wrongly, I was trying to mark the time when
> and hrtimer is started or restarted and the time when the same expires.
> The expiry time is compared against the time now and the actual expiry.
> The task indicates the task woken up as a result of the timer expiry.

You are explaining what the code does and you are explaining it wrong. But
you don't answer my question at all.

> >
> >> +     if (latency > 0)
> >> +             trace_latency_hrtimer_interrupt((u64) latency, task);
> >
> > And how should latency become < 0 ever? hrtimer interrupt guarantees to
> > never expire timers prematurely.
> 
> Not sure why, but I have seen some negative values here.

The you better investigate that matter instead of adding band aids to the
code.
 
> > Neither the changelog nor the code contain any information about how that
> > thing is to be used and what the actual value of the stored information
> > is.
> >
> > No way that this ad hoc statistics hackery which we carry in RT for a well
> > known reason is going to go upstream without proper justification and a
> > weel thought out and documented functionality.
> >
> 
> As Carsten has mentioned in his patch, this latency alone is not useful enough
> without the process latency which occur due to the disabled interrupts/preemtion
> or because of a timer missing its deadline. Since the process latency histogram
> needed tracepoints in scheduler code which is discouraged, I haven't gotten
> around to do it yet. I've been trying to calculate latencies by making
> use of kprobes
> events and tracing_maps but I was finding it little tricky.

How is that related to my question?

If that thing has no value on its own, why are you trying to push it?

I don't care if you find it tricky. If you want this patch merged, then you
have to supply information

    - What it does

    - Why it is useful

    - How it can be used

    - What is the output and how can it be analyzed

    - What are the limitations

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets
  2016-09-08  9:39     ` Binoy Jayan
  2016-09-09 13:30       ` Thomas Gleixner
@ 2016-09-09 17:10       ` Mark Brown
  2016-09-10  6:10         ` Thomas Gleixner
  1 sibling, 1 reply; 21+ messages in thread
From: Mark Brown @ 2016-09-09 17:10 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Thomas Gleixner, Carsten Emde, Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Masami

[-- Attachment #1: Type: text/plain, Size: 724 bytes --]

On Thu, Sep 08, 2016 at 03:09:36PM +0530, Binoy Jayan wrote:
> On 8 September 2016 at 12:40, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Wed, 7 Sep 2016, Binoy Jayan wrote:

> >> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
> >> +     ktime_t                         praecox;
> >> +#endif

> > And of course documenting new struct members is optional, correct? I'm
> > really looking forward for the explanation of that variable name.

> It marks the start time when a process is scheduled to be woken up as
> the result
> of expiry of the hrtimer. Will be mentioning it in the comments.

Even with an appropriate comment I'm not sure that the term praecox is
sufficiently widely understood to be a clear variable name.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets
  2016-09-09 17:10       ` Mark Brown
@ 2016-09-10  6:10         ` Thomas Gleixner
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-09-10  6:10 UTC (permalink / raw)
  To: Mark Brown
  Cc: Binoy Jayan, Carsten Emde, Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Masami

On Fri, 9 Sep 2016, Mark Brown wrote:
> On Thu, Sep 08, 2016 at 03:09:36PM +0530, Binoy Jayan wrote:
> > On 8 September 2016 at 12:40, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > On Wed, 7 Sep 2016, Binoy Jayan wrote:
> 
> > >> +#ifdef CONFIG_DELAYED_TIMER_OFFSETS_HIST
> > >> +     ktime_t                         praecox;
> > >> +#endif
> 
> > > And of course documenting new struct members is optional, correct? I'm
> > > really looking forward for the explanation of that variable name.
> 
> > It marks the start time when a process is scheduled to be woken up as
> > the result
> > of expiry of the hrtimer. Will be mentioning it in the comments.
> 
> Even with an appropriate comment I'm not sure that the term praecox is
> sufficiently widely understood to be a clear variable name.

For those who understand it and know the main use case of this term it
certainly has entertainment value.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints
  2016-09-08  8:06   ` Thomas Gleixner
@ 2016-09-12 13:50     ` Steven Rostedt
  2016-09-14  6:52     ` Binoy Jayan
  1 sibling, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2016-09-12 13:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binoy Jayan, Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	linux-kernel, Masami

On Thu, 8 Sep 2016 10:06:13 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:


> > +
> >  	if (preempt_trace() || irq_trace())
> >  		start_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
> >  }
> > @@ -431,6 +451,9 @@ void stop_critical_timings(void)
> >  {
> >  	if (preempt_trace() || irq_trace())
> >  		stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
> > +
> > +	if (unlikely(trace_latency_preempt_enabled()))
> > +		latency_preempt_timing_stop(LT_CRITTIME);  
> 
> And this is silly as well. You can put the whole evaluation into the trace
> event assignement so the tracer core will handle that conditional.
> 
> Aside of that it is silly to evaluate trace_clock_local() for the actual
> tracepoint simply because that time is already stored in the tracepoint
> itself. The flow here is:
> 
> 	event = trace_event_buffer_lock_reserve();
> 	entry = ring_buffer_event_data(event);
>        	{ <assign>; }  <-- Here we assign the entries by the __field and
>                            __array macros.
> 
> 
> So you should talk to Steven about having a way to retrieve that time from
> entry itself in a generic way.

Note, the time stamp in the ring buffer is not guaranteed to be
trace_local_clock(), it could be changed to be something like
x86_64-tsc.

That said, the timestamp that is saved in the event is an offset from
the previous event (to save space in the buffer). It wont be efficient
to calculate the timestamp at record time because of this. Although,
the last stamp recoded is saved in the buffer descriptor. But it could
be racy to capture it, as NMIs and interrupts could update it.

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints
  2016-09-08  8:06   ` Thomas Gleixner
  2016-09-12 13:50     ` Steven Rostedt
@ 2016-09-14  6:52     ` Binoy Jayan
  2016-09-14  9:42       ` Thomas Gleixner
  1 sibling, 1 reply; 21+ messages in thread
From: Binoy Jayan @ 2016-09-14  6:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Masami, Carsten Emde

Hi Thomas,

Sorry for the late reply. I was trying out some way to do it the way
you suggested.
Tried to talk to Carsten regarding the hrtimer latency patch but was unable to.

On 8 September 2016 at 13:36, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 7 Sep 2016, Binoy Jayan wrote:
>> This captures only the latencies introduced by disabled irqs and
>> preemption. Additional per process data has to be captured to calculate
>> the effective latencies introduced for individual processes.
>
> And what is the additional per process data and how is it captured and
> used?

This is the patch which would touch the scheduler code which I did not
want to do.
I was trying to achieve the same using kprobes but did not get around to make it
work with the histograms.

>> +static inline void latency_preempt_timing_start(enum latency_type ltype)
>> +{
>> +     this_cpu_write(lat_ts[ltype], (cycle_t) trace_clock_local());
>
> What is this silly type cast for? Why can't you just use u64 ?

>> +static inline void latency_preempt_timing_stop(enum latency_type type)
>> +{
>> +     trace_latency_preempt(type,
>> +             (cycle_t) trace_clock_local() - this_cpu_read(lat_ts[type]));
>
> And then of course you use a completely different data type in the trace
> itself.

This has been changed.

>> +DECLARE_EVENT_CLASS(latency_template,
>> +     TP_PROTO(int ltype, cycles_t latency),
>
> Are you sure, that you know what you are doing here? If yes, then please
> explain it in form of comments so mere mortals can understand it as well.

Added comments for the same.

>>  /* start and stop critical timings used to for stoppage (in idle) */
>>  void start_critical_timings(void)
>>  {
>> +     if (unlikely(trace_latency_preempt_enabled()))
>> +             latency_preempt_timing_start(LT_CRITTIME);
>
> I doubt, that this conditional is less expensive than a simple
> unconditional store to a per cpu variable.

This is changed as well.

>> @@ -431,6 +451,9 @@ void stop_critical_timings(void)
>>  {
>>       if (preempt_trace() || irq_trace())
>>               stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
>> +
>> +     if (unlikely(trace_latency_preempt_enabled()))
>> +             latency_preempt_timing_stop(LT_CRITTIME);
>
> And this is silly as well. You can put the whole evaluation into the trace
> event assignement so the tracer core will handle that conditional.

I hope this can be done for "time_hardirqs_off" and "trace_preempt_off"
as well.

> Aside of that it is silly to evaluate trace_clock_local() for the actual
> tracepoint simply because that time is already stored in the tracepoint
> itself. The flow here is:
>
>         event = trace_event_buffer_lock_reserve();
>         entry = ring_buffer_event_data(event);
>         { <assign>; }  <-- Here we assign the entries by the __field and
>                            __array macros.
>
>
> So you should talk to Steven about having a way to retrieve that time from
> entry itself in a generic way.
>

Steven mentioned that the timestamp in the ring buffer is in the
offset form and also
it may not be equivalent to trace_clock_local() time. I also tried
using the timestamp
from the per cpu trace data but it did not seem to provide correct
value for the timestamp.

struct trace_array_cpu *data =
per_cpu_ptr(irqsoff_trace->trace_buffer.data, cpu);
data->preempt_timestamp

Thanks,
Binoy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints
  2016-09-14  6:52     ` Binoy Jayan
@ 2016-09-14  9:42       ` Thomas Gleixner
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-09-14  9:42 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann,
	Linux kernel mailing list, Masami, Carsten Emde

Binoy,

On Wed, 14 Sep 2016, Binoy Jayan wrote:
> On 8 September 2016 at 13:36, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Wed, 7 Sep 2016, Binoy Jayan wrote:
> >> This captures only the latencies introduced by disabled irqs and
> >> preemption. Additional per process data has to be captured to calculate
> >> the effective latencies introduced for individual processes.
> >
> > And what is the additional per process data and how is it captured and
> > used?
> 
> This is the patch which would touch the scheduler code which I did not
> want to do.  I was trying to achieve the same using kprobes but did not
> get around to make it work with the histograms.

So the questions remain:

   - How is this facility useful on it's own?
   - What is the value of adding this?
   - How is it used and what valuable debug information can be collected?

And these questions need to be answered independently of additional data,
which you are obviously unable to sample.
    
> > Aside of that it is silly to evaluate trace_clock_local() for the actual
> > tracepoint simply because that time is already stored in the tracepoint
> > itself. The flow here is:
> >
> >         event = trace_event_buffer_lock_reserve();
> >         entry = ring_buffer_event_data(event);
> >         { <assign>; }  <-- Here we assign the entries by the __field and
> >                            __array macros.
> >
> >
> > So you should talk to Steven about having a way to retrieve that time from
> > entry itself in a generic way.
> >
> Steven mentioned that the timestamp in the ring buffer is in the offset
> form and also it may not be equivalent to trace_clock_local() time. I
> also tried using the timestamp from the per cpu trace data but it did not
> seem to provide correct value for the timestamp.

That's what we have today. Now if we want to add something like this then
we really want to avoid reading timestamps twice. And it's irrelevant
whether the timestamp comes from trace_clock_local() or
ktime_get_[mono|raw]_fast_ns because you can do the histogram based on that
as well.

Fact is that the tracer core itself takes a timestamp already and it should
not be rocket science to make the core store it as a full timestamp when
requested by the caller. Such a store is way cheeper than accessing
hardware over and over.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets
  2016-09-09 13:30       ` Thomas Gleixner
@ 2016-09-23 14:20         ` Carsten Emde
  0 siblings, 0 replies; 21+ messages in thread
From: Carsten Emde @ 2016-09-23 14:20 UTC (permalink / raw)
  To: Linux kernel mailing list
  Cc: Thomas Gleixner, Binoy Jayan, Steven Rostedt (Red Hat),
	Ingo Molnar, Daniel Wagner, Arnd Bergmann, Masami

On 09/09/2016 03:30 PM, Thomas Gleixner wrote:
> On Thu, 8 Sep 2016, Binoy Jayan wrote:
>> On 8 September 2016 at 12:40, Thomas Gleixner <tglx@linutronix.de> wrote:
> [..]
>>>> +     latency = ktime_to_ns(ktime_sub(basenow,
>>>> +                           ktime_to_ns(timer->praecox) ?
>>>> +                           timer->praecox : hrtimer_get_expires(timer)));
>>>> +     task = timer->function == hrtimer_wakeup ?
>>>> +                     container_of(timer, struct hrtimer_sleeper,
>>>> +                                  timer)->task : NULL;
>>>
>>> This is a complete horrible hack. You're tying the task evaluation into a
>>> single instance of hrtimer users. What's the justification for this and why
>>> do you need task at all?
Makes only sense, if any, if wakeup latency histograms that store the task with
the highest latency so far are available. Initially, I wanted to give a short
explanation here why such internal histograms may be useful, but I found it
easier in a format where pictures of example latency plots and recordings can
be shown. So I wrote a short article on "Preemption latency of real-time Linux
systems" -> https://www.osadl.org/Single-View.111+M5246ebc55e3.0.html.

Hope this helps,

	-Carsten.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-09-23 15:04 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-07 11:13 [PATCH v6 0/4] *** Latency histograms - IRQSOFF,PREEMPTOFF, Delayed HRTIMERS *** Binoy Jayan
2016-09-07 11:13 ` [PATCH v6 1/4] tracing: Deference pointers without RCU checks Binoy Jayan
2016-09-08  6:46   ` Masami Hiramatsu
2016-09-07 11:13 ` [PATCH v6 2/4] tracing: Add hist trigger support for generic fields Binoy Jayan
2016-09-07 13:39   ` kbuild test robot
2016-09-07 11:13 ` [PATCH v6 3/4] tracing: Add trace_irqsoff tracepoints Binoy Jayan
2016-09-07 16:42   ` Thomas Gleixner
2016-09-08  5:20     ` Daniel Wagner
2016-09-08  6:31       ` Thomas Gleixner
2016-09-08  8:06   ` Thomas Gleixner
2016-09-12 13:50     ` Steven Rostedt
2016-09-14  6:52     ` Binoy Jayan
2016-09-14  9:42       ` Thomas Gleixner
2016-09-07 11:13 ` [PATCH v6 4/4] tracing: Histogram for delayed hrtimer offsets Binoy Jayan
2016-09-08  6:56   ` Masami Hiramatsu
2016-09-08  7:10   ` Thomas Gleixner
2016-09-08  9:39     ` Binoy Jayan
2016-09-09 13:30       ` Thomas Gleixner
2016-09-23 14:20         ` Carsten Emde
2016-09-09 17:10       ` Mark Brown
2016-09-10  6:10         ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.