linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] Some new features for the preempt/irqsoff tracers
@ 2019-09-03 13:25 Viktor Rosendahl
  2019-09-03 13:25 ` [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency Viktor Rosendahl
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Viktor Rosendahl @ 2019-09-03 13:25 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, linux-kernel
  Cc: Joel Fernandes, Peter Zijlstra, Viktor Rosendahl

Hello all,

Changes in v5:
- [PATCH 1/4]:
  * trace_disable_fsnotify() => latency_fsnotify_disable(), in order to avoid
    naming conflicts with trace points.

  * Using more appropriate operations for the percpu variables.

  * Moved around the calls to latency_fsnotify_enable/disable() a bit.

  * Added static branches to reduce overhead when the facility is not in use.

  * Does not add any trace points anymore. Instead using explicit function
    calls from sched and idle code.

  * Unfortunately still adds some gunk to sched and idle code. I do not know
    how to avoid this.

This series is meant to address two issues with the latency tracing.

The first three patches provide a method to trace latencies that always
occurs very close to each other and to differentiate between them, in spite
of the fact that the latency tracers work in overwrite mode.

[PATCH 1/4] This implement fs notification for tracing_max_latency. It
makes it possible for userspace to detect when a new latency has been
detected.

[PATCH 2/4] This extends the preemptirq_delay_test module so that it can be
used to generate a burst of closely occurring latencies.

[PATCH 3/4] This adds a user space program to the tools directory that
utilizes the fs notification feature and a randomized algorithm to print out
any of the latencies in a burst with approximately equal probability.

The last patch is not directly connected but earlier it didn't apply
cleanly on its own. However, now it does, so in principle it could be
applied separately from the others.

[PATCH 4/4] This adds the option console-latency to the trace options. This
makes it possible to enable tracing of console latencies.

best regards,

Viktor

Viktor Rosendahl (4):
  ftrace: Implement fs notification for tracing_max_latency
  preemptirq_delay_test: Add the burst feature and a sysfs trigger
  Add the latency-collector to tools
  ftrace: Add an option for tracing console latencies

 include/linux/ftrace.h               |   31 +
 include/linux/irqflags.h             |   21 +
 kernel/printk/printk.c               |    6 +-
 kernel/sched/core.c                  |    3 +
 kernel/sched/idle.c                  |    3 +
 kernel/sched/sched.h                 |    1 +
 kernel/trace/Kconfig                 |    6 +-
 kernel/trace/preemptirq_delay_test.c |  145 ++-
 kernel/trace/trace.c                 |  112 ++-
 kernel/trace/trace.h                 |   23 +
 kernel/trace/trace_hwlat.c           |    4 +-
 kernel/trace/trace_irqsoff.c         |   16 +
 kernel/trace/trace_sched_wakeup.c    |    4 +
 tools/Makefile                       |   14 +-
 tools/trace/Makefile                 |   20 +
 tools/trace/latency-collector.c      | 1211 ++++++++++++++++++++++++++
 16 files changed, 1587 insertions(+), 33 deletions(-)
 create mode 100644 tools/trace/Makefile
 create mode 100644 tools/trace/latency-collector.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency
  2019-09-03 13:25 [PATCH v5 0/4] Some new features for the preempt/irqsoff tracers Viktor Rosendahl
@ 2019-09-03 13:25 ` Viktor Rosendahl
  2019-09-04  4:00   ` Joel Fernandes
                     ` (2 more replies)
  2019-09-03 13:26 ` [PATCH v5 2/4] preemptirq_delay_test: Add the burst feature and a sysfs trigger Viktor Rosendahl
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 19+ messages in thread
From: Viktor Rosendahl @ 2019-09-03 13:25 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, linux-kernel
  Cc: Joel Fernandes, Peter Zijlstra, Viktor Rosendahl

This patch implements the feature that the tracing_max_latency file,
e.g. /sys/kernel/debug/tracing/tracing_max_latency will receive
notifications through the fsnotify framework when a new latency is
available.

One particularly interesting use of this facility is when enabling
threshold tracing, through /sys/kernel/debug/tracing/tracing_thresh,
together with the preempt/irqsoff tracers. This makes it possible to
implement a user space program that can, with equal probability,
obtain traces of latencies that occur immediately after each other in
spite of the fact that the preempt/irqsoff tracers operate in overwrite
mode.

This facility works with the hwlat, preempt/irqsoff, and wakeup
tracers.

This patch also adds some unfortunate calls from __schedule() and
do_idle(). Those calls to the latency_fsnotify_disable/enable() are
needed because we cannot wake up the workqueue from these critical
sections without risking a deadlock. Similar problems would also arise
if we try to schedule a tasklet, raise a softirq, or wake up a kernel
thread. If a notification event would happen in the forbidden sections,
we schedule the fsnotify work as soon as we have exited them.

There was a suggestion to remove this latency_fsnotify_enable/disable()
gunk, or at least to combine it with the start_critical_timings() and
stop_critical_timings(). I have however not been able to come up with
a way to do it.

It seems like it would be possible to simply replace the calls to
latency_fsnotify_enable/disable() with calls to
start/stop_critical_timings(). However, the main problem is that it
would not work for the wakup tracer. The wakeup tracer needs a
facility that postpones the notifications, not one that prevents the
measurements because all its measurements takes place in the middle
of __schedule(). On the other hand, in some places, like in idle and
the console we need start stop functions that prevents the
measurements from being make.

Signed-off-by: Viktor Rosendahl <viktor.rosendahl@gmail.com>
---
 include/linux/ftrace.h            |  31 +++++++++
 kernel/sched/core.c               |   3 +
 kernel/sched/idle.c               |   3 +
 kernel/sched/sched.h              |   1 +
 kernel/trace/trace.c              | 112 +++++++++++++++++++++++++++++-
 kernel/trace/trace.h              |  22 ++++++
 kernel/trace/trace_hwlat.c        |   4 +-
 kernel/trace/trace_irqsoff.c      |   4 ++
 kernel/trace/trace_sched_wakeup.c |   4 ++
 9 files changed, 181 insertions(+), 3 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 8a8cb3c401b2..b4d9700ef917 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -907,4 +907,35 @@ unsigned long arch_syscall_addr(int nr);
 
 #endif /* CONFIG_FTRACE_SYSCALLS */
 
+#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
+	defined(CONFIG_FSNOTIFY)
+
+DECLARE_PER_CPU(int, latency_notify_disable);
+DECLARE_STATIC_KEY_FALSE(latency_notify_key);
+
+void latency_fsnotify_process(void);
+
+/*
+ * Disable/enable fsnotify while in scheduler and idle code. Trying to wake
+ * anything up from there, such as calling queue_work() is prone to deadlock.
+ */
+static inline void latency_fsnotify_disable(void)
+{
+	this_cpu_inc(latency_notify_disable);
+}
+
+static inline void latency_fsnotify_enable(void)
+{
+	this_cpu_dec(latency_notify_disable);
+	if (static_branch_unlikely(&latency_notify_key))
+		latency_fsnotify_process();
+}
+
+#else
+
+#define latency_fsnotify_disable() do { } while (0)
+#define latency_fsnotify_enable()  do { } while (0)
+
+#endif
+
 #endif /* _LINUX_FTRACE_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 010d578118d6..e3c1dc801073 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3198,6 +3198,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 	 */
 
 	rq = finish_task_switch(prev);
+	latency_fsnotify_enable();
 	balance_callback(rq);
 	preempt_enable();
 
@@ -3820,6 +3821,7 @@ static void __sched notrace __schedule(bool preempt)
 
 	local_irq_disable();
 	rcu_note_context_switch(preempt);
+	latency_fsnotify_disable();
 
 	/*
 	 * Make sure that signal_pending_state()->signal_pending() below
@@ -3883,6 +3885,7 @@ static void __sched notrace __schedule(bool preempt)
 		rq_unlock_irq(rq, &rf);
 	}
 
+	latency_fsnotify_enable();
 	balance_callback(rq);
 }
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 80940939b733..5fc87d99a407 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -236,6 +236,7 @@ static void do_idle(void)
 
 	__current_set_polling();
 	tick_nohz_idle_enter();
+	latency_fsnotify_disable();
 
 	while (!need_resched()) {
 		check_pgt_cache();
@@ -265,6 +266,8 @@ static void do_idle(void)
 		arch_cpu_idle_exit();
 	}
 
+	latency_fsnotify_enable();
+
 	/*
 	 * Since we fell out of the loop above, we know TIF_NEED_RESCHED must
 	 * be set, propagate it into PREEMPT_NEED_RESCHED.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..467d6ad03f16 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -46,6 +46,7 @@
 #include <linux/debugfs.h>
 #include <linux/delayacct.h>
 #include <linux/energy_model.h>
+#include <linux/ftrace.h>
 #include <linux/init_task.h>
 #include <linux/kprobes.h>
 #include <linux/kthread.h>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 563e80f9006a..a622263a69e4 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -44,6 +44,10 @@
 #include <linux/trace.h>
 #include <linux/sched/clock.h>
 #include <linux/sched/rt.h>
+#include <linux/fsnotify.h>
+#include <linux/workqueue.h>
+#include <trace/events/power.h>
+#include <trace/events/sched.h>
 
 #include "trace.h"
 #include "trace_output.h"
@@ -1480,6 +1484,110 @@ static ssize_t trace_seq_to_buffer(struct trace_seq *s, void *buf, size_t cnt)
 
 unsigned long __read_mostly	tracing_thresh;
 
+#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
+	defined(CONFIG_FSNOTIFY)
+
+static const struct file_operations tracing_max_lat_fops;
+static struct workqueue_struct *fsnotify_wq;
+static DEFINE_PER_CPU(struct llist_head, notify_list);
+
+DEFINE_PER_CPU(int, latency_notify_disable);
+DEFINE_STATIC_KEY_FALSE(latency_notify_key);
+
+static void latency_fsnotify_workfn(struct work_struct *work)
+{
+	struct trace_array *tr = container_of(work, struct trace_array,
+					      fsnotify_work);
+	fsnotify(tr->d_max_latency->d_inode, FS_MODIFY,
+		 tr->d_max_latency->d_inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+}
+
+static void trace_create_maxlat_file(struct trace_array *tr,
+				     struct dentry *d_tracer)
+{
+	INIT_WORK(&tr->fsnotify_work, latency_fsnotify_workfn);
+	atomic_set(&tr->notify_pending, 0);
+	tr->d_max_latency = trace_create_file("tracing_max_latency", 0644,
+					      d_tracer, &tr->max_latency,
+					      &tracing_max_lat_fops);
+}
+
+void latency_fsnotify_stop(void)
+{
+	/* Make sure all CPUs see caller's previous actions to stop tracer */
+	smp_wmb();
+	static_branch_disable(&latency_notify_key);
+	latency_fsnotify_process();
+}
+
+void latency_fsnotify_start(void)
+{
+	static_branch_enable(&latency_notify_key);
+	/* Make sure all CPUs see key value before caller continue */
+	smp_wmb();
+}
+
+void latency_fsnotify_process(void)
+{
+	struct trace_array *tr;
+	struct llist_head *list;
+	struct llist_node *node;
+
+	if (this_cpu_read(latency_notify_disable))
+		return;
+
+	list = this_cpu_ptr(&notify_list);
+	for (node = llist_del_first(list); node != NULL;
+	     node = llist_del_first(list)) {
+		tr = llist_entry(node, struct trace_array, notify_ll);
+		atomic_set(&tr->notify_pending, 0);
+		queue_work(fsnotify_wq, &tr->fsnotify_work);
+	}
+}
+
+__init static int latency_fsnotify_init(void)
+{
+	fsnotify_wq = alloc_workqueue("tr_max_lat_wq",
+				      WQ_UNBOUND | WQ_HIGHPRI, 0);
+	if (!fsnotify_wq) {
+		pr_err("Unable to allocate tr_max_lat_wq\n");
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+late_initcall_sync(latency_fsnotify_init);
+
+void latency_fsnotify(struct trace_array *tr)
+{
+	if (!fsnotify_wq)
+		return;
+
+	if (!this_cpu_read(latency_notify_disable))
+		queue_work(fsnotify_wq, &tr->fsnotify_work);
+	else {
+		/*
+		 * notify_pending prevents us from adding the same entry to
+		 * more than one notify_list. It will get queued in
+		 * latency_enable_fsnotify()
+		 */
+		if (!atomic_xchg(&tr->notify_pending, 1))
+			llist_add(&tr->notify_ll, this_cpu_ptr(&notify_list));
+	}
+}
+
+/*
+ * (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
+ *  defined(CONFIG_FSNOTIFY)
+ */
+#else
+
+#define trace_create_maxlat_file(tr, d_tracer)				\
+	trace_create_file("tracing_max_latency", 0644, d_tracer,	\
+			  &tr->max_latency, &tracing_max_lat_fops)
+
+#endif
+
 #ifdef CONFIG_TRACER_MAX_TRACE
 /*
  * Copy the new maximum trace into the separate maximum-trace
@@ -1518,6 +1626,7 @@ __update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu)
 
 	/* record this tasks comm */
 	tracing_record_cmdline(tsk);
+	latency_fsnotify(tr);
 }
 
 /**
@@ -8550,8 +8659,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
 	create_trace_options_dir(tr);
 
 #if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
-	trace_create_file("tracing_max_latency", 0644, d_tracer,
-			&tr->max_latency, &tracing_max_lat_fops);
+	trace_create_maxlat_file(tr, d_tracer);
 #endif
 
 	if (ftrace_create_function_files(tr, d_tracer))
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 005f08629b8b..d9f83b2aaa71 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -16,6 +16,7 @@
 #include <linux/trace_events.h>
 #include <linux/compiler.h>
 #include <linux/glob.h>
+#include <linux/workqueue.h>
 
 #ifdef CONFIG_FTRACE_SYSCALLS
 #include <asm/unistd.h>		/* For NR_SYSCALLS	     */
@@ -264,6 +265,12 @@ struct trace_array {
 #endif
 #if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
 	unsigned long		max_latency;
+#ifdef CONFIG_FSNOTIFY
+	struct dentry		*d_max_latency;
+	struct work_struct	fsnotify_work;
+	atomic_t		notify_pending;
+	struct llist_node	notify_ll;
+#endif
 #endif
 	struct trace_pid_list	__rcu *filtered_pids;
 	/*
@@ -785,6 +792,21 @@ void update_max_tr_single(struct trace_array *tr,
 			  struct task_struct *tsk, int cpu);
 #endif /* CONFIG_TRACER_MAX_TRACE */
 
+#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
+	defined(CONFIG_FSNOTIFY)
+
+void latency_fsnotify(struct trace_array *tr);
+void latency_fsnotify_start(void);
+void latency_fsnotify_stop(void);
+
+#else
+
+#define latency_fsnotify(tr)     do { } while (0)
+#define latency_fsnotify_start() do { } while (0)
+#define latency_fsnotify_stop()  do { } while (0)
+
+#endif
+
 #ifdef CONFIG_STACKTRACE
 void __trace_stack(struct trace_array *tr, unsigned long flags, int skip,
 		   int pc);
diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index fa95139445b2..9c379261ee89 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -254,8 +254,10 @@ static int get_sample(void)
 		trace_hwlat_sample(&s);
 
 		/* Keep a running maximum ever recorded hardware latency */
-		if (sample > tr->max_latency)
+		if (sample > tr->max_latency) {
 			tr->max_latency = sample;
+			latency_fsnotify(tr);
+		}
 	}
 
 out:
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index a745b0cee5d3..29403a83a5f0 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -557,6 +557,7 @@ static int __irqsoff_tracer_init(struct trace_array *tr)
 	if (irqsoff_busy)
 		return -EBUSY;
 
+	latency_fsnotify_start();
 	save_flags = tr->trace_flags;
 
 	/* non overwrite screws up the latency tracers */
@@ -591,16 +592,19 @@ static void __irqsoff_tracer_reset(struct trace_array *tr)
 	ftrace_reset_array_ops(tr);
 
 	irqsoff_busy = false;
+	latency_fsnotify_stop();
 }
 
 static void irqsoff_tracer_start(struct trace_array *tr)
 {
+	latency_fsnotify_start();
 	tracer_enabled = 1;
 }
 
 static void irqsoff_tracer_stop(struct trace_array *tr)
 {
 	tracer_enabled = 0;
+	latency_fsnotify_stop();
 }
 
 #ifdef CONFIG_IRQSOFF_TRACER
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 743b2b520d34..3dc90d9f605b 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -669,6 +669,7 @@ static bool wakeup_busy;
 
 static int __wakeup_tracer_init(struct trace_array *tr)
 {
+	latency_fsnotify_start();
 	save_flags = tr->trace_flags;
 
 	/* non overwrite screws up the latency tracers */
@@ -727,10 +728,12 @@ static void wakeup_tracer_reset(struct trace_array *tr)
 	set_tracer_flag(tr, TRACE_ITER_OVERWRITE, overwrite_flag);
 	ftrace_reset_array_ops(tr);
 	wakeup_busy = false;
+	latency_fsnotify_stop();
 }
 
 static void wakeup_tracer_start(struct trace_array *tr)
 {
+	latency_fsnotify_start();
 	wakeup_reset(tr);
 	tracer_enabled = 1;
 }
@@ -738,6 +741,7 @@ static void wakeup_tracer_start(struct trace_array *tr)
 static void wakeup_tracer_stop(struct trace_array *tr)
 {
 	tracer_enabled = 0;
+	latency_fsnotify_stop();
 }
 
 static struct tracer wakeup_tracer __read_mostly =
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 2/4] preemptirq_delay_test: Add the burst feature and a sysfs trigger
  2019-09-03 13:25 [PATCH v5 0/4] Some new features for the preempt/irqsoff tracers Viktor Rosendahl
  2019-09-03 13:25 ` [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency Viktor Rosendahl
@ 2019-09-03 13:26 ` Viktor Rosendahl
  2019-09-04 11:42   ` Steven Rostedt
  2019-09-03 13:26 ` [PATCH v5 3/4] Add the latency-collector to tools Viktor Rosendahl
  2019-09-03 13:26 ` [PATCH v5 4/4] ftrace: Add an option for tracing console latencies Viktor Rosendahl
  3 siblings, 1 reply; 19+ messages in thread
From: Viktor Rosendahl @ 2019-09-03 13:26 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, linux-kernel
  Cc: Joel Fernandes, Peter Zijlstra, Viktor Rosendahl

This burst feature enables the user to generate a burst of
preempt/irqsoff latencies. This makes it possible to test whether we
are able to detect latencies that systematically occur very close to
each other.

The maximum burst size is 10. We also create 10 identical test
functions, so that we get 10 different backtraces; this is useful
when we want to test whether we can detect all the latencies in a
burst. Otherwise, there would be no easy way of differentiating
between which latency in a burst was captured by the tracer.

In addition, there is a sysfs trigger, so that it's not necessary to
reload the module to repeat the test. The trigger will appear as
/sys/kernel/preemptirq_delay_test/trigger in sysfs.

Signed-off-by: Viktor Rosendahl <viktor.rosendahl@gmail.com>
---
 kernel/trace/Kconfig                 |   6 +-
 kernel/trace/preemptirq_delay_test.c | 145 +++++++++++++++++++++++----
 2 files changed, 129 insertions(+), 22 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 98da8998c25c..5ff573d844e4 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -751,9 +751,9 @@ config PREEMPTIRQ_DELAY_TEST
 	  configurable delay. The module busy waits for the duration of the
 	  critical section.
 
-	  For example, the following invocation forces a one-time irq-disabled
-	  critical section for 500us:
-	  modprobe preemptirq_delay_test test_mode=irq delay=500000
+	  For example, the following invocation generates a burst of three
+	  irq-disabled critical sections for 500us:
+	  modprobe preemptirq_delay_test test_mode=irq delay=500 burst_size=3
 
 	  If unsure, say N
 
diff --git a/kernel/trace/preemptirq_delay_test.c b/kernel/trace/preemptirq_delay_test.c
index d8765c952fab..dc281fa75198 100644
--- a/kernel/trace/preemptirq_delay_test.c
+++ b/kernel/trace/preemptirq_delay_test.c
@@ -3,6 +3,7 @@
  * Preempt / IRQ disable delay thread to test latency tracers
  *
  * Copyright (C) 2018 Joel Fernandes (Google) <joel@joelfernandes.org>
+ * Copyright (C) 2018, 2019 BMW Car IT GmbH
  */
 
 #include <linux/trace_clock.h>
@@ -10,18 +11,25 @@
 #include <linux/interrupt.h>
 #include <linux/irq.h>
 #include <linux/kernel.h>
+#include <linux/kobject.h>
 #include <linux/kthread.h>
 #include <linux/module.h>
 #include <linux/printk.h>
 #include <linux/string.h>
+#include <linux/sysfs.h>
 
 static ulong delay = 100;
-static char test_mode[10] = "irq";
+static char test_mode[12] = "irq";
+static uint burst_size = 1;
 
-module_param_named(delay, delay, ulong, S_IRUGO);
-module_param_string(test_mode, test_mode, 10, S_IRUGO);
-MODULE_PARM_DESC(delay, "Period in microseconds (100 uS default)");
-MODULE_PARM_DESC(test_mode, "Mode of the test such as preempt or irq (default irq)");
+module_param_named(delay, delay, ulong, 0444);
+module_param_string(test_mode, test_mode, 12, 0444);
+module_param_named(burst_size, burst_size, uint, 0444);
+MODULE_PARM_DESC(delay, "Period in microseconds (100 us default)");
+MODULE_PARM_DESC(test_mode, "Mode of the test such as preempt, irq, or alternate (default irq)");
+MODULE_PARM_DESC(burst_size, "The size of a burst (default 1)");
+
+#define MIN(x, y) ((x) < (y) ? (x) : (y))
 
 static void busy_wait(ulong time)
 {
@@ -34,37 +42,136 @@ static void busy_wait(ulong time)
 	} while ((end - start) < (time * 1000));
 }
 
-static int preemptirq_delay_run(void *data)
+static __always_inline void irqoff_test(void)
 {
 	unsigned long flags;
+	local_irq_save(flags);
+	busy_wait(delay);
+	local_irq_restore(flags);
+}
 
-	if (!strcmp(test_mode, "irq")) {
-		local_irq_save(flags);
-		busy_wait(delay);
-		local_irq_restore(flags);
-	} else if (!strcmp(test_mode, "preempt")) {
-		preempt_disable();
-		busy_wait(delay);
-		preempt_enable();
+static __always_inline void preemptoff_test(void)
+{
+	preempt_disable();
+	busy_wait(delay);
+	preempt_enable();
+}
+
+static void execute_preemptirqtest(int idx)
+{
+	if (!strcmp(test_mode, "irq"))
+		irqoff_test();
+	else if (!strcmp(test_mode, "preempt"))
+		preemptoff_test();
+	else if (!strcmp(test_mode, "alternate")) {
+		if (idx % 2 == 0)
+			irqoff_test();
+		else
+			preemptoff_test();
 	}
+}
+
+#define DECLARE_TESTFN(POSTFIX)				\
+	static void preemptirqtest_##POSTFIX(int idx)	\
+	{						\
+		execute_preemptirqtest(idx);		\
+	}						\
 
+/*
+ * We create 10 different functions, so that we can get 10 different
+ * backtraces.
+ */
+DECLARE_TESTFN(0)
+DECLARE_TESTFN(1)
+DECLARE_TESTFN(2)
+DECLARE_TESTFN(3)
+DECLARE_TESTFN(4)
+DECLARE_TESTFN(5)
+DECLARE_TESTFN(6)
+DECLARE_TESTFN(7)
+DECLARE_TESTFN(8)
+DECLARE_TESTFN(9)
+
+static void (*testfuncs[])(int)  = {
+	preemptirqtest_0,
+	preemptirqtest_1,
+	preemptirqtest_2,
+	preemptirqtest_3,
+	preemptirqtest_4,
+	preemptirqtest_5,
+	preemptirqtest_6,
+	preemptirqtest_7,
+	preemptirqtest_8,
+	preemptirqtest_9,
+};
+
+#define NR_TEST_FUNCS ARRAY_SIZE(testfuncs)
+
+static int preemptirq_delay_run(void *data)
+{
+	int i;
+	int s = MIN(burst_size, NR_TEST_FUNCS);
+
+	for (i = 0; i < s; i++)
+		(testfuncs[i])(i);
 	return 0;
 }
 
-static int __init preemptirq_delay_init(void)
+static struct task_struct *preemptirq_start_test(void)
 {
 	char task_name[50];
-	struct task_struct *test_task;
 
 	snprintf(task_name, sizeof(task_name), "%s_test", test_mode);
+	return kthread_run(preemptirq_delay_run, NULL, task_name);
+}
+
+
+static ssize_t trigger_store(struct kobject *kobj, struct kobj_attribute *attr,
+			 const char *buf, size_t count)
+{
+	preemptirq_start_test();
+	return count;
+}
+
+static struct kobj_attribute trigger_attribute =
+	__ATTR(trigger, 0200, NULL, trigger_store);
+
+static struct attribute *attrs[] = {
+	&trigger_attribute.attr,
+	NULL,
+};
+
+static struct attribute_group attr_group = {
+	.attrs = attrs,
+};
+
+static struct kobject *preemptirq_delay_kobj;
+
+static int __init preemptirq_delay_init(void)
+{
+	struct task_struct *test_task;
+	int retval;
+
+	test_task = preemptirq_start_test();
+	retval = PTR_ERR_OR_ZERO(test_task);
+	if (retval != 0)
+		return retval;
+
+	preemptirq_delay_kobj = kobject_create_and_add("preemptirq_delay_test",
+						       kernel_kobj);
+	if (!preemptirq_delay_kobj)
+		return -ENOMEM;
+
+	retval = sysfs_create_group(preemptirq_delay_kobj, &attr_group);
+	if (retval)
+		kobject_put(preemptirq_delay_kobj);
 
-	test_task = kthread_run(preemptirq_delay_run, NULL, task_name);
-	return PTR_ERR_OR_ZERO(test_task);
+	return retval;
 }
 
 static void __exit preemptirq_delay_exit(void)
 {
-	return;
+	kobject_put(preemptirq_delay_kobj);
 }
 
 module_init(preemptirq_delay_init)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 3/4] Add the latency-collector to tools
  2019-09-03 13:25 [PATCH v5 0/4] Some new features for the preempt/irqsoff tracers Viktor Rosendahl
  2019-09-03 13:25 ` [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency Viktor Rosendahl
  2019-09-03 13:26 ` [PATCH v5 2/4] preemptirq_delay_test: Add the burst feature and a sysfs trigger Viktor Rosendahl
@ 2019-09-03 13:26 ` Viktor Rosendahl
  2019-09-03 13:26 ` [PATCH v5 4/4] ftrace: Add an option for tracing console latencies Viktor Rosendahl
  3 siblings, 0 replies; 19+ messages in thread
From: Viktor Rosendahl @ 2019-09-03 13:26 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, linux-kernel
  Cc: Joel Fernandes, Peter Zijlstra, Viktor Rosendahl

This is a tool that is intended to work around the fact that the
preemptoff, irqsoff, and preemptirqsoff tracers only work in
overwrite mode. The idea is to act randomly in such a way that we
do not systematically lose any latencies, so that if enough testing
is done, all latencies will be captured. If the same burst of
latencies is repeated, then sooner or later we will have captured all
the latencies.

It also works with the wakeup_dl, wakeup_rt, wakeup, and hwlat tracers.
However, in that case it is probably not useful to use the random
sleep functionality.

The reason why it may be desirable to catch all latencies with a long
test campaign is that for some organizations, it's necessary to test
the kernel in the field and not practical for developers to work
iteratively with field testers. Because of cost and project schedules
it is not possible to start a new test campaign every time a latency
problem has been fixed.

It uses inotify to detect changes to /sys/kernel/debug/tracing/trace.
When a latency is detected, it will either sleep or print
immediately, depending on a function that act as an unfair coin
toss.

If immediate print is chosen, it means that we open
/sys/kernel/debug/tracing/trace and thereby cause a blackout period
that will hide any subsequent latencies.

If sleep is chosen, it means that we wait before opening
/sys/kernel/debug/tracing/trace, by default for 1000 ms, to see if
there is another latency during this period. If there is, then we will
lose the previous latency. The coin will be tossed again with a
different probability, and we will either print the new latency, or
possibly a subsequent one.

The probability for the unfair coin toss is chosen so that there
is equal probability to obtain any of the latencies in a burst.
However, this assumes that we make an assumption of how many
latencies there can be. By default  the program assumes that there
are no more than 2 latencies in a burst, the probability of immediate
printout will be:

1/2 and 1

Thus, the probability of getting each of the two latencies will be 1/2.

If we ever find that there is more than one latency in a series,
meaning that we reach the probability of 1, then the table will be
expanded to:

1/3, 1/2, and 1

Thus, we assume that there are no more than three latencies and each
with a probability of 1/3 of being captured. If the probability of 1
is reached in the new table, that is we see more than two closely
occurring latencies, then the table will again be extended, and so
on.

On my systems, it seems like this scheme works fairly well, as
long as the latencies we trace are long enough, 300 us seems to be
enough. This userspace program receive the inotify event at the end
of a latency, and it has time until the end of the next latency
to react, that is to open /sys/kernel/debug/tracing/trace. Thus,
if we trace latencies that are >300 us, then we have at least 300 us
to react.

The minimum latency will of course not be 300 us on all systems, it
will depend on the hardware, kernel version, workload and
configuration.

Example output:

root@myhost:~# echo 100 > /sys/kernel/debug/tracing/tracing_thresh
root@myhost:~# echo preemptirqsoff > /sys/kernel/debug/tracing/current_tracer
root@myhost:~# latency-collector -rvv -a 11 &
[1] 4915
root@myhost:~# Running with policy rr and priority 99. Using 3 print threads.
/sys/kernel/debug/tracing/trace will be printed with random delay
Start size of the probability table:                    11
Print a message when prob. table table changes size:    true
Print a warning when an event has been lost:            true
Sleep time is:                                          1000 ms
Initializing probability table to 11
Thread 4916 runs with policy rr and priority 99
Thread 4917 runs with policy rr and priority 99
Thread 4918 runs with policy rr and priority 99
Thread 4919 runs with policy rr and priority 99

root@myhost:~# modprobe preemptirq_delay_test delay=105 test_mode=alternate burst_size=10
2850.638308 Latency 1 printout skipped due to random delay
2850.638383 Latency 2 printout skipped due to random delay
2850.638497 Latency 3 immediate print
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BEGIN <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
alternat-4922    6d...    0us!: execute_preemptirqtest <-preemptirqtest_2
alternat-4922    6d...  106us : execute_preemptirqtest <-preemptirqtest_2
alternat-4922    6d...  106us : tracer_hardirqs_on <-preemptirqtest_2
alternat-4922    6d...  108us : <stack trace>
 => preemptirqtest_2
 => preemptirq_delay_run
 => kthread
 => ret_from_fork
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> END <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

In the example above, we randomly, happened to get the third latency
in a burst burst of 10. If the experiment is repeated enough times,
we will get all 10.

Sometimes, there will be lost latencies, so that we get:

3054.078294 Latency 44 printout skipped due to inotify loop
..or:

3211.957685 Latency 112 printout skipped due to print loop

In my experience, these are not frequent enough to be a problem.

Also, we do not really lose any latency in these cases, it's
more like we get another latency than what was intended by
the design.

Signed-off-by: Viktor Rosendahl <viktor.rosendahl@gmail.com>
---
 tools/Makefile                  |   14 +-
 tools/trace/Makefile            |   20 +
 tools/trace/latency-collector.c | 1211 +++++++++++++++++++++++++++++++
 3 files changed, 1239 insertions(+), 6 deletions(-)
 create mode 100644 tools/trace/Makefile
 create mode 100644 tools/trace/latency-collector.c

diff --git a/tools/Makefile b/tools/Makefile
index 68defd7ecf5d..2b3738e66280 100644
--- a/tools/Makefile
+++ b/tools/Makefile
@@ -30,6 +30,7 @@ help:
 	@echo '  spi                    - spi tools'
 	@echo '  objtool                - an ELF object analysis tool'
 	@echo '  tmon                   - thermal monitoring and tuning tool'
+	@echo '  trace                  - misc tracing tools'
 	@echo '  turbostat              - Intel CPU idle stats and freq reporting tool'
 	@echo '  usb                    - USB testing tools'
 	@echo '  virtio                 - vhost test module'
@@ -63,7 +64,7 @@ acpi: FORCE
 cpupower: FORCE
 	$(call descend,power/$@)
 
-cgroup firewire hv guest spi usb virtio vm bpf iio gpio objtool leds wmi pci firmware debugging: FORCE
+cgroup firewire hv guest spi usb virtio vm bpf iio gpio objtool leds wmi pci firmware debugging trace: FORCE
 	$(call descend,$@)
 
 liblockdep: FORCE
@@ -99,7 +100,7 @@ all: acpi cgroup cpupower gpio hv firewire liblockdep \
 		perf selftests spi turbostat usb \
 		virtio vm bpf x86_energy_perf_policy \
 		tmon freefall iio objtool kvm_stat wmi \
-		pci debugging
+		pci debugging trace
 
 acpi_install:
 	$(call descend,power/$(@:_install=),install)
@@ -107,7 +108,7 @@ acpi_install:
 cpupower_install:
 	$(call descend,power/$(@:_install=),install)
 
-cgroup_install firewire_install gpio_install hv_install iio_install perf_install spi_install usb_install virtio_install vm_install bpf_install objtool_install wmi_install pci_install debugging_install:
+cgroup_install firewire_install gpio_install hv_install iio_install perf_install spi_install usb_install virtio_install vm_install bpf_install objtool_install wmi_install pci_install debugging_install trace_install:
 	$(call descend,$(@:_install=),install)
 
 liblockdep_install:
@@ -133,7 +134,8 @@ install: acpi_install cgroup_install cpupower_install gpio_install \
 		perf_install selftests_install turbostat_install usb_install \
 		virtio_install vm_install bpf_install x86_energy_perf_policy_install \
 		tmon_install freefall_install objtool_install kvm_stat_install \
-		wmi_install pci_install debugging_install intel-speed-select_install
+		wmi_install pci_install debugging_install intel-speed-select_install \
+		trace_install
 
 acpi_clean:
 	$(call descend,power/acpi,clean)
@@ -141,7 +143,7 @@ acpi_clean:
 cpupower_clean:
 	$(call descend,power/cpupower,clean)
 
-cgroup_clean hv_clean firewire_clean spi_clean usb_clean virtio_clean vm_clean wmi_clean bpf_clean iio_clean gpio_clean objtool_clean leds_clean pci_clean firmware_clean debugging_clean:
+cgroup_clean hv_clean firewire_clean spi_clean usb_clean virtio_clean vm_clean wmi_clean bpf_clean iio_clean gpio_clean objtool_clean leds_clean pci_clean firmware_clean debugging_clean trace_clean:
 	$(call descend,$(@:_clean=),clean)
 
 liblockdep_clean:
@@ -180,6 +182,6 @@ clean: acpi_clean cgroup_clean cpupower_clean hv_clean firewire_clean \
 		vm_clean bpf_clean iio_clean x86_energy_perf_policy_clean tmon_clean \
 		freefall_clean build_clean libbpf_clean libsubcmd_clean liblockdep_clean \
 		gpio_clean objtool_clean leds_clean wmi_clean pci_clean firmware_clean debugging_clean \
-		intel-speed-select_clean
+		intel-speed-select_clean trace_clean
 
 .PHONY: FORCE
diff --git a/tools/trace/Makefile b/tools/trace/Makefile
new file mode 100644
index 000000000000..59cd483ab01f
--- /dev/null
+++ b/tools/trace/Makefile
@@ -0,0 +1,20 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for vm tools
+#
+TARGETS = latency-collector
+CFLAGS = -Wall -Wextra -g -O2
+LDFLAGS = -lpthread
+
+all: $(TARGETS)
+
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+clean:
+	$(RM) latency-collector
+
+sbindir ?= /usr/sbin
+
+install: all
+	install -d $(DESTDIR)$(sbindir)
+	install -m 755 -p $(TARGETS) $(DESTDIR)$(sbindir)
diff --git a/tools/trace/latency-collector.c b/tools/trace/latency-collector.c
new file mode 100644
index 000000000000..8cf3c74d998f
--- /dev/null
+++ b/tools/trace/latency-collector.c
@@ -0,0 +1,1211 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017, 2018, 2019 BMW Car IT GmbH
+ */
+
+#define _GNU_SOURCE
+#define _POSIX_C_SOURCE 200809L
+
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <err.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <sched.h>
+#include <linux/unistd.h>
+#include <sys/inotify.h>
+#include <unistd.h>
+#include <pthread.h>
+
+static const char *prg_name;
+static const char *prg_unknown = "unknown program name";
+
+static int fd_stdout;
+
+/* These are default values */
+static int sched_policy;
+static bool sched_policy_set;
+
+static int sched_pri;
+static bool sched_pri_set;
+
+static bool trace_enable = true;
+static bool use_random_sleep;
+
+static char inotify_buffer[655360];
+
+#define likely(x)      __builtin_expect(!!(x), 1)
+#define unlikely(x)    __builtin_expect(!!(x), 0)
+#define bool2str(x)    (x ? "true":"false")
+
+#define DEFAULT_NR_PRINTER_THREADS (3)
+static unsigned int nr_threads = DEFAULT_NR_PRINTER_THREADS;
+
+#define DEFAULT_TABLE_SIZE (2)
+static unsigned int table_startsize = DEFAULT_TABLE_SIZE;
+
+static int verbosity;
+
+#define verbose_sizechange() (verbosity >= 1)
+#define verbose_lostevent()  (verbosity >= 2)
+
+static const char *debug_tracefile;
+static const char *debug_tracefile_dflt = "/sys/kernel/debug/tracing/trace";
+static const char *debug_maxlat_file;
+static const char *debug_maxlat_dflt =
+	"/sys/kernel/debug/tracing/tracing_max_latency";
+
+
+#define DEV_URANDOM     "/dev/urandom"
+#define RT_DEFAULT_PRI (99)
+#define DEFAULT_PRI    (0)
+
+#define USEC_PER_MSEC (1000L)
+#define NSEC_PER_USEC (1000L)
+#define NSEC_PER_MSEC (USEC_PER_MSEC * NSEC_PER_USEC)
+
+#define MSEC_PER_SEC (1000L)
+#define USEC_PER_SEC (USEC_PER_MSEC * MSEC_PER_SEC)
+#define NSEC_PER_SEC (NSEC_PER_MSEC * MSEC_PER_SEC)
+
+#define SLEEP_TIME_MS_DEFAULT (1000L)
+
+static long sleep_time = (USEC_PER_MSEC * SLEEP_TIME_MS_DEFAULT);
+
+static char queue_full_warning[] =
+"Could not queue trace for printing. It is likely that events happen faster\n"
+"than what they can be printed. Probably partly because of random sleeping\n";
+
+
+struct policy {
+	const char *name;
+	int policy;
+	int default_pri;
+};
+
+static const struct policy policies[] = {
+	{ "other", SCHED_OTHER, DEFAULT_PRI    },
+	{ "batch", SCHED_BATCH, DEFAULT_PRI    },
+	{ "idle",  SCHED_IDLE,  DEFAULT_PRI    },
+	{ "rr",    SCHED_RR,    RT_DEFAULT_PRI },
+	{ "fifo",  SCHED_FIFO,  RT_DEFAULT_PRI },
+	{ NULL,    0,           DEFAULT_PRI    }
+};
+
+struct entry {
+	int ticket;
+	int ticket_completed_ref;
+};
+
+struct print_state {
+	int ticket_counter;
+	int ticket_completed;
+	pthread_mutex_t mutex;
+	pthread_cond_t cond;
+	int cnt;
+	pthread_mutex_t cnt_mutex;
+};
+
+struct short_msg {
+	char buf[160];
+	int len;
+};
+
+static struct print_state printstate;
+
+#define PROB_TABLE_MAX_SIZE (1000)
+
+int probabilities[PROB_TABLE_MAX_SIZE];
+
+struct sleep_table {
+	int *table;
+	int size;
+	pthread_mutex_t mutex;
+};
+
+static struct sleep_table sleeptable;
+
+#define QUEUE_SIZE (10)
+
+struct queue {
+	struct entry entries[QUEUE_SIZE];
+	int next_prod_idx;
+	int next_cons_idx;
+	pthread_mutex_t mutex;
+	pthread_cond_t cond;
+};
+
+#define MAX_THREADS (40)
+
+struct queue printqueue;
+pthread_t printthread[MAX_THREADS];
+pthread_mutex_t print_mtx = PTHREAD_MUTEX_INITIALIZER;
+#define PRINT_BUFFER_SIZE (16 * 1024 * 1024)
+
+static __always_inline void *malloc_or_die(size_t size)
+{
+	void *ptr = malloc(size);
+
+	if (ptr == NULL)
+		err(0, "malloc() failed");
+	return ptr;
+}
+
+static __always_inline void write_or_die(int fd, const char *buf, size_t count)
+{
+	ssize_t r;
+
+	do {
+		r = write(fd, buf, count);
+		if (r < 0) {
+			if (errno == EINTR)
+				continue;
+			err(0, "write() failed");
+		}
+		count -= r;
+		buf += r;
+	} while (count > 0);
+}
+
+static __always_inline void clock_gettime_or_die(clockid_t clk_id,
+						 struct timespec *tp)
+{
+	int r = clock_gettime(clk_id, tp);
+
+	if (r != 0)
+		err(0, "clock_gettime() failed");
+}
+
+static void open_stdout(void)
+{
+	if (setvbuf(stdout, NULL, _IONBF, 0) != 0)
+		err(0, "setvbuf() failed");
+	fd_stdout = fileno(stdout);
+	if (fd_stdout < 0)
+		err(0, "fileno() failed");
+}
+
+static __always_inline void get_time_in_future(struct timespec *future,
+					       long time_us)
+{
+	long nsec;
+
+	clock_gettime_or_die(CLOCK_MONOTONIC, future);
+	future->tv_sec += time_us / USEC_PER_SEC;
+	nsec = future->tv_nsec + (time_us * NSEC_PER_USEC) % NSEC_PER_SEC;
+	if (nsec >= NSEC_PER_SEC) {
+		future->tv_nsec = nsec % NSEC_PER_SEC;
+		future->tv_sec += 1;
+	}
+}
+
+static __always_inline bool time_has_passed(const struct timespec *time)
+{
+	struct timespec now;
+
+	clock_gettime_or_die(CLOCK_MONOTONIC, &now);
+	if (now.tv_sec > time->tv_sec)
+		return true;
+	if (now.tv_sec < time->tv_sec)
+		return false;
+	return (now.tv_nsec >= time->tv_nsec);
+}
+
+static void init_printstate(void)
+{
+	pthread_condattr_t attr;
+
+	printstate.ticket_counter = 0;
+	printstate.ticket_completed = 0;
+	printstate.cnt = 0;
+
+	if (pthread_mutex_init(&printstate.mutex, NULL) != 0)
+		err(0, "pthread_mutex_init() failed");
+
+	if (pthread_mutex_init(&printstate.cnt_mutex, NULL) != 0)
+		err(0, "pthread_mutex_init() failed");
+
+	if (pthread_condattr_init(&attr) != 0)
+		err(0, "pthread_condattr_init()");
+
+	if (pthread_condattr_setclock(&attr, CLOCK_MONOTONIC) != 0)
+		err(0, "pthread_condattr_setclock()");
+
+	if (pthread_cond_init(&printstate.cond, &attr) != 0)
+		err(0, "pthread_cond_init() failed");
+}
+
+static __always_inline void mutex_lock(pthread_mutex_t *mtx)
+{
+	if (pthread_mutex_lock(mtx) != 0)
+		err(0, "pthread_mutex_lock() failed");
+}
+
+static __always_inline void mutex_unlock(pthread_mutex_t *mtx)
+{
+	if (pthread_mutex_unlock(mtx) != 0)
+		err(0, "pthread_mutex_unlock() failed");
+}
+
+static __always_inline void cond_signal(pthread_cond_t *cond)
+{
+	if (pthread_cond_signal(cond) != 0)
+		err(0, "pthread_cond_signal() failed");
+}
+
+static __always_inline void cond_broadcast(pthread_cond_t *cond)
+{
+	if (pthread_cond_broadcast(cond) != 0)
+		err(0, "pthread_cond_broadcast() failed");
+}
+
+static int printstate_next_ticket(struct entry *req)
+{
+	int r;
+
+	r = ++(printstate.ticket_counter);
+	req->ticket = r;
+	req->ticket_completed_ref = printstate.ticket_completed;
+	cond_broadcast(&printstate.cond);
+	return r;
+}
+
+static __always_inline
+void printstate_mark_req_completed(const struct entry *req)
+{
+	if (req->ticket > printstate.ticket_completed)
+		printstate.ticket_completed = req->ticket;
+}
+
+static __always_inline
+bool printstate_has_new_req_arrived(const struct entry *req)
+{
+	return (printstate.ticket_counter != req->ticket);
+}
+
+static __always_inline int printstate_cnt_inc(void)
+{
+	int value;
+
+	mutex_lock(&printstate.cnt_mutex);
+	value = ++printstate.cnt;
+	mutex_unlock(&printstate.cnt_mutex);
+	return value;
+}
+
+static __always_inline int printstate_cnt_dec(void)
+{
+	int value;
+
+	mutex_lock(&printstate.cnt_mutex);
+	value = --printstate.cnt;
+	mutex_unlock(&printstate.cnt_mutex);
+	return value;
+}
+
+static __always_inline int printstate_cnt_read(void)
+{
+	int value;
+
+	mutex_lock(&printstate.cnt_mutex);
+	value = printstate.cnt;
+	mutex_unlock(&printstate.cnt_mutex);
+	return value;
+}
+
+static __always_inline
+bool prev_req_won_race(const struct entry *req)
+{
+	return (printstate.ticket_completed != req->ticket_completed_ref);
+}
+
+static void sleeptable_resize(int size, bool printout, struct short_msg *msg)
+{
+	int bytes;
+
+	if (printout) {
+		msg->len = 0;
+		if (unlikely(size > PROB_TABLE_MAX_SIZE))
+			bytes = snprintf(msg->buf, sizeof(msg->buf),
+"Cannot increase probability table to %d (maximum size reached)\n", size);
+		else
+			bytes = snprintf(msg->buf, sizeof(msg->buf),
+"Increasing probability table to %d\n", size);
+		if (bytes < 0)
+			warn("snprintf() failed");
+		else
+			msg->len = bytes;
+	}
+
+	if (unlikely(size < 0)) {
+		/* Should never happen */
+		errx(0, "Bad program state at %s:%d", __FILE__, __LINE__);
+		return;
+	}
+	sleeptable.size = size;
+	sleeptable.table = &probabilities[PROB_TABLE_MAX_SIZE - size];
+}
+
+static void init_probabilities(void)
+{
+	int i;
+	int j = 1000;
+
+	for (i = 0; i < PROB_TABLE_MAX_SIZE; i++) {
+		probabilities[i] = 1000 / j;
+		j--;
+	}
+	if (pthread_mutex_init(&sleeptable.mutex, NULL) != 0)
+		err(0, "pthread_mutex_init() failed");
+}
+
+static int table_get_probability(const struct entry *req,
+				 struct short_msg *msg)
+{
+	int diff = req->ticket - req->ticket_completed_ref;
+	int rval = 0;
+
+	msg->len = 0;
+	diff--;
+	/* Should never happen...*/
+	if (diff < 0)
+		errx(0, "Programmer assumption error at %s:%d\n", __FILE__,
+		     __LINE__);
+	mutex_lock(&sleeptable.mutex);
+	if (diff >= (sleeptable.size - 1)) {
+		rval = sleeptable.table[sleeptable.size - 1];
+		sleeptable_resize(sleeptable.size + 1, verbose_sizechange(),
+				  msg);
+	} else {
+		rval = sleeptable.table[diff];
+	}
+	mutex_unlock(&sleeptable.mutex);
+	return rval;
+}
+
+static void init_queue(struct queue *q)
+{
+	q->next_prod_idx = 0;
+	q->next_cons_idx = 0;
+	if (pthread_mutex_init(&q->mutex, NULL) != 0)
+		err(0, "pthread_mutex_init() failed");
+	if (pthread_cond_init(&q->cond, NULL) != 0)
+		err(0, "pthread_cond_init() failed");
+}
+
+static __always_inline int queue_len(const struct queue *q)
+{
+	if (q->next_prod_idx >= q->next_cons_idx)
+		return q->next_prod_idx - q->next_cons_idx;
+	else
+		return QUEUE_SIZE - q->next_cons_idx + q->next_prod_idx;
+}
+
+static __always_inline int queue_nr_free(const struct queue *q)
+{
+	int nr_free = QUEUE_SIZE - queue_len(q);
+
+	/*
+	 * If there is only one slot left we will anyway lie and claim that the
+	 * queue is full because adding an element will make it appear empty
+	 */
+	if (nr_free == 1)
+		nr_free = 0;
+	return nr_free;
+}
+
+static __always_inline void queue_idx_inc(int *idx)
+{
+	*idx = (*idx + 1) % QUEUE_SIZE;
+}
+
+static __always_inline void queue_push_to_back(struct queue *q,
+					      const struct entry *e)
+{
+	q->entries[q->next_prod_idx] = *e;
+	queue_idx_inc(&q->next_prod_idx);
+}
+
+static __always_inline struct entry queue_pop_from_front(struct queue *q)
+{
+	struct entry e = q->entries[q->next_cons_idx];
+
+	queue_idx_inc(&q->next_cons_idx);
+	return e;
+}
+
+static __always_inline void queue_cond_signal(struct queue *q)
+{
+	if (pthread_cond_signal(&q->cond) != 0)
+		err(0, "pthread_cond_signal() failed");
+}
+
+static __always_inline void queue_cond_wait(struct queue *q)
+{
+	if (pthread_cond_wait(&q->cond, &q->mutex) != 0)
+		err(0, "pthread_cond_wait() failed");
+}
+
+
+static __always_inline int queue_try_to_add_entry(struct queue *q,
+						  const struct entry *e)
+{
+	int r = 0;
+
+	mutex_lock(&q->mutex);
+	if (queue_nr_free(q) > 0) {
+		queue_push_to_back(q, e);
+		cond_signal(&q->cond);
+	} else
+		r = -1;
+	mutex_unlock(&q->mutex);
+	return r;
+}
+
+static struct entry queue_wait_for_entry(struct queue *q)
+{
+	struct entry e;
+
+	mutex_lock(&q->mutex);
+	while (true) {
+		if (queue_len(&printqueue) > 0) {
+			e = queue_pop_from_front(q);
+			break;
+		}
+		queue_cond_wait(q);
+	}
+	mutex_unlock(&q->mutex);
+
+	return e;
+}
+
+static const struct policy *policy_from_name(const char *name)
+{
+	const struct policy *p = &policies[0];
+
+	while (p->name != NULL) {
+		if (!strcmp(name, p->name))
+			return p;
+		p++;
+	}
+	return NULL;
+}
+
+static const char *policy_name(int policy)
+{
+	const struct policy *p = &policies[0];
+	static const char *rval = "unknown";
+
+	while (p->name != NULL) {
+		if (p->policy == policy)
+			return p->name;
+		p++;
+	}
+	return rval;
+}
+
+static bool toss_coin(struct drand48_data *buffer, unsigned int prob)
+{
+	long r;
+
+	if (lrand48_r(buffer, &r) != 0)
+		err(0, "lrand48_r() failed");
+	r = r % 1000L;
+	if (r < prob)
+		return true;
+	else
+		return false;
+}
+
+
+static long go_to_sleep(const struct entry *req)
+{
+	struct timespec future;
+	long delay = sleep_time;
+
+	get_time_in_future(&future, delay);
+
+	mutex_lock(&printstate.mutex);
+	while (!printstate_has_new_req_arrived(req)) {
+		if (pthread_cond_timedwait(&printstate.cond, &printstate.mutex,
+					   &future) != 0) {
+			if (errno != ETIMEDOUT && errno != 0)
+				err(0, "pthread_cond_timedwait() %d", errno);
+		}
+		if (time_has_passed(&future))
+			break;
+	};
+
+	if (printstate_has_new_req_arrived(req))
+		delay = -1;
+	mutex_unlock(&printstate.mutex);
+
+	return delay;
+}
+
+
+static void set_priority(void)
+{
+	int r;
+	pid_t pid;
+	struct sched_param param;
+
+	memset(&param, 0, sizeof(param));
+	param.sched_priority = sched_pri;
+
+	pid = getpid();
+	r = sched_setscheduler(pid, sched_policy, &param);
+
+	if (r != 0)
+		err(0, "sched_setscheduler() failed");
+}
+
+pid_t latency_collector_gettid(void)
+{
+	return (pid_t) syscall(__NR_gettid);
+}
+
+static void print_priority(void)
+{
+	pid_t tid;
+	int policy;
+	int r;
+	struct sched_param param;
+
+	tid = latency_collector_gettid();
+	r = pthread_getschedparam(pthread_self(), &policy, &param);
+	if (r != 0)
+		err(0, "pthread_getschedparam() failed");
+	mutex_lock(&print_mtx);
+	printf("Thread %d runs with scheduling policy %s and priority %d\n",
+	       tid, policy_name(policy), param.sched_priority);
+	mutex_unlock(&print_mtx);
+}
+
+static __always_inline
+void __print_skipmessage(const struct short_msg *resize_msg,
+			 const struct timespec *timestamp, char *buffer,
+			 size_t bufspace, const struct entry *req, bool excuse,
+			 const char *str)
+{
+	ssize_t bytes = 0;
+	char *p = &buffer[0];
+	long us, sec;
+	int r;
+
+	sec = timestamp->tv_sec;
+	us = timestamp->tv_nsec / 1000;
+
+	if (resize_msg != NULL && resize_msg->len > 0) {
+		strncpy(p, resize_msg->buf, resize_msg->len);
+		bytes += resize_msg->len;
+		p += resize_msg->len;
+		bufspace -= resize_msg->len;
+	}
+
+	if (excuse)
+		r = snprintf(p, bufspace,
+"%ld.%06ld Latency %d printout skipped due to %s\n",
+			     sec, us, req->ticket, str);
+	else
+		r = snprintf(p, bufspace, "%ld.%06ld Latency %d detected\n",
+			    sec, us, req->ticket);
+
+	if (r < 0)
+		warn("snprintf() failed");
+	else
+		bytes += r;
+
+	/* These prints could happen concurrently */
+	mutex_lock(&print_mtx);
+	write_or_die(fd_stdout, buffer, bytes);
+	mutex_unlock(&print_mtx);
+}
+
+static void print_skipmessage(const struct short_msg *resize_msg,
+			      const struct timespec *timestamp, char *buffer,
+			      size_t bufspace, const struct entry *req,
+			      bool excuse)
+{
+	__print_skipmessage(resize_msg, timestamp, buffer, bufspace, req,
+			    excuse, "random delay");
+}
+
+static void print_lostmessage(const struct timespec *timestamp, char *buffer,
+			      size_t bufspace, const struct entry *req,
+			      const char *reason)
+{
+	__print_skipmessage(NULL, timestamp, buffer, bufspace, req, true,
+			    reason);
+}
+
+static void print_tracefile(const struct short_msg *resize_msg,
+			    const struct timespec *timestamp, char *buffer,
+			    size_t bufspace, long slept,
+			    const struct entry *req)
+{
+	static const int reserve = 256;
+	char *p = &buffer[0];
+	ssize_t bytes = 0;
+	ssize_t bytes_tot = 0;
+	long us, sec;
+	long slept_ms;
+	int trace_fd;
+
+	/* Save some space for the final string and final null char */
+	bufspace = bufspace - reserve - 1;
+
+	if (resize_msg != NULL && resize_msg->len > 0) {
+		bytes = resize_msg->len;
+		strncpy(p, resize_msg->buf, bytes);
+		bytes_tot += bytes;
+		p += bytes;
+		bufspace -= bytes;
+	}
+
+	trace_fd = open(debug_tracefile, O_RDONLY);
+
+	if (trace_fd < 0) {
+		warn("open() failed on %s", debug_tracefile);
+		return;
+	}
+
+	sec = timestamp->tv_sec;
+	us = timestamp->tv_nsec / 1000;
+
+	if (slept != 0) {
+		slept_ms = slept / 1000;
+		bytes = snprintf(p, bufspace,
+"%ld.%06ld Latency %d randomly sleep for %ld ms before print\n",
+				 sec, us, req->ticket, slept_ms);
+	} else {
+		bytes = snprintf(p, bufspace,
+				 "%ld.%06ld Latency %d immediate print\n", sec,
+				 us, req->ticket);
+	}
+
+	if (bytes < 0) {
+		warn("snprintf() failed");
+		return;
+	}
+	p += bytes;
+	bufspace -= bytes;
+	bytes_tot += bytes;
+
+	bytes = snprintf(p, bufspace,
+">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BEGIN <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<\n"
+		);
+
+	if (bytes < 0) {
+		warn("snprintf() failed");
+		return;
+	}
+
+	p += bytes;
+	bufspace -= bytes;
+	bytes_tot += bytes;
+
+	do {
+		bytes = read(trace_fd, p, bufspace);
+		if (bytes < 0) {
+			if (errno == EINTR)
+				continue;
+			warn("read() failed on %s", debug_tracefile);
+			if (close(trace_fd) != 0)
+				warn("close() failed on %s", debug_tracefile);
+			return;
+		}
+		if (bytes == 0)
+			break;
+		p += bytes;
+		bufspace -= bytes;
+		bytes_tot += bytes;
+	} while (true);
+
+	if (close(trace_fd) != 0)
+		warn("close() failed on %s", debug_tracefile);
+
+	printstate_cnt_dec();
+	/* Add the reserve space back to the budget for the final string */
+	bufspace += reserve;
+
+	bytes = snprintf(p, bufspace,
+			 ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> END <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<\n\n");
+
+	if (bytes < 0) {
+		warn("snprintf() failed");
+		return;
+	}
+
+	bytes_tot += bytes;
+
+	/* These prints could happen concurrently */
+	mutex_lock(&print_mtx);
+	write_or_die(fd_stdout, buffer, bytes_tot);
+	mutex_unlock(&print_mtx);
+}
+
+static void do_jittertest(void)
+{
+	int ifd = inotify_init();
+	int wd;
+	const ssize_t bufsize = sizeof(inotify_buffer);
+	const ssize_t istructsize = sizeof(struct inotify_event);
+	char *buf = &inotify_buffer[0];
+	ssize_t nr_read;
+	char *p;
+	int modified;
+	struct inotify_event *event;
+	struct entry req;
+	char *buffer;
+	const size_t bufspace = PRINT_BUFFER_SIZE;
+	struct timespec timestamp;
+
+	print_priority();
+
+	buffer = malloc_or_die(bufspace);
+
+	if (ifd < 0)
+		err(0, "inotify_init() failed!");
+
+	wd = inotify_add_watch(ifd, debug_maxlat_file, IN_MODIFY);
+	if (wd < 0)
+		err(0, "inotify_add_watch() failed!");
+
+	while (true) {
+		modified = 0;
+		nr_read = read(ifd, buf, bufsize);
+		if (nr_read <= 0)
+			err(0, "read() failed on inotify fd!");
+		if (nr_read == bufsize)
+			warnx("inotify() buffer filled, skipping events");
+		if (nr_read < istructsize)
+			errx(0, "read() returned too few bytes on inotify fd");
+
+		for (p = buf; p < buf + nr_read;) {
+			event = (struct inotify_event *) p;
+			if ((event->mask & IN_MODIFY) != 0)
+				modified++;
+			p += istructsize + event->len;
+		}
+		while (modified > 0) {
+			mutex_lock(&printstate.mutex);
+			printstate_next_ticket(&req);
+			if (printstate_cnt_read() > 0) {
+				printstate_mark_req_completed(&req);
+				mutex_unlock(&printstate.mutex);
+				if (verbose_lostevent()) {
+					clock_gettime_or_die(CLOCK_MONOTONIC,
+							     &timestamp);
+					print_lostmessage(&timestamp, buffer,
+							  bufspace, &req,
+							  "inotify loop");
+				}
+				break;
+			}
+			mutex_unlock(&printstate.mutex);
+			if (queue_try_to_add_entry(&printqueue, &req) != 0) {
+				/* These prints could happen concurrently */
+				mutex_lock(&print_mtx);
+				write_or_die(fd_stdout, queue_full_warning,
+					     sizeof(queue_full_warning));
+				mutex_unlock(&print_mtx);
+			}
+			modified--;
+		}
+	}
+}
+
+static void *do_printloop(void *arg)
+{
+	const size_t bufspace = PRINT_BUFFER_SIZE;
+	char *buffer;
+	long *rseed = (long *) arg;
+	struct drand48_data drandbuf;
+	long slept = 0;
+	struct entry req;
+	int prob = 0;
+	struct timespec timestamp;
+	struct short_msg resize_msg;
+
+	print_priority();
+
+	if (srand48_r(*rseed, &drandbuf) != 0)
+		err(0, "srand48_r() failed!\n");
+
+	buffer = malloc_or_die(bufspace);
+
+	while (true) {
+		req = queue_wait_for_entry(&printqueue);
+		clock_gettime_or_die(CLOCK_MONOTONIC, &timestamp);
+		mutex_lock(&printstate.mutex);
+		if (prev_req_won_race(&req)) {
+			printstate_mark_req_completed(&req);
+			mutex_unlock(&printstate.mutex);
+			if (verbose_lostevent())
+				print_lostmessage(&timestamp, buffer, bufspace,
+						  &req, "print loop");
+			continue;
+		}
+		mutex_unlock(&printstate.mutex);
+
+		/*
+		 * Toss a coin to decide if we want to sleep a random amount
+		 * before printing out the backtrace. The reason for this is
+		 * that opening /sys/kernel/debug/tracing/trace will cause a
+		 * blackout of about 430 ms, where no latencies will be noted
+		 * by the latency tracer. Thus by randomly sleeping a random
+		 * amount we try to avoid missing traces systematically due to
+		 * this. With this option we will sometimes get the first
+		 * latency, some other times some of the later ones, in case of
+		 * closely spaced traces.
+		 */
+		if (trace_enable && use_random_sleep) {
+			slept = 0;
+			prob = table_get_probability(&req, &resize_msg);
+			if (!toss_coin(&drandbuf, prob))
+				slept = go_to_sleep(&req);
+			if (slept >= 0) {
+				/* A print is ongoing */
+				printstate_cnt_inc();
+				/*
+				 * We will do the printout below so we have to
+				 * mark it as completed while we still have the
+				 * mutex.
+				 */
+				mutex_lock(&printstate.mutex);
+				printstate_mark_req_completed(&req);
+				mutex_unlock(&printstate.mutex);
+			}
+		}
+		if (trace_enable) {
+			/*
+			 * slept < 0  means that we detected another
+			 * notification in go_to_sleep() above
+			 */
+			if (slept >= 0)
+				/*
+				 * N.B. printstate_cnt_dec(); will be called
+				 * inside print_tracefile()
+				 */
+				print_tracefile(&resize_msg, &timestamp, buffer,
+						bufspace, slept, &req);
+			else
+				print_skipmessage(&resize_msg, &timestamp,
+						  buffer, bufspace, &req, true);
+		} else {
+			print_skipmessage(&resize_msg, &timestamp, buffer,
+					  bufspace, &req, false);
+		}
+	}
+	return NULL;
+}
+
+static void start_printthread(void)
+{
+	unsigned int i;
+	long *seed;
+	int ufd;
+
+	ufd = open(DEV_URANDOM, O_RDONLY);
+	if (nr_threads > MAX_THREADS) {
+		warnx(
+"Number of requested print threads was %d, max number is %d\n",
+		      nr_threads, MAX_THREADS);
+		nr_threads = MAX_THREADS;
+	}
+	for (i = 0; i < nr_threads; i++) {
+		seed = malloc_or_die(sizeof(*seed));
+		if (ufd <  0 ||
+		    read(ufd, seed, sizeof(*seed)) != sizeof(*seed)) {
+			printf(
+"Warning! Using trivial random nummer seed, since %s not available\n",
+			DEV_URANDOM);
+			fflush(stdout);
+			*seed = i;
+		}
+		if (pthread_create(&printthread[i], NULL, do_printloop, seed)
+		    != 0)
+			err(0, "pthread_create()");
+	}
+	if (ufd > 0)
+		close(ufd);
+}
+
+static void show_usage(void)
+{
+	printf(
+"Usage: %s [OPTION]...\n\n"
+"Collect closely occurring latencies from %s\n"
+"when any of the following tracers are enabled: preemptirqsoff, preemptoff,\n"
+"irqsoff, wakeup_dl, wakeup_rt, wakeup, or hwlat. A tracer can be enabled by\n"
+"doing something like this:\n\n"
+
+"echo 1000 > /sys/kernel/debug/tracing/tracing_thresh\n"
+"echo preemptirqsoff > /sys/kernel/debug/tracing/current_tracer\n\n"
+
+"The occurrence of a latency is detected by monitoring the file\n"
+"%s with inotify.\n\n"
+
+"The following options are supported:\n"
+"-c, --policy POL\tRun the program with scheduling policy POL. POL can be\n"
+"\t\t\tother, batch, idle, rr or fifo. The default is rr. When\n"
+"\t\t\tusing rr or fifo, remember that these policies may cause\n"
+"\t\t\tother tasks to experience latencies.\n\n"
+
+"-p, --priority PRI\tRun the program with priority PRI. The acceptable range\n"
+"\t\t\tof PRI depends on the scheduling policy.\n\n"
+
+"-n, --notrace\t\tIf latency is detected, do not print out the content of\n"
+"\t\t\tthe trace file to standard output\n\n"
+
+"-t, --threads NRTHR\tRun NRTHR threads for printing. Default is %d.\n\n"
+
+"-r, --random\t\tArbitrarily sleep a certain amount of time, default\n"
+"\t\t\t%ld ms, before reading the trace file. The\n"
+"\t\t\tprobabilities for sleep are chosen so that the\n"
+"\t\t\tprobability of obtaining any of a cluster of closely\n"
+"\t\t\toccurring latencies are equal, i.e. we will randomly\n"
+"\t\t\tchoose which one we collect from the trace file.\n\n"
+"\t\t\tThis option is probably only useful with the irqsoff,\n"
+"\t\t\tpreemptoff, and preemptirqsoff tracers.\n\n"
+
+"-a, --nrlat NRLAT\tFor the purpose of arbitrary delay, assume that there\n"
+"\t\t\tare no more than NRLAT clustered latencies. If NRLAT\n"
+"\t\t\tlatencies are detected during a run, this value will\n"
+"\t\t\tautomatically be increased to NRLAT + 1 and then to\n"
+"\t\t\tNRLAT + 2 and so on. The default is %d. This option\n"
+"\t\t\timplies -r. We need to know this number in order to\n"
+"\t\t\tbe able to calculate the probabilities of sleeping.\n"
+"\t\t\tSpecifically, the probabilities of not sleeping, i.e. to\n"
+"\t\t\tdo an immediate printout will be:\n\n"
+"\t\t\t1/NRLAT  1/(NRLAT - 1) ... 1/3  1/2  1\n\n"
+"\t\t\tThe probability of sleeping will be:\n\n"
+"\t\t\t1 - P, where P is from the series above\n\n"
+"\t\t\tThis descending probability will cause us to choose\n"
+"\t\t\tan occurrence at random. Observe that the final\n"
+"\t\t\tprobability is 0, it is when we reach this probability\n"
+"\t\t\tthat we increase NRLAT automatically. As an example,\n"
+"\t\t\twith the default value of 2, the probabilities will be:\n\n"
+"\t\t\t1/2  0\n\n"
+"\t\t\tThis means, when a latency is detected we will sleep\n"
+"\t\t\twith 50%% probability. If we ever detect another latency\n"
+"\t\t\tduring the sleep period, then the probability of sleep\n"
+"\t\t\twill be 0%% and the table will be expanded to:\n\n"
+"\t\t\t1/3  1/2  0\n\n"
+
+"-v, --verbose\t\tIncrease the verbosity. If this option is given once,\n"
+"\t\t\tthen print a message every time that the NRLAT value\n"
+"\t\t\tis automatically increased. If this option is given at\n"
+"\t\t\tleast twice, then also print a warning for lost events.\n\n"
+
+"-u, --time TIME\t\tArbitrarily sleep for a specified time TIME ms before\n"
+"\t\t\tprinting out the trace from the trace file. The default\n"
+"\t\t\tis %ld ms. This option implies -r.\n\n"
+
+"-f, --tracefile FILE\tUse FILE as trace file. The default is\n"
+"\t\t\t%s.\n\n"
+
+"-m, --max-lat FILE\tUse FILE as tracing_max_latency file. The default is\n"
+"\t\t\t%s\n\n"
+,
+prg_name, debug_tracefile_dflt, debug_maxlat_dflt, DEFAULT_NR_PRINTER_THREADS,
+SLEEP_TIME_MS_DEFAULT, DEFAULT_TABLE_SIZE, SLEEP_TIME_MS_DEFAULT,
+debug_tracefile_dflt, debug_maxlat_dflt);
+}
+
+static void scan_arguments(int argc, char *argv[])
+{
+	int c;
+	int option_idx = 0;
+
+	debug_tracefile = debug_tracefile_dflt;
+	debug_maxlat_file = debug_maxlat_dflt;
+
+	static struct option long_options[] = {
+		{ "policy",	required_argument,	0, 'c' },
+		{ "priority",	required_argument,	0, 'p' },
+		{ "help",	no_argument,		0, 'h' },
+		{ "notrace",	no_argument,		0, 'n' },
+		{ "random",	no_argument,		0, 'r' },
+		{ "nrlat",	required_argument,	0, 'a' },
+		{ "threads",	required_argument,	0, 't' },
+		{ "time",	required_argument,	0, 'u' },
+		{ "verbose",	no_argument,		0, 'v' },
+		{ "tracefile",	required_argument,	0, 'f' },
+		{ "max-lat",	required_argument,	0, 'm' },
+		{ 0,		0,			0,  0  }
+	};
+	const struct policy *p;
+	int max, min;
+	int value;
+
+	while (true) {
+		c = getopt_long(argc, argv, "c:p:hnra:t:u:vf:m:", long_options,
+				&option_idx);
+		if (c == -1)
+			break;
+
+		switch (c) {
+		case 'c':
+			p = policy_from_name(optarg);
+			if (p != NULL) {
+				sched_policy = p->policy;
+				sched_policy_set = true;
+				if (!sched_pri_set) {
+					sched_pri = p->default_pri;
+					sched_pri_set = true;
+				}
+			} else {
+				warnx("Unknown scheduling %s\n", optarg);
+				show_usage();
+				exit(0);
+			}
+			break;
+		case 'p':
+			sched_pri = atoi(optarg);
+			sched_pri_set = true;
+			break;
+		case 'h':
+			show_usage();
+			exit(0);
+			break;
+		case 'n':
+			trace_enable = false;
+			use_random_sleep = false;
+			break;
+		case 't':
+			value = atoi(optarg);
+			if (value > 0)
+				nr_threads = value;
+			else {
+				warnx("NRTHR must be > 0\n");
+				show_usage();
+				exit(0);
+			}
+			break;
+		case 'u':
+			value = atoi(optarg);
+			if (value < 0) {
+				warnx("TIME must be >= 0\n");
+				show_usage();
+				exit(0);
+			}
+			trace_enable = true;
+			use_random_sleep = true;
+			sleep_time = value * USEC_PER_MSEC;
+			break;
+		case 'v':
+			verbosity++;
+			break;
+		case 'r':
+			trace_enable = true;
+			use_random_sleep = true;
+			break;
+		case 'a':
+			value = atoi(optarg);
+			if (value <= 0) {
+				warnx("NRLAT must be > 0\n");
+				show_usage();
+				exit(0);
+			}
+			trace_enable = true;
+			use_random_sleep = true;
+			table_startsize = value;
+			break;
+		case 'f':
+			debug_tracefile = strdup(optarg);
+			break;
+		case 'm':
+			debug_maxlat_file = strdup(optarg);
+			break;
+		default:
+			show_usage();
+			exit(0);
+			break;
+		}
+	}
+
+	if (!sched_policy_set) {
+		sched_policy = SCHED_RR;
+		sched_policy_set = true;
+		if (!sched_pri_set) {
+			sched_pri = RT_DEFAULT_PRI;
+			sched_pri_set = true;
+		}
+	}
+
+	max = sched_get_priority_max(sched_policy);
+	min = sched_get_priority_min(sched_policy);
+
+	if (sched_pri < min) {
+		printf(
+"ATTENTION: Increasing priority to minimum, which is %d\n", min);
+		sched_pri = min;
+	}
+	if (sched_pri > max) {
+		printf(
+"ATTENTION: Reducing priority to maximum, which is %d\n", max);
+		sched_pri = max;
+	}
+}
+
+static void show_params(void)
+{
+	printf(
+		"Running with scheduling policy %s and priority %d. Using %d print threads.\n",
+		policy_name(sched_policy), sched_pri, nr_threads);
+	if (trace_enable) {
+		if (use_random_sleep) {
+			printf(
+"%s will be printed with random delay\n"
+"Start size of the probability table:\t\t\t%d\n"
+"Print a message when prob. table table changes size:\t%s\n"
+"Print a warning when an event has been lost:\t\t%s\n"
+"Sleep time is:\t\t\t\t\t\t%ld ms\n",
+debug_tracefile,
+table_startsize,
+bool2str(verbose_sizechange()),
+bool2str(verbose_lostevent()),
+sleep_time / USEC_PER_MSEC);
+		} else {
+			printf("%s will be printed immediately\n",
+			       debug_tracefile);
+		}
+	} else {
+		printf("%s will not be printed\n",
+		       debug_tracefile);
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	open_stdout();
+
+	if (argc >= 1)
+		prg_name = argv[0];
+	else
+		prg_name = prg_unknown;
+
+	scan_arguments(argc, argv);
+	show_params();
+
+	init_printstate();
+	if (use_random_sleep) {
+		init_probabilities();
+		if (verbose_sizechange())
+			printf("Initializing probability table to %d\n",
+			       table_startsize);
+		sleeptable_resize(table_startsize, false, NULL);
+	}
+	set_priority();
+	init_queue(&printqueue);
+	start_printthread();
+	do_jittertest();
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 4/4] ftrace: Add an option for tracing console latencies
  2019-09-03 13:25 [PATCH v5 0/4] Some new features for the preempt/irqsoff tracers Viktor Rosendahl
                   ` (2 preceding siblings ...)
  2019-09-03 13:26 ` [PATCH v5 3/4] Add the latency-collector to tools Viktor Rosendahl
@ 2019-09-03 13:26 ` Viktor Rosendahl
  2019-09-04  5:02   ` kbuild test robot
  3 siblings, 1 reply; 19+ messages in thread
From: Viktor Rosendahl @ 2019-09-03 13:26 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, linux-kernel
  Cc: Joel Fernandes, Peter Zijlstra, Viktor Rosendahl

This new trace option "console-latency" will enable the latency
tracers to trace the console latencies. Previously this has always been
implicitely disabled. I guess this is because they are considered
to be well known and unavoidable.

However, for some organizations it may nevertheless be desirable to
trace them. Basically, we want to be able to tell that there are
latencies in the system under test because someone has incorrectly
enabled the serial console.

Signed-off-by: Viktor Rosendahl <viktor.rosendahl@gmail.com>
---
 include/linux/irqflags.h     | 21 +++++++++++++++++++++
 kernel/printk/printk.c       |  6 ++++--
 kernel/trace/trace.h         |  1 +
 kernel/trace/trace_irqsoff.c | 12 ++++++++++++
 4 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 21619c92c377..34b0424a32dc 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -68,9 +68,30 @@ do {						\
 	defined(CONFIG_PREEMPT_TRACER)
  extern void stop_critical_timings(void);
  extern void start_critical_timings(void);
+ extern bool console_tracing_disabled(void);
+
+# define console_stop_critical_timings(flag)		\
+	do {						\
+		typecheck(bool, flag);			\
+		flag = console_tracing_disabled();	\
+		if (flag)				\
+			stop_critical_timings();	\
+	} while (0)
+
+# define console_start_critical_timings(flag)		 \
+	do {						 \
+		typecheck(bool, flag);			 \
+		if (flag)				 \
+			start_critical_timings();	 \
+	} while (0)
+
 #else
 # define stop_critical_timings() do { } while (0)
 # define start_critical_timings() do { } while (0)
+# define console_stop_critical_timings(flag)	\
+	do { typecheck(bool, flag); } while (0)
+# define console_start_critical_timings(flag)	\
+	do { typecheck(bool, flag); } while (0)
 #endif
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 1888f6a3b694..036460a7e4cd 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2359,6 +2359,7 @@ void console_unlock(void)
 	static char ext_text[CONSOLE_EXT_LOG_MAX];
 	static char text[LOG_LINE_MAX + PREFIX_MAX];
 	unsigned long flags;
+	bool cflag;
 	bool do_cond_resched, retry;
 
 	if (console_suspended) {
@@ -2459,9 +2460,10 @@ void console_unlock(void)
 		 */
 		console_lock_spinning_enable();
 
-		stop_critical_timings();	/* don't trace print latency */
+		/* don't trace print latency if it's disabled */
+		console_stop_critical_timings(cflag);
 		call_console_drivers(ext_text, ext_len, text, len);
-		start_critical_timings();
+		console_start_critical_timings(cflag);
 
 		if (console_lock_spinning_disable_and_check()) {
 			printk_safe_exit_irqrestore(flags);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index d9f83b2aaa71..8e4c9736a801 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1266,6 +1266,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 		C(PRINTK_MSGONLY,	"printk-msg-only"),	\
 		C(CONTEXT_INFO,		"context-info"),   /* Print pid/cpu/time */ \
 		C(LATENCY_FMT,		"latency-format"),	\
+		C(CONSOLE_LATENCY,	"console-latency"),	\
 		C(RECORD_CMD,		"record-cmd"),		\
 		C(RECORD_TGID,		"record-tgid"),		\
 		C(OVERWRITE,		"overwrite"),		\
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index 29403a83a5f0..c8d17c4c1c91 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -456,6 +456,18 @@ void stop_critical_timings(void)
 EXPORT_SYMBOL_GPL(stop_critical_timings);
 NOKPROBE_SYMBOL(stop_critical_timings);
 
+bool console_tracing_disabled(void)
+{
+	struct trace_array *tr = irqsoff_trace;
+	int pc = preempt_count();
+
+	if (!preempt_trace(pc) && !irq_trace())
+		return false;
+
+	return !(tr->trace_flags & TRACE_ITER_CONSOLE_LATENCY);
+}
+EXPORT_SYMBOL_GPL(console_tracing_disabled);
+
 #ifdef CONFIG_FUNCTION_TRACER
 static bool function_enabled;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency
  2019-09-03 13:25 ` [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency Viktor Rosendahl
@ 2019-09-04  4:00   ` Joel Fernandes
  2019-09-04  8:19     ` Peter Zijlstra
  2019-09-04 10:21     ` Paul E. McKenney
  2019-09-04  8:20   ` Peter Zijlstra
  2019-09-04 11:39   ` Steven Rostedt
  2 siblings, 2 replies; 19+ messages in thread
From: Joel Fernandes @ 2019-09-04  4:00 UTC (permalink / raw)
  To: Viktor Rosendahl, paulmck
  Cc: Steven Rostedt, Ingo Molnar, linux-kernel, Peter Zijlstra,
	linux-rt-users

[ Resending since I messed up my last email's headers! ]

On Tue, Sep 03, 2019 at 03:25:59PM +0200, Viktor Rosendahl wrote:
> This patch implements the feature that the tracing_max_latency file,
> e.g. /sys/kernel/debug/tracing/tracing_max_latency will receive
> notifications through the fsnotify framework when a new latency is
> available.
> 
> One particularly interesting use of this facility is when enabling
> threshold tracing, through /sys/kernel/debug/tracing/tracing_thresh,
> together with the preempt/irqsoff tracers. This makes it possible to
> implement a user space program that can, with equal probability,
> obtain traces of latencies that occur immediately after each other in
> spite of the fact that the preempt/irqsoff tracers operate in overwrite
> mode.

Adding Paul since RCU faces similar situations, i.e. raising softirq risks
scheduler deadlock in rcu_read_unlock_special() -- but RCU's solution is to
avoid raising the softirq and instead use irq_work.

I was wondering, if we can rename __raise_softirq_irqoff() to
raise_softirq_irqoff_no_wake() and call that from places where there is risk
of scheduler related deadlocks. Then I think this can be used from Viktor's
code.  Let us discuss - what would happen if the softirq is raised, but
ksoftirqd is not awakened for this latency notification path? Is this really
an issue considering the softirq will execute during the next interrupt exit?

thanks,

 - Joel


> This facility works with the hwlat, preempt/irqsoff, and wakeup
> tracers.
> 
> This patch also adds some unfortunate calls from __schedule() and
> do_idle(). Those calls to the latency_fsnotify_disable/enable() are
> needed because we cannot wake up the workqueue from these critical
> sections without risking a deadlock. Similar problems would also arise
> if we try to schedule a tasklet, raise a softirq, or wake up a kernel
> thread. If a notification event would happen in the forbidden sections,
> we schedule the fsnotify work as soon as we have exited them.
> 
> There was a suggestion to remove this latency_fsnotify_enable/disable()
> gunk, or at least to combine it with the start_critical_timings() and
> stop_critical_timings(). I have however not been able to come up with
> a way to do it.
> 
> It seems like it would be possible to simply replace the calls to
> latency_fsnotify_enable/disable() with calls to
> start/stop_critical_timings(). However, the main problem is that it
> would not work for the wakup tracer. The wakeup tracer needs a
> facility that postpones the notifications, not one that prevents the
> measurements because all its measurements takes place in the middle
> of __schedule(). On the other hand, in some places, like in idle and
> the console we need start stop functions that prevents the
> measurements from being make.
> 
> Signed-off-by: Viktor Rosendahl <viktor.rosendahl@gmail.com>
> ---
>  include/linux/ftrace.h            |  31 +++++++++
>  kernel/sched/core.c               |   3 +
>  kernel/sched/idle.c               |   3 +
>  kernel/sched/sched.h              |   1 +
>  kernel/trace/trace.c              | 112 +++++++++++++++++++++++++++++-
>  kernel/trace/trace.h              |  22 ++++++
>  kernel/trace/trace_hwlat.c        |   4 +-
>  kernel/trace/trace_irqsoff.c      |   4 ++
>  kernel/trace/trace_sched_wakeup.c |   4 ++
>  9 files changed, 181 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 8a8cb3c401b2..b4d9700ef917 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -907,4 +907,35 @@ unsigned long arch_syscall_addr(int nr);
>  
>  #endif /* CONFIG_FTRACE_SYSCALLS */
>  
> +#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
> +	defined(CONFIG_FSNOTIFY)
> +
> +DECLARE_PER_CPU(int, latency_notify_disable);
> +DECLARE_STATIC_KEY_FALSE(latency_notify_key);
> +
> +void latency_fsnotify_process(void);
> +
> +/*
> + * Disable/enable fsnotify while in scheduler and idle code. Trying to wake
> + * anything up from there, such as calling queue_work() is prone to deadlock.
> + */
> +static inline void latency_fsnotify_disable(void)
> +{
> +	this_cpu_inc(latency_notify_disable);
> +}
> +
> +static inline void latency_fsnotify_enable(void)
> +{
> +	this_cpu_dec(latency_notify_disable);
> +	if (static_branch_unlikely(&latency_notify_key))
> +		latency_fsnotify_process();
> +}
> +
> +#else
> +
> +#define latency_fsnotify_disable() do { } while (0)
> +#define latency_fsnotify_enable()  do { } while (0)
> +
> +#endif
> +
>  #endif /* _LINUX_FTRACE_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 010d578118d6..e3c1dc801073 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3198,6 +3198,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
>  	 */
>  
>  	rq = finish_task_switch(prev);
> +	latency_fsnotify_enable();
>  	balance_callback(rq);
>  	preempt_enable();
>  
> @@ -3820,6 +3821,7 @@ static void __sched notrace __schedule(bool preempt)
>  
>  	local_irq_disable();
>  	rcu_note_context_switch(preempt);
> +	latency_fsnotify_disable();
>  
>  	/*
>  	 * Make sure that signal_pending_state()->signal_pending() below
> @@ -3883,6 +3885,7 @@ static void __sched notrace __schedule(bool preempt)
>  		rq_unlock_irq(rq, &rf);
>  	}
>  
> +	latency_fsnotify_enable();
>  	balance_callback(rq);
>  }
>  
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 80940939b733..5fc87d99a407 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -236,6 +236,7 @@ static void do_idle(void)
>  
>  	__current_set_polling();
>  	tick_nohz_idle_enter();
> +	latency_fsnotify_disable();
>  
>  	while (!need_resched()) {
>  		check_pgt_cache();
> @@ -265,6 +266,8 @@ static void do_idle(void)
>  		arch_cpu_idle_exit();
>  	}
>  
> +	latency_fsnotify_enable();
> +
>  	/*
>  	 * Since we fell out of the loop above, we know TIF_NEED_RESCHED must
>  	 * be set, propagate it into PREEMPT_NEED_RESCHED.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 802b1f3405f2..467d6ad03f16 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -46,6 +46,7 @@
>  #include <linux/debugfs.h>
>  #include <linux/delayacct.h>
>  #include <linux/energy_model.h>
> +#include <linux/ftrace.h>
>  #include <linux/init_task.h>
>  #include <linux/kprobes.h>
>  #include <linux/kthread.h>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 563e80f9006a..a622263a69e4 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -44,6 +44,10 @@
>  #include <linux/trace.h>
>  #include <linux/sched/clock.h>
>  #include <linux/sched/rt.h>
> +#include <linux/fsnotify.h>
> +#include <linux/workqueue.h>
> +#include <trace/events/power.h>
> +#include <trace/events/sched.h>
>  
>  #include "trace.h"
>  #include "trace_output.h"
> @@ -1480,6 +1484,110 @@ static ssize_t trace_seq_to_buffer(struct trace_seq *s, void *buf, size_t cnt)
>  
>  unsigned long __read_mostly	tracing_thresh;
>  
> +#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
> +	defined(CONFIG_FSNOTIFY)
> +
> +static const struct file_operations tracing_max_lat_fops;
> +static struct workqueue_struct *fsnotify_wq;
> +static DEFINE_PER_CPU(struct llist_head, notify_list);
> +
> +DEFINE_PER_CPU(int, latency_notify_disable);
> +DEFINE_STATIC_KEY_FALSE(latency_notify_key);
> +
> +static void latency_fsnotify_workfn(struct work_struct *work)
> +{
> +	struct trace_array *tr = container_of(work, struct trace_array,
> +					      fsnotify_work);
> +	fsnotify(tr->d_max_latency->d_inode, FS_MODIFY,
> +		 tr->d_max_latency->d_inode, FSNOTIFY_EVENT_INODE, NULL, 0);
> +}
> +
> +static void trace_create_maxlat_file(struct trace_array *tr,
> +				     struct dentry *d_tracer)
> +{
> +	INIT_WORK(&tr->fsnotify_work, latency_fsnotify_workfn);
> +	atomic_set(&tr->notify_pending, 0);
> +	tr->d_max_latency = trace_create_file("tracing_max_latency", 0644,
> +					      d_tracer, &tr->max_latency,
> +					      &tracing_max_lat_fops);
> +}
> +
> +void latency_fsnotify_stop(void)
> +{
> +	/* Make sure all CPUs see caller's previous actions to stop tracer */
> +	smp_wmb();
> +	static_branch_disable(&latency_notify_key);
> +	latency_fsnotify_process();
> +}
> +
> +void latency_fsnotify_start(void)
> +{
> +	static_branch_enable(&latency_notify_key);
> +	/* Make sure all CPUs see key value before caller continue */
> +	smp_wmb();
> +}
> +
> +void latency_fsnotify_process(void)
> +{
> +	struct trace_array *tr;
> +	struct llist_head *list;
> +	struct llist_node *node;
> +
> +	if (this_cpu_read(latency_notify_disable))
> +		return;
> +
> +	list = this_cpu_ptr(&notify_list);
> +	for (node = llist_del_first(list); node != NULL;
> +	     node = llist_del_first(list)) {
> +		tr = llist_entry(node, struct trace_array, notify_ll);
> +		atomic_set(&tr->notify_pending, 0);
> +		queue_work(fsnotify_wq, &tr->fsnotify_work);
> +	}
> +}
> +
> +__init static int latency_fsnotify_init(void)
> +{
> +	fsnotify_wq = alloc_workqueue("tr_max_lat_wq",
> +				      WQ_UNBOUND | WQ_HIGHPRI, 0);
> +	if (!fsnotify_wq) {
> +		pr_err("Unable to allocate tr_max_lat_wq\n");
> +		return -ENOMEM;
> +	}
> +	return 0;
> +}
> +
> +late_initcall_sync(latency_fsnotify_init);
> +
> +void latency_fsnotify(struct trace_array *tr)
> +{
> +	if (!fsnotify_wq)
> +		return;
> +
> +	if (!this_cpu_read(latency_notify_disable))
> +		queue_work(fsnotify_wq, &tr->fsnotify_work);
> +	else {
> +		/*
> +		 * notify_pending prevents us from adding the same entry to
> +		 * more than one notify_list. It will get queued in
> +		 * latency_enable_fsnotify()
> +		 */
> +		if (!atomic_xchg(&tr->notify_pending, 1))
> +			llist_add(&tr->notify_ll, this_cpu_ptr(&notify_list));
> +	}
> +}
> +
> +/*
> + * (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
> + *  defined(CONFIG_FSNOTIFY)
> + */
> +#else
> +
> +#define trace_create_maxlat_file(tr, d_tracer)				\
> +	trace_create_file("tracing_max_latency", 0644, d_tracer,	\
> +			  &tr->max_latency, &tracing_max_lat_fops)
> +
> +#endif
> +
>  #ifdef CONFIG_TRACER_MAX_TRACE
>  /*
>   * Copy the new maximum trace into the separate maximum-trace
> @@ -1518,6 +1626,7 @@ __update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu)
>  
>  	/* record this tasks comm */
>  	tracing_record_cmdline(tsk);
> +	latency_fsnotify(tr);
>  }
>  
>  /**
> @@ -8550,8 +8659,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
>  	create_trace_options_dir(tr);
>  
>  #if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
> -	trace_create_file("tracing_max_latency", 0644, d_tracer,
> -			&tr->max_latency, &tracing_max_lat_fops);
> +	trace_create_maxlat_file(tr, d_tracer);
>  #endif
>  
>  	if (ftrace_create_function_files(tr, d_tracer))
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 005f08629b8b..d9f83b2aaa71 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -16,6 +16,7 @@
>  #include <linux/trace_events.h>
>  #include <linux/compiler.h>
>  #include <linux/glob.h>
> +#include <linux/workqueue.h>
>  
>  #ifdef CONFIG_FTRACE_SYSCALLS
>  #include <asm/unistd.h>		/* For NR_SYSCALLS	     */
> @@ -264,6 +265,12 @@ struct trace_array {
>  #endif
>  #if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
>  	unsigned long		max_latency;
> +#ifdef CONFIG_FSNOTIFY
> +	struct dentry		*d_max_latency;
> +	struct work_struct	fsnotify_work;
> +	atomic_t		notify_pending;
> +	struct llist_node	notify_ll;
> +#endif
>  #endif
>  	struct trace_pid_list	__rcu *filtered_pids;
>  	/*
> @@ -785,6 +792,21 @@ void update_max_tr_single(struct trace_array *tr,
>  			  struct task_struct *tsk, int cpu);
>  #endif /* CONFIG_TRACER_MAX_TRACE */
>  
> +#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
> +	defined(CONFIG_FSNOTIFY)
> +
> +void latency_fsnotify(struct trace_array *tr);
> +void latency_fsnotify_start(void);
> +void latency_fsnotify_stop(void);
> +
> +#else
> +
> +#define latency_fsnotify(tr)     do { } while (0)
> +#define latency_fsnotify_start() do { } while (0)
> +#define latency_fsnotify_stop()  do { } while (0)
> +
> +#endif
> +
>  #ifdef CONFIG_STACKTRACE
>  void __trace_stack(struct trace_array *tr, unsigned long flags, int skip,
>  		   int pc);
> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
> index fa95139445b2..9c379261ee89 100644
> --- a/kernel/trace/trace_hwlat.c
> +++ b/kernel/trace/trace_hwlat.c
> @@ -254,8 +254,10 @@ static int get_sample(void)
>  		trace_hwlat_sample(&s);
>  
>  		/* Keep a running maximum ever recorded hardware latency */
> -		if (sample > tr->max_latency)
> +		if (sample > tr->max_latency) {
>  			tr->max_latency = sample;
> +			latency_fsnotify(tr);
> +		}
>  	}
>  
>  out:
> diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
> index a745b0cee5d3..29403a83a5f0 100644
> --- a/kernel/trace/trace_irqsoff.c
> +++ b/kernel/trace/trace_irqsoff.c
> @@ -557,6 +557,7 @@ static int __irqsoff_tracer_init(struct trace_array *tr)
>  	if (irqsoff_busy)
>  		return -EBUSY;
>  
> +	latency_fsnotify_start();
>  	save_flags = tr->trace_flags;
>  
>  	/* non overwrite screws up the latency tracers */
> @@ -591,16 +592,19 @@ static void __irqsoff_tracer_reset(struct trace_array *tr)
>  	ftrace_reset_array_ops(tr);
>  
>  	irqsoff_busy = false;
> +	latency_fsnotify_stop();
>  }
>  
>  static void irqsoff_tracer_start(struct trace_array *tr)
>  {
> +	latency_fsnotify_start();
>  	tracer_enabled = 1;
>  }
>  
>  static void irqsoff_tracer_stop(struct trace_array *tr)
>  {
>  	tracer_enabled = 0;
> +	latency_fsnotify_stop();
>  }
>  
>  #ifdef CONFIG_IRQSOFF_TRACER
> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> index 743b2b520d34..3dc90d9f605b 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -669,6 +669,7 @@ static bool wakeup_busy;
>  
>  static int __wakeup_tracer_init(struct trace_array *tr)
>  {
> +	latency_fsnotify_start();
>  	save_flags = tr->trace_flags;
>  
>  	/* non overwrite screws up the latency tracers */
> @@ -727,10 +728,12 @@ static void wakeup_tracer_reset(struct trace_array *tr)
>  	set_tracer_flag(tr, TRACE_ITER_OVERWRITE, overwrite_flag);
>  	ftrace_reset_array_ops(tr);
>  	wakeup_busy = false;
> +	latency_fsnotify_stop();
>  }
>  
>  static void wakeup_tracer_start(struct trace_array *tr)
>  {
> +	latency_fsnotify_start();
>  	wakeup_reset(tr);
>  	tracer_enabled = 1;
>  }
> @@ -738,6 +741,7 @@ static void wakeup_tracer_start(struct trace_array *tr)
>  static void wakeup_tracer_stop(struct trace_array *tr)
>  {
>  	tracer_enabled = 0;
> +	latency_fsnotify_stop();
>  }
>  
>  static struct tracer wakeup_tracer __read_mostly =
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 4/4] ftrace: Add an option for tracing console latencies
  2019-09-03 13:26 ` [PATCH v5 4/4] ftrace: Add an option for tracing console latencies Viktor Rosendahl
@ 2019-09-04  5:02   ` kbuild test robot
  0 siblings, 0 replies; 19+ messages in thread
From: kbuild test robot @ 2019-09-04  5:02 UTC (permalink / raw)
  To: Viktor Rosendahl
  Cc: kbuild-all, Steven Rostedt, Ingo Molnar, linux-kernel,
	Joel Fernandes, Peter Zijlstra, Viktor Rosendahl

[-- Attachment #1: Type: text/plain, Size: 1496 bytes --]

Hi Viktor,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[cannot apply to v5.3-rc7 next-20190903]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Viktor-Rosendahl/Some-new-features-for-the-preempt-irqsoff-tracers/20190904-083632
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gcc (GCC) 7.4.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.4.0 make.cross ARCH=sh 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from kernel/events/hw_breakpoint.c:20:0:
>> include/linux/irqflags.h:71:9: error: unknown type name 'bool'
     extern bool console_tracing_disabled(void);
            ^~~~

vim +/bool +71 include/linux/irqflags.h

    66	
    67	#if defined(CONFIG_IRQSOFF_TRACER) || \
    68		defined(CONFIG_PREEMPT_TRACER)
    69	 extern void stop_critical_timings(void);
    70	 extern void start_critical_timings(void);
  > 71	 extern bool console_tracing_disabled(void);
    72	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 51785 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency
  2019-09-04  4:00   ` Joel Fernandes
@ 2019-09-04  8:19     ` Peter Zijlstra
  2019-09-04 13:42       ` Joel Fernandes
  2019-09-04 18:17       ` Viktor Rosendahl
  2019-09-04 10:21     ` Paul E. McKenney
  1 sibling, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2019-09-04  8:19 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Viktor Rosendahl, paulmck, Steven Rostedt, Ingo Molnar,
	linux-kernel, linux-rt-users

On Wed, Sep 04, 2019 at 12:00:39AM -0400, Joel Fernandes wrote:
> [ Resending since I messed up my last email's headers! ]
> 
> On Tue, Sep 03, 2019 at 03:25:59PM +0200, Viktor Rosendahl wrote:
> > This patch implements the feature that the tracing_max_latency file,
> > e.g. /sys/kernel/debug/tracing/tracing_max_latency will receive
> > notifications through the fsnotify framework when a new latency is
> > available.
> > 
> > One particularly interesting use of this facility is when enabling
> > threshold tracing, through /sys/kernel/debug/tracing/tracing_thresh,
> > together with the preempt/irqsoff tracers. This makes it possible to
> > implement a user space program that can, with equal probability,
> > obtain traces of latencies that occur immediately after each other in
> > spite of the fact that the preempt/irqsoff tracers operate in overwrite
> > mode.
> 
> Adding Paul since RCU faces similar situations, i.e. raising softirq risks
> scheduler deadlock in rcu_read_unlock_special() -- but RCU's solution is to
> avoid raising the softirq and instead use irq_work.

Which is right.

> I was wondering, if we can rename __raise_softirq_irqoff() to
> raise_softirq_irqoff_no_wake() and call that from places where there is risk
> of scheduler related deadlocks. Then I think this can be used from Viktor's
> code.  Let us discuss - what would happen if the softirq is raised, but
> ksoftirqd is not awakened for this latency notification path? Is this really
> an issue considering the softirq will execute during the next interrupt exit?

You'd get unbounded latency for processing the softirq and warnings on
going idle with softirqs pending.

I really don't see why we should/want to be using softirq here.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency
  2019-09-03 13:25 ` [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency Viktor Rosendahl
  2019-09-04  4:00   ` Joel Fernandes
@ 2019-09-04  8:20   ` Peter Zijlstra
  2019-09-04 11:37     ` Steven Rostedt
  2019-09-04 11:39   ` Steven Rostedt
  2 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2019-09-04  8:20 UTC (permalink / raw)
  To: Viktor Rosendahl
  Cc: Steven Rostedt, Ingo Molnar, linux-kernel, Joel Fernandes

On Tue, Sep 03, 2019 at 03:25:59PM +0200, Viktor Rosendahl wrote:

> It seems like it would be possible to simply replace the calls to
> latency_fsnotify_enable/disable() with calls to
> start/stop_critical_timings(). However, the main problem is that it
> would not work for the wakup tracer. The wakeup tracer needs a
> facility that postpones the notifications, not one that prevents the
> measurements because all its measurements takes place in the middle
> of __schedule(). On the other hand, in some places, like in idle and
> the console we need start stop functions that prevents the
> measurements from being make.

Like Joel already mentioned; you can use irq_work here.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency
  2019-09-04  4:00   ` Joel Fernandes
  2019-09-04  8:19     ` Peter Zijlstra
@ 2019-09-04 10:21     ` Paul E. McKenney
  1 sibling, 0 replies; 19+ messages in thread
From: Paul E. McKenney @ 2019-09-04 10:21 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Viktor Rosendahl, Steven Rostedt, Ingo Molnar, linux-kernel,
	Peter Zijlstra, linux-rt-users

On Wed, Sep 04, 2019 at 12:00:39AM -0400, Joel Fernandes wrote:
> [ Resending since I messed up my last email's headers! ]
> 
> On Tue, Sep 03, 2019 at 03:25:59PM +0200, Viktor Rosendahl wrote:
> > This patch implements the feature that the tracing_max_latency file,
> > e.g. /sys/kernel/debug/tracing/tracing_max_latency will receive
> > notifications through the fsnotify framework when a new latency is
> > available.
> > 
> > One particularly interesting use of this facility is when enabling
> > threshold tracing, through /sys/kernel/debug/tracing/tracing_thresh,
> > together with the preempt/irqsoff tracers. This makes it possible to
> > implement a user space program that can, with equal probability,
> > obtain traces of latencies that occur immediately after each other in
> > spite of the fact that the preempt/irqsoff tracers operate in overwrite
> > mode.
> 
> Adding Paul since RCU faces similar situations, i.e. raising softirq risks
> scheduler deadlock in rcu_read_unlock_special() -- but RCU's solution is to
> avoid raising the softirq and instead use irq_work.

Peter's solution, actually.  ;-)

							Thanx, Paul

> I was wondering, if we can rename __raise_softirq_irqoff() to
> raise_softirq_irqoff_no_wake() and call that from places where there is risk
> of scheduler related deadlocks. Then I think this can be used from Viktor's
> code.  Let us discuss - what would happen if the softirq is raised, but
> ksoftirqd is not awakened for this latency notification path? Is this really
> an issue considering the softirq will execute during the next interrupt exit?
> 
> thanks,
> 
>  - Joel
> 
> 
> > This facility works with the hwlat, preempt/irqsoff, and wakeup
> > tracers.
> > 
> > This patch also adds some unfortunate calls from __schedule() and
> > do_idle(). Those calls to the latency_fsnotify_disable/enable() are
> > needed because we cannot wake up the workqueue from these critical
> > sections without risking a deadlock. Similar problems would also arise
> > if we try to schedule a tasklet, raise a softirq, or wake up a kernel
> > thread. If a notification event would happen in the forbidden sections,
> > we schedule the fsnotify work as soon as we have exited them.
> > 
> > There was a suggestion to remove this latency_fsnotify_enable/disable()
> > gunk, or at least to combine it with the start_critical_timings() and
> > stop_critical_timings(). I have however not been able to come up with
> > a way to do it.
> > 
> > It seems like it would be possible to simply replace the calls to
> > latency_fsnotify_enable/disable() with calls to
> > start/stop_critical_timings(). However, the main problem is that it
> > would not work for the wakup tracer. The wakeup tracer needs a
> > facility that postpones the notifications, not one that prevents the
> > measurements because all its measurements takes place in the middle
> > of __schedule(). On the other hand, in some places, like in idle and
> > the console we need start stop functions that prevents the
> > measurements from being make.
> > 
> > Signed-off-by: Viktor Rosendahl <viktor.rosendahl@gmail.com>
> > ---
> >  include/linux/ftrace.h            |  31 +++++++++
> >  kernel/sched/core.c               |   3 +
> >  kernel/sched/idle.c               |   3 +
> >  kernel/sched/sched.h              |   1 +
> >  kernel/trace/trace.c              | 112 +++++++++++++++++++++++++++++-
> >  kernel/trace/trace.h              |  22 ++++++
> >  kernel/trace/trace_hwlat.c        |   4 +-
> >  kernel/trace/trace_irqsoff.c      |   4 ++
> >  kernel/trace/trace_sched_wakeup.c |   4 ++
> >  9 files changed, 181 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > index 8a8cb3c401b2..b4d9700ef917 100644
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -907,4 +907,35 @@ unsigned long arch_syscall_addr(int nr);
> >  
> >  #endif /* CONFIG_FTRACE_SYSCALLS */
> >  
> > +#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
> > +	defined(CONFIG_FSNOTIFY)
> > +
> > +DECLARE_PER_CPU(int, latency_notify_disable);
> > +DECLARE_STATIC_KEY_FALSE(latency_notify_key);
> > +
> > +void latency_fsnotify_process(void);
> > +
> > +/*
> > + * Disable/enable fsnotify while in scheduler and idle code. Trying to wake
> > + * anything up from there, such as calling queue_work() is prone to deadlock.
> > + */
> > +static inline void latency_fsnotify_disable(void)
> > +{
> > +	this_cpu_inc(latency_notify_disable);
> > +}
> > +
> > +static inline void latency_fsnotify_enable(void)
> > +{
> > +	this_cpu_dec(latency_notify_disable);
> > +	if (static_branch_unlikely(&latency_notify_key))
> > +		latency_fsnotify_process();
> > +}
> > +
> > +#else
> > +
> > +#define latency_fsnotify_disable() do { } while (0)
> > +#define latency_fsnotify_enable()  do { } while (0)
> > +
> > +#endif
> > +
> >  #endif /* _LINUX_FTRACE_H */
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 010d578118d6..e3c1dc801073 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3198,6 +3198,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
> >  	 */
> >  
> >  	rq = finish_task_switch(prev);
> > +	latency_fsnotify_enable();
> >  	balance_callback(rq);
> >  	preempt_enable();
> >  
> > @@ -3820,6 +3821,7 @@ static void __sched notrace __schedule(bool preempt)
> >  
> >  	local_irq_disable();
> >  	rcu_note_context_switch(preempt);
> > +	latency_fsnotify_disable();
> >  
> >  	/*
> >  	 * Make sure that signal_pending_state()->signal_pending() below
> > @@ -3883,6 +3885,7 @@ static void __sched notrace __schedule(bool preempt)
> >  		rq_unlock_irq(rq, &rf);
> >  	}
> >  
> > +	latency_fsnotify_enable();
> >  	balance_callback(rq);
> >  }
> >  
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index 80940939b733..5fc87d99a407 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -236,6 +236,7 @@ static void do_idle(void)
> >  
> >  	__current_set_polling();
> >  	tick_nohz_idle_enter();
> > +	latency_fsnotify_disable();
> >  
> >  	while (!need_resched()) {
> >  		check_pgt_cache();
> > @@ -265,6 +266,8 @@ static void do_idle(void)
> >  		arch_cpu_idle_exit();
> >  	}
> >  
> > +	latency_fsnotify_enable();
> > +
> >  	/*
> >  	 * Since we fell out of the loop above, we know TIF_NEED_RESCHED must
> >  	 * be set, propagate it into PREEMPT_NEED_RESCHED.
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 802b1f3405f2..467d6ad03f16 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -46,6 +46,7 @@
> >  #include <linux/debugfs.h>
> >  #include <linux/delayacct.h>
> >  #include <linux/energy_model.h>
> > +#include <linux/ftrace.h>
> >  #include <linux/init_task.h>
> >  #include <linux/kprobes.h>
> >  #include <linux/kthread.h>
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 563e80f9006a..a622263a69e4 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -44,6 +44,10 @@
> >  #include <linux/trace.h>
> >  #include <linux/sched/clock.h>
> >  #include <linux/sched/rt.h>
> > +#include <linux/fsnotify.h>
> > +#include <linux/workqueue.h>
> > +#include <trace/events/power.h>
> > +#include <trace/events/sched.h>
> >  
> >  #include "trace.h"
> >  #include "trace_output.h"
> > @@ -1480,6 +1484,110 @@ static ssize_t trace_seq_to_buffer(struct trace_seq *s, void *buf, size_t cnt)
> >  
> >  unsigned long __read_mostly	tracing_thresh;
> >  
> > +#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
> > +	defined(CONFIG_FSNOTIFY)
> > +
> > +static const struct file_operations tracing_max_lat_fops;
> > +static struct workqueue_struct *fsnotify_wq;
> > +static DEFINE_PER_CPU(struct llist_head, notify_list);
> > +
> > +DEFINE_PER_CPU(int, latency_notify_disable);
> > +DEFINE_STATIC_KEY_FALSE(latency_notify_key);
> > +
> > +static void latency_fsnotify_workfn(struct work_struct *work)
> > +{
> > +	struct trace_array *tr = container_of(work, struct trace_array,
> > +					      fsnotify_work);
> > +	fsnotify(tr->d_max_latency->d_inode, FS_MODIFY,
> > +		 tr->d_max_latency->d_inode, FSNOTIFY_EVENT_INODE, NULL, 0);
> > +}
> > +
> > +static void trace_create_maxlat_file(struct trace_array *tr,
> > +				     struct dentry *d_tracer)
> > +{
> > +	INIT_WORK(&tr->fsnotify_work, latency_fsnotify_workfn);
> > +	atomic_set(&tr->notify_pending, 0);
> > +	tr->d_max_latency = trace_create_file("tracing_max_latency", 0644,
> > +					      d_tracer, &tr->max_latency,
> > +					      &tracing_max_lat_fops);
> > +}
> > +
> > +void latency_fsnotify_stop(void)
> > +{
> > +	/* Make sure all CPUs see caller's previous actions to stop tracer */
> > +	smp_wmb();
> > +	static_branch_disable(&latency_notify_key);
> > +	latency_fsnotify_process();
> > +}
> > +
> > +void latency_fsnotify_start(void)
> > +{
> > +	static_branch_enable(&latency_notify_key);
> > +	/* Make sure all CPUs see key value before caller continue */
> > +	smp_wmb();
> > +}
> > +
> > +void latency_fsnotify_process(void)
> > +{
> > +	struct trace_array *tr;
> > +	struct llist_head *list;
> > +	struct llist_node *node;
> > +
> > +	if (this_cpu_read(latency_notify_disable))
> > +		return;
> > +
> > +	list = this_cpu_ptr(&notify_list);
> > +	for (node = llist_del_first(list); node != NULL;
> > +	     node = llist_del_first(list)) {
> > +		tr = llist_entry(node, struct trace_array, notify_ll);
> > +		atomic_set(&tr->notify_pending, 0);
> > +		queue_work(fsnotify_wq, &tr->fsnotify_work);
> > +	}
> > +}
> > +
> > +__init static int latency_fsnotify_init(void)
> > +{
> > +	fsnotify_wq = alloc_workqueue("tr_max_lat_wq",
> > +				      WQ_UNBOUND | WQ_HIGHPRI, 0);
> > +	if (!fsnotify_wq) {
> > +		pr_err("Unable to allocate tr_max_lat_wq\n");
> > +		return -ENOMEM;
> > +	}
> > +	return 0;
> > +}
> > +
> > +late_initcall_sync(latency_fsnotify_init);
> > +
> > +void latency_fsnotify(struct trace_array *tr)
> > +{
> > +	if (!fsnotify_wq)
> > +		return;
> > +
> > +	if (!this_cpu_read(latency_notify_disable))
> > +		queue_work(fsnotify_wq, &tr->fsnotify_work);
> > +	else {
> > +		/*
> > +		 * notify_pending prevents us from adding the same entry to
> > +		 * more than one notify_list. It will get queued in
> > +		 * latency_enable_fsnotify()
> > +		 */
> > +		if (!atomic_xchg(&tr->notify_pending, 1))
> > +			llist_add(&tr->notify_ll, this_cpu_ptr(&notify_list));
> > +	}
> > +}
> > +
> > +/*
> > + * (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
> > + *  defined(CONFIG_FSNOTIFY)
> > + */
> > +#else
> > +
> > +#define trace_create_maxlat_file(tr, d_tracer)				\
> > +	trace_create_file("tracing_max_latency", 0644, d_tracer,	\
> > +			  &tr->max_latency, &tracing_max_lat_fops)
> > +
> > +#endif
> > +
> >  #ifdef CONFIG_TRACER_MAX_TRACE
> >  /*
> >   * Copy the new maximum trace into the separate maximum-trace
> > @@ -1518,6 +1626,7 @@ __update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu)
> >  
> >  	/* record this tasks comm */
> >  	tracing_record_cmdline(tsk);
> > +	latency_fsnotify(tr);
> >  }
> >  
> >  /**
> > @@ -8550,8 +8659,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
> >  	create_trace_options_dir(tr);
> >  
> >  #if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
> > -	trace_create_file("tracing_max_latency", 0644, d_tracer,
> > -			&tr->max_latency, &tracing_max_lat_fops);
> > +	trace_create_maxlat_file(tr, d_tracer);
> >  #endif
> >  
> >  	if (ftrace_create_function_files(tr, d_tracer))
> > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> > index 005f08629b8b..d9f83b2aaa71 100644
> > --- a/kernel/trace/trace.h
> > +++ b/kernel/trace/trace.h
> > @@ -16,6 +16,7 @@
> >  #include <linux/trace_events.h>
> >  #include <linux/compiler.h>
> >  #include <linux/glob.h>
> > +#include <linux/workqueue.h>
> >  
> >  #ifdef CONFIG_FTRACE_SYSCALLS
> >  #include <asm/unistd.h>		/* For NR_SYSCALLS	     */
> > @@ -264,6 +265,12 @@ struct trace_array {
> >  #endif
> >  #if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
> >  	unsigned long		max_latency;
> > +#ifdef CONFIG_FSNOTIFY
> > +	struct dentry		*d_max_latency;
> > +	struct work_struct	fsnotify_work;
> > +	atomic_t		notify_pending;
> > +	struct llist_node	notify_ll;
> > +#endif
> >  #endif
> >  	struct trace_pid_list	__rcu *filtered_pids;
> >  	/*
> > @@ -785,6 +792,21 @@ void update_max_tr_single(struct trace_array *tr,
> >  			  struct task_struct *tsk, int cpu);
> >  #endif /* CONFIG_TRACER_MAX_TRACE */
> >  
> > +#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
> > +	defined(CONFIG_FSNOTIFY)
> > +
> > +void latency_fsnotify(struct trace_array *tr);
> > +void latency_fsnotify_start(void);
> > +void latency_fsnotify_stop(void);
> > +
> > +#else
> > +
> > +#define latency_fsnotify(tr)     do { } while (0)
> > +#define latency_fsnotify_start() do { } while (0)
> > +#define latency_fsnotify_stop()  do { } while (0)
> > +
> > +#endif
> > +
> >  #ifdef CONFIG_STACKTRACE
> >  void __trace_stack(struct trace_array *tr, unsigned long flags, int skip,
> >  		   int pc);
> > diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
> > index fa95139445b2..9c379261ee89 100644
> > --- a/kernel/trace/trace_hwlat.c
> > +++ b/kernel/trace/trace_hwlat.c
> > @@ -254,8 +254,10 @@ static int get_sample(void)
> >  		trace_hwlat_sample(&s);
> >  
> >  		/* Keep a running maximum ever recorded hardware latency */
> > -		if (sample > tr->max_latency)
> > +		if (sample > tr->max_latency) {
> >  			tr->max_latency = sample;
> > +			latency_fsnotify(tr);
> > +		}
> >  	}
> >  
> >  out:
> > diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
> > index a745b0cee5d3..29403a83a5f0 100644
> > --- a/kernel/trace/trace_irqsoff.c
> > +++ b/kernel/trace/trace_irqsoff.c
> > @@ -557,6 +557,7 @@ static int __irqsoff_tracer_init(struct trace_array *tr)
> >  	if (irqsoff_busy)
> >  		return -EBUSY;
> >  
> > +	latency_fsnotify_start();
> >  	save_flags = tr->trace_flags;
> >  
> >  	/* non overwrite screws up the latency tracers */
> > @@ -591,16 +592,19 @@ static void __irqsoff_tracer_reset(struct trace_array *tr)
> >  	ftrace_reset_array_ops(tr);
> >  
> >  	irqsoff_busy = false;
> > +	latency_fsnotify_stop();
> >  }
> >  
> >  static void irqsoff_tracer_start(struct trace_array *tr)
> >  {
> > +	latency_fsnotify_start();
> >  	tracer_enabled = 1;
> >  }
> >  
> >  static void irqsoff_tracer_stop(struct trace_array *tr)
> >  {
> >  	tracer_enabled = 0;
> > +	latency_fsnotify_stop();
> >  }
> >  
> >  #ifdef CONFIG_IRQSOFF_TRACER
> > diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> > index 743b2b520d34..3dc90d9f605b 100644
> > --- a/kernel/trace/trace_sched_wakeup.c
> > +++ b/kernel/trace/trace_sched_wakeup.c
> > @@ -669,6 +669,7 @@ static bool wakeup_busy;
> >  
> >  static int __wakeup_tracer_init(struct trace_array *tr)
> >  {
> > +	latency_fsnotify_start();
> >  	save_flags = tr->trace_flags;
> >  
> >  	/* non overwrite screws up the latency tracers */
> > @@ -727,10 +728,12 @@ static void wakeup_tracer_reset(struct trace_array *tr)
> >  	set_tracer_flag(tr, TRACE_ITER_OVERWRITE, overwrite_flag);
> >  	ftrace_reset_array_ops(tr);
> >  	wakeup_busy = false;
> > +	latency_fsnotify_stop();
> >  }
> >  
> >  static void wakeup_tracer_start(struct trace_array *tr)
> >  {
> > +	latency_fsnotify_start();
> >  	wakeup_reset(tr);
> >  	tracer_enabled = 1;
> >  }
> > @@ -738,6 +741,7 @@ static void wakeup_tracer_start(struct trace_array *tr)
> >  static void wakeup_tracer_stop(struct trace_array *tr)
> >  {
> >  	tracer_enabled = 0;
> > +	latency_fsnotify_stop();
> >  }
> >  
> >  static struct tracer wakeup_tracer __read_mostly =
> > -- 
> > 2.17.1
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency
  2019-09-04  8:20   ` Peter Zijlstra
@ 2019-09-04 11:37     ` Steven Rostedt
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2019-09-04 11:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Viktor Rosendahl, Ingo Molnar, linux-kernel, Joel Fernandes

On Wed, 4 Sep 2019 10:20:46 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Sep 03, 2019 at 03:25:59PM +0200, Viktor Rosendahl wrote:
> 
> > It seems like it would be possible to simply replace the calls to
> > latency_fsnotify_enable/disable() with calls to
> > start/stop_critical_timings(). However, the main problem is that it
> > would not work for the wakup tracer. The wakeup tracer needs a
> > facility that postpones the notifications, not one that prevents the
> > measurements because all its measurements takes place in the middle
> > of __schedule(). On the other hand, in some places, like in idle and
> > the console we need start stop functions that prevents the
> > measurements from being make.  
> 
> Like Joel already mentioned; you can use irq_work here.

And I was thinking the exact same thing here too ;-)

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency
  2019-09-03 13:25 ` [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency Viktor Rosendahl
  2019-09-04  4:00   ` Joel Fernandes
  2019-09-04  8:20   ` Peter Zijlstra
@ 2019-09-04 11:39   ` Steven Rostedt
  2019-09-04 19:00     ` Viktor Rosendahl
  2 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2019-09-04 11:39 UTC (permalink / raw)
  To: Viktor Rosendahl
  Cc: Ingo Molnar, linux-kernel, Joel Fernandes, Peter Zijlstra

On Tue,  3 Sep 2019 15:25:59 +0200
Viktor Rosendahl <viktor.rosendahl@gmail.com> wrote:

> +void latency_fsnotify_stop(void)
> +{
> +	/* Make sure all CPUs see caller's previous actions to stop tracer */
> +	smp_wmb();

These memory barriers just look wrong. What exactly are you trying to protect here?

Where's the matching rmbs?

-- Steve

> +	static_branch_disable(&latency_notify_key);
> +	latency_fsnotify_process();
> +}
> +
> +void latency_fsnotify_start(void)
> +{
> +	static_branch_enable(&latency_notify_key);
> +	/* Make sure all CPUs see key value before caller continue */
> +	smp_wmb();
> +}
> +

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 2/4] preemptirq_delay_test: Add the burst feature and a sysfs trigger
  2019-09-03 13:26 ` [PATCH v5 2/4] preemptirq_delay_test: Add the burst feature and a sysfs trigger Viktor Rosendahl
@ 2019-09-04 11:42   ` Steven Rostedt
  2019-09-04 19:10     ` Viktor Rosendahl
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2019-09-04 11:42 UTC (permalink / raw)
  To: Viktor Rosendahl
  Cc: Ingo Molnar, linux-kernel, Joel Fernandes, Peter Zijlstra

On Tue,  3 Sep 2019 15:26:00 +0200
Viktor Rosendahl <viktor.rosendahl@gmail.com> wrote:
> diff --git a/kernel/trace/preemptirq_delay_test.c b/kernel/trace/preemptirq_delay_test.c
> index d8765c952fab..dc281fa75198 100644
> --- a/kernel/trace/preemptirq_delay_test.c
> +++ b/kernel/trace/preemptirq_delay_test.c
> @@ -3,6 +3,7 @@
>   * Preempt / IRQ disable delay thread to test latency tracers
>   *
>   * Copyright (C) 2018 Joel Fernandes (Google) <joel@joelfernandes.org>
> + * Copyright (C) 2018, 2019 BMW Car IT GmbH

A name and what you did should also be attached here. Ideally, we leave
these out as git history is usually enough.

-- Steve

>   */
>  
>  #include <linux/trace_clock.h>
> @@ -10,18 +11,25 @@
>  #include <linux/interrupt.h>
>  #include <linux/irq.h>
>  #include <linux/kernel.h>
> +#include <linux/kobject.h>
>  #include <linux/kthread.h>
>  #include <linux/module.h>
>  #include <linux/printk.h>
>  #include <linux/string.h>
> +#include <linux/sysfs.h>
>  
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency
  2019-09-04  8:19     ` Peter Zijlstra
@ 2019-09-04 13:42       ` Joel Fernandes
  2019-09-04 18:17       ` Viktor Rosendahl
  1 sibling, 0 replies; 19+ messages in thread
From: Joel Fernandes @ 2019-09-04 13:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Viktor Rosendahl, paulmck, Steven Rostedt, Ingo Molnar,
	linux-kernel, linux-rt-users

On Wed, Sep 04, 2019 at 10:19:19AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 04, 2019 at 12:00:39AM -0400, Joel Fernandes wrote:
> > [ Resending since I messed up my last email's headers! ]
> > 
> > On Tue, Sep 03, 2019 at 03:25:59PM +0200, Viktor Rosendahl wrote:
> > > This patch implements the feature that the tracing_max_latency file,
> > > e.g. /sys/kernel/debug/tracing/tracing_max_latency will receive
> > > notifications through the fsnotify framework when a new latency is
> > > available.
> > > 
> > > One particularly interesting use of this facility is when enabling
> > > threshold tracing, through /sys/kernel/debug/tracing/tracing_thresh,
> > > together with the preempt/irqsoff tracers. This makes it possible to
> > > implement a user space program that can, with equal probability,
> > > obtain traces of latencies that occur immediately after each other in
> > > spite of the fact that the preempt/irqsoff tracers operate in overwrite
> > > mode.
> > 
> > Adding Paul since RCU faces similar situations, i.e. raising softirq risks
> > scheduler deadlock in rcu_read_unlock_special() -- but RCU's solution is to
> > avoid raising the softirq and instead use irq_work.
> 
> Which is right.

Cool.

> > I was wondering, if we can rename __raise_softirq_irqoff() to
> > raise_softirq_irqoff_no_wake() and call that from places where there is risk
> > of scheduler related deadlocks. Then I think this can be used from Viktor's
> > code.  Let us discuss - what would happen if the softirq is raised, but
> > ksoftirqd is not awakened for this latency notification path? Is this really
> > an issue considering the softirq will execute during the next interrupt exit?
> 
> You'd get unbounded latency for processing the softirq and warnings on
> going idle with softirqs pending.

Thanks for sharing that.

> I really don't see why we should/want to be using softirq here.

Sure. makes sense.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency
  2019-09-04  8:19     ` Peter Zijlstra
  2019-09-04 13:42       ` Joel Fernandes
@ 2019-09-04 18:17       ` Viktor Rosendahl
  1 sibling, 0 replies; 19+ messages in thread
From: Viktor Rosendahl @ 2019-09-04 18:17 UTC (permalink / raw)
  To: Peter Zijlstra, Joel Fernandes
  Cc: paulmck, Steven Rostedt, Ingo Molnar, linux-kernel, linux-rt-users

On 9/4/19 10:19 AM, Peter Zijlstra wrote:

>>
>> Adding Paul since RCU faces similar situations, i.e. raising softirq risks
>> scheduler deadlock in rcu_read_unlock_special() -- but RCU's solution is to
>> avoid raising the softirq and instead use irq_work.
> 
> Which is right.
> 

Thanks Joel and Peter for suggesting the irq_work facility.

I was ignorant of the existence of this facility.

It looks like it is the Holy Grail of deferred work, for this particular 
type of problem :)

It looks like it will be possible to both avoid deadlock and also avoid 
the ugly additions to the sched and idle code. In addition, the rest of 
the code will become a lot simpler.

best regards,

Viktor

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency
  2019-09-04 11:39   ` Steven Rostedt
@ 2019-09-04 19:00     ` Viktor Rosendahl
  0 siblings, 0 replies; 19+ messages in thread
From: Viktor Rosendahl @ 2019-09-04 19:00 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, linux-kernel, Joel Fernandes, Peter Zijlstra

On 9/4/19 1:39 PM, Steven Rostedt wrote:
> On Tue,  3 Sep 2019 15:25:59 +0200
> Viktor Rosendahl <viktor.rosendahl@gmail.com> wrote:
> 
>> +void latency_fsnotify_stop(void)
>> +{
>> +	/* Make sure all CPUs see caller's previous actions to stop tracer */
>> +	smp_wmb();
> 
> These memory barriers just look wrong. What exactly are you trying to protect here?
> 
> Where's the matching rmbs?
> 

Thanks for reviewing.

However, since these functions will disappear when I take the irq_work
facility into use, we should perhaps not spend too much time discussing
what would have been.

There are no matching rmbs, I was thinking that the smp_wmb() would
merely enforce the order of the memory writes, as seen by other CPUs, so 
that the tracer would be stopped, before the latency fsnotify is disabled.

E.g. in case of the preemptirqsoff tracer the idea was that it doesn't
matter exactly when a CPU sees the "tracer_enabled = 0;" in
stop_irqsoff_tracer() but that it needs to be seen before the writes in
latency_fsnotify_stop() are seen.

best regards,

Viktor

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 2/4] preemptirq_delay_test: Add the burst feature and a sysfs trigger
  2019-09-04 11:42   ` Steven Rostedt
@ 2019-09-04 19:10     ` Viktor Rosendahl
  2019-09-05 16:52       ` Steven Rostedt
  0 siblings, 1 reply; 19+ messages in thread
From: Viktor Rosendahl @ 2019-09-04 19:10 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, linux-kernel, Joel Fernandes, Peter Zijlstra

On 9/4/19 1:42 PM, Steven Rostedt wrote:
> On Tue,  3 Sep 2019 15:26:00 +0200
> Viktor Rosendahl <viktor.rosendahl@gmail.com> wrote:
>> diff --git a/kernel/trace/preemptirq_delay_test.c b/kernel/trace/preemptirq_delay_test.c
>> index d8765c952fab..dc281fa75198 100644
>> --- a/kernel/trace/preemptirq_delay_test.c
>> +++ b/kernel/trace/preemptirq_delay_test.c
>> @@ -3,6 +3,7 @@
>>    * Preempt / IRQ disable delay thread to test latency tracers
>>    *
>>    * Copyright (C) 2018 Joel Fernandes (Google) <joel@joelfernandes.org>
>> + * Copyright (C) 2018, 2019 BMW Car IT GmbH
> 
> A name and what you did should also be attached here. Ideally, we leave
> these out as git history is usually enough.

I am not so keen to clutter source files with a new copyright message. 
My problem is that git-send-email doesn't work well with my work email 
address, so I am forced to use my private gmail, which may create a 
false impression that I as a private individual would be the copyright 
holder.

best regards,

Viktor

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 2/4] preemptirq_delay_test: Add the burst feature and a sysfs trigger
  2019-09-04 19:10     ` Viktor Rosendahl
@ 2019-09-05 16:52       ` Steven Rostedt
  2019-09-05 18:27         ` Viktor Rosendahl
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2019-09-05 16:52 UTC (permalink / raw)
  To: Viktor Rosendahl
  Cc: Ingo Molnar, linux-kernel, Joel Fernandes, Peter Zijlstra

On Wed, 4 Sep 2019 21:10:38 +0200
Viktor Rosendahl <viktor.rosendahl@gmail.com> wrote:

> On 9/4/19 1:42 PM, Steven Rostedt wrote:
> > On Tue,  3 Sep 2019 15:26:00 +0200
> > Viktor Rosendahl <viktor.rosendahl@gmail.com> wrote:  
> >> diff --git a/kernel/trace/preemptirq_delay_test.c b/kernel/trace/preemptirq_delay_test.c
> >> index d8765c952fab..dc281fa75198 100644
> >> --- a/kernel/trace/preemptirq_delay_test.c
> >> +++ b/kernel/trace/preemptirq_delay_test.c
> >> @@ -3,6 +3,7 @@
> >>    * Preempt / IRQ disable delay thread to test latency tracers
> >>    *
> >>    * Copyright (C) 2018 Joel Fernandes (Google) <joel@joelfernandes.org>
> >> + * Copyright (C) 2018, 2019 BMW Car IT GmbH  
> > 
> > A name and what you did should also be attached here. Ideally, we leave
> > these out as git history is usually enough.  
> 
> I am not so keen to clutter source files with a new copyright message. 
> My problem is that git-send-email doesn't work well with my work email 
> address, so I am forced to use my private gmail, which may create a 
> false impression that I as a private individual would be the copyright 
> holder.
>

Then do what I do:

  Steven Rostedt (VMware) <rostedt@goodmis.org>

I did "(Red Hat)" when I worked for Red Hat.

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 2/4] preemptirq_delay_test: Add the burst feature and a sysfs trigger
  2019-09-05 16:52       ` Steven Rostedt
@ 2019-09-05 18:27         ` Viktor Rosendahl
  0 siblings, 0 replies; 19+ messages in thread
From: Viktor Rosendahl @ 2019-09-05 18:27 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, linux-kernel, Joel Fernandes, Peter Zijlstra

On 9/5/19 6:52 PM, Steven Rostedt wrote:
> On Wed, 4 Sep 2019 21:10:38 +0200
> Viktor Rosendahl <viktor.rosendahl@gmail.com> wrote:
> 
>> On 9/4/19 1:42 PM, Steven Rostedt wrote:
>>
>> I am not so keen to clutter source files with a new copyright message.
>> My problem is that git-send-email doesn't work well with my work email
>> address, so I am forced to use my private gmail, which may create a
>> false impression that I as a private individual would be the copyright
>> holder.
>>
> 
> Then do what I do:
> 
>    Steven Rostedt (VMware) <rostedt@goodmis.org>
> 
> I did "(Red Hat)" when I worked for Red Hat.
> 
> -- Steve
> 

Ok, that seems simple and effective. I will update accordingly for v7 of 
the patches.

best regards,

Viktor

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-09-05 18:28 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-03 13:25 [PATCH v5 0/4] Some new features for the preempt/irqsoff tracers Viktor Rosendahl
2019-09-03 13:25 ` [PATCH v5 1/4] ftrace: Implement fs notification for tracing_max_latency Viktor Rosendahl
2019-09-04  4:00   ` Joel Fernandes
2019-09-04  8:19     ` Peter Zijlstra
2019-09-04 13:42       ` Joel Fernandes
2019-09-04 18:17       ` Viktor Rosendahl
2019-09-04 10:21     ` Paul E. McKenney
2019-09-04  8:20   ` Peter Zijlstra
2019-09-04 11:37     ` Steven Rostedt
2019-09-04 11:39   ` Steven Rostedt
2019-09-04 19:00     ` Viktor Rosendahl
2019-09-03 13:26 ` [PATCH v5 2/4] preemptirq_delay_test: Add the burst feature and a sysfs trigger Viktor Rosendahl
2019-09-04 11:42   ` Steven Rostedt
2019-09-04 19:10     ` Viktor Rosendahl
2019-09-05 16:52       ` Steven Rostedt
2019-09-05 18:27         ` Viktor Rosendahl
2019-09-03 13:26 ` [PATCH v5 3/4] Add the latency-collector to tools Viktor Rosendahl
2019-09-03 13:26 ` [PATCH v5 4/4] ftrace: Add an option for tracing console latencies Viktor Rosendahl
2019-09-04  5:02   ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).