All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] hwlat improvements and osnoise tracer
@ 2021-04-08 14:13 Daniel Bristot de Oliveira
  2021-04-08 14:13 ` [RFC PATCH 1/5] tracing/hwlat: Add a cpus file specific for hwlat_detector Daniel Bristot de Oliveira
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-08 14:13 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel
  Cc: bristot, kcarcia, Jonathan Corbet, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexandre Chartre, Clark Willaims, John Kacur,
	Juri Lelli, linux-doc

This series proposes a set of improvements and new features for the
tracing subsystem to facilitate the debugging of low latency
deployments.

Currently, hwlat runs on a single CPU at a time, migrating across a
set of CPUs in a round-robin fashion. The first three patches are
changes made to allow hwlat to run on multiple CPUs in parallel,
increasing the chances of detecting a hardware latency.

The fourth patch is a helper to print a timestamp in a u64 in
seconds.nanoseconds format on tracepoints.

The fifth patch proposes a new tracer named osnoise and aims to help
users of isolcpus= (or a similar method) to measure how much noise the
OS and the hardware add to the isolated application. The osnoise tracer
bases on the hwlat detector code. The difference is that, instead of
sampling with interrupts disabled, the osnoise tracer samples the CPU with
interrupts and preemption enabled. In this way, the sampling thread will
suffer any source of noise from the OS. The detection and classification
of the type of noise are then made by observing the entry points of NMIs,
IRQs, SoftIRQs, and threads. If none of these sources of noise is detected,
the tool associates the noise with the hardware. The tool periodically
prints a status, printing the total noise of the period, the max single
noise observed, the percentage of CPU available for the task, along with
the counters of each source of the noise. To debug the sources of noise,
the tracer also adds a set of tracepoints that print any NMI, IRQ, SofIRQ,
and thread occurrence. These tracepoints print the starting time and the
noise's net duration at the end of the noise. In this way, it reduces the
number of tracepoints (one instead of two) and the need to manually
accounting the contribution of each noise independently.

Daniel Bristot de Oliveira (4):
  tracing/hwlat: Add a cpus file specific for hwlat_detector
  tracing/hwlat: Implement the mode config option
  tracing/hwlat: Implement the per-cpu mode
  tracing: Add the osnoise tracer

Steven Rostedt (1):
  tracing: Add __print_ns_to_secs() and __print_ns_without_secs()
    helpers

 Documentation/trace/hwlat_detector.rst |   29 +-
 Documentation/trace/osnoise_tracer.rst |  149 ++
 include/linux/ftrace_irq.h             |   16 +
 include/trace/events/osnoise.h         |  141 ++
 include/trace/trace_events.h           |   25 +
 kernel/trace/Kconfig                   |   34 +
 kernel/trace/Makefile                  |    1 +
 kernel/trace/trace.h                   |    9 +-
 kernel/trace/trace_entries.h           |   27 +
 kernel/trace/trace_hwlat.c             |  445 +++++-
 kernel/trace/trace_osnoise.c           | 1714 ++++++++++++++++++++++++
 kernel/trace/trace_output.c            |   72 +-
 12 files changed, 2604 insertions(+), 58 deletions(-)
 create mode 100644 Documentation/trace/osnoise_tracer.rst
 create mode 100644 include/trace/events/osnoise.h
 create mode 100644 kernel/trace/trace_osnoise.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH 1/5] tracing/hwlat: Add a cpus file specific for hwlat_detector
  2021-04-08 14:13 [RFC PATCH 0/5] hwlat improvements and osnoise tracer Daniel Bristot de Oliveira
@ 2021-04-08 14:13 ` Daniel Bristot de Oliveira
  2021-04-14 14:10   ` Steven Rostedt
  2021-04-08 14:13 ` [RFC PATCH 2/5] tracing/hwlat: Implement the mode config option Daniel Bristot de Oliveira
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-08 14:13 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel
  Cc: bristot, kcarcia, Jonathan Corbet, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexandre Chartre, Clark Willaims, John Kacur,
	Juri Lelli, linux-doc

Provides a "cpus" interface to the hardware latency detector. By
default, it lists all CPUs, allowing hwlatd threads to run on any online
CPU of the system.

It serves to restrict the execution of hwlatd to the set of CPUs writing
via this interface. Note that hwlatd also respects the "tracing_cpumask."
Hence, hwlatd threads will run only on the set of CPUs allowed here AND
on "tracing_cpumask."

Why not keep just "tracing_cpumask"? Because the user might be interested
in tracing what is running on other CPUs. For instance, one might run
hwlatd in one HT CPU while observing what is running on the sibling HT
CPU. The cpu list format is also more intuitive.

Also in preparation to the per-cpu mode.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 Documentation/trace/hwlat_detector.rst |  14 +--
 kernel/trace/trace_hwlat.c             | 125 ++++++++++++++++++++++++-
 2 files changed, 131 insertions(+), 8 deletions(-)

diff --git a/Documentation/trace/hwlat_detector.rst b/Documentation/trace/hwlat_detector.rst
index 5739349649c8..86f973a7763c 100644
--- a/Documentation/trace/hwlat_detector.rst
+++ b/Documentation/trace/hwlat_detector.rst
@@ -73,11 +73,13 @@ in /sys/kernel/tracing:
 
  - tracing_threshold	- minimum latency value to be considered (usecs)
  - tracing_max_latency	- maximum hardware latency actually observed (usecs)
- - tracing_cpumask	- the CPUs to move the hwlat thread across
- - hwlat_detector/width	- specified amount of time to spin within window (usecs)
- - hwlat_detector/window	- amount of time between (width) runs (usecs)
+ - hwlat_detector/width - specified amount of time to spin within window (usecs)
+ - hwlat_detector/window        - amount of time between (width) runs (usecs)
+ - hwlat_detector/cpus  - the CPUs to move the hwlat thread across
 
 The hwlat detector's kernel thread will migrate across each CPU specified in
-tracing_cpumask between each window. To limit the migration, either modify
-tracing_cpumask, or modify the hwlat kernel thread (named [hwlatd]) CPU
-affinity directly, and the migration will stop.
+cpus list between each window. The hwlat detector will also obey the
+tracing_cpumask, so the thread will migrate on the set of cpus that is
+both on its cpus list and in the global tracing_cpumask file.
+To limit the migration, either modify cpumask, or modify the hwlat kernel
+thread (named [hwlatd]) CPU affinity directly, and the migration will stop.
diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index 34dc1a712dcb..deecb93f97f2 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -59,6 +59,7 @@ static struct task_struct *hwlat_kthread;
 
 static struct dentry *hwlat_sample_width;	/* sample width us */
 static struct dentry *hwlat_sample_window;	/* sample window us */
+static struct dentry *hwlat_cpumask_dentry;	/* hwlat cpus allowed */
 
 /* Save the previous tracing_thresh value */
 static unsigned long save_tracing_thresh;
@@ -272,6 +273,7 @@ static int get_sample(void)
 	return ret;
 }
 
+static struct cpumask hwlat_cpumask;
 static struct cpumask save_cpumask;
 static bool disable_migrate;
 
@@ -292,7 +294,14 @@ static void move_to_next_cpu(void)
 		goto disable;
 
 	get_online_cpus();
-	cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask);
+	/*
+	 * Run only on CPUs in which trace and hwlat are allowed to run.
+	 */
+	cpumask_and(current_mask, tr->tracing_cpumask, &hwlat_cpumask);
+	/*
+	 * And the CPU is online.
+	 */
+	cpumask_and(current_mask, cpu_online_mask, current_mask);
 	next_cpu = cpumask_next(smp_processor_id(), current_mask);
 	put_online_cpus();
 
@@ -368,7 +377,14 @@ static int start_kthread(struct trace_array *tr)
 
 	/* Just pick the first CPU on first iteration */
 	get_online_cpus();
-	cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask);
+	/*
+	 * Run only on CPUs in which trace and hwlat are allowed to run.
+	 */
+	cpumask_and(current_mask, tr->tracing_cpumask, &hwlat_cpumask);
+	/*
+	 * And the CPU is online.
+	 */
+	cpumask_and(current_mask, cpu_online_mask, current_mask);
 	put_online_cpus();
 	next_cpu = cpumask_first(current_mask);
 
@@ -402,6 +418,94 @@ static void stop_kthread(void)
 	hwlat_kthread = NULL;
 }
 
+/*
+ * hwlat_cpus_read - Read function for reading the "cpus" file
+ * @filp: The active open file structure
+ * @ubuf: The userspace provided buffer to read value into
+ * @cnt: The maximum number of bytes to read
+ * @ppos: The current "file" position
+ *
+ * Prints the "cpus" output into the user-provided buffer.
+ */
+static ssize_t
+hwlat_cpus_read(struct file *filp, char __user *ubuf, size_t count,
+		   loff_t *ppos)
+{
+	char *mask_str;
+	int len;
+
+	len = snprintf(NULL, 0, "%*pbl\n",
+		       cpumask_pr_args(&hwlat_cpumask)) + 1;
+	mask_str = kmalloc(len, GFP_KERNEL);
+	if (!mask_str)
+		return -ENOMEM;
+
+	len = snprintf(mask_str, len, "%*pbl\n",
+		       cpumask_pr_args(&hwlat_cpumask));
+	if (len >= count) {
+		count = -EINVAL;
+		goto out_err;
+	}
+	count = simple_read_from_buffer(ubuf, count, ppos, mask_str, len);
+
+out_err:
+	kfree(mask_str);
+
+	return count;
+}
+
+/**
+ * hwlat_cpus_write - Write function for "cpus" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * This function provides a write implementation for the "cpus"
+ * interface to the hardware latency detector. By default, it lists all
+ * CPUs, in this way, allowing hwlatd threads to run on any online CPU
+ * of the system. It serves to restrict the execution of hwlatd to the
+ * set of CPUs writing via this interface. Note that hwlatd also
+ * respects the "tracing_cpumask." Hence, hwlatd threads will run only
+ * on the set of CPUs allowed here AND on "tracing_cpumask." Why not
+ * have just "tracing_cpumask?" Because the user might be interested
+ * in tracing what is running on other CPUs. For instance, one might
+ * run hwlatd in one HT CPU while observing what is running on the
+ * sibling HT CPU.
+ */
+static ssize_t
+hwlat_cpus_write(struct file *filp, const char __user *ubuf, size_t count,
+		    loff_t *ppos)
+{
+	cpumask_var_t hwlat_cpumask_new;
+	char buf[256];
+	int err;
+
+	if (count >= 256)
+		return -EINVAL;
+
+	if (copy_from_user(buf, ubuf, count))
+		return -EFAULT;
+
+	if (!zalloc_cpumask_var(&hwlat_cpumask_new, GFP_KERNEL))
+		return -ENOMEM;
+
+	err = cpulist_parse(buf, hwlat_cpumask_new);
+	if (err)
+		goto err_free;
+
+	cpumask_copy(&hwlat_cpumask, hwlat_cpumask_new);
+
+	free_cpumask_var(hwlat_cpumask_new);
+
+	return count;
+
+err_free:
+	free_cpumask_var(hwlat_cpumask_new);
+
+	return err;
+}
+
 /*
  * hwlat_read - Wrapper read function for reading both window and width
  * @filp: The active open file structure
@@ -523,6 +627,14 @@ static const struct file_operations window_fops = {
 	.write		= hwlat_window_write,
 };
 
+static const struct file_operations cpus_fops = {
+	.open		= tracing_open_generic,
+	.read		= hwlat_cpus_read,
+	.write		= hwlat_cpus_write,
+	.llseek		= generic_file_llseek,
+};
+
+
 /**
  * init_tracefs - A function to initialize the tracefs interface files
  *
@@ -558,6 +670,13 @@ static int init_tracefs(void)
 	if (!hwlat_sample_width)
 		goto err;
 
+	hwlat_cpumask_dentry = trace_create_file("cpus", 0644,
+						 top_dir,
+						 NULL,
+						 &cpus_fops);
+	if (!hwlat_cpumask_dentry)
+		goto err;
+
 	return 0;
 
  err:
@@ -637,6 +756,8 @@ __init static int init_hwlat_tracer(void)
 	if (ret)
 		return ret;
 
+	cpumask_copy(&hwlat_cpumask, cpu_all_mask);
+
 	init_tracefs();
 
 	return 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 2/5] tracing/hwlat: Implement the mode config option
  2021-04-08 14:13 [RFC PATCH 0/5] hwlat improvements and osnoise tracer Daniel Bristot de Oliveira
  2021-04-08 14:13 ` [RFC PATCH 1/5] tracing/hwlat: Add a cpus file specific for hwlat_detector Daniel Bristot de Oliveira
@ 2021-04-08 14:13 ` Daniel Bristot de Oliveira
  2021-04-08 20:52   ` kernel test robot
  2021-04-14 14:30   ` Steven Rostedt
  2021-04-08 14:13 ` [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode Daniel Bristot de Oliveira
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-08 14:13 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel
  Cc: bristot, kcarcia, Jonathan Corbet, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexandre Chartre, Clark Willaims, John Kacur,
	Juri Lelli, linux-doc

Provides the "mode" config to the hardware latency detector. hwlatd has
two different operation modes. The default mode is the "round-robin" one,
in which a single hwlatd thread runs, migrating among the allowed CPUs in a
"round-robin" fashion. This is the current behavior.

The "none" sets the allowed cpumask for a single hwlatd thread at the
startup, but skips the round-robin, letting the scheduler handle the
migration.

In preparation to the per-cpu mode.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>

---
 Documentation/trace/hwlat_detector.rst |  21 +++-
 kernel/trace/trace_hwlat.c             | 157 +++++++++++++++++++++++--
 2 files changed, 162 insertions(+), 16 deletions(-)

diff --git a/Documentation/trace/hwlat_detector.rst b/Documentation/trace/hwlat_detector.rst
index 86f973a7763c..f63fdd867598 100644
--- a/Documentation/trace/hwlat_detector.rst
+++ b/Documentation/trace/hwlat_detector.rst
@@ -76,10 +76,19 @@ in /sys/kernel/tracing:
  - hwlat_detector/width - specified amount of time to spin within window (usecs)
  - hwlat_detector/window        - amount of time between (width) runs (usecs)
  - hwlat_detector/cpus  - the CPUs to move the hwlat thread across
+ - hwlat_detector/mode	- the thread mode
+
+By default, the hwlat detector's kernel thread will migrate across each CPU
+specified in cpumask at the beginning of a new window, in a round-robin
+fashion. This behavior can be changed by changing the thread mode,
+the available options are:
+
+ - none:        do not force migration
+ - round-robin: migrate across each CPU specified in cpus between each window
+
+By default, hwlat detector will also obey the tracing_cpumask, so the thread
+will be placed only in the set of cpus that is both on the hwlat detector's
+cpus and in the global tracing_cpumask file. The user can overwrite the
+cpumask by setting it manually. Changing the hwlatd affinity externally,
+e.g., via taskset tool, will disable the round-robin migration.
 
-The hwlat detector's kernel thread will migrate across each CPU specified in
-cpus list between each window. The hwlat detector will also obey the
-tracing_cpumask, so the thread will migrate on the set of cpus that is
-both on its cpus list and in the global tracing_cpumask file.
-To limit the migration, either modify cpumask, or modify the hwlat kernel
-thread (named [hwlatd]) CPU affinity directly, and the migration will stop.
diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index deecb93f97f2..3818200c9e24 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -60,6 +60,15 @@ static struct task_struct *hwlat_kthread;
 static struct dentry *hwlat_sample_width;	/* sample width us */
 static struct dentry *hwlat_sample_window;	/* sample window us */
 static struct dentry *hwlat_cpumask_dentry;	/* hwlat cpus allowed */
+static struct dentry *hwlat_thread_mode;	/* hwlat thread mode */
+
+enum {
+	MODE_NONE = 0,
+	MODE_ROUND_ROBIN,
+	MODE_MAX
+};
+
+static char *thread_mode_str[] = { "none", "round-robin" };
 
 /* Save the previous tracing_thresh value */
 static unsigned long save_tracing_thresh;
@@ -97,11 +106,16 @@ static struct hwlat_data {
 	u64	sample_window;		/* total sampling window (on+off) */
 	u64	sample_width;		/* active sampling portion of window */
 
+	int	thread_mode;			/* thread mode */
+
 } hwlat_data = {
 	.sample_window		= DEFAULT_SAMPLE_WINDOW,
 	.sample_width		= DEFAULT_SAMPLE_WIDTH,
+	.thread_mode		= MODE_ROUND_ROBIN
 };
 
+static bool hwlat_busy;
+
 static void trace_hwlat_sample(struct hwlat_sample *sample)
 {
 	struct trace_array *tr = hwlat_trace;
@@ -337,7 +351,8 @@ static int kthread_fn(void *data)
 
 	while (!kthread_should_stop()) {
 
-		move_to_next_cpu();
+		if (hwlat_data.thread_mode == MODE_ROUND_ROBIN)
+			move_to_next_cpu();
 
 		local_irq_disable();
 		get_sample();
@@ -375,6 +390,14 @@ static int start_kthread(struct trace_array *tr)
 	if (hwlat_kthread)
 		return 0;
 
+
+	kthread = kthread_create(kthread_fn, NULL, "hwlatd");
+	if (IS_ERR(kthread)) {
+		pr_err(BANNER "could not start sampling thread\n");
+		return -ENOMEM;
+	}
+
+
 	/* Just pick the first CPU on first iteration */
 	get_online_cpus();
 	/*
@@ -386,16 +409,14 @@ static int start_kthread(struct trace_array *tr)
 	 */
 	cpumask_and(current_mask, cpu_online_mask, current_mask);
 	put_online_cpus();
-	next_cpu = cpumask_first(current_mask);
 
-	kthread = kthread_create(kthread_fn, NULL, "hwlatd");
-	if (IS_ERR(kthread)) {
-		pr_err(BANNER "could not start sampling thread\n");
-		return -ENOMEM;
+	if (hwlat_data.thread_mode == MODE_ROUND_ROBIN) {
+		next_cpu = cpumask_first(current_mask);
+		cpumask_clear(current_mask);
+		cpumask_set_cpu(next_cpu, current_mask);
+
 	}
 
-	cpumask_clear(current_mask);
-	cpumask_set_cpu(next_cpu, current_mask);
 	sched_setaffinity(kthread->pid, current_mask);
 
 	hwlat_kthread = kthread;
@@ -615,6 +636,109 @@ hwlat_window_write(struct file *filp, const char __user *ubuf,
 	return cnt;
 }
 
+static void *s_mode_start(struct seq_file *s, loff_t *pos)
+{
+	int mode = *pos;
+
+	if (mode >= MODE_MAX)
+		return NULL;
+
+	return pos;
+}
+
+static void *s_mode_next(struct seq_file *s, void *v, loff_t *pos)
+{
+	int mode = ++(*pos);
+
+	if (mode >= MODE_MAX)
+		return NULL;
+
+	return pos;
+}
+
+static int s_mode_show(struct seq_file *s, void *v)
+{
+	loff_t *pos = v;
+	int mode = *pos;
+
+	if (mode == hwlat_data.thread_mode)
+		seq_printf(s, "[%s]", thread_mode_str[mode]);
+	else
+		seq_printf(s, "%s", thread_mode_str[mode]);
+
+	if (mode != MODE_MAX)
+		seq_puts(s, " ");
+
+	return 0;
+}
+
+static void s_mode_stop(struct seq_file *s, void *v)
+{
+	seq_puts(s, "\n");
+}
+
+static const struct seq_operations thread_mode_seq_ops = {
+	.start		= s_mode_start,
+	.next		= s_mode_next,
+	.show		= s_mode_show,
+	.stop		= s_mode_stop
+};
+
+static int hwlat_mode_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &thread_mode_seq_ops);
+};
+
+/**
+ * hwlat_mode_write - Write function for "mode" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * This function provides a write implementation for the "mode" interface
+ * to the hardware latency detector. hwlatd has different operation modes.
+ * The "none" sets the allowed cpumask for a single hwlatd thread at the
+ * startup and lets the scheduler handle the migration. The default mode is
+ * the "round-robin" one, in which a single hwlatd thread runs, migrating
+ * among the allowed CPUs in a round-robin fashion.
+ */
+static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
+				 size_t cnt, loff_t *ppos)
+{
+	const char *mode;
+	char buf[64];
+	int ret;
+	int i;
+
+	if (hwlat_busy)
+		return -EBUSY;
+
+	if (cnt >= sizeof(buf))
+		return -EINVAL;
+
+	if (copy_from_user(buf, ubuf, cnt))
+		return -EFAULT;
+
+	buf[cnt] = 0;
+
+	mode = strstrip(buf);
+
+	ret = -EINVAL;
+
+	for (i = 0; i < MODE_MAX; i++) {
+		if (strcmp(mode, thread_mode_str[i]) == 0) {
+			hwlat_data.thread_mode = i;
+			ret = cnt;
+		}
+	}
+
+	*ppos += cnt;
+
+	return cnt;
+}
+
+
 static const struct file_operations width_fops = {
 	.open		= tracing_open_generic,
 	.read		= hwlat_read,
@@ -634,6 +758,14 @@ static const struct file_operations cpus_fops = {
 	.llseek		= generic_file_llseek,
 };
 
+static const struct file_operations thread_mode_fops = {
+	.open		= hwlat_mode_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+	.write		= hwlat_mode_write
+
+};
 
 /**
  * init_tracefs - A function to initialize the tracefs interface files
@@ -677,6 +809,13 @@ static int init_tracefs(void)
 	if (!hwlat_cpumask_dentry)
 		goto err;
 
+	hwlat_thread_mode = trace_create_file("mode", 0644,
+					      top_dir,
+					      NULL,
+					      &thread_mode_fops);
+	if (!hwlat_thread_mode)
+		goto err;
+
 	return 0;
 
  err:
@@ -698,8 +837,6 @@ static void hwlat_tracer_stop(struct trace_array *tr)
 	stop_kthread();
 }
 
-static bool hwlat_busy;
-
 static int hwlat_tracer_init(struct trace_array *tr)
 {
 	/* Only allow one instance to enable this */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode
  2021-04-08 14:13 [RFC PATCH 0/5] hwlat improvements and osnoise tracer Daniel Bristot de Oliveira
  2021-04-08 14:13 ` [RFC PATCH 1/5] tracing/hwlat: Add a cpus file specific for hwlat_detector Daniel Bristot de Oliveira
  2021-04-08 14:13 ` [RFC PATCH 2/5] tracing/hwlat: Implement the mode config option Daniel Bristot de Oliveira
@ 2021-04-08 14:13 ` Daniel Bristot de Oliveira
  2021-04-08 19:39   ` kernel test robot
                     ` (3 more replies)
  2021-04-08 14:13 ` [RFC PATCH 4/5] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers Daniel Bristot de Oliveira
  2021-04-08 14:13 ` [RFC PATCH 5/5] tracing: Add the osnoise tracer Daniel Bristot de Oliveira
  4 siblings, 4 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-08 14:13 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel
  Cc: bristot, kcarcia, Jonathan Corbet, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexandre Chartre, Clark Willaims, John Kacur,
	Juri Lelli, linux-doc

Implements the per-cpu mode in which a sampling thread is created for
each cpu in the "cpus" (and tracing_mask).

The per-cpu mode has the potention to speed up the hwlat detection by
running on multiple CPUs at the same time.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>

---
 Documentation/trace/hwlat_detector.rst |   6 +-
 kernel/trace/trace_hwlat.c             | 171 +++++++++++++++++++------
 2 files changed, 137 insertions(+), 40 deletions(-)

diff --git a/Documentation/trace/hwlat_detector.rst b/Documentation/trace/hwlat_detector.rst
index f63fdd867598..7a6fab105b29 100644
--- a/Documentation/trace/hwlat_detector.rst
+++ b/Documentation/trace/hwlat_detector.rst
@@ -85,10 +85,12 @@ the available options are:
 
  - none:        do not force migration
  - round-robin: migrate across each CPU specified in cpus between each window
+ - per-cpu:     create a per-cpu thread for each cpu in cpus
 
 By default, hwlat detector will also obey the tracing_cpumask, so the thread
 will be placed only in the set of cpus that is both on the hwlat detector's
 cpus and in the global tracing_cpumask file. The user can overwrite the
 cpumask by setting it manually. Changing the hwlatd affinity externally,
-e.g., via taskset tool, will disable the round-robin migration.
-
+e.g., via taskset tool, will disable the round-robin migration. In the
+per-cpu mode, the per-cpu thread (hwlatd/CPU) will be pinned to its relative
+cpu, and its affinity cannot be changed.
diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index 3818200c9e24..52968ea312df 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -34,7 +34,7 @@
  * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <jcm@redhat.com>
  * Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. <srostedt@redhat.com>
  *
- * Includes useful feedback from Clark Williams <clark@redhat.com>
+ * Includes useful feedback from Clark Williams <williams@redhat.com>
  *
  */
 #include <linux/kthread.h>
@@ -54,9 +54,6 @@ static struct trace_array	*hwlat_trace;
 #define DEFAULT_SAMPLE_WIDTH	500000			/* 0.5s */
 #define DEFAULT_LAT_THRESHOLD	10			/* 10us */
 
-/* sampling thread*/
-static struct task_struct *hwlat_kthread;
-
 static struct dentry *hwlat_sample_width;	/* sample width us */
 static struct dentry *hwlat_sample_window;	/* sample window us */
 static struct dentry *hwlat_cpumask_dentry;	/* hwlat cpus allowed */
@@ -65,19 +62,27 @@ static struct dentry *hwlat_thread_mode;	/* hwlat thread mode */
 enum {
 	MODE_NONE = 0,
 	MODE_ROUND_ROBIN,
+	MODE_PER_CPU,
 	MODE_MAX
 };
 
-static char *thread_mode_str[] = { "none", "round-robin" };
+static char *thread_mode_str[] = { "none", "round-robin", "per-cpu" };
 
 /* Save the previous tracing_thresh value */
 static unsigned long save_tracing_thresh;
 
-/* NMI timestamp counters */
-static u64 nmi_ts_start;
-static u64 nmi_total_ts;
-static int nmi_count;
-static int nmi_cpu;
+/* runtime kthread data */
+struct hwlat_kthread_data {
+	struct task_struct *kthread;
+	/* NMI timestamp counters */
+	u64 nmi_ts_start;
+	u64 nmi_total_ts;
+	int nmi_count;
+	int nmi_cpu;
+};
+
+struct hwlat_kthread_data hwlat_single_cpu_data;
+DEFINE_PER_CPU(struct hwlat_kthread_data, hwlat_per_cpu_data);
 
 /* Tells NMIs to call back to the hwlat tracer to record timestamps */
 bool trace_hwlat_callback_enabled;
@@ -114,6 +119,14 @@ static struct hwlat_data {
 	.thread_mode		= MODE_ROUND_ROBIN
 };
 
+struct hwlat_kthread_data *get_cpu_data(void)
+{
+	if (hwlat_data.thread_mode == MODE_PER_CPU)
+		return this_cpu_ptr(&hwlat_per_cpu_data);
+	else
+		return &hwlat_single_cpu_data;
+}
+
 static bool hwlat_busy;
 
 static void trace_hwlat_sample(struct hwlat_sample *sample)
@@ -151,7 +164,9 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)
 
 void trace_hwlat_callback(bool enter)
 {
-	if (smp_processor_id() != nmi_cpu)
+	struct hwlat_kthread_data *kdata = get_cpu_data();
+
+	if (kdata->kthread)
 		return;
 
 	/*
@@ -160,13 +175,13 @@ void trace_hwlat_callback(bool enter)
 	 */
 	if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
 		if (enter)
-			nmi_ts_start = time_get();
+			kdata->nmi_ts_start = time_get();
 		else
-			nmi_total_ts += time_get() - nmi_ts_start;
+			kdata->nmi_total_ts += time_get() - kdata->nmi_ts_start;
 	}
 
 	if (enter)
-		nmi_count++;
+		kdata->nmi_count++;
 }
 
 /**
@@ -178,6 +193,7 @@ void trace_hwlat_callback(bool enter)
  */
 static int get_sample(void)
 {
+	struct hwlat_kthread_data *kdata = get_cpu_data();
 	struct trace_array *tr = hwlat_trace;
 	struct hwlat_sample s;
 	time_type start, t1, t2, last_t2;
@@ -190,9 +206,8 @@ static int get_sample(void)
 
 	do_div(thresh, NSEC_PER_USEC); /* modifies interval value */
 
-	nmi_cpu = smp_processor_id();
-	nmi_total_ts = 0;
-	nmi_count = 0;
+	kdata->nmi_total_ts = 0;
+	kdata->nmi_count = 0;
 	/* Make sure NMIs see this first */
 	barrier();
 
@@ -262,15 +277,15 @@ static int get_sample(void)
 		ret = 1;
 
 		/* We read in microseconds */
-		if (nmi_total_ts)
-			do_div(nmi_total_ts, NSEC_PER_USEC);
+		if (kdata->nmi_total_ts)
+			do_div(kdata->nmi_total_ts, NSEC_PER_USEC);
 
 		hwlat_data.count++;
 		s.seqnum = hwlat_data.count;
 		s.duration = sample;
 		s.outer_duration = outer_sample;
-		s.nmi_total_ts = nmi_total_ts;
-		s.nmi_count = nmi_count;
+		s.nmi_total_ts = kdata->nmi_total_ts;
+		s.nmi_count = kdata->nmi_count;
 		s.count = count;
 		trace_hwlat_sample(&s);
 
@@ -376,23 +391,43 @@ static int kthread_fn(void *data)
 }
 
 /**
- * start_kthread - Kick off the hardware latency sampling/detector kthread
+ * stop_stop_kthread - Inform the hardware latency samping/detector kthread to stop
+ *
+ * This kicks the running hardware latency sampling/detector kernel thread and
+ * tells it to stop sampling now. Use this on unload and at system shutdown.
+ */
+static void stop_single_kthread(void)
+{
+	struct hwlat_kthread_data *kdata = get_cpu_data();
+	struct task_struct *kthread = kdata->kthread;
+
+	if (!kthread)
+
+		return;
+	kthread_stop(kthread);
+
+	kdata->kthread = NULL;
+}
+
+
+/**
+ * start_single_kthread - Kick off the hardware latency sampling/detector kthread
  *
  * This starts the kernel thread that will sit and sample the CPU timestamp
  * counter (TSC or similar) and look for potential hardware latencies.
  */
-static int start_kthread(struct trace_array *tr)
+static int start_single_kthread(struct trace_array *tr)
 {
+	struct hwlat_kthread_data *kdata = get_cpu_data();
 	struct cpumask *current_mask = &save_cpumask;
 	struct task_struct *kthread;
 	int next_cpu;
 
-	if (hwlat_kthread)
+	if (kdata->kthread)
 		return 0;
 
-
 	kthread = kthread_create(kthread_fn, NULL, "hwlatd");
-	if (IS_ERR(kthread)) {
+	if (IS_ERR(kdata->kthread)) {
 		pr_err(BANNER "could not start sampling thread\n");
 		return -ENOMEM;
 	}
@@ -419,24 +454,77 @@ static int start_kthread(struct trace_array *tr)
 
 	sched_setaffinity(kthread->pid, current_mask);
 
-	hwlat_kthread = kthread;
+	kdata->kthread = kthread;
 	wake_up_process(kthread);
 
 	return 0;
 }
 
 /**
- * stop_kthread - Inform the hardware latency samping/detector kthread to stop
+ * stop_per_cpu_kthread - Inform the hardware latency samping/detector kthread to stop
  *
- * This kicks the running hardware latency sampling/detector kernel thread and
+ * This kicks the running hardware latency sampling/detector kernel threads and
  * tells it to stop sampling now. Use this on unload and at system shutdown.
  */
-static void stop_kthread(void)
+static void stop_per_cpu_kthreads(void)
 {
-	if (!hwlat_kthread)
-		return;
-	kthread_stop(hwlat_kthread);
-	hwlat_kthread = NULL;
+	struct task_struct *kthread;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		kthread = per_cpu(hwlat_per_cpu_data, cpu).kthread;
+		if (kthread)
+			kthread_stop(kthread);
+	}
+}
+
+/**
+ * start_per_cpu_kthread - Kick off the hardware latency sampling/detector kthreads
+ *
+ * This starts the kernel threads that will sit on potentially all cpus and
+ * sample the CPU timestamp counter (TSC or similar) and look for potential
+ * hardware latencies.
+ */
+static int start_per_cpu_kthreads(struct trace_array *tr)
+{
+	struct cpumask *current_mask = &save_cpumask;
+	struct cpumask *this_cpumask;
+	struct task_struct *kthread;
+	char comm[24];
+	int cpu;
+
+	if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
+		return -ENOMEM;
+
+	get_online_cpus();
+	/*
+	 * Run only on CPUs in which trace and hwlat are allowed to run.
+	 */
+	cpumask_and(current_mask, tr->tracing_cpumask, &hwlat_cpumask);
+	/*
+	 * And the CPU is online.
+	 */
+	cpumask_and(current_mask, cpu_online_mask, current_mask);
+	put_online_cpus();
+
+	for_each_online_cpu(cpu)
+		per_cpu(hwlat_per_cpu_data, cpu).kthread = NULL;
+
+	for_each_cpu(cpu, current_mask) {
+		snprintf(comm, 24, "hwlatd/%d", cpu);
+
+		kthread = kthread_create_on_cpu(kthread_fn, NULL, cpu, comm);
+		if (IS_ERR(kthread)) {
+			pr_err(BANNER "could not start sampling thread\n");
+			stop_per_cpu_kthreads();
+			return -ENOMEM;
+		}
+
+		per_cpu(hwlat_per_cpu_data, cpu).kthread = kthread;
+		wake_up_process(kthread);
+	}
+
+	return 0;
 }
 
 /*
@@ -701,7 +789,8 @@ static int hwlat_mode_open(struct inode *inode, struct file *file)
  * The "none" sets the allowed cpumask for a single hwlatd thread at the
  * startup and lets the scheduler handle the migration. The default mode is
  * the "round-robin" one, in which a single hwlatd thread runs, migrating
- * among the allowed CPUs in a round-robin fashion.
+ * among the allowed CPUs in a round-robin fashion. The "per-cpu" mode
+ * creates one hwlatd thread per allowed CPU.
  */
 static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
 				 size_t cnt, loff_t *ppos)
@@ -827,14 +916,20 @@ static void hwlat_tracer_start(struct trace_array *tr)
 {
 	int err;
 
-	err = start_kthread(tr);
+	if (hwlat_data.thread_mode == MODE_PER_CPU)
+		err = start_per_cpu_kthreads(tr);
+	else
+		err = start_single_kthread(tr);
 	if (err)
 		pr_err(BANNER "Cannot start hwlat kthread\n");
 }
 
 static void hwlat_tracer_stop(struct trace_array *tr)
 {
-	stop_kthread();
+	if (hwlat_data.thread_mode == MODE_PER_CPU)
+		stop_per_cpu_kthreads();
+	else
+		stop_single_kthread();
 }
 
 static int hwlat_tracer_init(struct trace_array *tr)
@@ -864,7 +959,7 @@ static int hwlat_tracer_init(struct trace_array *tr)
 
 static void hwlat_tracer_reset(struct trace_array *tr)
 {
-	stop_kthread();
+	hwlat_tracer_stop(tr);
 
 	/* the tracing threshold is static between runs */
 	last_tracing_thresh = tracing_thresh;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 4/5] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers
  2021-04-08 14:13 [RFC PATCH 0/5] hwlat improvements and osnoise tracer Daniel Bristot de Oliveira
                   ` (2 preceding siblings ...)
  2021-04-08 14:13 ` [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode Daniel Bristot de Oliveira
@ 2021-04-08 14:13 ` Daniel Bristot de Oliveira
  2021-04-08 14:13 ` [RFC PATCH 5/5] tracing: Add the osnoise tracer Daniel Bristot de Oliveira
  4 siblings, 0 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-08 14:13 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel
  Cc: bristot, kcarcia, Jonathan Corbet, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexandre Chartre, Clark Willaims, John Kacur,
	Juri Lelli, linux-doc

From: Steven Rostedt <rostedt@goodmis.org>

To have nanosecond output displayed in a more human readable format, its
nicer to convert it to a seconds format (XXX.YYYYYYYYY). The problem is that
to do so, the numbers must be divided by NSEC_PER_SEC, and moded too. But as
these numbers are 64 bit, this can not be done simply with '/' and '%'
operators, but must use do_div() instead.

Instead of performing the expensive do_div() in the hot path of the
tracepoint, it is more efficient to perform it during the output phase. But
passing in do_div() can confuse the parser, and do_div() doesn't work
exactly like a normal C function. It modifies the number in place, and we
don't want to modify the actual values in the ring buffer.

Two helper functions are now created:

  __print_ns_to_secs() and __print_ns_without_secs()

They both take a value of nanoseconds, and the former will return that
number divided by NSEC_PER_SEC, and the latter will mod it with NSEC_PER_SEC
giving a way to print a nice human readable format:

 __print_fmt("time=%llu.%09u",
	__print_ns_to_secs(REC->nsec_val),
	__print_ns_without_secs(REC->nsec_val))

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>

---
 include/trace/trace_events.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 8268bf747d6f..c60fd1037b91 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -33,6 +33,21 @@
 	static const char TRACE_SYSTEM_STRING[] =	\
 		__stringify(TRACE_SYSTEM)
 
+#undef __print_ns_to_secs
+#define __print_ns_to_secs(value)			\
+	({						\
+		u64 ____val = (u64)value;		\
+		do_div(____val, NSEC_PER_SEC);		\
+		____val;				\
+	})
+
+#undef __print_ns_without_secs
+#define __print_ns_without_secs(value)			\
+	({						\
+		u64 ____val = (u64)value;		\
+		(u32) do_div(____val, NSEC_PER_SEC);	\
+	})
+
 TRACE_MAKE_SYSTEM_STR();
 
 #undef TRACE_DEFINE_ENUM
@@ -736,6 +751,16 @@ static inline void ftrace_test_probe_##call(void)			\
 #undef __print_array
 #undef __print_hex_dump
 
+/*
+ * The below is not executed in the kernel. It is only what is
+ * displayed in the print format for userspace to parse.
+ */
+#undef __print_ns_to_secs
+#define __print_ns_to_secs(val) val / 1000000000UL
+
+#undef __print_ns_without_secs
+#define __print_ns_without_secs(val) val % 1000000000UL
+
 #undef TP_printk
 #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 5/5] tracing: Add the osnoise tracer
  2021-04-08 14:13 [RFC PATCH 0/5] hwlat improvements and osnoise tracer Daniel Bristot de Oliveira
                   ` (3 preceding siblings ...)
  2021-04-08 14:13 ` [RFC PATCH 4/5] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers Daniel Bristot de Oliveira
@ 2021-04-08 14:13 ` Daniel Bristot de Oliveira
  2021-04-08 15:58   ` Jonathan Corbet
                     ` (2 more replies)
  4 siblings, 3 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-08 14:13 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel
  Cc: bristot, kcarcia, Jonathan Corbet, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexandre Chartre, Clark Willaims, John Kacur,
	Juri Lelli, linux-doc

In the context of high-performance computing (HPC), the Operating System
Noise (osnoise) refers to the interference experienced by an application
due to activities inside the operating system. In the context of Linux,
NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
system. Moreover, hardware-related jobs can also cause noise, for example,
via SMIs.

hwlat_detector is one of the tools used to identify the most complex
source of noise: hardware noise.

In a nutshell, the hwlat_detector creates a thread that runs
periodically for a given period. At the beginning of a period, the thread
disables interrupt and starts sampling. While running, the hwlatd
thread reads the time in a loop. As interrupts are disabled, threads,
IRQs, and SoftIRQs cannot interfere with the hwlatd thread. Hence, the
cause of any gap between two different reads of the time roots either on
NMI or in the hardware itself. At the end of the period, hwlatd enables
interrupts and reports the max observed gap between the reads. It also
prints an NMI occurrence counter. If the output does not report NMI
executions, the user can conclude that the hardware is the culprit for
the latency. The hwlat detects the NMI execution by observing
the entry and exit of an NMI.

The osnoise tracer leverages the hwlat_detector by running a
similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing
all the sources of osnoise during its execution. Using the same approach
of hwlat, osnoise takes note of the entry and exit point of any
source of interferences, increasing a per-cpu interference counter. The
osnoise tracer also saves an interference counter for each source of
interference. The interference counter for NMI, IRQs, SoftIRQs, and
threads is increased anytime the tool observes these interferences' entry
events. When a noise happens without any interference from the operating
system level, the hardware noise counter increases, pointing to a
hardware-related noise. In this way, osnoise can account for any
source of interference. At the end of the period, the osnoise tracer
prints the sum of all noise, the max single noise, the percentage of CPU
available for the thread, and the counters for the noise sources.

Usage

Write the ASCII text osnoise into the current_tracer file of the
tracing system (generally mounted at /sys/kernel/tracing or
/sys/kernel/debug/tracing).

For example::

        [root@f32 ~]# cd /sys/kernel/tracing/
        [root@f32 tracing]# echo osnoise > current_tracer

It is possible to follow the trace by reading the trace trace file::

        [root@f32 tracing]# cat trace
        # tracer: osnoise
        #
        #                                _-----=> irqs-off
        #                               / _----=> need-resched
        #                              | / _---=> hardirq/softirq
        #                              || / _--=> preempt-depth                            MAX
        #                              || /                                             SINGLE     Interference counters:
        #                              ||||               RUNTIME      NOISE   % OF CPU  NOISE    +-----------------------------+
        #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
        #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
                   <...>-859     [000] ....    81.637220: 1000000        190  99.98100       9     18      0   1007     18      1
                   <...>-860     [001] ....    81.638154: 1000000        656  99.93440      74     23      0   1006     16      3
                   <...>-861     [002] ....    81.638193: 1000000       5675  99.43250     202      6      0   1013     25     21
                   <...>-862     [003] ....    81.638242: 1000000        125  99.98750      45      1      0   1011     23      0
                   <...>-863     [004] ....    81.638260: 1000000       1721  99.82790     168      7      0   1002     49     41
                   <...>-864     [005] ....    81.638286: 1000000        263  99.97370      57      6      0   1006     26      2
                   <...>-865     [006] ....    81.638302: 1000000        109  99.98910      21      3      0   1006     18      1
                   <...>-866     [007] ....    81.638326: 1000000       7816  99.21840     107      8      0   1016     39     19

In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
tracer prints a message at the end of each period for each CPU that is
running an osnoise/ thread. The osnoise specific fields report:

 - The RUNTIME IN USE reports the amount of time in microseconds that
   the osnoise thread kept looping reading the time.
 - The NOISE IN US reports the sum of noise in microseconds observed
   by the osnoise tracer during the associated runtime.
 - The % OF CPU AVAILABLE reports the percentage of CPU available for
   the osnoise thread during the runtime window.
 - The MAX SINGLE NOISE IN US reports the maximum single noise observed
   during the runtime window.
 - The Interference counters display how many each of the respective
   interference happened during the runtime window.

Note that the example above shows a high number of HW noise samples.
The reason being is that this sample was taken on a virtual machine,
and the host interference is detected as a hardware interference.

Tracer options

The tracer has a set of options inside the osnoise directory, they are:

 - cpus: CPUs at which a osnoise thread will execute.
 - period_us: the period of the osnoise thread.
 - runtime_us: how long an osnoise thread will look for noise.
 - stop_tracing_single_us: stop the system tracing of a single noise
   higher than the configured value is happens. Writing 0 disables this
   option.
 - stop_tracing_total_us: stop the system tracing of a NOISE IN USE
   higher than the configured value is happens. Writing 0 disables this
   option.
 - tolerance_ns: the minimum delta between two time() reads to be
   considered as noise.

Additional Tracing

In addition to the tracer, a set of tracepoints were added to
facilitate the identification of the osnoise source.

 - osnoise:sample_threshold: printed anytime a noise is higher than
   the configurable tolerance_ns.
 - osnoise:nmi_noise: noise from NMI, including the duration.
 - osnoise:irq_noise: noise from an IRQ, including the duration.
 - osnoise:softirq_noise: noise from a SoftIRQ, including the
   duration.
 - osnoise:thread_noise: noise from a thread, including the duration.

Note that a all the values are net values. This means that a thread
duration will not contain the duration of the IRQs that happened during
its execution, for example. The same is valid for all duration values.

Here is one example of the usage of these tracepoints::

       osnoise/8-961     [008] d.h.  5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
       osnoise/8-961     [008] dNh.  5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
     migration/8-54      [008] d...  5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
       osnoise/8-961     [008] ....  5789.858413: sample_threshold: start 5789.858404555 duration 8 us interferences 2

In this example, a noise sample of 8 microseconds was reported in the last
fine, pointing to two interferences. Looking backward in the trace, the
two previous entries were about the migration thread running after
a timer IRQ execution. The first event is not part of the noise because
it took place one millisecond before.

It is worth noticing that the sum of the duration reported in the
tracepoints is smaller than eight us reported in the
sample_threshold. The reason roots in the tracing overhead and in
the overhead of the entry and exit code that happens before and after
any interference execution. This justifies the dual approach: measuring
thread and tracing.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>

---
 Documentation/trace/osnoise_tracer.rst |  149 ++
 include/linux/ftrace_irq.h             |   16 +
 include/trace/events/osnoise.h         |  141 ++
 kernel/trace/Kconfig                   |   34 +
 kernel/trace/Makefile                  |    1 +
 kernel/trace/trace.h                   |    9 +-
 kernel/trace/trace_entries.h           |   27 +
 kernel/trace/trace_osnoise.c           | 1714 ++++++++++++++++++++++++
 kernel/trace/trace_output.c            |   72 +-
 9 files changed, 2159 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/trace/osnoise_tracer.rst
 create mode 100644 include/trace/events/osnoise.h
 create mode 100644 kernel/trace/trace_osnoise.c

diff --git a/Documentation/trace/osnoise_tracer.rst b/Documentation/trace/osnoise_tracer.rst
new file mode 100644
index 000000000000..9a97f557317b
--- /dev/null
+++ b/Documentation/trace/osnoise_tracer.rst
@@ -0,0 +1,149 @@
+==============
+OSNOISE Tracer
+==============
+
+In the context of high-performance computing (HPC), the Operating System
+Noise (*osnoise*) refers to the interference experienced by an application
+due to activities inside the operating system. In the context of Linux,
+NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
+system. Moreover, hardware-related jobs can also cause noise, for example,
+via SMIs.
+
+``hwlat_detector`` is one of the tools used to identify the most complex
+source of noise: *hardware noise*.
+
+In a nutshell, the ``hwlat_detector`` creates a thread that runs
+periodically for a given period. At the beginning of a period, the thread
+disables interrupt and starts sampling. While running, the ``hwlatd``
+thread reads the time in a loop. As interrupts are disabled, threads,
+IRQs, and SoftIRQs cannot interfere with the ``hwlatd`` thread. Hence, the
+cause of any gap between two different reads of the time roots either on
+NMI or in the hardware itself. At the end of the period, ``hwlatd`` enables
+interrupts and reports the max observed gap between the reads. It also
+prints a NMI occurrence counter. If the output does not report NMI
+executions, the user can conclude that the hardware is the culprit for
+the latency. The ``hwlat`` detects the NMI execution by observing
+the entry and exit of a NMI.
+
+The ``osnoise`` tracer leverages the ``hwlat_detector`` by running a
+similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing
+all the sources of *osnoise* during its execution. Using the same approach
+of ``hwlat``, ``osnoise`` takes note of the entry and exit point of any
+source of interferences, increasing a per-cpu interference counter. The
+``osnoise`` tracer also saves an interference counter for each source of
+interference. The interference counter for NMI, IRQs, SoftIRQs, and
+threads is increased anytime the tool observes these interferences' entry
+events. When a noise happens without any interference from the operating
+system level, the hardware noise counter increases, pointing to a
+hardware-related noise. In this way, ``osnoise`` can account for any
+source of interference. At the end of the period, the ``osnoise`` tracer
+prints the sum of all noise, the max single noise, the percentage of CPU
+available for the thread, and the counters for the noise sources.
+
+Usage
+-----
+
+Write the ASCII text ``osnoise`` into the ``current_tracer`` file of the
+tracing system (generally mounted at ``/sys/kernel/tracing`` or
+``/sys/kernel/debug/tracing``).
+
+For example::
+
+        [root@f32 ~]# cd /sys/kernel/tracing/
+        [root@f32 tracing]# echo osnoise > current_tracer
+
+It is possible to follow the trace by reading the ``trace`` trace file::
+
+        [root@f32 tracing]# cat trace
+        # tracer: osnoise
+        #
+        #                                _-----=> irqs-off
+        #                               / _----=> need-resched
+        #                              | / _---=> hardirq/softirq
+        #                              || / _--=> preempt-depth                            MAX
+        #                              || /                                             SINGLE     Interference counters:
+        #                              ||||               RUNTIME      NOISE   % OF CPU  NOISE    +-----------------------------+
+        #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
+        #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
+                   <...>-859     [000] ....    81.637220: 1000000        190  99.98100       9     18      0   1007     18      1
+                   <...>-860     [001] ....    81.638154: 1000000        656  99.93440      74     23      0   1006     16      3
+                   <...>-861     [002] ....    81.638193: 1000000       5675  99.43250     202      6      0   1013     25     21
+                   <...>-862     [003] ....    81.638242: 1000000        125  99.98750      45      1      0   1011     23      0
+                   <...>-863     [004] ....    81.638260: 1000000       1721  99.82790     168      7      0   1002     49     41
+                   <...>-864     [005] ....    81.638286: 1000000        263  99.97370      57      6      0   1006     26      2
+                   <...>-865     [006] ....    81.638302: 1000000        109  99.98910      21      3      0   1006     18      1
+                   <...>-866     [007] ....    81.638326: 1000000       7816  99.21840     107      8      0   1016     39     19
+
+In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
+tracer prints a message at the end of each period for each CPU that is
+running an ``osnoise/`` thread. The osnoise specific fields report:
+
+ - The ``RUNTIME IN USE`` reports the amount of time in microseconds that
+   the ``osnoise`` thread kept looping reading the time.
+ - The ``NOISE IN US`` reports the sum of noise in microseconds observed
+   by the osnoise tracer during the associated runtime.
+ - The ``% OF CPU AVAILABLE`` reports the percentage of CPU available for
+   the ``osnoise`` thread during the ``runtime`` window.
+ - The ``MAX SINGLE NOISE IN US`` reports the maximum single noise observed
+   during the ``runtime`` window.
+ - The ``Interference counters`` display how many each of the respective
+   interference happened during the ``runtime`` window.
+
+Note that the example above shows a high number of ``HW noise`` samples.
+The reason being is that this sample was taken on a virtual machine,
+and the host interference is detected as a hardware interference.
+
+Tracer options
+---------------------
+
+The tracer has a set of options inside the ``osnoise`` directory, they are:
+
+ - ``cpus``: CPUs at which a ``osnoise`` thread will execute.
+ - ``period_us``: the period of the ``osnoise`` thread.
+ - ``runtime_us``: how long an ``osnoise`` thread will look for noise.
+ - ``stop_tracing_single_us``: stop the system tracing of a single noise
+   higher than the configured value is happens. Writing ``0`` disables this
+   option.
+ - ``stop_tracing_total_us``: stop the system tracing of a ``NOISE IN USE``
+   higher than the configured value is happens. Writing ``0`` disables this
+   option.
+ - ``tolerance_ns``: the minimum delta between two time() reads to be
+   considered as noise.
+
+Additional Tracing
+------------------
+
+In addition to the tracer, a set of ``tracepoints`` were added to
+facilitate the identification of the osnoise source.
+
+ - ``osnoise:sample_threshold``: printed anytime a noise is higher than
+   the configurable ``tolerance_ns``.
+ - ``osnoise:nmi_noise``: noise from NMI, including the duration.
+ - ``osnoise:irq_noise``: noise from an IRQ, including the duration.
+ - ``osnoise:softirq_noise``: noise from a SoftIRQ, including the
+   duration.
+ - ``osnoise:thread_noise``: noise from a thread, including the duration.
+
+Note that a all the values are *net values*. This means that a *thread*
+duration will not contain the duration of the *IRQs* that happened during
+its execution, for example. The same is valid for all duration values.
+
+Here is one example of the usage of these ``tracepoints``::
+
+       osnoise/8-961     [008] d.h.  5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
+       osnoise/8-961     [008] dNh.  5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
+     migration/8-54      [008] d...  5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
+       osnoise/8-961     [008] ....  5789.858413: sample_threshold: start 5789.858404555 duration 8 us interferences 2
+
+In this example, a noise sample of 8 microseconds was reported in the last
+fine, pointing to two interferences. Looking backward in the trace, the
+two previous entries were about the ``migration`` thread running after
+a timer IRQ execution. The first event is not part of the noise because
+it took place one millisecond before.
+
+It is worth noticing that the sum of the duration reported in the
+``tracepoints`` is smaller than eight us reported in the
+``sample_threshold``. The reason roots in the tracing overhead and in
+the overhead of the entry and exit code that happens before and after
+any interference execution. This justifies the dual approach: measuring
+thread and tracing.
diff --git a/include/linux/ftrace_irq.h b/include/linux/ftrace_irq.h
index 0abd9a1d2852..fd54045980ce 100644
--- a/include/linux/ftrace_irq.h
+++ b/include/linux/ftrace_irq.h
@@ -7,12 +7,24 @@ extern bool trace_hwlat_callback_enabled;
 extern void trace_hwlat_callback(bool enter);
 #endif
 
+/*
+ * XXX: Make it generic
+ */
+#ifdef CONFIG_OSNOISE_TRACER
+extern bool trace_osnoise_callback_enabled;
+extern void trace_osnoise_callback(bool enter);
+#endif
+
 static inline void ftrace_nmi_enter(void)
 {
 #ifdef CONFIG_HWLAT_TRACER
 	if (trace_hwlat_callback_enabled)
 		trace_hwlat_callback(true);
 #endif
+#ifdef CONFIG_OSNOISE_TRACER
+	if (trace_osnoise_callback_enabled)
+		trace_osnoise_callback(true);
+#endif
 }
 
 static inline void ftrace_nmi_exit(void)
@@ -21,6 +33,10 @@ static inline void ftrace_nmi_exit(void)
 	if (trace_hwlat_callback_enabled)
 		trace_hwlat_callback(false);
 #endif
+#ifdef CONFIG_OSNOISE_TRACER
+	if (trace_osnoise_callback_enabled)
+		trace_osnoise_callback(false);
+#endif
 }
 
 #endif /* _LINUX_FTRACE_IRQ_H */
diff --git a/include/trace/events/osnoise.h b/include/trace/events/osnoise.h
new file mode 100644
index 000000000000..81939234814b
--- /dev/null
+++ b/include/trace/events/osnoise.h
@@ -0,0 +1,141 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM osnoise
+
+#if !defined(_OSNOISE_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _OSNOISE_TRACE_H
+
+#include <linux/tracepoint.h>
+TRACE_EVENT(thread_noise,
+
+	TP_PROTO(struct task_struct *t, u64 start, u64 duration),
+
+	TP_ARGS(t, start, duration),
+
+	TP_STRUCT__entry(
+		__array(	char,		comm,	TASK_COMM_LEN)
+		__field(	pid_t,		pid	)
+		__field(	u64,		start	)
+		__field(	u64,		duration)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->pid = t->pid;
+		__entry->start = start;
+		__entry->duration = duration;
+	),
+
+	TP_printk("%8s:%d start %llu.%09u duration %llu ns",
+		__entry->comm,
+		__entry->pid,
+		__print_ns_to_secs(__entry->start),
+		__print_ns_without_secs(__entry->start),
+		__entry->duration)
+);
+
+TRACE_EVENT(softirq_noise,
+
+	TP_PROTO(int vector, u64 start, u64 duration),
+
+	TP_ARGS(vector, start, duration),
+
+	TP_STRUCT__entry(
+		__field(	int,		vector	)
+		__field(	u64,		start	)
+		__field(	u64,		duration)
+	),
+
+	TP_fast_assign(
+		__entry->vector = vector;
+		__entry->start = start;
+		__entry->duration = duration;
+	),
+
+	TP_printk("%8s:%d start %llu.%09u duration %llu ns",
+		show_softirq_name(__entry->vector),
+		__entry->vector,
+		__print_ns_to_secs(__entry->start),
+		__print_ns_without_secs(__entry->start),
+		__entry->duration)
+);
+
+TRACE_EVENT(irq_noise,
+
+	TP_PROTO(int vector, const char *desc, u64 start, u64 duration),
+
+	TP_ARGS(vector, desc, start, duration),
+
+	TP_STRUCT__entry(
+		__string(	desc,		desc    )
+		__field(	int,		vector	)
+		__field(	u64,		start	)
+		__field(	u64,		duration)
+	),
+
+	TP_fast_assign(
+		__assign_str(desc, desc);
+		__entry->vector = vector;
+		__entry->start = start;
+		__entry->duration = duration;
+	),
+
+	TP_printk("%s:%d start %llu.%09u duration %llu ns",
+		__get_str(desc),
+		__entry->vector,
+		__print_ns_to_secs(__entry->start),
+		__print_ns_without_secs(__entry->start),
+		__entry->duration)
+);
+
+TRACE_EVENT(nmi_noise,
+
+	TP_PROTO(u64 start, u64 duration),
+
+	TP_ARGS(start, duration),
+
+	TP_STRUCT__entry(
+		__field(	u64,		start	)
+		__field(	u64,		duration)
+	),
+
+	TP_fast_assign(
+		__entry->start = start;
+		__entry->duration = duration;
+	),
+
+	TP_printk("start %llu.%09u duration %llu ns",
+		__print_ns_to_secs(__entry->start),
+		__print_ns_without_secs(__entry->start),
+		__entry->duration)
+);
+
+TRACE_EVENT(sample_threshold,
+
+	TP_PROTO(u64 start, u64 duration, u64 interference),
+
+	TP_ARGS(start, duration, interference),
+
+	TP_STRUCT__entry(
+		__field(	u64,		start	)
+		__field(	u64,		duration)
+		__field(	u64,		interference)
+	),
+
+	TP_fast_assign(
+		__entry->start = start;
+		__entry->duration = duration;
+		__entry->interference = interference;
+	),
+
+	TP_printk("start %llu.%09u duration %llu us interferences %llu",
+		__print_ns_to_secs(__entry->start),
+		__print_ns_without_secs(__entry->start),
+		__entry->duration,
+		__entry->interference)
+);
+
+#endif /* _TRACE_OSNOISE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 7fa82778c3e6..41582ae4682b 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -356,6 +356,40 @@ config HWLAT_TRACER
 	 file. Every time a latency is greater than tracing_thresh, it will
 	 be recorded into the ring buffer.
 
+config OSNOISE_TRACER
+	bool "OS Noise tracer"
+	select GENERIC_TRACER
+	help
+	  In the context of high-performance computing (HPC), the Operating
+	  System Noise (osnoise) refers to the interference experienced by an
+	  application due to activities inside the operating system. In the
+	  context of Linux, NMIs, IRQs, SoftIRQs, and any other system thread
+	  can cause noise to the system. Moreover, hardware-related jobs can
+	  also cause noise, for example, via SMIs.
+
+	  The osnoise tracer leverages the hwlat_detector by running a similar
+	  loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
+	  the sources of osnoise during its execution. The osnoise tracer takes
+	  note of the entry and exit point of any source of interferences,
+	  increasing a per-cpu interference counter. It saves an interference
+	  counter for each source of interference. The interference counter for
+	  NMI, IRQs, SoftIRQs, and threads is increased anytime the tool
+	  observes these interferences' entry events. When a noise happens
+	  without any interference from the operating system level, the
+	  hardware noise counter increases, pointing to a hardware-related
+	  noise. In this way, osnoise can account for any source of
+	  interference. At the end of the period, the osnoise tracer prints
+	  the sum of all noise, the max single noise, the percentage of CPU
+	  available for the thread, and the counters for the noise sources.
+
+	  In addition to the tracer, a set of tracepoints were added to
+	  facilitate the identification of the osnoise source.
+
+	  The output will appear in the trace and trace_pipe files.
+
+	  To enable this tracer, echo in "osnoise" into the current_tracer
+          file.
+
 config MMIOTRACE
 	bool "Memory mapped IO tracing"
 	depends on HAVE_MMIOTRACE_SUPPORT && PCI
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index b28d3e5013cd..b1c47ccf4f73 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -58,6 +58,7 @@ obj-$(CONFIG_IRQSOFF_TRACER) += trace_irqsoff.o
 obj-$(CONFIG_PREEMPT_TRACER) += trace_irqsoff.o
 obj-$(CONFIG_SCHED_TRACER) += trace_sched_wakeup.o
 obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
+obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
 obj-$(CONFIG_NOP_TRACER) += trace_nop.o
 obj-$(CONFIG_STACK_TRACER) += trace_stack.o
 obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index a6446c03cfbc..9a66e3a1df6e 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -44,6 +44,7 @@ enum trace_type {
 	TRACE_BLK,
 	TRACE_BPUTS,
 	TRACE_HWLAT,
+	TRACE_OSNOISE,
 	TRACE_RAW_DATA,
 
 	__TRACE_LAST_TYPE,
@@ -285,7 +286,8 @@ struct trace_array {
 	struct array_buffer	max_buffer;
 	bool			allocated_snapshot;
 #endif
-#if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
+#if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER) \
+	|| defined(CONFIG_OSNOISE_TRACER)
 	unsigned long		max_latency;
 #ifdef CONFIG_FSNOTIFY
 	struct dentry		*d_max_latency;
@@ -431,6 +433,7 @@ extern void __ftrace_bad_type(void);
 		IF_ASSIGN(var, ent, struct bprint_entry, TRACE_BPRINT);	\
 		IF_ASSIGN(var, ent, struct bputs_entry, TRACE_BPUTS);	\
 		IF_ASSIGN(var, ent, struct hwlat_entry, TRACE_HWLAT);	\
+		IF_ASSIGN(var, ent, struct osnoise_entry, TRACE_OSNOISE);\
 		IF_ASSIGN(var, ent, struct raw_data_entry, TRACE_RAW_DATA);\
 		IF_ASSIGN(var, ent, struct trace_mmiotrace_rw,		\
 			  TRACE_MMIO_RW);				\
@@ -656,8 +659,8 @@ void update_max_tr_single(struct trace_array *tr,
 			  struct task_struct *tsk, int cpu);
 #endif /* CONFIG_TRACER_MAX_TRACE */
 
-#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
-	defined(CONFIG_FSNOTIFY)
+#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) \
+	|| defined(CONFIG_OSNOISE_TRACER) && defined(CONFIG_FSNOTIFY)
 
 void latency_fsnotify(struct trace_array *tr);
 
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 4547ac59da61..aed479e510cc 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -338,3 +338,30 @@ FTRACE_ENTRY(hwlat, hwlat_entry,
 		 __entry->nmi_total_ts,
 		 __entry->nmi_count)
 );
+
+FTRACE_ENTRY(osnoise, osnoise_entry,
+
+	TRACE_OSNOISE,
+
+	F_STRUCT(
+		__field(	u64,			noise		)
+		__field(	u64,			runtime		)
+		__field(	u64,			max_sample	)
+		__field(	unsigned int,		count		)
+		__field(	unsigned int,		hw_count	)
+		__field(	unsigned int,		nmi_count	)
+		__field(	unsigned int,		irq_count	)
+		__field(	unsigned int,		softirq_count	)
+		__field(	unsigned int,		thread_count	)
+	),
+
+	F_printk("noise:%llu\tmax_sample:%llu\tcount:%d\thw:%u\tnmi:%u\tirq:%u\tsoftirq:%u\tthread:%u\n",
+		 __entry->noise,
+		 __entry->max_sample,
+		 __entry->count,
+		 __entry->hw_count,
+		 __entry->nmi_count,
+		 __entry->irq_count,
+		 __entry->softirq_count,
+		 __entry->thread_count)
+);
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
new file mode 100644
index 000000000000..e6b793725c96
--- /dev/null
+++ b/kernel/trace/trace_osnoise.c
@@ -0,0 +1,1714 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * OS Noise Tracer: computes the OS Noise suffered by a running thread.
+ *
+ * Based on "hwlat_detector" tracer by:
+ *   Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <jcm@redhat.com>
+ *   Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. <srostedt@redhat.com>
+ *   With feedback from Clark Williams <williams@redhat.com>
+ *
+ * And also based on the rtsl tracer presented on:
+ *  DE OLIVEIRA, Daniel Bristot, et al. Demystifying the real-time linux
+ *  scheduling latency. In: 32nd Euromicro Conference on Real-Time Systems
+ *  (ECRTS 2020). Schloss Dagstuhl-Leibniz-Zentrum fur Informatik, 2020.
+ *
+ * Copyright (C) 2021 Daniel Bristot de Oliveira, Red Hat, Inc. <bristot@redhat.com>
+ */
+#include <linux/kthread.h>
+#include <linux/tracefs.h>
+#include <linux/uaccess.h>
+#include <linux/cpumask.h>
+#include <linux/delay.h>
+#include <linux/sched/clock.h>
+#include <linux/sched.h>
+#include "trace.h"
+
+#ifdef CONFIG_X86_LOCAL_APIC
+#include <asm/trace/irq_vectors.h>
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#endif /* CONFIG_X86_LOCAL_APIC */
+
+#include <trace/events/irq.h>
+#include <trace/events/sched.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/osnoise.h>
+
+static struct trace_array	*osnoise_trace;
+
+/*
+ * Default values.
+ */
+#define U64STR_SIZE		22			/* 20 digits max */
+#define BANNER			"osnoise: "
+#define DEFAULT_SAMPLE_PERIOD	1000000			/* 1s */
+#define DEFAULT_SAMPLE_RUNTIME	1000000			/* 1s */
+#define DEFAULT_TOLERANCE_NS    5000			/* 5 us */
+
+/*
+ * NMI runtime info.
+ */
+struct nmi {
+	u64 count;
+	u64 delta_start;
+	u64 max;
+};
+
+/*
+ * IRQ runtime info.
+ */
+struct irq {
+	u64 count;
+	u64 arrival_time;
+	u64 delta_start;
+};
+
+/*
+ * SofIRQ runtime info.
+ */
+struct softirq {
+	u64 count;
+	u64 arrival_time;
+	u64 delta_start;
+};
+
+/*
+ * Thread runtime info.
+ */
+struct thread {
+	u64 count;
+	u64 arrival_time;
+	u64 delta_start;
+};
+
+/*
+ * Runtime information: this structure saves the runtime information used by
+ * one sampling thread.
+ */
+struct osnoise_variables {
+	struct task_struct *kthread;
+	bool sampling;
+	pid_t pid;
+	struct nmi nmi;
+	struct irq irq;
+	struct softirq softirq;
+	struct thread thread;
+	local_t int_counter;
+};
+
+/*
+ * Per-cpu runtime information.
+ */
+DEFINE_PER_CPU(struct osnoise_variables, per_cpu_osnoise_var);
+
+/**
+ * this_cpu_osn_var - Return the per-cpu osnoise_variables on its relative CPU
+ */
+static inline struct osnoise_variables *this_cpu_osn_var(void)
+{
+	return this_cpu_ptr(&per_cpu_osnoise_var);
+}
+
+/**
+ * osn_var_reset - Reset the values of the given osnoise_variables
+ */
+static inline void osn_var_reset(struct osnoise_variables *osn_var)
+{
+	/*
+	 * So far, all the values are initialized as 0, so
+	 * zeroing the structure is perfect.
+	 */
+	memset(osn_var, 0, sizeof(struct osnoise_variables));
+}
+
+/**
+ * osn_var_reset_all - Reset the value of all per-cpu osnoise_variables
+ */
+static inline void osn_var_reset_all(void)
+{
+	struct osnoise_variables *osn_var;
+	int cpu;
+
+	for_each_cpu(cpu, cpu_online_mask) {
+		osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
+		osn_var_reset(osn_var);
+	}
+}
+
+/*
+ * Tells NMIs to call back to the osnoise tracer to record timestamps.
+ */
+bool trace_osnoise_callback_enabled;
+
+/*
+ * osnoise sample structure definition. Used to store the statistics of a
+ * sample run.
+ */
+struct osnoise_sample {
+	u64			runtime;	/* runtime */
+	u64			noise;		/* noise */
+	u64			max_sample;	/* max single noise sample */
+	int			hw_count;	/* # HW (incl. hypervisor) interference */
+	int			nmi_count;	/* # NMIs during this sample */
+	int			irq_count;	/* # IRQs during this sample */
+	int			softirq_count;	/* # SoftIRQs during this sample */
+	int			thread_count;	/* # Threads during this sample */
+	int			count;		/* # of iterations over threash */
+};
+
+/*
+ * Tracer data.
+ */
+static struct osnoise_data {
+	struct mutex lock;		/* protect changes */
+
+	u64	sample_period;		/* total sampling period */
+	u64	sample_runtime;		/* active sampling portion of period */
+	u64	noise_tolerance_ns;	/* miminum noise to be considered */
+	u64	stop_tracing_single_max;/* stop sampling a CPU if single > */
+	u64	stop_tracing_total_max;	/* stop sampling a CPU if total > */
+} osnoise_data = {
+	.sample_period			= DEFAULT_SAMPLE_PERIOD,
+	.sample_runtime			= DEFAULT_SAMPLE_RUNTIME,
+	.noise_tolerance_ns		= DEFAULT_TOLERANCE_NS,
+	.stop_tracing_single_max	= 0,
+	.stop_tracing_total_max		= 0,
+};
+
+/*
+ * Boolean variable used to inform that the tracer is currently sampling.
+ */
+static bool osnoise_busy;
+
+/*
+ * Print the osnoise header info.
+ */
+static void print_osnoise_headers(struct seq_file *s)
+{
+	seq_puts(s, "#                                _-----=> irqs-off\n");
+	seq_puts(s, "#                               / _----=> need-resched\n");
+	seq_puts(s, "#                              | / _---=> hardirq/softirq\n");
+	seq_puts(s, "#                              || / _--=> preempt-depth     ");
+	seq_puts(s, "                       MAX\n");
+
+	seq_puts(s, "#                              || /                         ");
+	seq_puts(s, "                    SINGLE      Interference counters:\n");
+
+	seq_puts(s, "#                              ||||               RUNTIME   ");
+	seq_puts(s, "   NOISE  %% OF CPU  NOISE    +-----------------------------+\n");
+
+	seq_puts(s, "#           TASK-PID      CPU# ||||   TIMESTAMP    IN US    ");
+	seq_puts(s, "   IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD\n");
+
+	seq_puts(s, "#              | |         |   ||||      |           |      ");
+	seq_puts(s, "       |    |            |      |      |      |      |      |\n");
+}
+
+/*
+ * Record an osnoise_sample into the tracer buffer.
+ */
+static void trace_osnoise_sample(struct osnoise_sample *sample)
+{
+	struct trace_array *tr = osnoise_trace;
+	struct trace_buffer *buffer = tr->array_buffer.buffer;
+	struct trace_event_call *call = &event_osnoise;
+	struct ring_buffer_event *event;
+	struct osnoise_entry *entry;
+
+	event = trace_buffer_lock_reserve(buffer, TRACE_OSNOISE, sizeof(*entry),
+					  tracing_gen_ctx());
+	if (!event)
+		return;
+	entry	= ring_buffer_event_data(event);
+	entry->runtime		= sample->runtime;
+	entry->noise		= sample->noise;
+	entry->max_sample	= sample->max_sample;
+	entry->hw_count		= sample->hw_count;
+	entry->nmi_count	= sample->nmi_count;
+	entry->irq_count	= sample->irq_count;
+	entry->softirq_count	= sample->softirq_count;
+	entry->thread_count	= sample->thread_count;
+	entry->count		= sample->count;
+
+	if (!call_filter_check_discard(call, entry, buffer, event))
+		trace_buffer_unlock_commit_nostack(buffer, event);
+}
+
+/**
+ * Macros to encapsulate the time capturing infrastructure.
+ */
+#define time_type	u64
+#define time_get()	trace_clock_local()
+#define time_to_us(x)	div_u64(x, 1000)
+#define time_sub(a, b)	((a) - (b))
+
+/**
+ * cond_move_irq_delta_start - Forward the delta_start of a running IRQ
+ *
+ * If an IRQ is preempted by an NMI, its delta_start is pushed forward
+ * to discount the NMI interference.
+ *
+ * See get_int_safe_duration().
+ */
+static inline void
+cond_move_irq_delta_start(struct osnoise_variables *osn_var, u64 duration)
+{
+	if (osn_var->irq.delta_start)
+		osn_var->irq.delta_start += duration;
+}
+
+#ifndef CONFIG_PREEMPT_RT
+/**
+ * cond_move_softirq_delta_start - Forward the delta_start of a running SoftIRQ
+ *
+ * If a SoftIRQ is preempted by an IRQ or NMI, its delta_start is pushed
+ * forward to discount the interference.
+ *
+ * See get_int_safe_duration().
+ */
+static inline void
+cond_move_softirq_delta_start(struct osnoise_variables *osn_var, u64 duration)
+{
+	if (osn_var->softirq.delta_start)
+		osn_var->softirq.delta_start += duration;
+}
+#else /* CONFIG_PREEMPT_RT */
+#define cond_move_softirq_delta_start(osn_var, duration) do {} while (0)
+#endif
+
+/**
+ * cond_move_thread_delta_start - Forward the delta_start of a running thread
+ *
+ * If a noisy thread is preempted by an Softirq, IRQ or NMI, its delta_start
+ * is pushed forward to discount the interference.
+ *
+ * See get_int_safe_duration().
+ */
+static inline void
+cond_move_thread_delta_start(struct osnoise_variables *osn_var, u64 duration)
+{
+	if (osn_var->thread.delta_start)
+		osn_var->thread.delta_start += duration;
+}
+
+/**
+ * get_int_safe_duration - Get the duration of a window
+ *
+ * The irq, softirq and thread varaibles need to have its duration without
+ * the interference from higher priority interrupts. Instead of keeping a
+ * variable to discount the interrupt interference from these variables, the
+ * starting time of these variables are pushed forward with the interrupt's
+ * duration. In this way, a single variable is used to:
+ *
+ *   - Know if a given window is being measured.
+ *   - Account its duration.
+ *   - Discount the interference.
+ *
+ * To avoid getting inconsistent values, e.g.,:
+ *
+ *	now = time_get()
+ *		--->	interrupt!
+ *			delta_start -= int duration;
+ *		<---
+ *	duration = now - delta_start;
+ *
+ *	result: negative duration if the variable duration before the
+ *	interrupt was smaller than the interrupt execution.
+ *
+ * A counter of interrupts is used. If the counter increased, try
+ * to capture an interference safe duration.
+ */
+static inline s64
+get_int_safe_duration(struct osnoise_variables *osn_var, u64 *delta_start)
+{
+	u64 int_counter, now;
+	s64 duration;
+
+	do {
+		int_counter = local_read(&osn_var->int_counter);
+		/* synchronize with interrupts */
+		barrier();
+
+		now = time_get();
+		duration = (now - *delta_start);
+
+		/* synchronize with interrupts */
+		barrier();
+	} while (int_counter != local_read(&osn_var->int_counter));
+
+	/*
+	 * This is an evidence of race conditions that cause
+	 * a value to be "discounted" too much.
+	 */
+	if (duration < 0)
+		pr_err("int safe negative!\n");
+
+	*delta_start = 0;
+
+	return duration;
+}
+
+/**
+ *
+ * set_int_safe_time - Save the current time on *time, aware of interference
+ *
+ * Get the time, taking into consideration a possible interference from
+ * higher priority interrupts.
+ *
+ * See get_int_safe_duration() for an explanation.
+ */
+static u64
+set_int_safe_time(struct osnoise_variables *osn_var, u64 *time)
+{
+	u64 int_counter;
+
+	do {
+		int_counter = local_read(&osn_var->int_counter);
+		/* synchronize with interrupts */
+		barrier();
+
+		*time = time_get();
+
+		/* synchronize with interrupts */
+		barrier();
+	} while (int_counter != local_read(&osn_var->int_counter));
+
+	return int_counter;
+}
+
+/**
+ * trace_osnoise_callback - NMI entry/exit callback
+ *
+ * This function is called at the entry and exit NMI code. The bool enter
+ * distinguishes between either case. This function is used to note a NMI
+ * occurrence, compute the noise caused by the NMI, and to remove the noise
+ * it is potentially causing on other interference variables.
+ */
+void trace_osnoise_callback(bool enter)
+{
+	struct osnoise_variables *osn_var = this_cpu_osn_var();
+	u64 duration;
+
+	if (!osn_var->sampling)
+		return;
+
+	/*
+	 * Currently trace_clock_local() calls sched_clock() and the
+	 * generic version is not NMI safe.
+	 */
+	if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
+		if (enter) {
+			osn_var->nmi.delta_start = time_get();
+			local_inc(&osn_var->int_counter);
+		} else {
+			duration = time_get() - osn_var->nmi.delta_start;
+
+			trace_nmi_noise(osn_var->nmi.delta_start, duration);
+
+			if (duration > osn_var->nmi.max)
+				osn_var->nmi.max = duration;
+
+			cond_move_irq_delta_start(osn_var, duration);
+			cond_move_softirq_delta_start(osn_var, duration);
+			cond_move_thread_delta_start(osn_var, duration);
+		}
+	}
+
+	if (enter)
+		osn_var->nmi.count++;
+}
+
+/**
+ * __trace_irq_entry - Note the starting of an IRQ
+ *
+ * Save the starting time of an IRQ. As IRQs are non-preemptive to other IRQs,
+ * it is safe to use a single variable (ons_var->irq) to save the statistics.
+ * The arrival_time is used to report... the arrival time. The delta_start
+ * is used to compute the duration at the IRQ exit handler. See
+ * cond_move_irq_delta_start().
+ */
+static inline void __trace_irq_entry(int id)
+{
+	struct osnoise_variables *osn_var = this_cpu_osn_var();
+
+	if (!osn_var->sampling)
+		return;
+	/*
+	 * This value will be used in the report, but not to compute
+	 * the execution time, so it is safe to get it unsafe.
+	 */
+	osn_var->irq.arrival_time = time_get();
+	set_int_safe_time(osn_var, &osn_var->irq.delta_start);
+	osn_var->irq.count++;
+
+	local_inc(&osn_var->int_counter);
+}
+
+/**
+ * __trace_irq_exit - Note the end of an IRQ, sava data and trace
+ *
+ * Computes the duration of the IRQ noise, and trace it. Also discounts the
+ * interference from other sources of noise could be currently being accounted.
+ */
+static inline void __trace_irq_exit(int id, const char *desc)
+{
+	struct osnoise_variables *osn_var = this_cpu_osn_var();
+	int duration;
+
+	if (!osn_var->sampling)
+		return;
+
+	duration = get_int_safe_duration(osn_var, &osn_var->irq.delta_start);
+	trace_irq_noise(id, desc, osn_var->irq.arrival_time, duration);
+	osn_var->irq.arrival_time = 0;
+	cond_move_softirq_delta_start(osn_var, duration);
+	cond_move_thread_delta_start(osn_var, duration);
+}
+
+/**
+ * trace_irqentry_callback - Callback to the irq:irq_entry traceevent
+ *
+ * Used to note the starting of an IRQ occurece.
+ */
+void trace_irqentry_callback(void *data, int irq, struct irqaction *action)
+{
+	__trace_irq_entry(irq);
+}
+
+/**
+ * trace_irqexit_callback - Callback to the irq:irq_exit traceevent
+ *
+ * Used to note the end of an IRQ occurece.
+ */
+void trace_irqexit_callback(void *data, int irq, struct irqaction *action, int ret)
+{
+	__trace_irq_exit(irq, action->name);
+}
+
+#ifdef CONFIG_X86_LOCAL_APIC
+/**
+ * trace_intel_irq_entry - record intel specific IRQ entry
+ */
+void trace_intel_irq_entry(void *data, int vector)
+{
+	__trace_irq_entry(vector);
+}
+
+/**
+ * trace_intel_irq_exit - record intel specific IRQ exit
+ */
+void trace_intel_irq_exit(void *data, int vector)
+{
+	char *vector_desc = (char *) data;
+
+	__trace_irq_exit(vector, vector_desc);
+}
+
+/**
+ * register_intel_irq_tp - Register intel specific IRQ entry tracepoints
+ */
+static int register_intel_irq_tp(void)
+{
+	int ret;
+
+	ret = register_trace_local_timer_entry(trace_intel_irq_entry, NULL);
+	if (ret)
+		goto out_err;
+
+	ret = register_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
+	if (ret)
+		goto out_timer_entry;
+
+#ifdef CONFIG_X86_THERMAL_VECTOR
+	ret = register_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
+	if (ret)
+		goto out_timer_exit;
+
+	ret = register_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
+	if (ret)
+		goto out_thermal_entry;
+#endif /* CONFIG_X86_THERMAL_VECTOR */
+
+#ifdef CONFIG_X86_MCE_AMD
+	ret = register_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
+	if (ret)
+		goto out_thermal_exit;
+
+	ret = register_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
+	if (ret)
+		goto out_deferred_entry;
+#endif
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+	ret = register_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
+	if (ret)
+		goto out_deferred_exit;
+
+	ret = register_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
+	if (ret)
+		goto out_threshold_entry;
+#endif /* CONFIG_X86_MCE_THRESHOLD */
+
+#ifdef CONFIG_SMP
+	ret = register_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
+	if (ret)
+		goto out_threshold_exit;
+
+	ret = register_trace_call_function_single_exit(trace_intel_irq_exit,
+						       "call_function_single");
+	if (ret)
+		goto out_call_function_single_entry;
+
+	ret = register_trace_call_function_entry(trace_intel_irq_entry, NULL);
+	if (ret)
+		goto out_call_function_single_exit;
+
+	ret = register_trace_call_function_exit(trace_intel_irq_exit, "call_function");
+	if (ret)
+		goto out_call_function_entry;
+
+	ret = register_trace_reschedule_entry(trace_intel_irq_entry, NULL);
+	if (ret)
+		goto out_call_function_exit;
+
+	ret = register_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
+	if (ret)
+		goto out_reschedule_entry;
+#endif /* CONFIG_SMP */
+
+#ifdef CONFIG_IRQ_WORK
+	ret = register_trace_irq_work_entry(trace_intel_irq_entry, NULL);
+	if (ret)
+		goto out_reschedule_exit;
+
+	ret = register_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
+	if (ret)
+		goto out_irq_work_entry;
+#endif
+
+	ret = register_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
+	if (ret)
+		goto out_irq_work_exit;
+
+	ret = register_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
+	if (ret)
+		goto out_x86_ipi_entry;
+
+	ret = register_trace_error_apic_entry(trace_intel_irq_entry, NULL);
+	if (ret)
+		goto out_x86_ipi_exit;
+
+	ret = register_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
+	if (ret)
+		goto out_error_apic_entry;
+
+	ret = register_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
+	if (ret)
+		goto out_error_apic_exit;
+
+	ret = register_trace_spurious_apic_exit(trace_intel_irq_exit, "spurious_apic");
+	if (ret)
+		goto out_spurious_apic_entry;
+
+	return 0;
+
+out_spurious_apic_entry:
+	unregister_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
+out_error_apic_exit:
+	unregister_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
+out_error_apic_entry:
+	unregister_trace_error_apic_entry(trace_intel_irq_entry, NULL);
+out_x86_ipi_exit:
+	unregister_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
+out_x86_ipi_entry:
+	unregister_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
+out_irq_work_exit:
+
+#ifdef CONFIG_IRQ_WORK
+	unregister_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
+out_irq_work_entry:
+	unregister_trace_irq_work_entry(trace_intel_irq_entry, NULL);
+out_reschedule_exit:
+#endif
+
+#ifdef CONFIG_SMP
+	unregister_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
+out_reschedule_entry:
+	unregister_trace_reschedule_entry(trace_intel_irq_entry, NULL);
+out_call_function_exit:
+	unregister_trace_call_function_exit(trace_intel_irq_exit, "call_function");
+out_call_function_entry:
+	unregister_trace_call_function_entry(trace_intel_irq_entry, NULL);
+out_call_function_single_exit:
+	unregister_trace_call_function_single_exit(trace_intel_irq_exit, "call_function_single");
+out_call_function_single_entry:
+	unregister_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
+out_threshold_exit:
+#endif
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+	unregister_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
+out_threshold_entry:
+	unregister_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
+out_deferred_exit:
+#endif
+
+#ifdef CONFIG_X86_MCE_AMD
+	unregister_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
+out_deferred_entry:
+	unregister_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
+out_thermal_exit:
+#endif /* CONFIG_X86_MCE_AMD */
+
+#ifdef CONFIG_X86_THERMAL_VECTOR
+	unregister_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
+out_thermal_entry:
+	unregister_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
+out_timer_exit:
+#endif /* CONFIG_X86_THERMAL_VECTOR */
+
+	unregister_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
+out_timer_entry:
+	unregister_trace_local_timer_entry(trace_intel_irq_entry, NULL);
+out_err:
+	return -EINVAL;
+}
+
+static void unregister_intel_irq_tp(void)
+{
+	unregister_trace_spurious_apic_exit(trace_intel_irq_exit, "spurious_apic");
+	unregister_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
+	unregister_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
+	unregister_trace_error_apic_entry(trace_intel_irq_entry, NULL);
+	unregister_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
+	unregister_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
+
+#ifdef CONFIG_IRQ_WORK
+	unregister_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
+	unregister_trace_irq_work_entry(trace_intel_irq_entry, NULL);
+#endif
+
+#ifdef CONFIG_SMP
+	unregister_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
+	unregister_trace_reschedule_entry(trace_intel_irq_entry, NULL);
+	unregister_trace_call_function_exit(trace_intel_irq_exit, "call_function");
+	unregister_trace_call_function_entry(trace_intel_irq_entry, NULL);
+	unregister_trace_call_function_single_exit(trace_intel_irq_exit, "call_function_single");
+	unregister_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
+#endif
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+	unregister_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
+	unregister_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
+#endif
+
+#ifdef CONFIG_X86_MCE_AMD
+	unregister_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
+	unregister_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
+#endif
+
+#ifdef CONFIG_X86_THERMAL_VECTOR
+	unregister_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
+	unregister_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
+#endif /* CONFIG_X86_THERMAL_VECTOR */
+
+	unregister_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
+	unregister_trace_local_timer_entry(trace_intel_irq_entry, NULL);
+}
+
+#else /* CONFIG_X86_LOCAL_APIC */
+#define register_intel_irq_tp() do {} while (0)
+#define unregister_intel_irq_tp() do {} while (0)
+#endif /* CONFIG_X86_LOCAL_APIC */
+
+/**
+ * hook_irq_events - Hook IRQ handling events
+ *
+ * This function hooks the IRQ related callbacks to the respective trace
+ * events.
+ */
+int hook_irq_events(void)
+{
+	int ret;
+
+	ret = register_trace_irq_handler_entry(trace_irqentry_callback, NULL);
+	if (ret)
+		goto out_err;
+
+	ret = register_trace_irq_handler_exit(trace_irqexit_callback, NULL);
+	if (ret)
+		goto out_unregister_entry;
+
+	ret = register_intel_irq_tp();
+	if (ret)
+		goto out_irq_exit;
+
+	return 0;
+
+out_irq_exit:
+	unregister_trace_irq_handler_exit(trace_irqexit_callback, NULL);
+out_unregister_entry:
+	unregister_trace_irq_handler_entry(trace_irqentry_callback, NULL);
+out_err:
+	return -EINVAL;
+}
+
+/**
+ * unhook_irq_events - Unhook IRQ handling events
+ *
+ * This function unhooks the IRQ related callbacks to the respective trace
+ * events.
+ */
+void unhook_irq_events(void)
+{
+	unregister_intel_irq_tp();
+	unregister_trace_irq_handler_exit(trace_irqexit_callback, NULL);
+	unregister_trace_irq_handler_entry(trace_irqentry_callback, NULL);
+}
+
+#ifndef CONFIG_PREEMPT_RT
+/**
+ * trace_softirq_entry_callback - Note the starting of a SoftIRQ
+ *
+ * Save the starting time of a SoftIRQ. As SoftIRQs are non-preemptive to
+ * other SoftIRQs, it is safe to use a single variable (ons_var->softirq)
+ * to save the statistics. The arrival_time is used to report... the
+ * arrival time. The delta_start is used to compute the duration at the
+ * SoftIRQ exit handler. See cond_move_softirq_delta_start().
+ */
+void trace_softirq_entry_callback(void *data, unsigned int vec_nr)
+{
+	struct osnoise_variables *osn_var = this_cpu_osn_var();
+
+	if (!osn_var->sampling)
+		return;
+	/*
+	 * This value will be used in the report, but not to compute
+	 * the execution time, so it is safe to get it unsafe.
+	 */
+	osn_var->softirq.arrival_time = time_get();
+	set_int_safe_time(osn_var, &osn_var->softirq.delta_start);
+	osn_var->softirq.count++;
+
+	local_inc(&osn_var->int_counter);
+}
+
+/**
+ * trace_softirq_exit_callback - Note the end of an SoftIRQ
+ *
+ * Computes the duration of the SoftIRQ noise, and trace it. Also discounts the
+ * interference from other sources of noise could be currently being accounted.
+ */
+void trace_softirq_exit_callback(void *data, unsigned int vec_nr)
+{
+	struct osnoise_variables *osn_var = this_cpu_osn_var();
+	int duration;
+
+	if (!osn_var->sampling)
+		return;
+
+	duration = get_int_safe_duration(osn_var, &osn_var->softirq.delta_start);
+	trace_softirq_noise(vec_nr, osn_var->softirq.arrival_time, duration);
+	cond_move_thread_delta_start(osn_var, duration);
+	osn_var->softirq.arrival_time = 0;
+}
+
+/**
+ * hook_softirq_events - Hook SoftIRQ handling events
+ *
+ * This function hooks the SoftIRQ related callbacks to the respective trace
+ * events.
+ */
+static int hook_softirq_events(void)
+{
+	int ret;
+
+	ret = register_trace_softirq_entry(trace_softirq_entry_callback, NULL);
+	if (ret)
+		goto out_err;
+
+	ret = register_trace_softirq_exit(trace_softirq_exit_callback, NULL);
+	if (ret)
+		goto out_unreg_entry;
+
+	return 0;
+
+out_unreg_entry:
+	unregister_trace_softirq_entry(trace_softirq_entry_callback, NULL);
+out_err:
+	return -EINVAL;
+}
+
+/**
+ * unhook_softirq_events - Unhook SoftIRQ handling events
+ *
+ * This function hooks the SoftIRQ related callbacks to the respective trace
+ * events.
+ */
+static void unhook_softirq_events(void)
+{
+	unregister_trace_softirq_entry(trace_softirq_entry_callback, NULL);
+	unregister_trace_softirq_exit(trace_softirq_exit_callback, NULL);
+}
+#else /* CONFIG_PREEMPT_RT */
+/*
+ * SoftIRQ are threads on the PREEMPT_RT mode.
+ */
+static int hook_softirq_events(void)
+{
+	return 0;
+}
+static void unhook_softirq_events(void)
+{
+}
+#endif
+
+/**
+ * thread_entry - Record the starting of a thread noise window
+ *
+ * It saves the context switch time for a noisy thread, and increments
+ * the interference counters.
+ */
+static void
+thread_entry(struct osnoise_variables *osn_var, struct task_struct *t)
+{
+	if (!osn_var->sampling)
+		return;
+	/*
+	 * The arrival time will be used in the report, but not to compute
+	 * the execution time, so it is safe to get it unsafe.
+	 */
+	osn_var->thread.arrival_time = time_get();
+
+	set_int_safe_time(osn_var, &osn_var->thread.delta_start);
+
+	osn_var->thread.count++;
+	local_inc(&osn_var->int_counter);
+}
+
+/**
+ * thread_exit - Report the end of a thread noise window
+ *
+ * It computes the total noise from a thread, tracing if needed.
+ */
+static void
+thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
+{
+	int duration;
+
+	if (!osn_var->sampling)
+		return;
+
+	duration = get_int_safe_duration(osn_var, &osn_var->thread.delta_start);
+
+	trace_thread_noise(t, osn_var->thread.arrival_time, duration);
+
+	osn_var->thread.arrival_time = 0;
+}
+
+/**
+ * trace_sched_switch - sched:sched_switch trace event handler
+ *
+ * This function is hooked to the sched:sched_switch trace event, and it is
+ * used to record the beginning and to report the end of a thread noise window.
+ */
+void
+trace_sched_switch_callback(void *data, bool preempt, struct task_struct *p,
+			    struct task_struct *n)
+{
+	struct osnoise_variables *osn_var = this_cpu_osn_var();
+
+	if (p->pid != osn_var->pid)
+		thread_exit(osn_var, p);
+
+	if (n->pid != osn_var->pid)
+		thread_entry(osn_var, n);
+}
+
+/**
+ * hook_thread_events - Hook the insturmentation for thread noise
+ *
+ * Hook the osnoise tracer callbacks to handle the noise from other
+ * threads on the necessary kernel events.
+ */
+int hook_thread_events(void)
+{
+	int ret;
+
+	ret = register_trace_sched_switch(trace_sched_switch_callback, NULL);
+	if (ret)
+		return -EINVAL;
+
+	return 0;
+}
+
+/**
+ * unhook_thread_events - *nhook the insturmentation for thread noise
+ *
+ * Unook the osnoise tracer callbacks to handle the noise from other
+ * threads on the necessary kernel events.
+ */
+void unhook_thread_events(void)
+{
+	unregister_trace_sched_switch(trace_sched_switch_callback, NULL);
+}
+
+/**
+ * save_osn_sample_stats - Save the osnoise_sample statistics
+ *
+ * Save the osnoise_sample statistics before the sampling phase. These
+ * values will be used later to compute the diff betwneen the statistics
+ * before and after the osnoise sampling.
+ */
+void save_osn_sample_stats(struct osnoise_variables *osn_var, struct osnoise_sample *s)
+{
+	s->nmi_count = osn_var->nmi.count;
+	s->irq_count = osn_var->irq.count;
+	s->softirq_count = osn_var->softirq.count;
+	s->thread_count = osn_var->thread.count;
+}
+
+/**
+ * diff_osn_sample_stats - Compute the osnoise_sample statistics
+ *
+ * After a sample period, compute the difference on the osnoise_sample
+ * statistics. The struct osnoise_sample *s contains the statistics saved via
+ * save_osn_sample_stats() before the osnoise sampling.
+ */
+void diff_osn_sample_stats(struct osnoise_variables *osn_var, struct osnoise_sample *s)
+{
+	s->nmi_count = osn_var->nmi.count - s->nmi_count;
+	s->irq_count = osn_var->irq.count - s->irq_count;
+	s->softirq_count = osn_var->softirq.count - s->softirq_count;
+	s->thread_count = osn_var->thread.count - s->thread_count;
+}
+
+/**
+ * run_osnoise - Sample the time and look for osnoise
+ *
+ * Used to capture the time, looking for potential osnoise latency repeatedly.
+ * Different from hwlat_detector, it is called with preemption and interrupts
+ * enabled. This allows irqs, softirqs and threads to run, interfering on the
+ * osnoise sampling thread, as they would do with a regular thread.
+ */
+static int run_osnoise(void)
+{
+	struct osnoise_variables *osn_var = this_cpu_osn_var();
+	u64 noise = 0, sum_noise = 0, max_noise = 0;
+	struct trace_array *tr = osnoise_trace;
+	time_type start, sample, last_sample;
+	u64 last_int_count, int_count;
+	s64 total, last_total = 0;
+	struct osnoise_sample s;
+	int hw_count = 0;
+	int ret = -1;
+
+	/*
+	 * Considers the current thread as the workload.
+	 */
+	osn_var->pid = current->pid;
+
+	/*
+	 * Save the current stats for the diff
+	 */
+	save_osn_sample_stats(osn_var, &s);
+
+	/*
+	 * Make sure NMIs see sampling first
+	 */
+	osn_var->sampling = true;
+	barrier();
+
+	/*
+	 * Start timestemp
+	 */
+	start = time_get();
+
+	/*
+	 * "previous" loop
+	 */
+	last_int_count = set_int_safe_time(osn_var, &last_sample);
+
+	do {
+		/*
+		 * Get sample!
+		 */
+		int_count = set_int_safe_time(osn_var, &sample);
+
+		noise = time_sub(sample, last_sample);
+
+		/*
+		 * This shouldn't happen.
+		 */
+		if (noise < 0) {
+			pr_err(BANNER "time running backwards\n");
+			goto out;
+		}
+
+		/*
+		 * Sample runtime.
+		 */
+		total = time_to_us(time_sub(sample, start));
+
+		/*
+		 * Check for possible overflows.
+		 */
+		if (total < last_total) {
+			pr_err("Time total overflowed\n");
+			break;
+		}
+
+		last_total = total;
+
+		if (noise >= osnoise_data.noise_tolerance_ns) {
+			int interference = int_count - last_int_count;
+			int noise_us = time_to_us(noise);
+
+			if (noise > max_noise)
+				max_noise = noise;
+
+			if (!interference)
+				hw_count++;
+
+			sum_noise += noise;
+
+			trace_sample_threshold(last_sample, noise_us, interference);
+
+			if (osnoise_data.stop_tracing_single_max)
+				if (noise_us > osnoise_data.stop_tracing_single_max)
+					tracing_off();
+		}
+
+		/*
+		 * For the non-preemptive kernel config: let threads runs, if
+		 * they so wish.
+		 */
+		cond_resched();
+
+		last_sample = sample;
+		last_int_count = int_count;
+
+	} while (total < osnoise_data.sample_runtime && !kthread_should_stop());
+
+	/*
+	 * Finish the above in the view for interrupts.
+	 */
+	barrier();
+
+	osn_var->sampling = false;
+
+	/*
+	 * Make sure sampling data is no longer updated.
+	 */
+	barrier();
+
+	/*
+	 * Save noise info.
+	 */
+	s.noise = time_to_us(sum_noise);
+	s.runtime = total;
+	s.max_sample = time_to_us(max_noise);
+	s.hw_count = hw_count;
+
+	/* Save interference stats info */
+	diff_osn_sample_stats(osn_var, &s);
+
+	trace_osnoise_sample(&s);
+
+	/* Keep a running maximum ever recorded os noise "latency" */
+	if (max_noise > tr->max_latency) {
+		tr->max_latency = max_noise;
+		latency_fsnotify(tr);
+	}
+
+	if (osnoise_data.stop_tracing_total_max)
+		if (s.noise > osnoise_data.stop_tracing_total_max)
+			tracing_off();
+
+	return 0;
+out:
+	return ret;
+}
+
+static struct cpumask osnoise_cpumask;
+static struct cpumask save_cpumask;
+
+/*
+ * kthread_fn - The osnoise detection kernel thread
+ *
+ * Calls run_osnoise() function to measure the osnoise for the configured runtime,
+ * every period.
+ */
+static int kthread_fn(void *data)
+{
+	s64 interval;
+
+	while (!kthread_should_stop()) {
+
+		run_osnoise();
+
+		mutex_lock(&osnoise_data.lock);
+		interval = osnoise_data.sample_period - osnoise_data.sample_runtime;
+		mutex_unlock(&osnoise_data.lock);
+
+		do_div(interval, USEC_PER_MSEC);
+
+		/*
+		 * differently from hwlat_detector, the osnoise tracer can run
+		 * without a pause because preemption is on.
+		 */
+		if (interval < 1)
+			continue;
+
+		if (msleep_interruptible(interval))
+			break;
+	}
+
+	return 0;
+}
+
+/**
+ * stop_per_cpu_kthread - stop per-cpu threads
+ *
+ * Stop the osnoise sampling htread. Use this on unload and at system
+ * shutdown.
+ */
+static void stop_per_cpu_kthreads(void)
+{
+	struct task_struct *kthread;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
+		if (kthread)
+			kthread_stop(kthread);
+	}
+}
+
+/**
+ * start_per_cpu_kthread - Kick off per-cpu osnoise sampling kthreads
+ *
+ * This starts the kernel thread that will look for osnoise on many
+ * cpus.
+ */
+static int start_per_cpu_kthreads(struct trace_array *tr)
+{
+	struct cpumask *current_mask = &save_cpumask;
+	struct task_struct *kthread;
+	char comm[24];
+	int cpu;
+
+	get_online_cpus();
+	/*
+	 * Run only on CPUs in which trace and osnoise are allowed to run.
+	 */
+	cpumask_and(current_mask, tr->tracing_cpumask, &osnoise_cpumask);
+	/*
+	 * And the CPU is online.
+	 */
+	cpumask_and(current_mask, cpu_online_mask, current_mask);
+	put_online_cpus();
+
+	for_each_online_cpu(cpu)
+		per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
+
+	for_each_cpu(cpu, current_mask) {
+		snprintf(comm, 24, "osnoise/%d", cpu);
+
+		kthread = kthread_create_on_cpu(kthread_fn, NULL, cpu, comm);
+
+		if (IS_ERR(kthread)) {
+			pr_err(BANNER "could not start sampling thread\n");
+			stop_per_cpu_kthreads();
+			return -ENOMEM;
+		}
+
+		per_cpu(per_cpu_osnoise_var, cpu).kthread = kthread;
+		wake_up_process(kthread);
+	}
+
+	return 0;
+}
+
+/*
+ * osnoise_cpus_read - Read function for reading the "cpus" file
+ * @filp: The active open file structure
+ * @ubuf: The userspace provided buffer to read value into
+ * @cnt: The maximum number of bytes to read
+ * @ppos: The current "file" position
+ *
+ * Prints the "cpus" output into the user-provided buffer.
+ */
+static ssize_t
+osnoise_cpus_read(struct file *filp, char __user *ubuf, size_t count,
+		  loff_t *ppos)
+{
+	char *mask_str;
+	int len;
+
+	len = snprintf(NULL, 0, "%*pbl\n",
+		       cpumask_pr_args(&osnoise_cpumask)) + 1;
+	mask_str = kmalloc(len, GFP_KERNEL);
+	if (!mask_str)
+		return -ENOMEM;
+
+	len = snprintf(mask_str, len, "%*pbl\n",
+		       cpumask_pr_args(&osnoise_cpumask));
+	if (len >= count) {
+		count = -EINVAL;
+		goto out_err;
+	}
+	count = simple_read_from_buffer(ubuf, count, ppos, mask_str, len);
+
+out_err:
+	kfree(mask_str);
+
+	return count;
+}
+
+/**
+ * osnoise_cpus_write - Write function for "cpus" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * This function provides a write implementation for the "cpus"
+ * interface to the osnoise trace. By default, it lists all  CPUs,
+ * in this way, allowing osnoised threads to run on any online CPU
+ * of the system. It serves to restrict the execution of osnoise to the
+ * set of CPUs writing via this interface. Note that osnoise also
+ * respects the "tracing_cpumask." Hence, osnoised threads will run only
+ * on the set of CPUs allowed here AND on "tracing_cpumask." Why not
+ * have just "tracing_cpumask?" Because the user might be interested
+ * in tracing what is running on other CPUs. For instance, one might
+ * run osnoised in one HT CPU while observing what is running on the
+ * sibling HT CPU.
+ */
+static ssize_t
+osnoise_cpus_write(struct file *filp, const char __user *ubuf, size_t count,
+		   loff_t *ppos)
+{
+	cpumask_var_t osnoise_cpumask_new;
+	char buf[256];
+	int err;
+
+	if (count >= 256)
+		return -EINVAL;
+
+	if (copy_from_user(buf, ubuf, count))
+		return -EFAULT;
+
+	if (!zalloc_cpumask_var(&osnoise_cpumask_new, GFP_KERNEL))
+		return -ENOMEM;
+
+	err = cpulist_parse(buf, osnoise_cpumask_new);
+	if (err)
+		goto err_free;
+
+	cpumask_copy(&osnoise_cpumask, osnoise_cpumask_new);
+
+	free_cpumask_var(osnoise_cpumask_new);
+
+	return count;
+
+err_free:
+	free_cpumask_var(osnoise_cpumask_new);
+
+	return err;
+}
+
+/*
+ * osnoise_read - Wrapper read function for reading %llu value
+ * @filp: The active open file structure
+ * @ubuf: The userspace provided buffer to read value into
+ * @cnt: The maximum number of bytes to read
+ * @ppos: The current "file" position
+ *
+ * This function provides a generic read implementation for the global state
+ * "osnoise_data" structure filesystem entries.
+ */
+static ssize_t osnoise_read(struct file *filp, char __user *ubuf,
+			    size_t cnt, loff_t *ppos)
+{
+	u64 *entry = filp->private_data;
+	char buf[U64STR_SIZE];
+	u64 val;
+	int len;
+
+	if (!entry)
+		return -EFAULT;
+
+	if (cnt > sizeof(buf))
+		cnt = sizeof(buf);
+
+	val = *entry;
+
+	len = snprintf(buf, sizeof(buf), "%llu\n", val);
+
+	return simple_read_from_buffer(ubuf, cnt, ppos, buf, len);
+}
+
+/**
+ * osnoise_runtime_write - Write function for "runtime_us" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * This function provides a write implementation for the "runtime" interface
+ * to the osnoise tracer. It can be used to configure  for how many us of the
+ * total period us we will actively sample for any noise.
+ */
+static ssize_t
+osnoise_runtime_write(struct file *filp, const char __user *ubuf,
+		  size_t cnt, loff_t *ppos)
+{
+	u64 val;
+	int err;
+
+	err = kstrtoull_from_user(ubuf, cnt, 10, &val);
+	if (err)
+		return err;
+
+	mutex_lock(&osnoise_data.lock);
+	if (val <= osnoise_data.sample_period)
+		osnoise_data.sample_runtime = val;
+	else
+		err = -EINVAL;
+	mutex_unlock(&osnoise_data.lock);
+
+	if (err)
+		return err;
+
+	return cnt;
+}
+
+/**
+ * osnoise_period_write - Write function for "period_us" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * This function provides a write implementation for the "period" interface
+ * to the osnoise tracer. The period is the total time in us that will be
+ * considered one sample period. Conceptually, periods occur back-to-back
+ * and contain a sample runtime period during which actual sampling occurs.
+ * Can be used to write a new total period size. It is enforced that any
+ * value written must be greater than or equals to the sample runtime size,
+ * or an error results.
+ */
+static ssize_t
+osnoise_period_write(struct file *filp, const char __user *ubuf,
+		   size_t cnt, loff_t *ppos)
+{
+	u64 val;
+	int err;
+
+	err = kstrtoull_from_user(ubuf, cnt, 10, &val);
+	if (err)
+		return err;
+
+	mutex_lock(&osnoise_data.lock);
+	if (osnoise_data.sample_runtime < val)
+		osnoise_data.sample_period = val;
+	else
+		err = -EINVAL;
+	mutex_unlock(&osnoise_data.lock);
+
+	if (err)
+		return err;
+
+	return cnt;
+}
+
+/**
+ * osnoise_tolerance_write - Write function for "tolerance_ns" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * This function provides a write implementation for the "tolerance_ns"
+ * interface.  The tolerance is lower mark, which any "noise" bellow it is
+ * not considered a noise, or it is so small that it is tolerable.
+ */
+static ssize_t
+osnoise_tolerance_write(struct file *filp, const char __user *ubuf,
+			size_t cnt, loff_t *ppos)
+{
+	u64 val;
+	int err;
+
+	err = kstrtoull_from_user(ubuf, cnt, 10, &val);
+	if (err)
+		return err;
+
+	mutex_lock(&osnoise_data.lock);
+	if ((val < 500) || (val > 100000000))
+		err = -EINVAL;
+	else
+		osnoise_data.noise_tolerance_ns = val;
+
+	mutex_unlock(&osnoise_data.lock);
+
+	if (err)
+		return err;
+
+	return cnt;
+}
+
+/**
+ * osnoise_stop_tracing_write - Write function for "stop_tracing_*" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * This function provides a write implementation for the
+ * "stop_tracing_single_us" and "stop_tracing_total_us". Once a sample
+ * reaches the respective value written in these variables, the sampling
+ * will stop.
+ */
+static ssize_t
+osnoise_stop_tracing_write(struct file *filp, const char __user *ubuf,
+			size_t cnt, loff_t *ppos)
+{
+	u64 *entry = filp->private_data;
+	u64 val;
+	int err;
+
+	err = kstrtoull_from_user(ubuf, cnt, 10, &val);
+	if (err)
+		return err;
+
+	mutex_lock(&osnoise_data.lock);
+	if (osnoise_data.sample_runtime < val)
+		err = -EINVAL;
+	else
+		*entry = val;
+
+	mutex_unlock(&osnoise_data.lock);
+
+	if (err)
+		return err;
+
+	return cnt;
+}
+
+static const struct file_operations runtime_fops = {
+	.open		= tracing_open_generic,
+	.read		= osnoise_read,
+	.write		= osnoise_runtime_write,
+};
+
+static const struct file_operations period_fops = {
+	.open		= tracing_open_generic,
+	.read		= osnoise_read,
+	.write		= osnoise_period_write,
+};
+
+static const struct file_operations tolerance_fops = {
+	.open		= tracing_open_generic,
+	.read		= osnoise_read,
+	.write		= osnoise_tolerance_write,
+};
+
+static const struct file_operations stop_tracing_fops = {
+	.open		= tracing_open_generic,
+	.read		= osnoise_read,
+	.write		= osnoise_stop_tracing_write,
+};
+
+static const struct file_operations cpus_fops = {
+	.open		= tracing_open_generic,
+	.read		= osnoise_cpus_read,
+	.write		= osnoise_cpus_write,
+	.llseek		= generic_file_llseek,
+};
+
+/**
+ * init_tracefs - A function to initialize the tracefs interface files
+ *
+ * This function creates entries in tracefs for "osnoise". It creates the
+ * "osnoise" directory in the tracing directory, and within that
+ * directory is the count, runtime and period files to change and view
+ * those values.
+ */
+static int init_tracefs(void)
+{
+	struct dentry *top_dir;
+	struct dentry *tmp;
+	int ret;
+
+	ret = tracing_init_dentry();
+	if (ret)
+		return -ENOMEM;
+
+	top_dir = tracefs_create_dir("osnoise", NULL);
+	if (!top_dir)
+		return -ENOMEM;
+
+	tmp = tracefs_create_file("period_us", 0640, top_dir,
+				  &osnoise_data.sample_period, &period_fops);
+	if (!tmp)
+		goto err;
+
+	tmp = tracefs_create_file("runtime_us", 0644, top_dir,
+				  &osnoise_data.sample_runtime, &runtime_fops);
+	if (!tmp)
+		goto err;
+
+	tmp = tracefs_create_file("tolerance_ns", 0640, top_dir,
+				  &osnoise_data.noise_tolerance_ns,
+				  &tolerance_fops);
+	if (!tmp)
+		goto err;
+
+	tmp = tracefs_create_file("stop_tracing_single_us", 0640, top_dir,
+				  &osnoise_data.stop_tracing_single_max,
+				  &stop_tracing_fops);
+	if (!tmp)
+		goto err;
+
+	tmp = tracefs_create_file("stop_tracing_total_us", 0640, top_dir,
+				  &osnoise_data.stop_tracing_total_max,
+				  &stop_tracing_fops);
+	if (!tmp)
+		goto err;
+
+
+	tmp = trace_create_file("cpus", 0644, top_dir, NULL, &cpus_fops);
+	if (!tmp)
+		goto err;
+
+	return 0;
+
+ err:
+	tracefs_remove(top_dir);
+	return -ENOMEM;
+}
+
+static void osnoise_tracer_start(struct trace_array *tr)
+{
+	int retval;
+
+	/* Only allow one instance to enable this */
+	if (osnoise_busy)
+		return;
+
+	/*
+	 * Trace is already hooked, we are re-enabling from
+	 * a stop_tracing_*.
+	 */
+	if (trace_osnoise_callback_enabled)
+		return;
+
+	osn_var_reset_all();
+
+	retval = hook_irq_events();
+	if (retval)
+		goto err;
+
+	retval = hook_softirq_events();
+	if (retval)
+		goto out_unhook_irq;
+
+	retval = hook_thread_events();
+
+	if (retval)
+		goto out_unrook_softirq;
+
+	/*
+	 * Make sure NMIs see reseted values.
+	 */
+	barrier();
+	trace_osnoise_callback_enabled = true;
+
+	retval = start_per_cpu_kthreads(tr);
+	/*
+	 * all fine!
+	 */
+	if (!retval)
+		return;
+
+	unhook_thread_events();
+out_unrook_softirq:
+	unhook_softirq_events();
+out_unhook_irq:
+	unhook_irq_events();
+err:
+	pr_err(BANNER "Error starting osnoise tracer\n");
+}
+
+static void osnoise_tracer_stop(struct trace_array *tr)
+{
+	/* Only allow one instance to enable this */
+	if (!osnoise_busy)
+		return;
+
+	trace_osnoise_callback_enabled = false;
+	barrier();
+
+	stop_per_cpu_kthreads();
+
+	unhook_irq_events();
+	unhook_softirq_events();
+	unhook_thread_events();
+}
+
+static int osnoise_tracer_init(struct trace_array *tr)
+{
+	/* Only allow one instance to enable this */
+	if (osnoise_busy)
+		return -EBUSY;
+
+	osnoise_trace = tr;
+
+	tr->max_latency = 0;
+
+	if (tracer_tracing_is_on(tr))
+		osnoise_tracer_start(tr);
+
+	osnoise_busy = true;
+
+
+	return 0;
+}
+
+static void osnoise_tracer_reset(struct trace_array *tr)
+{
+	osnoise_tracer_stop(tr);
+
+	osnoise_busy = false;
+}
+
+static struct tracer osnoise_tracer __read_mostly = {
+	.name		= "osnoise",
+	.init		= osnoise_tracer_init,
+	.reset		= osnoise_tracer_reset,
+	.start		= osnoise_tracer_start,
+	.stop		= osnoise_tracer_stop,
+	.print_header	= print_osnoise_headers,
+	.allow_instances = true,
+};
+
+__init static int init_osnoise_tracer(void)
+{
+	int ret;
+
+	mutex_init(&osnoise_data.lock);
+
+	ret = register_tracer(&osnoise_tracer);
+	if (ret)
+		return ret;
+
+	cpumask_copy(&osnoise_cpumask, cpu_all_mask);
+
+	init_tracefs();
+
+	return 0;
+}
+late_initcall(init_osnoise_tracer);
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 61255bad7e01..edeb127fcdea 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1189,7 +1189,6 @@ trace_hwlat_print(struct trace_iterator *iter, int flags,
 	return trace_handle_return(s);
 }
 
-
 static enum print_line_t
 trace_hwlat_raw(struct trace_iterator *iter, int flags,
 		struct trace_event *event)
@@ -1219,6 +1218,76 @@ static struct trace_event trace_hwlat_event = {
 	.funcs		= &trace_hwlat_funcs,
 };
 
+/* TRACE_OSNOISE */
+static enum print_line_t
+trace_osnoise_print(struct trace_iterator *iter, int flags,
+		    struct trace_event *event)
+{
+	struct trace_entry *entry = iter->ent;
+	struct trace_seq *s = &iter->seq;
+	struct osnoise_entry *field;
+	u64 ratio, ratio_dec;
+	u64 net_runtime;
+
+	trace_assign_type(field, entry);
+
+	/*
+	 * compute the available % of cpu time.
+	 */
+	net_runtime = field->runtime - field->noise;
+	ratio = net_runtime * 10000000;
+	do_div(ratio, field->runtime);
+	ratio_dec = do_div(ratio, 100000);
+
+	trace_seq_printf(s, "%llu %10llu %3llu.%05llu %7llu",
+			 field->runtime,
+			 field->noise,
+			 ratio, ratio_dec,
+			 field->max_sample);
+
+	trace_seq_printf(s, " %6u", field->hw_count);
+	trace_seq_printf(s, " %6u", field->nmi_count);
+	trace_seq_printf(s, " %6u", field->irq_count);
+	trace_seq_printf(s, " %6u", field->softirq_count);
+	trace_seq_printf(s, " %6u", field->thread_count);
+
+	trace_seq_putc(s, '\n');
+
+	return trace_handle_return(s);
+}
+
+static enum print_line_t
+trace_osnoise_raw(struct trace_iterator *iter, int flags,
+		  struct trace_event *event)
+{
+	struct osnoise_entry *field;
+	struct trace_seq *s = &iter->seq;
+
+	trace_assign_type(field, iter->ent);
+
+	trace_seq_printf(s, "%lld %llu %llu %u %u %u %u %u\n",
+			 field->runtime,
+			 field->noise,
+			 field->max_sample,
+			 field->hw_count,
+			 field->nmi_count,
+			 field->irq_count,
+			 field->softirq_count,
+			 field->thread_count);
+
+	return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_osnoise_funcs = {
+	.trace		= trace_osnoise_print,
+	.raw		= trace_osnoise_raw,
+};
+
+static struct trace_event trace_osnoise_event = {
+	.type		= TRACE_OSNOISE,
+	.funcs		= &trace_osnoise_funcs,
+};
+
 /* TRACE_BPUTS */
 static enum print_line_t
 trace_bputs_print(struct trace_iterator *iter, int flags,
@@ -1384,6 +1453,7 @@ static struct trace_event *events[] __initdata = {
 	&trace_bprint_event,
 	&trace_print_event,
 	&trace_hwlat_event,
+	&trace_osnoise_event,
 	&trace_raw_data_event,
 	NULL
 };
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/5] tracing: Add the osnoise tracer
  2021-04-08 14:13 ` [RFC PATCH 5/5] tracing: Add the osnoise tracer Daniel Bristot de Oliveira
@ 2021-04-08 15:58   ` Jonathan Corbet
  2021-04-09  7:19     ` Daniel Bristot de Oliveira
  2021-04-08 23:57   ` kernel test robot
  2021-04-14 17:14   ` Steven Rostedt
  2 siblings, 1 reply; 25+ messages in thread
From: Jonathan Corbet @ 2021-04-08 15:58 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Steven Rostedt, linux-kernel
  Cc: bristot, kcarcia, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Alexandre Chartre, Clark Willaims, John Kacur, Juri Lelli,
	linux-doc

Daniel Bristot de Oliveira <bristot@redhat.com> writes:

A quick nit:

>  Documentation/trace/osnoise_tracer.rst |  149 ++
>  include/linux/ftrace_irq.h             |   16 +
>  include/trace/events/osnoise.h         |  141 ++
>  kernel/trace/Kconfig                   |   34 +
>  kernel/trace/Makefile                  |    1 +
>  kernel/trace/trace.h                   |    9 +-
>  kernel/trace/trace_entries.h           |   27 +
>  kernel/trace/trace_osnoise.c           | 1714 ++++++++++++++++++++++++
>  kernel/trace/trace_output.c            |   72 +-
>  9 files changed, 2159 insertions(+), 4 deletions(-)
>  create mode 100644 Documentation/trace/osnoise_tracer.rst
>  create mode 100644 include/trace/events/osnoise.h
>  create mode 100644 kernel/trace/trace_osnoise.c

When you create a new RST file, you need to add it to an index.rst (or
similar) file so that it gets incorporated into the docs build.

The document itself looks good on a quick read.  If you're making
another pass over it, you might consider reducing the ``markup noise`` a
bit; we try to keep that to a minimum in the kernel docs.  But otherwise
thanks for writing it!

jon

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode
  2021-04-08 14:13 ` [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode Daniel Bristot de Oliveira
@ 2021-04-08 19:39   ` kernel test robot
  2021-04-08 21:39   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2021-04-08 19:39 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3176 bytes --]

Hi Daniel,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on tip/perf/core]
[also build test WARNING on linux/master linus/master v5.12-rc6]
[cannot apply to trace/for-next next-20210408]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Daniel-Bristot-de-Oliveira/hwlat-improvements-and-osnoise-tracer/20210408-221655
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git cface0326a6c2ae5c8f47bd466f07624b3e348a7
config: mips-randconfig-r002-20210408 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 56ea2e2fdd691136d5e6631fa0e447173694b82c)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install mips cross compiling tool for clang build
        # apt-get install binutils-mips-linux-gnu
        # https://github.com/0day-ci/linux/commit/4e2f5d30c69f77756e8cf223acf55c2aa2657393
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Daniel-Bristot-de-Oliveira/hwlat-improvements-and-osnoise-tracer/20210408-221655
        git checkout 4e2f5d30c69f77756e8cf223acf55c2aa2657393
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/trace/trace_hwlat.c:122:28: warning: no previous prototype for function 'get_cpu_data' [-Wmissing-prototypes]
   struct hwlat_kthread_data *get_cpu_data(void)
                              ^
   kernel/trace/trace_hwlat.c:122:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   struct hwlat_kthread_data *get_cpu_data(void)
   ^
   static 
   kernel/trace/trace_hwlat.c:496:25: error: incompatible pointer types passing 'struct cpumask **' to parameter of type 'cpumask_var_t *' (aka 'struct cpumask (*)[1]') [-Werror,-Wincompatible-pointer-types]
           if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
                                  ^~~~~~~~~~~~~
   include/linux/cpumask.h:767:53: note: passing argument to parameter 'mask' here
   static inline bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
                                                       ^
   1 warning and 1 error generated.


vim +/get_cpu_data +122 kernel/trace/trace_hwlat.c

   121	
 > 122	struct hwlat_kthread_data *get_cpu_data(void)
   123	{
   124		if (hwlat_data.thread_mode == MODE_PER_CPU)
   125			return this_cpu_ptr(&hwlat_per_cpu_data);
   126		else
   127			return &hwlat_single_cpu_data;
   128	}
   129	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 31347 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/5] tracing/hwlat: Implement the mode config option
  2021-04-08 14:13 ` [RFC PATCH 2/5] tracing/hwlat: Implement the mode config option Daniel Bristot de Oliveira
@ 2021-04-08 20:52   ` kernel test robot
  2021-04-14 14:30   ` Steven Rostedt
  1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2021-04-08 20:52 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3556 bytes --]

Hi Daniel,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on tip/perf/core]
[also build test WARNING on trace/for-next linux/master linus/master v5.12-rc6 next-20210408]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Daniel-Bristot-de-Oliveira/hwlat-improvements-and-osnoise-tracer/20210408-221655
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git cface0326a6c2ae5c8f47bd466f07624b3e348a7
config: openrisc-randconfig-r013-20210408 (attached as .config)
compiler: or1k-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/a09dc6bb34fa0ac2596763b3a0285717799b57ce
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Daniel-Bristot-de-Oliveira/hwlat-improvements-and-osnoise-tracer/20210408-221655
        git checkout a09dc6bb34fa0ac2596763b3a0285717799b57ce
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=openrisc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   kernel/trace/trace_hwlat.c: In function 'hwlat_mode_write':
>> kernel/trace/trace_hwlat.c:711:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
     711 |  int ret;
         |      ^~~


vim +/ret +711 kernel/trace/trace_hwlat.c

   691	
   692	/**
   693	 * hwlat_mode_write - Write function for "mode" entry
   694	 * @filp: The active open file structure
   695	 * @ubuf: The user buffer that contains the value to write
   696	 * @cnt: The maximum number of bytes to write to "file"
   697	 * @ppos: The current position in @file
   698	 *
   699	 * This function provides a write implementation for the "mode" interface
   700	 * to the hardware latency detector. hwlatd has different operation modes.
   701	 * The "none" sets the allowed cpumask for a single hwlatd thread at the
   702	 * startup and lets the scheduler handle the migration. The default mode is
   703	 * the "round-robin" one, in which a single hwlatd thread runs, migrating
   704	 * among the allowed CPUs in a round-robin fashion.
   705	 */
   706	static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
   707					 size_t cnt, loff_t *ppos)
   708	{
   709		const char *mode;
   710		char buf[64];
 > 711		int ret;
   712		int i;
   713	
   714		if (hwlat_busy)
   715			return -EBUSY;
   716	
   717		if (cnt >= sizeof(buf))
   718			return -EINVAL;
   719	
   720		if (copy_from_user(buf, ubuf, cnt))
   721			return -EFAULT;
   722	
   723		buf[cnt] = 0;
   724	
   725		mode = strstrip(buf);
   726	
   727		ret = -EINVAL;
   728	
   729		for (i = 0; i < MODE_MAX; i++) {
   730			if (strcmp(mode, thread_mode_str[i]) == 0) {
   731				hwlat_data.thread_mode = i;
   732				ret = cnt;
   733			}
   734		}
   735	
   736		*ppos += cnt;
   737	
   738		return cnt;
   739	}
   740	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 25278 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode
  2021-04-08 14:13 ` [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode Daniel Bristot de Oliveira
  2021-04-08 19:39   ` kernel test robot
@ 2021-04-08 21:39   ` kernel test robot
  2021-04-08 23:54   ` kernel test robot
  2021-04-14 14:41   ` Steven Rostedt
  3 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2021-04-08 21:39 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 7539 bytes --]

Hi Daniel,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on tip/perf/core]
[also build test WARNING on linux/master linus/master v5.12-rc6]
[cannot apply to trace/for-next next-20210408]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Daniel-Bristot-de-Oliveira/hwlat-improvements-and-osnoise-tracer/20210408-221655
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git cface0326a6c2ae5c8f47bd466f07624b3e348a7
config: openrisc-randconfig-r013-20210408 (attached as .config)
compiler: or1k-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/4e2f5d30c69f77756e8cf223acf55c2aa2657393
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Daniel-Bristot-de-Oliveira/hwlat-improvements-and-osnoise-tracer/20210408-221655
        git checkout 4e2f5d30c69f77756e8cf223acf55c2aa2657393
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=openrisc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/trace/trace_hwlat.c:122:28: warning: no previous prototype for 'get_cpu_data' [-Wmissing-prototypes]
     122 | struct hwlat_kthread_data *get_cpu_data(void)
         |                            ^~~~~~~~~~~~
   In file included from include/linux/err.h:5,
                    from include/linux/kthread.h:5,
                    from kernel/trace/trace_hwlat.c:40:
   kernel/trace/trace_hwlat.c: In function 'start_per_cpu_kthreads':
   kernel/trace/trace_hwlat.c:496:25: error: passing argument 1 of 'alloc_cpumask_var' from incompatible pointer type [-Werror=incompatible-pointer-types]
     496 |  if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
         |                         ^~~~~~~~~~~~~
         |                         |
         |                         struct cpumask **
   include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
      58 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                    ^~~~
   kernel/trace/trace_hwlat.c:496:2: note: in expansion of macro 'if'
     496 |  if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
         |  ^~
   In file included from include/linux/smp.h:13,
                    from include/linux/lockdep.h:14,
                    from include/linux/rcupdate.h:29,
                    from include/linux/rculist.h:11,
                    from include/linux/pid.h:5,
                    from include/linux/sched.h:14,
                    from include/linux/kthread.h:6,
                    from kernel/trace/trace_hwlat.c:40:
   include/linux/cpumask.h:767:53: note: expected 'struct cpumask (*)[1]' but argument is of type 'struct cpumask **'
     767 | static inline bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
         |                                      ~~~~~~~~~~~~~~~^~~~
   In file included from include/linux/err.h:5,
                    from include/linux/kthread.h:5,
                    from kernel/trace/trace_hwlat.c:40:
   kernel/trace/trace_hwlat.c:496:25: error: passing argument 1 of 'alloc_cpumask_var' from incompatible pointer type [-Werror=incompatible-pointer-types]
     496 |  if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
         |                         ^~~~~~~~~~~~~
         |                         |
         |                         struct cpumask **
   include/linux/compiler.h:58:61: note: in definition of macro '__trace_if_var'
      58 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                             ^~~~
   kernel/trace/trace_hwlat.c:496:2: note: in expansion of macro 'if'
     496 |  if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
         |  ^~
   In file included from include/linux/smp.h:13,
                    from include/linux/lockdep.h:14,
                    from include/linux/rcupdate.h:29,
                    from include/linux/rculist.h:11,
                    from include/linux/pid.h:5,
                    from include/linux/sched.h:14,
                    from include/linux/kthread.h:6,
                    from kernel/trace/trace_hwlat.c:40:
   include/linux/cpumask.h:767:53: note: expected 'struct cpumask (*)[1]' but argument is of type 'struct cpumask **'
     767 | static inline bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
         |                                      ~~~~~~~~~~~~~~~^~~~
   In file included from include/linux/err.h:5,
                    from include/linux/kthread.h:5,
                    from kernel/trace/trace_hwlat.c:40:
   kernel/trace/trace_hwlat.c:496:25: error: passing argument 1 of 'alloc_cpumask_var' from incompatible pointer type [-Werror=incompatible-pointer-types]
     496 |  if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
         |                         ^~~~~~~~~~~~~
         |                         |
         |                         struct cpumask **
   include/linux/compiler.h:69:3: note: in definition of macro '__trace_if_value'
      69 |  (cond) ?     \
         |   ^~~~
   include/linux/compiler.h:56:28: note: in expansion of macro '__trace_if_var'
      56 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                            ^~~~~~~~~~~~~~
   kernel/trace/trace_hwlat.c:496:2: note: in expansion of macro 'if'
     496 |  if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
         |  ^~
   In file included from include/linux/smp.h:13,
                    from include/linux/lockdep.h:14,
                    from include/linux/rcupdate.h:29,
                    from include/linux/rculist.h:11,
                    from include/linux/pid.h:5,
                    from include/linux/sched.h:14,
                    from include/linux/kthread.h:6,
                    from kernel/trace/trace_hwlat.c:40:
   include/linux/cpumask.h:767:53: note: expected 'struct cpumask (*)[1]' but argument is of type 'struct cpumask **'
     767 | static inline bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
         |                                      ~~~~~~~~~~~~~~~^~~~
   kernel/trace/trace_hwlat.c: In function 'hwlat_mode_write':
   kernel/trace/trace_hwlat.c:800:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
     800 |  int ret;
         |      ^~~
   cc1: some warnings being treated as errors


vim +/get_cpu_data +122 kernel/trace/trace_hwlat.c

   121	
 > 122	struct hwlat_kthread_data *get_cpu_data(void)
   123	{
   124		if (hwlat_data.thread_mode == MODE_PER_CPU)
   125			return this_cpu_ptr(&hwlat_per_cpu_data);
   126		else
   127			return &hwlat_single_cpu_data;
   128	}
   129	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 25278 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode
  2021-04-08 14:13 ` [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode Daniel Bristot de Oliveira
  2021-04-08 19:39   ` kernel test robot
  2021-04-08 21:39   ` kernel test robot
@ 2021-04-08 23:54   ` kernel test robot
  2021-04-14 14:41   ` Steven Rostedt
  3 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2021-04-08 23:54 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4566 bytes --]

Hi Daniel,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on tip/perf/core]
[also build test ERROR on linux/master linus/master v5.12-rc6]
[cannot apply to trace/for-next next-20210408]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Daniel-Bristot-de-Oliveira/hwlat-improvements-and-osnoise-tracer/20210408-221655
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git cface0326a6c2ae5c8f47bd466f07624b3e348a7
config: mips-randconfig-r002-20210408 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 56ea2e2fdd691136d5e6631fa0e447173694b82c)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install mips cross compiling tool for clang build
        # apt-get install binutils-mips-linux-gnu
        # https://github.com/0day-ci/linux/commit/4e2f5d30c69f77756e8cf223acf55c2aa2657393
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Daniel-Bristot-de-Oliveira/hwlat-improvements-and-osnoise-tracer/20210408-221655
        git checkout 4e2f5d30c69f77756e8cf223acf55c2aa2657393
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   kernel/trace/trace_hwlat.c:122:28: warning: no previous prototype for function 'get_cpu_data' [-Wmissing-prototypes]
   struct hwlat_kthread_data *get_cpu_data(void)
                              ^
   kernel/trace/trace_hwlat.c:122:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   struct hwlat_kthread_data *get_cpu_data(void)
   ^
   static 
>> kernel/trace/trace_hwlat.c:496:25: error: incompatible pointer types passing 'struct cpumask **' to parameter of type 'cpumask_var_t *' (aka 'struct cpumask (*)[1]') [-Werror,-Wincompatible-pointer-types]
           if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
                                  ^~~~~~~~~~~~~
   include/linux/cpumask.h:767:53: note: passing argument to parameter 'mask' here
   static inline bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
                                                       ^
   1 warning and 1 error generated.


vim +496 kernel/trace/trace_hwlat.c

   480	
   481	/**
   482	 * start_per_cpu_kthread - Kick off the hardware latency sampling/detector kthreads
   483	 *
   484	 * This starts the kernel threads that will sit on potentially all cpus and
   485	 * sample the CPU timestamp counter (TSC or similar) and look for potential
   486	 * hardware latencies.
   487	 */
   488	static int start_per_cpu_kthreads(struct trace_array *tr)
   489	{
   490		struct cpumask *current_mask = &save_cpumask;
   491		struct cpumask *this_cpumask;
   492		struct task_struct *kthread;
   493		char comm[24];
   494		int cpu;
   495	
 > 496		if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
   497			return -ENOMEM;
   498	
   499		get_online_cpus();
   500		/*
   501		 * Run only on CPUs in which trace and hwlat are allowed to run.
   502		 */
   503		cpumask_and(current_mask, tr->tracing_cpumask, &hwlat_cpumask);
   504		/*
   505		 * And the CPU is online.
   506		 */
   507		cpumask_and(current_mask, cpu_online_mask, current_mask);
   508		put_online_cpus();
   509	
   510		for_each_online_cpu(cpu)
   511			per_cpu(hwlat_per_cpu_data, cpu).kthread = NULL;
   512	
   513		for_each_cpu(cpu, current_mask) {
   514			snprintf(comm, 24, "hwlatd/%d", cpu);
   515	
   516			kthread = kthread_create_on_cpu(kthread_fn, NULL, cpu, comm);
   517			if (IS_ERR(kthread)) {
   518				pr_err(BANNER "could not start sampling thread\n");
   519				stop_per_cpu_kthreads();
   520				return -ENOMEM;
   521			}
   522	
   523			per_cpu(hwlat_per_cpu_data, cpu).kthread = kthread;
   524			wake_up_process(kthread);
   525		}
   526	
   527		return 0;
   528	}
   529	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 31347 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/5] tracing: Add the osnoise tracer
  2021-04-08 14:13 ` [RFC PATCH 5/5] tracing: Add the osnoise tracer Daniel Bristot de Oliveira
  2021-04-08 15:58   ` Jonathan Corbet
@ 2021-04-08 23:57   ` kernel test robot
  2021-04-14 17:14   ` Steven Rostedt
  2 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2021-04-08 23:57 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2151 bytes --]

Hi Daniel,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on tip/perf/core]
[also build test ERROR on linux/master linus/master v5.12-rc6]
[cannot apply to trace/for-next next-20210408]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Daniel-Bristot-de-Oliveira/hwlat-improvements-and-osnoise-tracer/20210408-221655
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git cface0326a6c2ae5c8f47bd466f07624b3e348a7
config: s390-randconfig-c023-20210408 (attached as .config)
compiler: s390-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/8fc23c7171dc536d3ce8c17fb66cfe6fcdbc5fb6
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Daniel-Bristot-de-Oliveira/hwlat-improvements-and-osnoise-tracer/20210408-221655
        git checkout 8fc23c7171dc536d3ce8c17fb66cfe6fcdbc5fb6
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=s390 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   s390-linux-ld: arch/s390/appldata/appldata_base.o: in function `appldata_generic_handler':
   appldata_base.c:(.text+0x1d0): undefined reference to `sysctl_vals'
   s390-linux-ld: kernel/trace/trace.o: in function `__update_max_tr':
   trace.c:(.text+0x6c0e): undefined reference to `latency_fsnotify'
   s390-linux-ld: kernel/trace/trace_hwlat.o: in function `kthread_fn':
>> trace_hwlat.c:(.text+0xc26): undefined reference to `latency_fsnotify'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 35242 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/5] tracing: Add the osnoise tracer
  2021-04-08 15:58   ` Jonathan Corbet
@ 2021-04-09  7:19     ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-09  7:19 UTC (permalink / raw)
  To: Jonathan Corbet, Steven Rostedt, linux-kernel
  Cc: kcarcia, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Alexandre Chartre, Clark Willaims, John Kacur, Juri Lelli,
	linux-doc

On 4/8/21 5:58 PM, Jonathan Corbet wrote:
> Daniel Bristot de Oliveira <bristot@redhat.com> writes:
> 
> A quick nit:
> 
>>  Documentation/trace/osnoise_tracer.rst |  149 ++
>>  include/linux/ftrace_irq.h             |   16 +
>>  include/trace/events/osnoise.h         |  141 ++
>>  kernel/trace/Kconfig                   |   34 +
>>  kernel/trace/Makefile                  |    1 +
>>  kernel/trace/trace.h                   |    9 +-
>>  kernel/trace/trace_entries.h           |   27 +
>>  kernel/trace/trace_osnoise.c           | 1714 ++++++++++++++++++++++++
>>  kernel/trace/trace_output.c            |   72 +-
>>  9 files changed, 2159 insertions(+), 4 deletions(-)
>>  create mode 100644 Documentation/trace/osnoise_tracer.rst
>>  create mode 100644 include/trace/events/osnoise.h
>>  create mode 100644 kernel/trace/trace_osnoise.c
> When you create a new RST file, you need to add it to an index.rst (or
> similar) file so that it gets incorporated into the docs build.


ack!

> 
> The document itself looks good on a quick read.  If you're making
> another pass over it, you might consider reducing the ``markup noise`` a
> bit; we try to keep that to a minimum in the kernel docs.  But otherwise
> thanks for writing it!

Thanks for the review, Jon. I will reduce the `` markup (on this, and on some
other docs that are about to come :-))

-- Daniel
> jon
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/5] tracing/hwlat: Add a cpus file specific for hwlat_detector
  2021-04-08 14:13 ` [RFC PATCH 1/5] tracing/hwlat: Add a cpus file specific for hwlat_detector Daniel Bristot de Oliveira
@ 2021-04-14 14:10   ` Steven Rostedt
  2021-04-15 13:09     ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2021-04-14 14:10 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On Thu,  8 Apr 2021 16:13:19 +0200
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> Provides a "cpus" interface to the hardware latency detector. By
> default, it lists all CPUs, allowing hwlatd threads to run on any online
> CPU of the system.
> 
> It serves to restrict the execution of hwlatd to the set of CPUs writing
> via this interface. Note that hwlatd also respects the "tracing_cpumask."
> Hence, hwlatd threads will run only on the set of CPUs allowed here AND
> on "tracing_cpumask."
> 
> Why not keep just "tracing_cpumask"? Because the user might be interested
> in tracing what is running on other CPUs. For instance, one might run
> hwlatd in one HT CPU while observing what is running on the sibling HT
> CPU. The cpu list format is also more intuitive.
> 
> Also in preparation to the per-cpu mode.

OK, I'm still not convinced that you couldn't use tracing_cpumask here.
Because we have instances, and tracing_cpumask is defined per instance, you
could simply do:

 # cd /sys/kernel/tracing
 # mkdir instances/hwlat
 # echo a > instances/hwlat/tracing_cpumask
 # echo hwlat > instances/hwlat/current_tracer

Now the tracing_cpumask above only affects the hwlat tracer.

I'm just reluctant to add more tracing files if the current ones can be
used without too much trouble. For being intuitive, let's make user space
tools hide the nastiness of the kernel interface ;-)

-- Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/5] tracing/hwlat: Implement the mode config option
  2021-04-08 14:13 ` [RFC PATCH 2/5] tracing/hwlat: Implement the mode config option Daniel Bristot de Oliveira
  2021-04-08 20:52   ` kernel test robot
@ 2021-04-14 14:30   ` Steven Rostedt
  2021-04-15 13:16     ` Daniel Bristot de Oliveira
  1 sibling, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2021-04-14 14:30 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On Thu,  8 Apr 2021 16:13:20 +0200
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> +/**
> + * hwlat_mode_write - Write function for "mode" entry
> + * @filp: The active open file structure
> + * @ubuf: The user buffer that contains the value to write
> + * @cnt: The maximum number of bytes to write to "file"
> + * @ppos: The current position in @file
> + *
> + * This function provides a write implementation for the "mode" interface
> + * to the hardware latency detector. hwlatd has different operation modes.
> + * The "none" sets the allowed cpumask for a single hwlatd thread at the
> + * startup and lets the scheduler handle the migration. The default mode is
> + * the "round-robin" one, in which a single hwlatd thread runs, migrating
> + * among the allowed CPUs in a round-robin fashion.
> + */
> +static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
> +				 size_t cnt, loff_t *ppos)
> +{
> +	const char *mode;
> +	char buf[64];
> +	int ret;
> +	int i;
> +
> +	if (hwlat_busy)
> +		return -EBUSY;

So we can't switch modes while running?


Also, with this implemented, you can remove the disable_migrate variable,
and just switch the mode to NONE when it's detected that the affinity mask
of the thread has been changed.

-- Steve


> +
> +	if (cnt >= sizeof(buf))
> +		return -EINVAL;
> +
> +	if (copy_from_user(buf, ubuf, cnt))
> +		return -EFAULT;
> +
> +	buf[cnt] = 0;
> +
> +	mode = strstrip(buf);
> +
> +	ret = -EINVAL;
> +
> +	for (i = 0; i < MODE_MAX; i++) {
> +		if (strcmp(mode, thread_mode_str[i]) == 0) {
> +			hwlat_data.thread_mode = i;
> +			ret = cnt;
> +		}
> +	}
> +
> +	*ppos += cnt;
> +
> +	return cnt;
> +}
> +
> +

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode
  2021-04-08 14:13 ` [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode Daniel Bristot de Oliveira
                     ` (2 preceding siblings ...)
  2021-04-08 23:54   ` kernel test robot
@ 2021-04-14 14:41   ` Steven Rostedt
  2021-04-15 13:22     ` Daniel Bristot de Oliveira
  3 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2021-04-14 14:41 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On Thu,  8 Apr 2021 16:13:21 +0200
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> Implements the per-cpu mode in which a sampling thread is created for
> each cpu in the "cpus" (and tracing_mask).
> 
> The per-cpu mode has the potention to speed up the hwlat detection by
> running on multiple CPUs at the same time.

And totally slow down the entire system in the process ;-)

> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
> Cc: Clark Willaims <williams@redhat.com>
> Cc: John Kacur <jkacur@redhat.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
> 
> ---
>  Documentation/trace/hwlat_detector.rst |   6 +-
>  kernel/trace/trace_hwlat.c             | 171 +++++++++++++++++++------
>  2 files changed, 137 insertions(+), 40 deletions(-)
> 
> diff --git a/Documentation/trace/hwlat_detector.rst b/Documentation/trace/hwlat_detector.rst
> index f63fdd867598..7a6fab105b29 100644
> --- a/Documentation/trace/hwlat_detector.rst
> +++ b/Documentation/trace/hwlat_detector.rst
> @@ -85,10 +85,12 @@ the available options are:
>  
>   - none:        do not force migration
>   - round-robin: migrate across each CPU specified in cpus between each window
> + - per-cpu:     create a per-cpu thread for each cpu in cpus
>  
>  By default, hwlat detector will also obey the tracing_cpumask, so the thread
>  will be placed only in the set of cpus that is both on the hwlat detector's
>  cpus and in the global tracing_cpumask file. The user can overwrite the
>  cpumask by setting it manually. Changing the hwlatd affinity externally,
> -e.g., via taskset tool, will disable the round-robin migration.
> -
> +e.g., via taskset tool, will disable the round-robin migration. In the
> +per-cpu mode, the per-cpu thread (hwlatd/CPU) will be pinned to its relative
> +cpu, and its affinity cannot be changed.
> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
> index 3818200c9e24..52968ea312df 100644
> --- a/kernel/trace/trace_hwlat.c
> +++ b/kernel/trace/trace_hwlat.c
> @@ -34,7 +34,7 @@
>   * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <jcm@redhat.com>
>   * Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. <srostedt@redhat.com>
>   *
> - * Includes useful feedback from Clark Williams <clark@redhat.com>
> + * Includes useful feedback from Clark Williams <williams@redhat.com>

Interesting update ;-)

>   *
>   */
>  #include <linux/kthread.h>
> @@ -54,9 +54,6 @@ static struct trace_array	*hwlat_trace;
>  #define DEFAULT_SAMPLE_WIDTH	500000			/* 0.5s */
>  #define DEFAULT_LAT_THRESHOLD	10			/* 10us */
>  
> -/* sampling thread*/
> -static struct task_struct *hwlat_kthread;
> -
>  static struct dentry *hwlat_sample_width;	/* sample width us */
>  static struct dentry *hwlat_sample_window;	/* sample window us */
>  static struct dentry *hwlat_cpumask_dentry;	/* hwlat cpus allowed */
> @@ -65,19 +62,27 @@ static struct dentry *hwlat_thread_mode;	/* hwlat thread mode */
>  enum {
>  	MODE_NONE = 0,
>  	MODE_ROUND_ROBIN,
> +	MODE_PER_CPU,
>  	MODE_MAX
>  };
>  
> -static char *thread_mode_str[] = { "none", "round-robin" };
> +static char *thread_mode_str[] = { "none", "round-robin", "per-cpu" };
>  
>  /* Save the previous tracing_thresh value */
>  static unsigned long save_tracing_thresh;
>  
> -/* NMI timestamp counters */
> -static u64 nmi_ts_start;
> -static u64 nmi_total_ts;
> -static int nmi_count;
> -static int nmi_cpu;
> +/* runtime kthread data */
> +struct hwlat_kthread_data {
> +	struct task_struct *kthread;
> +	/* NMI timestamp counters */
> +	u64 nmi_ts_start;
> +	u64 nmi_total_ts;
> +	int nmi_count;
> +	int nmi_cpu;
> +};
> +
> +struct hwlat_kthread_data hwlat_single_cpu_data;
> +DEFINE_PER_CPU(struct hwlat_kthread_data, hwlat_per_cpu_data);
>  
>  /* Tells NMIs to call back to the hwlat tracer to record timestamps */
>  bool trace_hwlat_callback_enabled;
> @@ -114,6 +119,14 @@ static struct hwlat_data {
>  	.thread_mode		= MODE_ROUND_ROBIN
>  };
>  
> +struct hwlat_kthread_data *get_cpu_data(void)
> +{
> +	if (hwlat_data.thread_mode == MODE_PER_CPU)
> +		return this_cpu_ptr(&hwlat_per_cpu_data);
> +	else
> +		return &hwlat_single_cpu_data;
> +}
> +
>  static bool hwlat_busy;
>  
>  static void trace_hwlat_sample(struct hwlat_sample *sample)
> @@ -151,7 +164,9 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)
>  
>  void trace_hwlat_callback(bool enter)
>  {
> -	if (smp_processor_id() != nmi_cpu)
> +	struct hwlat_kthread_data *kdata = get_cpu_data();
> +
> +	if (kdata->kthread)
>  		return;
>  
>  	/*
> @@ -160,13 +175,13 @@ void trace_hwlat_callback(bool enter)
>  	 */
>  	if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
>  		if (enter)
> -			nmi_ts_start = time_get();
> +			kdata->nmi_ts_start = time_get();
>  		else
> -			nmi_total_ts += time_get() - nmi_ts_start;
> +			kdata->nmi_total_ts += time_get() - kdata->nmi_ts_start;
>  	}
>  
>  	if (enter)
> -		nmi_count++;
> +		kdata->nmi_count++;
>  }
>  
>  /**
> @@ -178,6 +193,7 @@ void trace_hwlat_callback(bool enter)
>   */
>  static int get_sample(void)
>  {
> +	struct hwlat_kthread_data *kdata = get_cpu_data();
>  	struct trace_array *tr = hwlat_trace;
>  	struct hwlat_sample s;
>  	time_type start, t1, t2, last_t2;
> @@ -190,9 +206,8 @@ static int get_sample(void)
>  
>  	do_div(thresh, NSEC_PER_USEC); /* modifies interval value */
>  
> -	nmi_cpu = smp_processor_id();
> -	nmi_total_ts = 0;
> -	nmi_count = 0;
> +	kdata->nmi_total_ts = 0;
> +	kdata->nmi_count = 0;
>  	/* Make sure NMIs see this first */
>  	barrier();
>  
> @@ -262,15 +277,15 @@ static int get_sample(void)
>  		ret = 1;
>  
>  		/* We read in microseconds */
> -		if (nmi_total_ts)
> -			do_div(nmi_total_ts, NSEC_PER_USEC);
> +		if (kdata->nmi_total_ts)
> +			do_div(kdata->nmi_total_ts, NSEC_PER_USEC);
>  
>  		hwlat_data.count++;
>  		s.seqnum = hwlat_data.count;
>  		s.duration = sample;
>  		s.outer_duration = outer_sample;
> -		s.nmi_total_ts = nmi_total_ts;
> -		s.nmi_count = nmi_count;
> +		s.nmi_total_ts = kdata->nmi_total_ts;
> +		s.nmi_count = kdata->nmi_count;
>  		s.count = count;
>  		trace_hwlat_sample(&s);
>  
> @@ -376,23 +391,43 @@ static int kthread_fn(void *data)
>  }
>  
>  /**
> - * start_kthread - Kick off the hardware latency sampling/detector kthread
> + * stop_stop_kthread - Inform the hardware latency samping/detector kthread to stop
> + *
> + * This kicks the running hardware latency sampling/detector kernel thread and
> + * tells it to stop sampling now. Use this on unload and at system shutdown.
> + */
> +static void stop_single_kthread(void)
> +{
> +	struct hwlat_kthread_data *kdata = get_cpu_data();
> +	struct task_struct *kthread = kdata->kthread;
> +
> +	if (!kthread)
> +
> +		return;
> +	kthread_stop(kthread);
> +
> +	kdata->kthread = NULL;
> +}
> +
> +
> +/**
> + * start_single_kthread - Kick off the hardware latency sampling/detector kthread
>   *
>   * This starts the kernel thread that will sit and sample the CPU timestamp
>   * counter (TSC or similar) and look for potential hardware latencies.
>   */
> -static int start_kthread(struct trace_array *tr)
> +static int start_single_kthread(struct trace_array *tr)
>  {
> +	struct hwlat_kthread_data *kdata = get_cpu_data();
>  	struct cpumask *current_mask = &save_cpumask;
>  	struct task_struct *kthread;
>  	int next_cpu;
>  
> -	if (hwlat_kthread)
> +	if (kdata->kthread)
>  		return 0;
>  
> -
>  	kthread = kthread_create(kthread_fn, NULL, "hwlatd");
> -	if (IS_ERR(kthread)) {
> +	if (IS_ERR(kdata->kthread)) {
>  		pr_err(BANNER "could not start sampling thread\n");
>  		return -ENOMEM;
>  	}
> @@ -419,24 +454,77 @@ static int start_kthread(struct trace_array *tr)
>  
>  	sched_setaffinity(kthread->pid, current_mask);
>  
> -	hwlat_kthread = kthread;
> +	kdata->kthread = kthread;
>  	wake_up_process(kthread);
>  
>  	return 0;
>  }
>  
>  /**
> - * stop_kthread - Inform the hardware latency samping/detector kthread to stop
> + * stop_per_cpu_kthread - Inform the hardware latency samping/detector kthread to stop
>   *
> - * This kicks the running hardware latency sampling/detector kernel thread and
> + * This kicks the running hardware latency sampling/detector kernel threads and
>   * tells it to stop sampling now. Use this on unload and at system shutdown.
>   */
> -static void stop_kthread(void)
> +static void stop_per_cpu_kthreads(void)
>  {
> -	if (!hwlat_kthread)
> -		return;
> -	kthread_stop(hwlat_kthread);
> -	hwlat_kthread = NULL;
> +	struct task_struct *kthread;
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		kthread = per_cpu(hwlat_per_cpu_data, cpu).kthread;
> +		if (kthread)
> +			kthread_stop(kthread);

Probably want:

		per_cpu(hwlat_per_cpu_data, cpu).kthread = NULL;

Just to be safe. I don't like to rely on the start doing the job, as things
can change in the future. Having the clearing here as well makes the code
more robust.


> +	}
> +}
> +
> +/**
> + * start_per_cpu_kthread - Kick off the hardware latency sampling/detector kthreads
> + *
> + * This starts the kernel threads that will sit on potentially all cpus and
> + * sample the CPU timestamp counter (TSC or similar) and look for potential
> + * hardware latencies.
> + */
> +static int start_per_cpu_kthreads(struct trace_array *tr)
> +{
> +	struct cpumask *current_mask = &save_cpumask;
> +	struct cpumask *this_cpumask;
> +	struct task_struct *kthread;
> +	char comm[24];
> +	int cpu;
> +
> +	if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	get_online_cpus();
> +	/*
> +	 * Run only on CPUs in which trace and hwlat are allowed to run.
> +	 */
> +	cpumask_and(current_mask, tr->tracing_cpumask, &hwlat_cpumask);
> +	/*
> +	 * And the CPU is online.
> +	 */
> +	cpumask_and(current_mask, cpu_online_mask, current_mask);
> +	put_online_cpus();
> +
> +	for_each_online_cpu(cpu)
> +		per_cpu(hwlat_per_cpu_data, cpu).kthread = NULL;
> +
> +	for_each_cpu(cpu, current_mask) {
> +		snprintf(comm, 24, "hwlatd/%d", cpu);
> +
> +		kthread = kthread_create_on_cpu(kthread_fn, NULL, cpu, comm);
> +		if (IS_ERR(kthread)) {
> +			pr_err(BANNER "could not start sampling thread\n");
> +			stop_per_cpu_kthreads();
> +			return -ENOMEM;
> +		}
> +
> +		per_cpu(hwlat_per_cpu_data, cpu).kthread = kthread;
> +		wake_up_process(kthread);
> +	}
> +
> +	return 0;
>  }
>  
>  /*
> @@ -701,7 +789,8 @@ static int hwlat_mode_open(struct inode *inode, struct file *file)
>   * The "none" sets the allowed cpumask for a single hwlatd thread at the
>   * startup and lets the scheduler handle the migration. The default mode is
>   * the "round-robin" one, in which a single hwlatd thread runs, migrating
> - * among the allowed CPUs in a round-robin fashion.
> + * among the allowed CPUs in a round-robin fashion. The "per-cpu" mode
> + * creates one hwlatd thread per allowed CPU.
>   */
>  static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
>  				 size_t cnt, loff_t *ppos)
> @@ -827,14 +916,20 @@ static void hwlat_tracer_start(struct trace_array *tr)
>  {
>  	int err;
>  
> -	err = start_kthread(tr);
> +	if (hwlat_data.thread_mode == MODE_PER_CPU)
> +		err = start_per_cpu_kthreads(tr);
> +	else
> +		err = start_single_kthread(tr);
>  	if (err)
>  		pr_err(BANNER "Cannot start hwlat kthread\n");
>  }
>  
>  static void hwlat_tracer_stop(struct trace_array *tr)
>  {
> -	stop_kthread();
> +	if (hwlat_data.thread_mode == MODE_PER_CPU)
> +		stop_per_cpu_kthreads();
> +	else
> +		stop_single_kthread();

This explains why you have the "busy" check in the changing of the modes.
But really, I don't see why you cant change the mode. Just stop the
previous mode, and start the new one.

-- Steve


>  }
>  
>  static int hwlat_tracer_init(struct trace_array *tr)
> @@ -864,7 +959,7 @@ static int hwlat_tracer_init(struct trace_array *tr)
>  
>  static void hwlat_tracer_reset(struct trace_array *tr)
>  {
> -	stop_kthread();
> +	hwlat_tracer_stop(tr);
>  
>  	/* the tracing threshold is static between runs */
>  	last_tracing_thresh = tracing_thresh;


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/5] tracing: Add the osnoise tracer
  2021-04-08 14:13 ` [RFC PATCH 5/5] tracing: Add the osnoise tracer Daniel Bristot de Oliveira
  2021-04-08 15:58   ` Jonathan Corbet
  2021-04-08 23:57   ` kernel test robot
@ 2021-04-14 17:14   ` Steven Rostedt
  2021-04-15 13:43     ` Daniel Bristot de Oliveira
  2 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2021-04-14 17:14 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On Thu,  8 Apr 2021 16:13:23 +0200
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> In the context of high-performance computing (HPC), the Operating System
> Noise (osnoise) refers to the interference experienced by an application
> due to activities inside the operating system. In the context of Linux,
> NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
> system. Moreover, hardware-related jobs can also cause noise, for example,
> via SMIs.
> 
> hwlat_detector is one of the tools used to identify the most complex
> source of noise: hardware noise.
> 
> In a nutshell, the hwlat_detector creates a thread that runs
> periodically for a given period. At the beginning of a period, the thread
> disables interrupt and starts sampling. While running, the hwlatd
> thread reads the time in a loop. As interrupts are disabled, threads,
> IRQs, and SoftIRQs cannot interfere with the hwlatd thread. Hence, the
> cause of any gap between two different reads of the time roots either on
> NMI or in the hardware itself. At the end of the period, hwlatd enables
> interrupts and reports the max observed gap between the reads. It also
> prints an NMI occurrence counter. If the output does not report NMI
> executions, the user can conclude that the hardware is the culprit for
> the latency. The hwlat detects the NMI execution by observing
> the entry and exit of an NMI.
> 
> The osnoise tracer leverages the hwlat_detector by running a
> similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing
> all the sources of osnoise during its execution. Using the same approach
> of hwlat, osnoise takes note of the entry and exit point of any
> source of interferences, increasing a per-cpu interference counter. The
> osnoise tracer also saves an interference counter for each source of
> interference. The interference counter for NMI, IRQs, SoftIRQs, and
> threads is increased anytime the tool observes these interferences' entry
> events. When a noise happens without any interference from the operating
> system level, the hardware noise counter increases, pointing to a
> hardware-related noise. In this way, osnoise can account for any
> source of interference. At the end of the period, the osnoise tracer
> prints the sum of all noise, the max single noise, the percentage of CPU
> available for the thread, and the counters for the noise sources.
> 
> Usage
> 
> Write the ASCII text osnoise into the current_tracer file of the
> tracing system (generally mounted at /sys/kernel/tracing or
> /sys/kernel/debug/tracing).
> 
> For example::
> 
>         [root@f32 ~]# cd /sys/kernel/tracing/
>         [root@f32 tracing]# echo osnoise > current_tracer
> 
> It is possible to follow the trace by reading the trace trace file::
> 
>         [root@f32 tracing]# cat trace
>         # tracer: osnoise
>         #
>         #                                _-----=> irqs-off
>         #                               / _----=> need-resched
>         #                              | / _---=> hardirq/softirq
>         #                              || / _--=> preempt-depth                            MAX
>         #                              || /                                             SINGLE     Interference counters:
>         #                              ||||               RUNTIME      NOISE   % OF CPU  NOISE    +-----------------------------+
>         #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
>         #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
>                    <...>-859     [000] ....    81.637220: 1000000        190  99.98100       9     18      0   1007     18      1
>                    <...>-860     [001] ....    81.638154: 1000000        656  99.93440      74     23      0   1006     16      3
>                    <...>-861     [002] ....    81.638193: 1000000       5675  99.43250     202      6      0   1013     25     21
>                    <...>-862     [003] ....    81.638242: 1000000        125  99.98750      45      1      0   1011     23      0
>                    <...>-863     [004] ....    81.638260: 1000000       1721  99.82790     168      7      0   1002     49     41
>                    <...>-864     [005] ....    81.638286: 1000000        263  99.97370      57      6      0   1006     26      2
>                    <...>-865     [006] ....    81.638302: 1000000        109  99.98910      21      3      0   1006     18      1
>                    <...>-866     [007] ....    81.638326: 1000000       7816  99.21840     107      8      0   1016     39     19
> 
> In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
> tracer prints a message at the end of each period for each CPU that is
> running an osnoise/ thread. The osnoise specific fields report:
> 
>  - The RUNTIME IN USE reports the amount of time in microseconds that
>    the osnoise thread kept looping reading the time.
>  - The NOISE IN US reports the sum of noise in microseconds observed
>    by the osnoise tracer during the associated runtime.
>  - The % OF CPU AVAILABLE reports the percentage of CPU available for
>    the osnoise thread during the runtime window.
>  - The MAX SINGLE NOISE IN US reports the maximum single noise observed
>    during the runtime window.
>  - The Interference counters display how many each of the respective
>    interference happened during the runtime window.
> 
> Note that the example above shows a high number of HW noise samples.
> The reason being is that this sample was taken on a virtual machine,
> and the host interference is detected as a hardware interference.
> 
> Tracer options
> 
> The tracer has a set of options inside the osnoise directory, they are:
> 
>  - cpus: CPUs at which a osnoise thread will execute.

Again, I think we can reuse the tracing_cpumask.

>  - period_us: the period of the osnoise thread.
>  - runtime_us: how long an osnoise thread will look for noise.

These seem the same as window and width. At a minimum should probably share
the same code.

>  - stop_tracing_single_us: stop the system tracing of a single noise
>    higher than the configured value is happens. Writing 0 disables this
>    option.
>  - stop_tracing_total_us: stop the system tracing of a NOISE IN USE
>    higher than the configured value is happens. Writing 0 disables this
>    option.
>  - tolerance_ns: the minimum delta between two time() reads to be
>    considered as noise.

You can use tracing_threshold for the tolerance. Do you really need it in
ns?

> 
> Additional Tracing
> 
> In addition to the tracer, a set of tracepoints were added to
> facilitate the identification of the osnoise source.
> 
>  - osnoise:sample_threshold: printed anytime a noise is higher than
>    the configurable tolerance_ns.
>  - osnoise:nmi_noise: noise from NMI, including the duration.
>  - osnoise:irq_noise: noise from an IRQ, including the duration.
>  - osnoise:softirq_noise: noise from a SoftIRQ, including the
>    duration.
>  - osnoise:thread_noise: noise from a thread, including the duration.
> 
> Note that a all the values are net values. This means that a thread
> duration will not contain the duration of the IRQs that happened during
> its execution, for example. The same is valid for all duration values.
> 
> Here is one example of the usage of these tracepoints::
> 
>        osnoise/8-961     [008] d.h.  5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
>        osnoise/8-961     [008] dNh.  5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
>      migration/8-54      [008] d...  5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
>        osnoise/8-961     [008] ....  5789.858413: sample_threshold: start 5789.858404555 duration 8 us interferences 2
> 
> In this example, a noise sample of 8 microseconds was reported in the last
> fine, pointing to two interferences. Looking backward in the trace, the
> two previous entries were about the migration thread running after
> a timer IRQ execution. The first event is not part of the noise because
> it took place one millisecond before.
> 
> It is worth noticing that the sum of the duration reported in the
> tracepoints is smaller than eight us reported in the
> sample_threshold. The reason roots in the tracing overhead and in
> the overhead of the entry and exit code that happens before and after
> any interference execution. This justifies the dual approach: measuring
> thread and tracing.

I'm not sure the tracing overhead had much to do with it as did the
overhead of entering the interrupt itself. events are rather fast (usually
less than 200ns depending on the system). You can always enable the
benchmark_event to see what trace event overhead is. Then again, cold cache
can play into it as well.

> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
> Cc: Clark Willaims <williams@redhat.com>
> Cc: John Kacur <jkacur@redhat.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
> 
> ---
>  Documentation/trace/osnoise_tracer.rst |  149 ++
>  include/linux/ftrace_irq.h             |   16 +
>  include/trace/events/osnoise.h         |  141 ++
>  kernel/trace/Kconfig                   |   34 +
>  kernel/trace/Makefile                  |    1 +
>  kernel/trace/trace.h                   |    9 +-
>  kernel/trace/trace_entries.h           |   27 +
>  kernel/trace/trace_osnoise.c           | 1714 ++++++++++++++++++++++++
>  kernel/trace/trace_output.c            |   72 +-
>  9 files changed, 2159 insertions(+), 4 deletions(-)
>  create mode 100644 Documentation/trace/osnoise_tracer.rst
>  create mode 100644 include/trace/events/osnoise.h
>  create mode 100644 kernel/trace/trace_osnoise.c
> 
> diff --git a/Documentation/trace/osnoise_tracer.rst b/Documentation/trace/osnoise_tracer.rst
> new file mode 100644
> index 000000000000..9a97f557317b
> --- /dev/null
> +++ b/Documentation/trace/osnoise_tracer.rst
> @@ -0,0 +1,149 @@
> +==============
> +OSNOISE Tracer
> +==============
> +
> +In the context of high-performance computing (HPC), the Operating System
> +Noise (*osnoise*) refers to the interference experienced by an application
> +due to activities inside the operating system. In the context of Linux,
> +NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
> +system. Moreover, hardware-related jobs can also cause noise, for example,
> +via SMIs.
> +
> +``hwlat_detector`` is one of the tools used to identify the most complex
> +source of noise: *hardware noise*.
> +
> +In a nutshell, the ``hwlat_detector`` creates a thread that runs
> +periodically for a given period. At the beginning of a period, the thread
> +disables interrupt and starts sampling. While running, the ``hwlatd``
> +thread reads the time in a loop. As interrupts are disabled, threads,
> +IRQs, and SoftIRQs cannot interfere with the ``hwlatd`` thread. Hence, the
> +cause of any gap between two different reads of the time roots either on
> +NMI or in the hardware itself. At the end of the period, ``hwlatd`` enables
> +interrupts and reports the max observed gap between the reads. It also
> +prints a NMI occurrence counter. If the output does not report NMI
> +executions, the user can conclude that the hardware is the culprit for
> +the latency. The ``hwlat`` detects the NMI execution by observing
> +the entry and exit of a NMI.
> +
> +The ``osnoise`` tracer leverages the ``hwlat_detector`` by running a
> +similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing
> +all the sources of *osnoise* during its execution. Using the same approach
> +of ``hwlat``, ``osnoise`` takes note of the entry and exit point of any
> +source of interferences, increasing a per-cpu interference counter. The
> +``osnoise`` tracer also saves an interference counter for each source of
> +interference. The interference counter for NMI, IRQs, SoftIRQs, and
> +threads is increased anytime the tool observes these interferences' entry
> +events. When a noise happens without any interference from the operating
> +system level, the hardware noise counter increases, pointing to a
> +hardware-related noise. In this way, ``osnoise`` can account for any
> +source of interference. At the end of the period, the ``osnoise`` tracer
> +prints the sum of all noise, the max single noise, the percentage of CPU
> +available for the thread, and the counters for the noise sources.
> +
> +Usage
> +-----
> +
> +Write the ASCII text ``osnoise`` into the ``current_tracer`` file of the
> +tracing system (generally mounted at ``/sys/kernel/tracing`` or
> +``/sys/kernel/debug/tracing``).

I wouldn't even mention the /sys/kernel/debug/tracing path, I'm trying to
deprecated that.

> +
> +For example::
> +
> +        [root@f32 ~]# cd /sys/kernel/tracing/
> +        [root@f32 tracing]# echo osnoise > current_tracer
> +
> +It is possible to follow the trace by reading the ``trace`` trace file::
> +
> +        [root@f32 tracing]# cat trace
> +        # tracer: osnoise
> +        #
> +        #                                _-----=> irqs-off
> +        #                               / _----=> need-resched
> +        #                              | / _---=> hardirq/softirq
> +        #                              || / _--=> preempt-depth                            MAX
> +        #                              || /                                             SINGLE     Interference counters:
> +        #                              ||||               RUNTIME      NOISE   % OF CPU  NOISE    +-----------------------------+
> +        #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
> +        #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
> +                   <...>-859     [000] ....    81.637220: 1000000        190  99.98100       9     18      0   1007     18      1
> +                   <...>-860     [001] ....    81.638154: 1000000        656  99.93440      74     23      0   1006     16      3
> +                   <...>-861     [002] ....    81.638193: 1000000       5675  99.43250     202      6      0   1013     25     21
> +                   <...>-862     [003] ....    81.638242: 1000000        125  99.98750      45      1      0   1011     23      0
> +                   <...>-863     [004] ....    81.638260: 1000000       1721  99.82790     168      7      0   1002     49     41
> +                   <...>-864     [005] ....    81.638286: 1000000        263  99.97370      57      6      0   1006     26      2
> +                   <...>-865     [006] ....    81.638302: 1000000        109  99.98910      21      3      0   1006     18      1
> +                   <...>-866     [007] ....    81.638326: 1000000       7816  99.21840     107      8      0   1016     39     19
> +
> +In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
> +tracer prints a message at the end of each period for each CPU that is
> +running an ``osnoise/`` thread. The osnoise specific fields report:
> +
> + - The ``RUNTIME IN USE`` reports the amount of time in microseconds that
> +   the ``osnoise`` thread kept looping reading the time.
> + - The ``NOISE IN US`` reports the sum of noise in microseconds observed
> +   by the osnoise tracer during the associated runtime.
> + - The ``% OF CPU AVAILABLE`` reports the percentage of CPU available for
> +   the ``osnoise`` thread during the ``runtime`` window.
> + - The ``MAX SINGLE NOISE IN US`` reports the maximum single noise observed
> +   during the ``runtime`` window.
> + - The ``Interference counters`` display how many each of the respective
> +   interference happened during the ``runtime`` window.
> +
> +Note that the example above shows a high number of ``HW noise`` samples.
> +The reason being is that this sample was taken on a virtual machine,
> +and the host interference is detected as a hardware interference.
> +
> +Tracer options
> +---------------------
> +
> +The tracer has a set of options inside the ``osnoise`` directory, they are:
> +
> + - ``cpus``: CPUs at which a ``osnoise`` thread will execute.
> + - ``period_us``: the period of the ``osnoise`` thread.
> + - ``runtime_us``: how long an ``osnoise`` thread will look for noise.
> + - ``stop_tracing_single_us``: stop the system tracing of a single noise
> +   higher than the configured value is happens. Writing ``0`` disables this
> +   option.
> + - ``stop_tracing_total_us``: stop the system tracing of a ``NOISE IN USE``
> +   higher than the configured value is happens. Writing ``0`` disables this
> +   option.
> + - ``tolerance_ns``: the minimum delta between two time() reads to be
> +   considered as noise.
> +
> +Additional Tracing
> +------------------
> +
> +In addition to the tracer, a set of ``tracepoints`` were added to
> +facilitate the identification of the osnoise source.
> +
> + - ``osnoise:sample_threshold``: printed anytime a noise is higher than
> +   the configurable ``tolerance_ns``.
> + - ``osnoise:nmi_noise``: noise from NMI, including the duration.
> + - ``osnoise:irq_noise``: noise from an IRQ, including the duration.
> + - ``osnoise:softirq_noise``: noise from a SoftIRQ, including the
> +   duration.
> + - ``osnoise:thread_noise``: noise from a thread, including the duration.
> +
> +Note that a all the values are *net values*. This means that a *thread*

   "a all"?

> +duration will not contain the duration of the *IRQs* that happened during
> +its execution, for example. The same is valid for all duration values.

The above is hard to understand. Do you mean individual instances of noise
is not recorded, and only the sum is?

> +
> +Here is one example of the usage of these ``tracepoints``::
> +
> +       osnoise/8-961     [008] d.h.  5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
> +       osnoise/8-961     [008] dNh.  5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
> +     migration/8-54      [008] d...  5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
> +       osnoise/8-961     [008] ....  5789.858413: sample_threshold: start 5789.858404555 duration 8 us interferences 2
> +
> +In this example, a noise sample of 8 microseconds was reported in the last
> +fine, pointing to two interferences. Looking backward in the trace, the

  "fine"?

> +two previous entries were about the ``migration`` thread running after
> +a timer IRQ execution. The first event is not part of the noise because
> +it took place one millisecond before.
> +
> +It is worth noticing that the sum of the duration reported in the
> +``tracepoints`` is smaller than eight us reported in the
> +``sample_threshold``. The reason roots in the tracing overhead and in
> +the overhead of the entry and exit code that happens before and after
> +any interference execution. This justifies the dual approach: measuring
> +thread and tracing.
> diff --git a/include/linux/ftrace_irq.h b/include/linux/ftrace_irq.h
> index 0abd9a1d2852..fd54045980ce 100644
> --- a/include/linux/ftrace_irq.h
> +++ b/include/linux/ftrace_irq.h
> @@ -7,12 +7,24 @@ extern bool trace_hwlat_callback_enabled;
>  extern void trace_hwlat_callback(bool enter);
>  #endif
>  
> +/*
> + * XXX: Make it generic

Yes, this should be the same for both the hwlat detector and for
osnoise.

> + */
> +#ifdef CONFIG_OSNOISE_TRACER
> +extern bool trace_osnoise_callback_enabled;
> +extern void trace_osnoise_callback(bool enter);
> +#endif
> +
>  static inline void ftrace_nmi_enter(void)
>  {
>  #ifdef CONFIG_HWLAT_TRACER
>  	if (trace_hwlat_callback_enabled)
>  		trace_hwlat_callback(true);
>  #endif
> +#ifdef CONFIG_OSNOISE_TRACER
> +	if (trace_osnoise_callback_enabled)
> +		trace_osnoise_callback(true);
> +#endif
>  }
>  
>  static inline void ftrace_nmi_exit(void)
> @@ -21,6 +33,10 @@ static inline void ftrace_nmi_exit(void)
>  	if (trace_hwlat_callback_enabled)
>  		trace_hwlat_callback(false);
>  #endif
> +#ifdef CONFIG_OSNOISE_TRACER
> +	if (trace_osnoise_callback_enabled)
> +		trace_osnoise_callback(false);
> +#endif
>  }
>  
>  #endif /* _LINUX_FTRACE_IRQ_H */
> diff --git a/include/trace/events/osnoise.h b/include/trace/events/osnoise.h
> new file mode 100644
> index 000000000000..81939234814b
> --- /dev/null
> +++ b/include/trace/events/osnoise.h
> @@ -0,0 +1,141 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM osnoise
> +
> +#if !defined(_OSNOISE_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _OSNOISE_TRACE_H
> +
> +#include <linux/tracepoint.h>
> +TRACE_EVENT(thread_noise,
> +
> +	TP_PROTO(struct task_struct *t, u64 start, u64 duration),
> +
> +	TP_ARGS(t, start, duration),
> +
> +	TP_STRUCT__entry(
> +		__array(	char,		comm,	TASK_COMM_LEN)
> +		__field(	pid_t,		pid	)

I would place the start and duration first. As pid is 4 bytes, you have a 4
byte "hole" in the structure:

system: osnoise
name: thread_noise
ID: 442
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:char comm[16];	offset:8;	size:16;	signed:1;
	field:pid_t pid;	offset:24;	size:4;	signed:1;

[ 4 bytes of nothing here ]

	field:u64 start;	offset:32;	size:8;	signed:0;
	field:u64 duration;	offset:40;	size:8;	signed:0;


> +		__field(	u64,		start	)
> +		__field(	u64,		duration)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
> +		__entry->pid = t->pid;
> +		__entry->start = start;
> +		__entry->duration = duration;
> +	),
> +
> +	TP_printk("%8s:%d start %llu.%09u duration %llu ns",
> +		__entry->comm,
> +		__entry->pid,
> +		__print_ns_to_secs(__entry->start),
> +		__print_ns_without_secs(__entry->start),
> +		__entry->duration)
> +);
> +
> +TRACE_EVENT(softirq_noise,
> +
> +	TP_PROTO(int vector, u64 start, u64 duration),
> +
> +	TP_ARGS(vector, start, duration),
> +
> +	TP_STRUCT__entry(
> +		__field(	int,		vector	)
> +		__field(	u64,		start	)
> +		__field(	u64,		duration)

Same here.

name: softirq_noise
ID: 441
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:int vector;	offset:8;	size:4;	signed:1;

[ 4 bytes of nothing here]

	field:u64 start;	offset:16;	size:8;	signed:0;
	field:u64 duration;	offset:24;	size:8;	signed:0;

> +	),
> +
> +	TP_fast_assign(
> +		__entry->vector = vector;
> +		__entry->start = start;
> +		__entry->duration = duration;
> +	),
> +
> +	TP_printk("%8s:%d start %llu.%09u duration %llu ns",
> +		show_softirq_name(__entry->vector),
> +		__entry->vector,
> +		__print_ns_to_secs(__entry->start),
> +		__print_ns_without_secs(__entry->start),
> +		__entry->duration)
> +);
> +
> +TRACE_EVENT(irq_noise,
> +
> +	TP_PROTO(int vector, const char *desc, u64 start, u64 duration),
> +
> +	TP_ARGS(vector, desc, start, duration),
> +
> +	TP_STRUCT__entry(
> +		__string(	desc,		desc    )
> +		__field(	int,		vector	)

This doesn't have a hole, but I think it should still switch to be
consistent.


> +		__field(	u64,		start	)
> +		__field(	u64,		duration)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(desc, desc);
> +		__entry->vector = vector;
> +		__entry->start = start;
> +		__entry->duration = duration;
> +	),
> +
> +	TP_printk("%s:%d start %llu.%09u duration %llu ns",
> +		__get_str(desc),
> +		__entry->vector,
> +		__print_ns_to_secs(__entry->start),
> +		__print_ns_without_secs(__entry->start),
> +		__entry->duration)
> +);
> +
> +TRACE_EVENT(nmi_noise,
> +
> +	TP_PROTO(u64 start, u64 duration),
> +
> +	TP_ARGS(start, duration),
> +
> +	TP_STRUCT__entry(
> +		__field(	u64,		start	)
> +		__field(	u64,		duration)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->start = start;
> +		__entry->duration = duration;
> +	),
> +
> +	TP_printk("start %llu.%09u duration %llu ns",
> +		__print_ns_to_secs(__entry->start),
> +		__print_ns_without_secs(__entry->start),
> +		__entry->duration)
> +);
> +
> +TRACE_EVENT(sample_threshold,
> +
> +	TP_PROTO(u64 start, u64 duration, u64 interference),
> +
> +	TP_ARGS(start, duration, interference),
> +
> +	TP_STRUCT__entry(
> +		__field(	u64,		start	)
> +		__field(	u64,		duration)
> +		__field(	u64,		interference)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->start = start;
> +		__entry->duration = duration;
> +		__entry->interference = interference;
> +	),
> +
> +	TP_printk("start %llu.%09u duration %llu us interferences %llu",
> +		__print_ns_to_secs(__entry->start),
> +		__print_ns_without_secs(__entry->start),
> +		__entry->duration,
> +		__entry->interference)
> +);
> +
> +#endif /* _TRACE_OSNOISE_H */
> +


[..]

> +static void osnoise_tracer_start(struct trace_array *tr)
> +{
> +	int retval;
> +
> +	/* Only allow one instance to enable this */
> +	if (osnoise_busy)
> +		return;

I found that I couldn't start this with:

	trace-cmd start -B foo -p osnoise

> +
> +	/*
> +	 * Trace is already hooked, we are re-enabling from
> +	 * a stop_tracing_*.
> +	 */
> +	if (trace_osnoise_callback_enabled)
> +		return;
> +
> +	osn_var_reset_all();
> +
> +	retval = hook_irq_events();
> +	if (retval)
> +		goto err;
> +
> +	retval = hook_softirq_events();
> +	if (retval)
> +		goto out_unhook_irq;
> +
> +	retval = hook_thread_events();
> +
> +	if (retval)
> +		goto out_unrook_softirq;
> +
> +	/*
> +	 * Make sure NMIs see reseted values.
> +	 */
> +	barrier();
> +	trace_osnoise_callback_enabled = true;
> +
> +	retval = start_per_cpu_kthreads(tr);
> +	/*
> +	 * all fine!
> +	 */
> +	if (!retval)
> +		return;
> +
> +	unhook_thread_events();
> +out_unrook_softirq:
> +	unhook_softirq_events();
> +out_unhook_irq:
> +	unhook_irq_events();
> +err:
> +	pr_err(BANNER "Error starting osnoise tracer\n");
> +}
> +
> +static void osnoise_tracer_stop(struct trace_array *tr)
> +{
> +	/* Only allow one instance to enable this */
> +	if (!osnoise_busy)
> +		return;
> +
> +	trace_osnoise_callback_enabled = false;
> +	barrier();
> +
> +	stop_per_cpu_kthreads();
> +
> +	unhook_irq_events();
> +	unhook_softirq_events();
> +	unhook_thread_events();
> +}
> +
> +static int osnoise_tracer_init(struct trace_array *tr)
> +{
> +	/* Only allow one instance to enable this */
> +	if (osnoise_busy)
> +		return -EBUSY;
> +
> +	osnoise_trace = tr;
> +
> +	tr->max_latency = 0;
> +
> +	if (tracer_tracing_is_on(tr))
> +		osnoise_tracer_start(tr);

That's because trace-cmd will disable tracing when it enables a tracer. And
the above "osnoise_trace_start() is not called.

> +
> +	osnoise_busy = true;

Once this is set, when we enable tracing, the start wont start.

-- Steve


> +
> +
> +	return 0;
> +}
> +
> +static void osnoise_tracer_reset(struct trace_array *tr)
> +{
> +	osnoise_tracer_stop(tr);
> +
> +	osnoise_busy = false;
> +}
> +
> +static struct tracer osnoise_tracer __read_mostly = {
> +	.name		= "osnoise",
> +	.init		= osnoise_tracer_init,
> +	.reset		= osnoise_tracer_reset,
> +	.start		= osnoise_tracer_start,
> +	.stop		= osnoise_tracer_stop,
> +	.print_header	= print_osnoise_headers,
> +	.allow_instances = true,
> +};
> +
> +__init static int init_osnoise_tracer(void)
> +{
> +	int ret;
> +
> +	mutex_init(&osnoise_data.lock);
> +
> +	ret = register_tracer(&osnoise_tracer);
> +	if (ret)
> +		return ret;
> +
> +	cpumask_copy(&osnoise_cpumask, cpu_all_mask);
> +
> +	init_tracefs();
> +
> +	return 0;
> +}
> +late_initcall(init_osnoise_tracer);
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index 61255bad7e01..edeb127fcdea 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -1189,7 +1189,6 @@ trace_hwlat_print(struct trace_iterator *iter, int flags,
>  	return trace_handle_return(s);
>  }
>  
> -
>  static enum print_line_t
>  trace_hwlat_raw(struct trace_iterator *iter, int flags,
>  		struct trace_event *event)
> @@ -1219,6 +1218,76 @@ static struct trace_event trace_hwlat_event = {
>  	.funcs		= &trace_hwlat_funcs,
>  };
>  
> +/* TRACE_OSNOISE */
> +static enum print_line_t
> +trace_osnoise_print(struct trace_iterator *iter, int flags,
> +		    struct trace_event *event)
> +{
> +	struct trace_entry *entry = iter->ent;
> +	struct trace_seq *s = &iter->seq;
> +	struct osnoise_entry *field;
> +	u64 ratio, ratio_dec;
> +	u64 net_runtime;
> +
> +	trace_assign_type(field, entry);
> +
> +	/*
> +	 * compute the available % of cpu time.
> +	 */
> +	net_runtime = field->runtime - field->noise;
> +	ratio = net_runtime * 10000000;
> +	do_div(ratio, field->runtime);
> +	ratio_dec = do_div(ratio, 100000);
> +
> +	trace_seq_printf(s, "%llu %10llu %3llu.%05llu %7llu",
> +			 field->runtime,
> +			 field->noise,
> +			 ratio, ratio_dec,
> +			 field->max_sample);
> +
> +	trace_seq_printf(s, " %6u", field->hw_count);
> +	trace_seq_printf(s, " %6u", field->nmi_count);
> +	trace_seq_printf(s, " %6u", field->irq_count);
> +	trace_seq_printf(s, " %6u", field->softirq_count);
> +	trace_seq_printf(s, " %6u", field->thread_count);
> +
> +	trace_seq_putc(s, '\n');
> +
> +	return trace_handle_return(s);
> +}
> +
> +static enum print_line_t
> +trace_osnoise_raw(struct trace_iterator *iter, int flags,
> +		  struct trace_event *event)
> +{
> +	struct osnoise_entry *field;
> +	struct trace_seq *s = &iter->seq;
> +
> +	trace_assign_type(field, iter->ent);
> +
> +	trace_seq_printf(s, "%lld %llu %llu %u %u %u %u %u\n",
> +			 field->runtime,
> +			 field->noise,
> +			 field->max_sample,
> +			 field->hw_count,
> +			 field->nmi_count,
> +			 field->irq_count,
> +			 field->softirq_count,
> +			 field->thread_count);
> +
> +	return trace_handle_return(s);
> +}
> +
> +static struct trace_event_functions trace_osnoise_funcs = {
> +	.trace		= trace_osnoise_print,
> +	.raw		= trace_osnoise_raw,
> +};
> +
> +static struct trace_event trace_osnoise_event = {
> +	.type		= TRACE_OSNOISE,
> +	.funcs		= &trace_osnoise_funcs,
> +};
> +
>  /* TRACE_BPUTS */
>  static enum print_line_t
>  trace_bputs_print(struct trace_iterator *iter, int flags,
> @@ -1384,6 +1453,7 @@ static struct trace_event *events[] __initdata = {
>  	&trace_bprint_event,
>  	&trace_print_event,
>  	&trace_hwlat_event,
> +	&trace_osnoise_event,
>  	&trace_raw_data_event,
>  	NULL
>  };

	

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/5] tracing/hwlat: Add a cpus file specific for hwlat_detector
  2021-04-14 14:10   ` Steven Rostedt
@ 2021-04-15 13:09     ` Daniel Bristot de Oliveira
  2021-04-15 13:49       ` Steven Rostedt
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-15 13:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On 4/14/21 4:10 PM, Steven Rostedt wrote:
> On Thu,  8 Apr 2021 16:13:19 +0200
> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> 
>> Provides a "cpus" interface to the hardware latency detector. By
>> default, it lists all CPUs, allowing hwlatd threads to run on any online
>> CPU of the system.
>>
>> It serves to restrict the execution of hwlatd to the set of CPUs writing
>> via this interface. Note that hwlatd also respects the "tracing_cpumask."
>> Hence, hwlatd threads will run only on the set of CPUs allowed here AND
>> on "tracing_cpumask."
>>
>> Why not keep just "tracing_cpumask"? Because the user might be interested
>> in tracing what is running on other CPUs. For instance, one might run
>> hwlatd in one HT CPU while observing what is running on the sibling HT
>> CPU. The cpu list format is also more intuitive.
>>
>> Also in preparation to the per-cpu mode.
> 
> OK, I'm still not convinced that you couldn't use tracing_cpumask here.
> Because we have instances, and tracing_cpumask is defined per instance, you
> could simply do:
> 
>  # cd /sys/kernel/tracing
>  # mkdir instances/hwlat
>  # echo a > instances/hwlat/tracing_cpumask
>  # echo hwlat > instances/hwlat/current_tracer
> 
> Now the tracing_cpumask above only affects the hwlat tracer.
> 
> I'm just reluctant to add more tracing files if the current ones can be
> used without too much trouble. For being intuitive, let's make user space
> tools hide the nastiness of the kernel interface ;-)

[discussing about the cpus file in both hwlat and osnoise here...]

I see your point, but by having two different instances give you two
different output "trace" files... and it is not that always practical to
merge them when using only the tracefs interface (I like to use it, and
it is very handy when dealing with immutable systems, on customers...).

Thinking aloud, one might say: sort the two trace files by timestamp...

and other might reply: but some lines do not have a timestamp associated,
e.g., the stacktrace.

Anyway, the cpus file on hwlat is not a super essential thing, I agree...
interrupts are disabled, so not much could go wrong (although I really
needed the trace from a sibling cpu in a real case).

But for the osnoise tracer the cpus file is really useful. For instance, on a 
system with the CPU 7 isolated:

----- %< -----
 # echo 7 > osnoise/cpus
 # echo target_cpu == 7 > events/sched/sched_wakeup/filter 
 # echo stacktrace if target_cpu == 7 > events/sched/sched_wakeup/trigger
 # echo 1 > events/sched/sched_wakeup/enable
 # echo osnoise:thread_noise > set_event 
 # echo osnoise > current_tracer
 # cat trace 
    [find...]
     kworker/0:1-7       [000] d..5  1820.717780: <stack trace>
 => trace_event_raw_event_sched_wakeup_template
 => __traceiter_sched_wakeup
 => ttwu_do_wakeup
 => try_to_wake_up
 => __queue_work
 => queue_delayed_work_on
 => vmstat_shepherd
 => process_one_work
 => worker_thread
 => kthread
 => ret_from_fork
     kworker/7:1-410     [007] d..3  1820.717790: thread_noise: kworker/7:1:410 start 1820.717786519 duration 3626 ns
       osnoise/7-1000    [007] ....  1821.582340: 1000000         90  99.99100      15      1      0     12      6      1
----- >% -----

It was possible to easily find that the '1' thread noise was a kworker,
dispatched from CPU 0, and that it was dispatched by "vmstat_shepherd".

Also, the osnoise dir is not added to a new instance... so, it only
costs "one" file...

> -- Steve
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/5] tracing/hwlat: Implement the mode config option
  2021-04-14 14:30   ` Steven Rostedt
@ 2021-04-15 13:16     ` Daniel Bristot de Oliveira
  2021-04-15 13:50       ` Steven Rostedt
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-15 13:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On 4/14/21 4:30 PM, Steven Rostedt wrote:
> On Thu,  8 Apr 2021 16:13:20 +0200
> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> 
>> +/**
>> + * hwlat_mode_write - Write function for "mode" entry
>> + * @filp: The active open file structure
>> + * @ubuf: The user buffer that contains the value to write
>> + * @cnt: The maximum number of bytes to write to "file"
>> + * @ppos: The current position in @file
>> + *
>> + * This function provides a write implementation for the "mode" interface
>> + * to the hardware latency detector. hwlatd has different operation modes.
>> + * The "none" sets the allowed cpumask for a single hwlatd thread at the
>> + * startup and lets the scheduler handle the migration. The default mode is
>> + * the "round-robin" one, in which a single hwlatd thread runs, migrating
>> + * among the allowed CPUs in a round-robin fashion.
>> + */
>> +static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
>> +				 size_t cnt, loff_t *ppos)
>> +{
>> +	const char *mode;
>> +	char buf[64];
>> +	int ret;
>> +	int i;
>> +
>> +	if (hwlat_busy)
>> +		return -EBUSY;
> 
> So we can't switch modes while running?
> 

As you mentioned in the patch 3/5, this limitation was added because
of the running threads. But, yes, stopping and starting the tracer to
re-create the threads should work as well. I will try it for the next round.

> Also, with this implemented, you can remove the disable_migrate variable,
> and just switch the mode to NONE when it's detected that the affinity mask
> of the thread has been changed.
> 

That was my initial intention with the NONE mode, but I feared breaking
something by removing the "migrate_disable" logic. If you do not think it is
a problem, I will remove the migrate disable and just change the mode.

-- Daniel
> -- Steve
> 
> 
>> +
>> +	if (cnt >= sizeof(buf))
>> +		return -EINVAL;
>> +
>> +	if (copy_from_user(buf, ubuf, cnt))
>> +		return -EFAULT;
>> +
>> +	buf[cnt] = 0;
>> +
>> +	mode = strstrip(buf);
>> +
>> +	ret = -EINVAL;
>> +
>> +	for (i = 0; i < MODE_MAX; i++) {
>> +		if (strcmp(mode, thread_mode_str[i]) == 0) {
>> +			hwlat_data.thread_mode = i;
>> +			ret = cnt;
>> +		}
>> +	}
>> +
>> +	*ppos += cnt;
>> +
>> +	return cnt;
>> +}
>> +
>> +
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode
  2021-04-14 14:41   ` Steven Rostedt
@ 2021-04-15 13:22     ` Daniel Bristot de Oliveira
  2021-04-15 15:22       ` Steven Rostedt
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-15 13:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On 4/14/21 4:41 PM, Steven Rostedt wrote:
> On Thu,  8 Apr 2021 16:13:21 +0200
> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> 
>> Implements the per-cpu mode in which a sampling thread is created for
>> each cpu in the "cpus" (and tracing_mask).
>>
>> The per-cpu mode has the potention to speed up the hwlat detection by
>> running on multiple CPUs at the same time.
> 
> And totally slow down the entire system in the process ;-)

Too :-) But this is not the default config... So it should be an intentional change.

>>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
>> Cc: Clark Willaims <williams@redhat.com>
>> Cc: John Kacur <jkacur@redhat.com>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: linux-doc@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
>>
>> ---
>>  Documentation/trace/hwlat_detector.rst |   6 +-
>>  kernel/trace/trace_hwlat.c             | 171 +++++++++++++++++++------
>>  2 files changed, 137 insertions(+), 40 deletions(-)
>>
>> diff --git a/Documentation/trace/hwlat_detector.rst b/Documentation/trace/hwlat_detector.rst
>> index f63fdd867598..7a6fab105b29 100644
>> --- a/Documentation/trace/hwlat_detector.rst
>> +++ b/Documentation/trace/hwlat_detector.rst
>> @@ -85,10 +85,12 @@ the available options are:
>>  
>>   - none:        do not force migration
>>   - round-robin: migrate across each CPU specified in cpus between each window
>> + - per-cpu:     create a per-cpu thread for each cpu in cpus
>>  
>>  By default, hwlat detector will also obey the tracing_cpumask, so the thread
>>  will be placed only in the set of cpus that is both on the hwlat detector's
>>  cpus and in the global tracing_cpumask file. The user can overwrite the
>>  cpumask by setting it manually. Changing the hwlatd affinity externally,
>> -e.g., via taskset tool, will disable the round-robin migration.
>> -
>> +e.g., via taskset tool, will disable the round-robin migration. In the
>> +per-cpu mode, the per-cpu thread (hwlatd/CPU) will be pinned to its relative
>> +cpu, and its affinity cannot be changed.
>> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
>> index 3818200c9e24..52968ea312df 100644
>> --- a/kernel/trace/trace_hwlat.c
>> +++ b/kernel/trace/trace_hwlat.c
>> @@ -34,7 +34,7 @@
>>   * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <jcm@redhat.com>
>>   * Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. <srostedt@redhat.com>
>>   *
>> - * Includes useful feedback from Clark Williams <clark@redhat.com>
>> + * Includes useful feedback from Clark Williams <williams@redhat.com>
> 
> Interesting update ;-)

Should I make it a separated patch? :-)

>>   *
>>   */
>>  #include <linux/kthread.h>
>> @@ -54,9 +54,6 @@ static struct trace_array	*hwlat_trace;
>>  #define DEFAULT_SAMPLE_WIDTH	500000			/* 0.5s */
>>  #define DEFAULT_LAT_THRESHOLD	10			/* 10us */
>>  
>> -/* sampling thread*/
>> -static struct task_struct *hwlat_kthread;
>> -
>>  static struct dentry *hwlat_sample_width;	/* sample width us */
>>  static struct dentry *hwlat_sample_window;	/* sample window us */
>>  static struct dentry *hwlat_cpumask_dentry;	/* hwlat cpus allowed */
>> @@ -65,19 +62,27 @@ static struct dentry *hwlat_thread_mode;	/* hwlat thread mode */
>>  enum {
>>  	MODE_NONE = 0,
>>  	MODE_ROUND_ROBIN,
>> +	MODE_PER_CPU,
>>  	MODE_MAX
>>  };
>>  
>> -static char *thread_mode_str[] = { "none", "round-robin" };
>> +static char *thread_mode_str[] = { "none", "round-robin", "per-cpu" };
>>  
>>  /* Save the previous tracing_thresh value */
>>  static unsigned long save_tracing_thresh;
>>  
>> -/* NMI timestamp counters */
>> -static u64 nmi_ts_start;
>> -static u64 nmi_total_ts;
>> -static int nmi_count;
>> -static int nmi_cpu;
>> +/* runtime kthread data */
>> +struct hwlat_kthread_data {
>> +	struct task_struct *kthread;
>> +	/* NMI timestamp counters */
>> +	u64 nmi_ts_start;
>> +	u64 nmi_total_ts;
>> +	int nmi_count;
>> +	int nmi_cpu;
>> +};
>> +
>> +struct hwlat_kthread_data hwlat_single_cpu_data;
>> +DEFINE_PER_CPU(struct hwlat_kthread_data, hwlat_per_cpu_data);
>>  
>>  /* Tells NMIs to call back to the hwlat tracer to record timestamps */
>>  bool trace_hwlat_callback_enabled;
>> @@ -114,6 +119,14 @@ static struct hwlat_data {
>>  	.thread_mode		= MODE_ROUND_ROBIN
>>  };
>>  
>> +struct hwlat_kthread_data *get_cpu_data(void)
>> +{
>> +	if (hwlat_data.thread_mode == MODE_PER_CPU)
>> +		return this_cpu_ptr(&hwlat_per_cpu_data);
>> +	else
>> +		return &hwlat_single_cpu_data;
>> +}
>> +
>>  static bool hwlat_busy;
>>  
>>  static void trace_hwlat_sample(struct hwlat_sample *sample)
>> @@ -151,7 +164,9 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)
>>  
>>  void trace_hwlat_callback(bool enter)
>>  {
>> -	if (smp_processor_id() != nmi_cpu)
>> +	struct hwlat_kthread_data *kdata = get_cpu_data();
>> +
>> +	if (kdata->kthread)
>>  		return;
>>  
>>  	/*
>> @@ -160,13 +175,13 @@ void trace_hwlat_callback(bool enter)
>>  	 */
>>  	if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
>>  		if (enter)
>> -			nmi_ts_start = time_get();
>> +			kdata->nmi_ts_start = time_get();
>>  		else
>> -			nmi_total_ts += time_get() - nmi_ts_start;
>> +			kdata->nmi_total_ts += time_get() - kdata->nmi_ts_start;
>>  	}
>>  
>>  	if (enter)
>> -		nmi_count++;
>> +		kdata->nmi_count++;
>>  }
>>  
>>  /**
>> @@ -178,6 +193,7 @@ void trace_hwlat_callback(bool enter)
>>   */
>>  static int get_sample(void)
>>  {
>> +	struct hwlat_kthread_data *kdata = get_cpu_data();
>>  	struct trace_array *tr = hwlat_trace;
>>  	struct hwlat_sample s;
>>  	time_type start, t1, t2, last_t2;
>> @@ -190,9 +206,8 @@ static int get_sample(void)
>>  
>>  	do_div(thresh, NSEC_PER_USEC); /* modifies interval value */
>>  
>> -	nmi_cpu = smp_processor_id();
>> -	nmi_total_ts = 0;
>> -	nmi_count = 0;
>> +	kdata->nmi_total_ts = 0;
>> +	kdata->nmi_count = 0;
>>  	/* Make sure NMIs see this first */
>>  	barrier();
>>  
>> @@ -262,15 +277,15 @@ static int get_sample(void)
>>  		ret = 1;
>>  
>>  		/* We read in microseconds */
>> -		if (nmi_total_ts)
>> -			do_div(nmi_total_ts, NSEC_PER_USEC);
>> +		if (kdata->nmi_total_ts)
>> +			do_div(kdata->nmi_total_ts, NSEC_PER_USEC);
>>  
>>  		hwlat_data.count++;
>>  		s.seqnum = hwlat_data.count;
>>  		s.duration = sample;
>>  		s.outer_duration = outer_sample;
>> -		s.nmi_total_ts = nmi_total_ts;
>> -		s.nmi_count = nmi_count;
>> +		s.nmi_total_ts = kdata->nmi_total_ts;
>> +		s.nmi_count = kdata->nmi_count;
>>  		s.count = count;
>>  		trace_hwlat_sample(&s);
>>  
>> @@ -376,23 +391,43 @@ static int kthread_fn(void *data)
>>  }
>>  
>>  /**
>> - * start_kthread - Kick off the hardware latency sampling/detector kthread
>> + * stop_stop_kthread - Inform the hardware latency samping/detector kthread to stop
>> + *
>> + * This kicks the running hardware latency sampling/detector kernel thread and
>> + * tells it to stop sampling now. Use this on unload and at system shutdown.
>> + */
>> +static void stop_single_kthread(void)
>> +{
>> +	struct hwlat_kthread_data *kdata = get_cpu_data();
>> +	struct task_struct *kthread = kdata->kthread;
>> +
>> +	if (!kthread)
>> +
>> +		return;
>> +	kthread_stop(kthread);
>> +
>> +	kdata->kthread = NULL;
>> +}
>> +
>> +
>> +/**
>> + * start_single_kthread - Kick off the hardware latency sampling/detector kthread
>>   *
>>   * This starts the kernel thread that will sit and sample the CPU timestamp
>>   * counter (TSC or similar) and look for potential hardware latencies.
>>   */
>> -static int start_kthread(struct trace_array *tr)
>> +static int start_single_kthread(struct trace_array *tr)
>>  {
>> +	struct hwlat_kthread_data *kdata = get_cpu_data();
>>  	struct cpumask *current_mask = &save_cpumask;
>>  	struct task_struct *kthread;
>>  	int next_cpu;
>>  
>> -	if (hwlat_kthread)
>> +	if (kdata->kthread)
>>  		return 0;
>>  
>> -
>>  	kthread = kthread_create(kthread_fn, NULL, "hwlatd");
>> -	if (IS_ERR(kthread)) {
>> +	if (IS_ERR(kdata->kthread)) {
>>  		pr_err(BANNER "could not start sampling thread\n");
>>  		return -ENOMEM;
>>  	}
>> @@ -419,24 +454,77 @@ static int start_kthread(struct trace_array *tr)
>>  
>>  	sched_setaffinity(kthread->pid, current_mask);
>>  
>> -	hwlat_kthread = kthread;
>> +	kdata->kthread = kthread;
>>  	wake_up_process(kthread);
>>  
>>  	return 0;
>>  }
>>  
>>  /**
>> - * stop_kthread - Inform the hardware latency samping/detector kthread to stop
>> + * stop_per_cpu_kthread - Inform the hardware latency samping/detector kthread to stop
>>   *
>> - * This kicks the running hardware latency sampling/detector kernel thread and
>> + * This kicks the running hardware latency sampling/detector kernel threads and
>>   * tells it to stop sampling now. Use this on unload and at system shutdown.
>>   */
>> -static void stop_kthread(void)
>> +static void stop_per_cpu_kthreads(void)
>>  {
>> -	if (!hwlat_kthread)
>> -		return;
>> -	kthread_stop(hwlat_kthread);
>> -	hwlat_kthread = NULL;
>> +	struct task_struct *kthread;
>> +	int cpu;
>> +
>> +	for_each_online_cpu(cpu) {
>> +		kthread = per_cpu(hwlat_per_cpu_data, cpu).kthread;
>> +		if (kthread)
>> +			kthread_stop(kthread);
> 
> Probably want:
> 
> 		per_cpu(hwlat_per_cpu_data, cpu).kthread = NULL;
> 
> Just to be safe. I don't like to rely on the start doing the job, as things
> can change in the future. Having the clearing here as well makes the code
> more robust.

Ack!

> 
>> +	}
>> +}
>> +
>> +/**
>> + * start_per_cpu_kthread - Kick off the hardware latency sampling/detector kthreads
>> + *
>> + * This starts the kernel threads that will sit on potentially all cpus and
>> + * sample the CPU timestamp counter (TSC or similar) and look for potential
>> + * hardware latencies.
>> + */
>> +static int start_per_cpu_kthreads(struct trace_array *tr)
>> +{
>> +	struct cpumask *current_mask = &save_cpumask;
>> +	struct cpumask *this_cpumask;
>> +	struct task_struct *kthread;
>> +	char comm[24];
>> +	int cpu;
>> +
>> +	if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
>> +		return -ENOMEM;
>> +
>> +	get_online_cpus();
>> +	/*
>> +	 * Run only on CPUs in which trace and hwlat are allowed to run.
>> +	 */
>> +	cpumask_and(current_mask, tr->tracing_cpumask, &hwlat_cpumask);
>> +	/*
>> +	 * And the CPU is online.
>> +	 */
>> +	cpumask_and(current_mask, cpu_online_mask, current_mask);
>> +	put_online_cpus();
>> +
>> +	for_each_online_cpu(cpu)
>> +		per_cpu(hwlat_per_cpu_data, cpu).kthread = NULL;
>> +
>> +	for_each_cpu(cpu, current_mask) {
>> +		snprintf(comm, 24, "hwlatd/%d", cpu);
>> +
>> +		kthread = kthread_create_on_cpu(kthread_fn, NULL, cpu, comm);
>> +		if (IS_ERR(kthread)) {
>> +			pr_err(BANNER "could not start sampling thread\n");
>> +			stop_per_cpu_kthreads();
>> +			return -ENOMEM;
>> +		}
>> +
>> +		per_cpu(hwlat_per_cpu_data, cpu).kthread = kthread;
>> +		wake_up_process(kthread);
>> +	}
>> +
>> +	return 0;
>>  }
>>  
>>  /*
>> @@ -701,7 +789,8 @@ static int hwlat_mode_open(struct inode *inode, struct file *file)
>>   * The "none" sets the allowed cpumask for a single hwlatd thread at the
>>   * startup and lets the scheduler handle the migration. The default mode is
>>   * the "round-robin" one, in which a single hwlatd thread runs, migrating
>> - * among the allowed CPUs in a round-robin fashion.
>> + * among the allowed CPUs in a round-robin fashion. The "per-cpu" mode
>> + * creates one hwlatd thread per allowed CPU.
>>   */
>>  static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
>>  				 size_t cnt, loff_t *ppos)
>> @@ -827,14 +916,20 @@ static void hwlat_tracer_start(struct trace_array *tr)
>>  {
>>  	int err;
>>  
>> -	err = start_kthread(tr);
>> +	if (hwlat_data.thread_mode == MODE_PER_CPU)
>> +		err = start_per_cpu_kthreads(tr);
>> +	else
>> +		err = start_single_kthread(tr);
>>  	if (err)
>>  		pr_err(BANNER "Cannot start hwlat kthread\n");
>>  }
>>  
>>  static void hwlat_tracer_stop(struct trace_array *tr)
>>  {
>> -	stop_kthread();
>> +	if (hwlat_data.thread_mode == MODE_PER_CPU)
>> +		stop_per_cpu_kthreads();
>> +	else
>> +		stop_single_kthread();
> 
> This explains why you have the "busy" check in the changing of the modes.
> But really, I don't see why you cant change the mode. Just stop the
> previous mode, and start the new one.

I will try it!

Thanks!
-- Daniel

> -- Steve
> 
> 
>>  }
>>  
>>  static int hwlat_tracer_init(struct trace_array *tr)
>> @@ -864,7 +959,7 @@ static int hwlat_tracer_init(struct trace_array *tr)
>>  
>>  static void hwlat_tracer_reset(struct trace_array *tr)
>>  {
>> -	stop_kthread();
>> +	hwlat_tracer_stop(tr);
>>  
>>  	/* the tracing threshold is static between runs */
>>  	last_tracing_thresh = tracing_thresh;
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/5] tracing: Add the osnoise tracer
  2021-04-14 17:14   ` Steven Rostedt
@ 2021-04-15 13:43     ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-15 13:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On 4/14/21 7:14 PM, Steven Rostedt wrote:
> On Thu,  8 Apr 2021 16:13:23 +0200
> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> 
>> In the context of high-performance computing (HPC), the Operating System
>> Noise (osnoise) refers to the interference experienced by an application
>> due to activities inside the operating system. In the context of Linux,
>> NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
>> system. Moreover, hardware-related jobs can also cause noise, for example,
>> via SMIs.
>>
>> hwlat_detector is one of the tools used to identify the most complex
>> source of noise: hardware noise.
>>
>> In a nutshell, the hwlat_detector creates a thread that runs
>> periodically for a given period. At the beginning of a period, the thread
>> disables interrupt and starts sampling. While running, the hwlatd
>> thread reads the time in a loop. As interrupts are disabled, threads,
>> IRQs, and SoftIRQs cannot interfere with the hwlatd thread. Hence, the
>> cause of any gap between two different reads of the time roots either on
>> NMI or in the hardware itself. At the end of the period, hwlatd enables
>> interrupts and reports the max observed gap between the reads. It also
>> prints an NMI occurrence counter. If the output does not report NMI
>> executions, the user can conclude that the hardware is the culprit for
>> the latency. The hwlat detects the NMI execution by observing
>> the entry and exit of an NMI.
>>
>> The osnoise tracer leverages the hwlat_detector by running a
>> similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing
>> all the sources of osnoise during its execution. Using the same approach
>> of hwlat, osnoise takes note of the entry and exit point of any
>> source of interferences, increasing a per-cpu interference counter. The
>> osnoise tracer also saves an interference counter for each source of
>> interference. The interference counter for NMI, IRQs, SoftIRQs, and
>> threads is increased anytime the tool observes these interferences' entry
>> events. When a noise happens without any interference from the operating
>> system level, the hardware noise counter increases, pointing to a
>> hardware-related noise. In this way, osnoise can account for any
>> source of interference. At the end of the period, the osnoise tracer
>> prints the sum of all noise, the max single noise, the percentage of CPU
>> available for the thread, and the counters for the noise sources.
>>
>> Usage
>>
>> Write the ASCII text osnoise into the current_tracer file of the
>> tracing system (generally mounted at /sys/kernel/tracing or
>> /sys/kernel/debug/tracing).
>>
>> For example::
>>
>>         [root@f32 ~]# cd /sys/kernel/tracing/
>>         [root@f32 tracing]# echo osnoise > current_tracer
>>
>> It is possible to follow the trace by reading the trace trace file::
>>
>>         [root@f32 tracing]# cat trace
>>         # tracer: osnoise
>>         #
>>         #                                _-----=> irqs-off
>>         #                               / _----=> need-resched
>>         #                              | / _---=> hardirq/softirq
>>         #                              || / _--=> preempt-depth                            MAX
>>         #                              || /                                             SINGLE     Interference counters:
>>         #                              ||||               RUNTIME      NOISE   % OF CPU  NOISE    +-----------------------------+
>>         #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
>>         #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
>>                    <...>-859     [000] ....    81.637220: 1000000        190  99.98100       9     18      0   1007     18      1
>>                    <...>-860     [001] ....    81.638154: 1000000        656  99.93440      74     23      0   1006     16      3
>>                    <...>-861     [002] ....    81.638193: 1000000       5675  99.43250     202      6      0   1013     25     21
>>                    <...>-862     [003] ....    81.638242: 1000000        125  99.98750      45      1      0   1011     23      0
>>                    <...>-863     [004] ....    81.638260: 1000000       1721  99.82790     168      7      0   1002     49     41
>>                    <...>-864     [005] ....    81.638286: 1000000        263  99.97370      57      6      0   1006     26      2
>>                    <...>-865     [006] ....    81.638302: 1000000        109  99.98910      21      3      0   1006     18      1
>>                    <...>-866     [007] ....    81.638326: 1000000       7816  99.21840     107      8      0   1016     39     19
>>
>> In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
>> tracer prints a message at the end of each period for each CPU that is
>> running an osnoise/ thread. The osnoise specific fields report:
>>
>>  - The RUNTIME IN USE reports the amount of time in microseconds that
>>    the osnoise thread kept looping reading the time.
>>  - The NOISE IN US reports the sum of noise in microseconds observed
>>    by the osnoise tracer during the associated runtime.
>>  - The % OF CPU AVAILABLE reports the percentage of CPU available for
>>    the osnoise thread during the runtime window.
>>  - The MAX SINGLE NOISE IN US reports the maximum single noise observed
>>    during the runtime window.
>>  - The Interference counters display how many each of the respective
>>    interference happened during the runtime window.
>>
>> Note that the example above shows a high number of HW noise samples.
>> The reason being is that this sample was taken on a virtual machine,
>> and the host interference is detected as a hardware interference.
>>
>> Tracer options
>>
>> The tracer has a set of options inside the osnoise directory, they are:
>>
>>  - cpus: CPUs at which a osnoise thread will execute.
> 
> Again, I think we can reuse the tracing_cpumask.
> 
>>  - period_us: the period of the osnoise thread.
>>  - runtime_us: how long an osnoise thread will look for noise.
> 
> These seem the same as window and width. At a minimum should probably share
> the same code.

how about creating a generic handler for all the "to long" writes, that receives
a structure containing a pointer to where to save the value, a min and a max
acceptable value?

If so, where to place this function? trace.c?

>>  - stop_tracing_single_us: stop the system tracing of a single noise
>>    higher than the configured value is happens. Writing 0 disables this
>>    option.
>>  - stop_tracing_total_us: stop the system tracing of a NOISE IN USE
>>    higher than the configured value is happens. Writing 0 disables this
>>    option.
>>  - tolerance_ns: the minimum delta between two time() reads to be
>>    considered as noise.
> 
> You can use tracing_threshold for the tolerance. Do you really need it in
> ns?

Yes, I can. I placed it in ns to serve as a fine tune that one might need. But,
I can use the tracing_threshold in us as well.

>>
>> Additional Tracing
>>
>> In addition to the tracer, a set of tracepoints were added to
>> facilitate the identification of the osnoise source.
>>
>>  - osnoise:sample_threshold: printed anytime a noise is higher than
>>    the configurable tolerance_ns.
>>  - osnoise:nmi_noise: noise from NMI, including the duration.
>>  - osnoise:irq_noise: noise from an IRQ, including the duration.
>>  - osnoise:softirq_noise: noise from a SoftIRQ, including the
>>    duration.
>>  - osnoise:thread_noise: noise from a thread, including the duration.
>>
>> Note that a all the values are net values. This means that a thread
>> duration will not contain the duration of the IRQs that happened during
>> its execution, for example. The same is valid for all duration values.
>>
>> Here is one example of the usage of these tracepoints::
>>
>>        osnoise/8-961     [008] d.h.  5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
>>        osnoise/8-961     [008] dNh.  5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
>>      migration/8-54      [008] d...  5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
>>        osnoise/8-961     [008] ....  5789.858413: sample_threshold: start 5789.858404555 duration 8 us interferences 2
>>
>> In this example, a noise sample of 8 microseconds was reported in the last
>> fine, pointing to two interferences. Looking backward in the trace, the
>> two previous entries were about the migration thread running after
>> a timer IRQ execution. The first event is not part of the noise because
>> it took place one millisecond before.
>>
>> It is worth noticing that the sum of the duration reported in the
>> tracepoints is smaller than eight us reported in the
>> sample_threshold. The reason roots in the tracing overhead and in
>> the overhead of the entry and exit code that happens before and after
>> any interference execution. This justifies the dual approach: measuring
>> thread and tracing.
> 
> I'm not sure the tracing overhead had much to do with it as did the
> overhead of entering the interrupt itself. events are rather fast (usually
> less than 200ns depending on the system). You can always enable the
> benchmark_event to see what trace event overhead is. Then again, cold cache
> can play into it as well.

I agree, I will remove the tracing overhead part.

>>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
>> Cc: Clark Willaims <williams@redhat.com>
>> Cc: John Kacur <jkacur@redhat.com>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: linux-doc@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
>>
>> ---
>>  Documentation/trace/osnoise_tracer.rst |  149 ++
>>  include/linux/ftrace_irq.h             |   16 +
>>  include/trace/events/osnoise.h         |  141 ++
>>  kernel/trace/Kconfig                   |   34 +
>>  kernel/trace/Makefile                  |    1 +
>>  kernel/trace/trace.h                   |    9 +-
>>  kernel/trace/trace_entries.h           |   27 +
>>  kernel/trace/trace_osnoise.c           | 1714 ++++++++++++++++++++++++
>>  kernel/trace/trace_output.c            |   72 +-
>>  9 files changed, 2159 insertions(+), 4 deletions(-)
>>  create mode 100644 Documentation/trace/osnoise_tracer.rst
>>  create mode 100644 include/trace/events/osnoise.h
>>  create mode 100644 kernel/trace/trace_osnoise.c
>>
>> diff --git a/Documentation/trace/osnoise_tracer.rst b/Documentation/trace/osnoise_tracer.rst
>> new file mode 100644
>> index 000000000000..9a97f557317b
>> --- /dev/null
>> +++ b/Documentation/trace/osnoise_tracer.rst
>> @@ -0,0 +1,149 @@
>> +==============
>> +OSNOISE Tracer
>> +==============
>> +
>> +In the context of high-performance computing (HPC), the Operating System
>> +Noise (*osnoise*) refers to the interference experienced by an application
>> +due to activities inside the operating system. In the context of Linux,
>> +NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
>> +system. Moreover, hardware-related jobs can also cause noise, for example,
>> +via SMIs.
>> +
>> +``hwlat_detector`` is one of the tools used to identify the most complex
>> +source of noise: *hardware noise*.
>> +
>> +In a nutshell, the ``hwlat_detector`` creates a thread that runs
>> +periodically for a given period. At the beginning of a period, the thread
>> +disables interrupt and starts sampling. While running, the ``hwlatd``
>> +thread reads the time in a loop. As interrupts are disabled, threads,
>> +IRQs, and SoftIRQs cannot interfere with the ``hwlatd`` thread. Hence, the
>> +cause of any gap between two different reads of the time roots either on
>> +NMI or in the hardware itself. At the end of the period, ``hwlatd`` enables
>> +interrupts and reports the max observed gap between the reads. It also
>> +prints a NMI occurrence counter. If the output does not report NMI
>> +executions, the user can conclude that the hardware is the culprit for
>> +the latency. The ``hwlat`` detects the NMI execution by observing
>> +the entry and exit of a NMI.
>> +
>> +The ``osnoise`` tracer leverages the ``hwlat_detector`` by running a
>> +similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing
>> +all the sources of *osnoise* during its execution. Using the same approach
>> +of ``hwlat``, ``osnoise`` takes note of the entry and exit point of any
>> +source of interferences, increasing a per-cpu interference counter. The
>> +``osnoise`` tracer also saves an interference counter for each source of
>> +interference. The interference counter for NMI, IRQs, SoftIRQs, and
>> +threads is increased anytime the tool observes these interferences' entry
>> +events. When a noise happens without any interference from the operating
>> +system level, the hardware noise counter increases, pointing to a
>> +hardware-related noise. In this way, ``osnoise`` can account for any
>> +source of interference. At the end of the period, the ``osnoise`` tracer
>> +prints the sum of all noise, the max single noise, the percentage of CPU
>> +available for the thread, and the counters for the noise sources.
>> +
>> +Usage
>> +-----
>> +
>> +Write the ASCII text ``osnoise`` into the ``current_tracer`` file of the
>> +tracing system (generally mounted at ``/sys/kernel/tracing`` or
>> +``/sys/kernel/debug/tracing``).
> 
> I wouldn't even mention the /sys/kernel/debug/tracing path, I'm trying to
> deprecated that.

I mention it because that is the path (still) used on Fedora...

>> +
>> +For example::
>> +
>> +        [root@f32 ~]# cd /sys/kernel/tracing/
>> +        [root@f32 tracing]# echo osnoise > current_tracer
>> +
>> +It is possible to follow the trace by reading the ``trace`` trace file::
>> +
>> +        [root@f32 tracing]# cat trace
>> +        # tracer: osnoise
>> +        #
>> +        #                                _-----=> irqs-off
>> +        #                               / _----=> need-resched
>> +        #                              | / _---=> hardirq/softirq
>> +        #                              || / _--=> preempt-depth                            MAX
>> +        #                              || /                                             SINGLE     Interference counters:
>> +        #                              ||||               RUNTIME      NOISE   % OF CPU  NOISE    +-----------------------------+
>> +        #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
>> +        #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
>> +                   <...>-859     [000] ....    81.637220: 1000000        190  99.98100       9     18      0   1007     18      1
>> +                   <...>-860     [001] ....    81.638154: 1000000        656  99.93440      74     23      0   1006     16      3
>> +                   <...>-861     [002] ....    81.638193: 1000000       5675  99.43250     202      6      0   1013     25     21
>> +                   <...>-862     [003] ....    81.638242: 1000000        125  99.98750      45      1      0   1011     23      0
>> +                   <...>-863     [004] ....    81.638260: 1000000       1721  99.82790     168      7      0   1002     49     41
>> +                   <...>-864     [005] ....    81.638286: 1000000        263  99.97370      57      6      0   1006     26      2
>> +                   <...>-865     [006] ....    81.638302: 1000000        109  99.98910      21      3      0   1006     18      1
>> +                   <...>-866     [007] ....    81.638326: 1000000       7816  99.21840     107      8      0   1016     39     19
>> +
>> +In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
>> +tracer prints a message at the end of each period for each CPU that is
>> +running an ``osnoise/`` thread. The osnoise specific fields report:
>> +
>> + - The ``RUNTIME IN USE`` reports the amount of time in microseconds that
>> +   the ``osnoise`` thread kept looping reading the time.
>> + - The ``NOISE IN US`` reports the sum of noise in microseconds observed
>> +   by the osnoise tracer during the associated runtime.
>> + - The ``% OF CPU AVAILABLE`` reports the percentage of CPU available for
>> +   the ``osnoise`` thread during the ``runtime`` window.
>> + - The ``MAX SINGLE NOISE IN US`` reports the maximum single noise observed
>> +   during the ``runtime`` window.
>> + - The ``Interference counters`` display how many each of the respective
>> +   interference happened during the ``runtime`` window.
>> +
>> +Note that the example above shows a high number of ``HW noise`` samples.
>> +The reason being is that this sample was taken on a virtual machine,
>> +and the host interference is detected as a hardware interference.
>> +
>> +Tracer options
>> +---------------------
>> +
>> +The tracer has a set of options inside the ``osnoise`` directory, they are:
>> +
>> + - ``cpus``: CPUs at which a ``osnoise`` thread will execute.
>> + - ``period_us``: the period of the ``osnoise`` thread.
>> + - ``runtime_us``: how long an ``osnoise`` thread will look for noise.
>> + - ``stop_tracing_single_us``: stop the system tracing of a single noise
>> +   higher than the configured value is happens. Writing ``0`` disables this
>> +   option.
>> + - ``stop_tracing_total_us``: stop the system tracing of a ``NOISE IN USE``
>> +   higher than the configured value is happens. Writing ``0`` disables this
>> +   option.
>> + - ``tolerance_ns``: the minimum delta between two time() reads to be
>> +   considered as noise.
>> +
>> +Additional Tracing
>> +------------------
>> +
>> +In addition to the tracer, a set of ``tracepoints`` were added to
>> +facilitate the identification of the osnoise source.
>> +
>> + - ``osnoise:sample_threshold``: printed anytime a noise is higher than
>> +   the configurable ``tolerance_ns``.
>> + - ``osnoise:nmi_noise``: noise from NMI, including the duration.
>> + - ``osnoise:irq_noise``: noise from an IRQ, including the duration.
>> + - ``osnoise:softirq_noise``: noise from a SoftIRQ, including the
>> +   duration.
>> + - ``osnoise:thread_noise``: noise from a thread, including the duration.
>> +
>> +Note that a all the values are *net values*. This means that a *thread*
> 
>    "a all"?

Oops, I will fix that.

>> +duration will not contain the duration of the *IRQs* that happened during
>> +its execution, for example. The same is valid for all duration values.
> 
> The above is hard to understand. Do you mean individual instances of noise
> is not recorded, and only the sum is?

I need to rephrase that.... I meant that when we have two or more "noise"
stacked, e.g., a thread noise happening, and then an IRQ noise happens on top of
it, the noise from the TOP will be discounted, e.g., the IRQ noise added to the
thread noise will be discounted. Like...

osnoise_running
	--------> thread noise in
                       run 5 us
                  ----------------> IRQ noise
                                    run 3 us
		  <---------------- print duration 3 us
                       run 1 us
        <--------- print duration 6 us (not 9 us).

Making this computation in kernel reduces the amount of events printed in the
buffer.

>> +
>> +Here is one example of the usage of these ``tracepoints``::
>> +
>> +       osnoise/8-961     [008] d.h.  5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
>> +       osnoise/8-961     [008] dNh.  5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
>> +     migration/8-54      [008] d...  5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
>> +       osnoise/8-961     [008] ....  5789.858413: sample_threshold: start 5789.858404555 duration 8 us interferences 2
>> +
>> +In this example, a noise sample of 8 microseconds was reported in the last
>> +fine, pointing to two interferences. Looking backward in the trace, the
> 
>   "fine"?

Ops, "line" :-). Will fix.

>> +two previous entries were about the ``migration`` thread running after
>> +a timer IRQ execution. The first event is not part of the noise because
>> +it took place one millisecond before.
>> +
>> +It is worth noticing that the sum of the duration reported in the
>> +``tracepoints`` is smaller than eight us reported in the
>> +``sample_threshold``. The reason roots in the tracing overhead and in
>> +the overhead of the entry and exit code that happens before and after
>> +any interference execution. This justifies the dual approach: measuring
>> +thread and tracing.
>> diff --git a/include/linux/ftrace_irq.h b/include/linux/ftrace_irq.h
>> index 0abd9a1d2852..fd54045980ce 100644
>> --- a/include/linux/ftrace_irq.h
>> +++ b/include/linux/ftrace_irq.h
>> @@ -7,12 +7,24 @@ extern bool trace_hwlat_callback_enabled;
>>  extern void trace_hwlat_callback(bool enter);
>>  #endif
>>  
>> +/*
>> + * XXX: Make it generic
> 
> Yes, this should be the same for both the hwlat detector and for
> osnoise.

Where should I place it? On hwlat, making osnoise to select it? On trace.c?

>> + */
>> +#ifdef CONFIG_OSNOISE_TRACER
>> +extern bool trace_osnoise_callback_enabled;
>> +extern void trace_osnoise_callback(bool enter);
>> +#endif
>> +
>>  static inline void ftrace_nmi_enter(void)
>>  {
>>  #ifdef CONFIG_HWLAT_TRACER
>>  	if (trace_hwlat_callback_enabled)
>>  		trace_hwlat_callback(true);
>>  #endif
>> +#ifdef CONFIG_OSNOISE_TRACER
>> +	if (trace_osnoise_callback_enabled)
>> +		trace_osnoise_callback(true);
>> +#endif
>>  }
>>  
>>  static inline void ftrace_nmi_exit(void)
>> @@ -21,6 +33,10 @@ static inline void ftrace_nmi_exit(void)
>>  	if (trace_hwlat_callback_enabled)
>>  		trace_hwlat_callback(false);
>>  #endif
>> +#ifdef CONFIG_OSNOISE_TRACER
>> +	if (trace_osnoise_callback_enabled)
>> +		trace_osnoise_callback(false);
>> +#endif
>>  }
>>  
>>  #endif /* _LINUX_FTRACE_IRQ_H */
>> diff --git a/include/trace/events/osnoise.h b/include/trace/events/osnoise.h
>> new file mode 100644
>> index 000000000000..81939234814b
>> --- /dev/null
>> +++ b/include/trace/events/osnoise.h
>> @@ -0,0 +1,141 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM osnoise
>> +
>> +#if !defined(_OSNOISE_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _OSNOISE_TRACE_H
>> +
>> +#include <linux/tracepoint.h>
>> +TRACE_EVENT(thread_noise,
>> +
>> +	TP_PROTO(struct task_struct *t, u64 start, u64 duration),
>> +
>> +	TP_ARGS(t, start, duration),
>> +
>> +	TP_STRUCT__entry(
>> +		__array(	char,		comm,	TASK_COMM_LEN)
>> +		__field(	pid_t,		pid	)
> 
> I would place the start and duration first. As pid is 4 bytes, you have a 4
> byte "hole" in the structure:
> 
> system: osnoise
> name: thread_noise
> ID: 442
> format:
> 	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
> 	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
> 	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
> 	field:int common_pid;	offset:4;	size:4;	signed:1;
> 
> 	field:char comm[16];	offset:8;	size:16;	signed:1;
> 	field:pid_t pid;	offset:24;	size:4;	signed:1;
> 
> [ 4 bytes of nothing here ]
> 
> 	field:u64 start;	offset:32;	size:8;	signed:0;
> 	field:u64 duration;	offset:40;	size:8;	signed:0;
> 

Ack, will fix that.

>> +		__field(	u64,		start	)
>> +		__field(	u64,		duration)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
>> +		__entry->pid = t->pid;
>> +		__entry->start = start;
>> +		__entry->duration = duration;
>> +	),
>> +
>> +	TP_printk("%8s:%d start %llu.%09u duration %llu ns",
>> +		__entry->comm,
>> +		__entry->pid,
>> +		__print_ns_to_secs(__entry->start),
>> +		__print_ns_without_secs(__entry->start),
>> +		__entry->duration)
>> +);
>> +
>> +TRACE_EVENT(softirq_noise,
>> +
>> +	TP_PROTO(int vector, u64 start, u64 duration),
>> +
>> +	TP_ARGS(vector, start, duration),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(	int,		vector	)
>> +		__field(	u64,		start	)
>> +		__field(	u64,		duration)
> 
> Same here.
> 
> name: softirq_noise
> ID: 441
> format:
> 	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
> 	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
> 	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
> 	field:int common_pid;	offset:4;	size:4;	signed:1;
> 
> 	field:int vector;	offset:8;	size:4;	signed:1;
> 
> [ 4 bytes of nothing here]
> 
> 	field:u64 start;	offset:16;	size:8;	signed:0;
> 	field:u64 duration;	offset:24;	size:8;	signed:0;

ack!

>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->vector = vector;
>> +		__entry->start = start;
>> +		__entry->duration = duration;
>> +	),
>> +
>> +	TP_printk("%8s:%d start %llu.%09u duration %llu ns",
>> +		show_softirq_name(__entry->vector),
>> +		__entry->vector,
>> +		__print_ns_to_secs(__entry->start),
>> +		__print_ns_without_secs(__entry->start),
>> +		__entry->duration)
>> +);
>> +
>> +TRACE_EVENT(irq_noise,
>> +
>> +	TP_PROTO(int vector, const char *desc, u64 start, u64 duration),
>> +
>> +	TP_ARGS(vector, desc, start, duration),
>> +
>> +	TP_STRUCT__entry(
>> +		__string(	desc,		desc    )
>> +		__field(	int,		vector	)
> 
> This doesn't have a hole, but I think it should still switch to be
> consistent.

Ack!

> 
>> +		__field(	u64,		start	)
>> +		__field(	u64,		duration)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__assign_str(desc, desc);
>> +		__entry->vector = vector;
>> +		__entry->start = start;
>> +		__entry->duration = duration;
>> +	),
>> +
>> +	TP_printk("%s:%d start %llu.%09u duration %llu ns",
>> +		__get_str(desc),
>> +		__entry->vector,
>> +		__print_ns_to_secs(__entry->start),
>> +		__print_ns_without_secs(__entry->start),
>> +		__entry->duration)
>> +);
>> +
>> +TRACE_EVENT(nmi_noise,
>> +
>> +	TP_PROTO(u64 start, u64 duration),
>> +
>> +	TP_ARGS(start, duration),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(	u64,		start	)
>> +		__field(	u64,		duration)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->start = start;
>> +		__entry->duration = duration;
>> +	),
>> +
>> +	TP_printk("start %llu.%09u duration %llu ns",
>> +		__print_ns_to_secs(__entry->start),
>> +		__print_ns_without_secs(__entry->start),
>> +		__entry->duration)
>> +);
>> +
>> +TRACE_EVENT(sample_threshold,
>> +
>> +	TP_PROTO(u64 start, u64 duration, u64 interference),
>> +
>> +	TP_ARGS(start, duration, interference),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(	u64,		start	)
>> +		__field(	u64,		duration)
>> +		__field(	u64,		interference)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->start = start;
>> +		__entry->duration = duration;
>> +		__entry->interference = interference;
>> +	),
>> +
>> +	TP_printk("start %llu.%09u duration %llu us interferences %llu",
>> +		__print_ns_to_secs(__entry->start),
>> +		__print_ns_without_secs(__entry->start),
>> +		__entry->duration,
>> +		__entry->interference)
>> +);
>> +
>> +#endif /* _TRACE_OSNOISE_H */
>> +
> 
> 
> [..]
> 
>> +static void osnoise_tracer_start(struct trace_array *tr)
>> +{
>> +	int retval;
>> +
>> +	/* Only allow one instance to enable this */
>> +	if (osnoise_busy)
>> +		return;
> 
> I found that I couldn't start this with:
> 
> 	trace-cmd start -B foo -p osnoise

ok, I will debug that.

>> +
>> +	/*
>> +	 * Trace is already hooked, we are re-enabling from
>> +	 * a stop_tracing_*.
>> +	 */
>> +	if (trace_osnoise_callback_enabled)
>> +		return;
>> +
>> +	osn_var_reset_all();
>> +
>> +	retval = hook_irq_events();
>> +	if (retval)
>> +		goto err;
>> +
>> +	retval = hook_softirq_events();
>> +	if (retval)
>> +		goto out_unhook_irq;
>> +
>> +	retval = hook_thread_events();
>> +
>> +	if (retval)
>> +		goto out_unrook_softirq;
>> +
>> +	/*
>> +	 * Make sure NMIs see reseted values.
>> +	 */
>> +	barrier();
>> +	trace_osnoise_callback_enabled = true;
>> +
>> +	retval = start_per_cpu_kthreads(tr);
>> +	/*
>> +	 * all fine!
>> +	 */
>> +	if (!retval)
>> +		return;
>> +
>> +	unhook_thread_events();
>> +out_unrook_softirq:
>> +	unhook_softirq_events();
>> +out_unhook_irq:
>> +	unhook_irq_events();
>> +err:
>> +	pr_err(BANNER "Error starting osnoise tracer\n");
>> +}
>> +
>> +static void osnoise_tracer_stop(struct trace_array *tr)
>> +{
>> +	/* Only allow one instance to enable this */
>> +	if (!osnoise_busy)
>> +		return;
>> +
>> +	trace_osnoise_callback_enabled = false;
>> +	barrier();
>> +
>> +	stop_per_cpu_kthreads();
>> +
>> +	unhook_irq_events();
>> +	unhook_softirq_events();
>> +	unhook_thread_events();
>> +}
>> +
>> +static int osnoise_tracer_init(struct trace_array *tr)
>> +{
>> +	/* Only allow one instance to enable this */
>> +	if (osnoise_busy)
>> +		return -EBUSY;
>> +
>> +	osnoise_trace = tr;
>> +
>> +	tr->max_latency = 0;
>> +
>> +	if (tracer_tracing_is_on(tr))
>> +		osnoise_tracer_start(tr);
> 
> That's because trace-cmd will disable tracing when it enables a tracer. And
> the above "osnoise_trace_start() is not called.
> 
>> +
>> +	osnoise_busy = true;
> 
> Once this is set, when we enable tracing, the start wont start.

Ok, I will try to understand this better.

Thanks
-- Daniel

> -- Steve
> 
> 
>> +
>> +
>> +	return 0;
>> +}
>> +
>> +static void osnoise_tracer_reset(struct trace_array *tr)
>> +{
>> +	osnoise_tracer_stop(tr);
>> +
>> +	osnoise_busy = false;
>> +}
>> +
>> +static struct tracer osnoise_tracer __read_mostly = {
>> +	.name		= "osnoise",
>> +	.init		= osnoise_tracer_init,
>> +	.reset		= osnoise_tracer_reset,
>> +	.start		= osnoise_tracer_start,
>> +	.stop		= osnoise_tracer_stop,
>> +	.print_header	= print_osnoise_headers,
>> +	.allow_instances = true,
>> +};
>> +
>> +__init static int init_osnoise_tracer(void)
>> +{
>> +	int ret;
>> +
>> +	mutex_init(&osnoise_data.lock);
>> +
>> +	ret = register_tracer(&osnoise_tracer);
>> +	if (ret)
>> +		return ret;
>> +
>> +	cpumask_copy(&osnoise_cpumask, cpu_all_mask);
>> +
>> +	init_tracefs();
>> +
>> +	return 0;
>> +}
>> +late_initcall(init_osnoise_tracer);
>> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
>> index 61255bad7e01..edeb127fcdea 100644
>> --- a/kernel/trace/trace_output.c
>> +++ b/kernel/trace/trace_output.c
>> @@ -1189,7 +1189,6 @@ trace_hwlat_print(struct trace_iterator *iter, int flags,
>>  	return trace_handle_return(s);
>>  }
>>  
>> -
>>  static enum print_line_t
>>  trace_hwlat_raw(struct trace_iterator *iter, int flags,
>>  		struct trace_event *event)
>> @@ -1219,6 +1218,76 @@ static struct trace_event trace_hwlat_event = {
>>  	.funcs		= &trace_hwlat_funcs,
>>  };
>>  
>> +/* TRACE_OSNOISE */
>> +static enum print_line_t
>> +trace_osnoise_print(struct trace_iterator *iter, int flags,
>> +		    struct trace_event *event)
>> +{
>> +	struct trace_entry *entry = iter->ent;
>> +	struct trace_seq *s = &iter->seq;
>> +	struct osnoise_entry *field;
>> +	u64 ratio, ratio_dec;
>> +	u64 net_runtime;
>> +
>> +	trace_assign_type(field, entry);
>> +
>> +	/*
>> +	 * compute the available % of cpu time.
>> +	 */
>> +	net_runtime = field->runtime - field->noise;
>> +	ratio = net_runtime * 10000000;
>> +	do_div(ratio, field->runtime);
>> +	ratio_dec = do_div(ratio, 100000);
>> +
>> +	trace_seq_printf(s, "%llu %10llu %3llu.%05llu %7llu",
>> +			 field->runtime,
>> +			 field->noise,
>> +			 ratio, ratio_dec,
>> +			 field->max_sample);
>> +
>> +	trace_seq_printf(s, " %6u", field->hw_count);
>> +	trace_seq_printf(s, " %6u", field->nmi_count);
>> +	trace_seq_printf(s, " %6u", field->irq_count);
>> +	trace_seq_printf(s, " %6u", field->softirq_count);
>> +	trace_seq_printf(s, " %6u", field->thread_count);
>> +
>> +	trace_seq_putc(s, '\n');
>> +
>> +	return trace_handle_return(s);
>> +}
>> +
>> +static enum print_line_t
>> +trace_osnoise_raw(struct trace_iterator *iter, int flags,
>> +		  struct trace_event *event)
>> +{
>> +	struct osnoise_entry *field;
>> +	struct trace_seq *s = &iter->seq;
>> +
>> +	trace_assign_type(field, iter->ent);
>> +
>> +	trace_seq_printf(s, "%lld %llu %llu %u %u %u %u %u\n",
>> +			 field->runtime,
>> +			 field->noise,
>> +			 field->max_sample,
>> +			 field->hw_count,
>> +			 field->nmi_count,
>> +			 field->irq_count,
>> +			 field->softirq_count,
>> +			 field->thread_count);
>> +
>> +	return trace_handle_return(s);
>> +}
>> +
>> +static struct trace_event_functions trace_osnoise_funcs = {
>> +	.trace		= trace_osnoise_print,
>> +	.raw		= trace_osnoise_raw,
>> +};
>> +
>> +static struct trace_event trace_osnoise_event = {
>> +	.type		= TRACE_OSNOISE,
>> +	.funcs		= &trace_osnoise_funcs,
>> +};
>> +
>>  /* TRACE_BPUTS */
>>  static enum print_line_t
>>  trace_bputs_print(struct trace_iterator *iter, int flags,
>> @@ -1384,6 +1453,7 @@ static struct trace_event *events[] __initdata = {
>>  	&trace_bprint_event,
>>  	&trace_print_event,
>>  	&trace_hwlat_event,
>> +	&trace_osnoise_event,
>>  	&trace_raw_data_event,
>>  	NULL
>>  };
> 
> 	
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/5] tracing/hwlat: Add a cpus file specific for hwlat_detector
  2021-04-15 13:09     ` Daniel Bristot de Oliveira
@ 2021-04-15 13:49       ` Steven Rostedt
  2021-04-15 14:33         ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2021-04-15 13:49 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On Thu, 15 Apr 2021 15:09:50 +0200
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> But for the osnoise tracer the cpus file is really useful. For instance, on a 
> system with the CPU 7 isolated:
> 
> ----- %< -----
>  # echo 7 > osnoise/cpus
>  # echo target_cpu == 7 > events/sched/sched_wakeup/filter 
>  # echo stacktrace if target_cpu == 7 > events/sched/sched_wakeup/trigger
>  # echo 1 > events/sched/sched_wakeup/enable
>  # echo osnoise:thread_noise > set_event 
>  # echo osnoise > current_tracer
>  # cat trace 
>     [find...]
>      kworker/0:1-7       [000] d..5  1820.717780: <stack trace>
>  => trace_event_raw_event_sched_wakeup_template
>  => __traceiter_sched_wakeup
>  => ttwu_do_wakeup
>  => try_to_wake_up
>  => __queue_work
>  => queue_delayed_work_on
>  => vmstat_shepherd
>  => process_one_work
>  => worker_thread
>  => kthread
>  => ret_from_fork  
>      kworker/7:1-410     [007] d..3  1820.717790: thread_noise: kworker/7:1:410 start 1820.717786519 duration 3626 ns
>        osnoise/7-1000    [007] ....  1821.582340: 1000000         90  99.99100      15      1      0     12      6      1
> ----- >% -----  
> 
> It was possible to easily find that the '1' thread noise was a kworker,
> dispatched from CPU 0, and that it was dispatched by "vmstat_shepherd".
> 
> Also, the osnoise dir is not added to a new instance... so, it only
> costs "one" file...

Every file counts. ;-)

What you did not articulate well, is that you want the other trace points
to be traced on all CPUs (maybe) when the osnoise threads are on a few (or
vice versa).

OK, for osnoise, I can see how it is useful. But as you said above, for
hwlat tracer, it's not as useful.

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/5] tracing/hwlat: Implement the mode config option
  2021-04-15 13:16     ` Daniel Bristot de Oliveira
@ 2021-04-15 13:50       ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2021-04-15 13:50 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On Thu, 15 Apr 2021 15:16:04 +0200
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> That was my initial intention with the NONE mode, but I feared breaking
> something by removing the "migrate_disable" logic. If you do not think it is
> a problem, I will remove the migrate disable and just change the mode.

Yes, just change it.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/5] tracing/hwlat: Add a cpus file specific for hwlat_detector
  2021-04-15 13:49       ` Steven Rostedt
@ 2021-04-15 14:33         ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2021-04-15 14:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On 4/15/21 3:49 PM, Steven Rostedt wrote:
> OK, for osnoise, I can see how it is useful. But as you said above, for
> hwlat tracer, it's not as useful.

I agree, it is not as useful.

-- Daniel


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode
  2021-04-15 13:22     ` Daniel Bristot de Oliveira
@ 2021-04-15 15:22       ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2021-04-15 15:22 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, kcarcia, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexandre Chartre,
	Clark Willaims, John Kacur, Juri Lelli, linux-doc

On Thu, 15 Apr 2021 15:22:52 +0200
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> >> --- a/kernel/trace/trace_hwlat.c
> >> +++ b/kernel/trace/trace_hwlat.c
> >> @@ -34,7 +34,7 @@
> >>   * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <jcm@redhat.com>
> >>   * Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. <srostedt@redhat.com>
> >>   *
> >> - * Includes useful feedback from Clark Williams <clark@redhat.com>
> >> + * Includes useful feedback from Clark Williams <williams@redhat.com>  
> > 
> > Interesting update ;-)  
> 
> Should I make it a separated patch? :-)

Yeah, probably.

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-04-15 15:23 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-08 14:13 [RFC PATCH 0/5] hwlat improvements and osnoise tracer Daniel Bristot de Oliveira
2021-04-08 14:13 ` [RFC PATCH 1/5] tracing/hwlat: Add a cpus file specific for hwlat_detector Daniel Bristot de Oliveira
2021-04-14 14:10   ` Steven Rostedt
2021-04-15 13:09     ` Daniel Bristot de Oliveira
2021-04-15 13:49       ` Steven Rostedt
2021-04-15 14:33         ` Daniel Bristot de Oliveira
2021-04-08 14:13 ` [RFC PATCH 2/5] tracing/hwlat: Implement the mode config option Daniel Bristot de Oliveira
2021-04-08 20:52   ` kernel test robot
2021-04-14 14:30   ` Steven Rostedt
2021-04-15 13:16     ` Daniel Bristot de Oliveira
2021-04-15 13:50       ` Steven Rostedt
2021-04-08 14:13 ` [RFC PATCH 3/5] tracing/hwlat: Implement the per-cpu mode Daniel Bristot de Oliveira
2021-04-08 19:39   ` kernel test robot
2021-04-08 21:39   ` kernel test robot
2021-04-08 23:54   ` kernel test robot
2021-04-14 14:41   ` Steven Rostedt
2021-04-15 13:22     ` Daniel Bristot de Oliveira
2021-04-15 15:22       ` Steven Rostedt
2021-04-08 14:13 ` [RFC PATCH 4/5] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers Daniel Bristot de Oliveira
2021-04-08 14:13 ` [RFC PATCH 5/5] tracing: Add the osnoise tracer Daniel Bristot de Oliveira
2021-04-08 15:58   ` Jonathan Corbet
2021-04-09  7:19     ` Daniel Bristot de Oliveira
2021-04-08 23:57   ` kernel test robot
2021-04-14 17:14   ` Steven Rostedt
2021-04-15 13:43     ` Daniel Bristot de Oliveira

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.