[for-next][PATCH 00/11] tracing: Updates for 6.2

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [for-next][PATCH 00/11] tracing: Updates for 6.2
@ 2022-11-24 14:50 Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 01/11] ftrace: Clean comments related to FTRACE_OPS_FL_PER_CPU Steven Rostedt
                   ` (9 more replies)
  0 siblings, 10 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 14:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Masami Hiramatsu, Andrew Morton

  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace/for-next

Head SHA1: bd604f3db49c5b21171abea0414a2020dcbf2646


Chuang Wang (1):
      tracing/perf: Use strndup_user instead of kzalloc/strncpy_from_user

Daniel Bristot de Oliveira (3):
      tracing/osnoise: Add osnoise/options file
      tracing/osnoise: Add OSNOISE_WORKLOAD option
      Documentation/osnoise: Add osnoise/options documentation

Song Chen (1):
      ring_buffer: Remove unused "event" parameter

Steven Rostedt (Google) (3):
      tracing: Add __cpumask to denote a trace event field that is a cpumask_t
      tracing: Add trace_trigger kernel command line option
      ftrace: Avoid needless updates of the ftrace function call

Xiu Jianfeng (1):
      tracing: Make tracepoint_print_iter static

Zheng Yejian (2):
      ftrace: Clean comments related to FTRACE_OPS_FL_PER_CPU
      tracing: Optimize event type allocation with IDA

----
 Documentation/admin-guide/kernel-parameters.txt |  19 +++
 Documentation/trace/osnoise-tracer.rst          |  12 ++
 include/linux/ring_buffer.h                     |   3 +-
 include/linux/trace_events.h                    |   1 -
 include/trace/bpf_probe.h                       |   6 +
 include/trace/perf.h                            |   6 +
 include/trace/stages/stage1_struct_define.h     |   6 +
 include/trace/stages/stage2_data_offsets.h      |   6 +
 include/trace/stages/stage3_trace_output.h      |   6 +
 include/trace/stages/stage4_event_fields.h      |   6 +
 include/trace/stages/stage5_get_offsets.h       |   6 +
 include/trace/stages/stage6_event_callback.h    |  20 +++
 include/trace/stages/stage7_class_define.h      |   2 +
 kernel/trace/ftrace.c                           |  27 ++--
 kernel/trace/ring_buffer.c                      |  12 +-
 kernel/trace/ring_buffer_benchmark.c            |   2 +-
 kernel/trace/trace.c                            |   4 +-
 kernel/trace/trace.h                            |   2 -
 kernel/trace/trace_event_perf.c                 |  16 +-
 kernel/trace/trace_events.c                     |  72 ++++++++-
 kernel/trace/trace_osnoise.c                    | 194 +++++++++++++++++++++++-
 kernel/trace/trace_output.c                     |  66 ++------
 samples/trace_events/trace-events-sample.c      |   2 +-
 samples/trace_events/trace-events-sample.h      |  34 ++++-
 24 files changed, 431 insertions(+), 99 deletions(-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [for-next][PATCH 01/11] ftrace: Clean comments related to FTRACE_OPS_FL_PER_CPU
  2022-11-24 14:50 [for-next][PATCH 00/11] tracing: Updates for 6.2 Steven Rostedt
@ 2022-11-24 14:50 ` Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t Steven Rostedt
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 14:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Masami Hiramatsu, Andrew Morton, Zheng Yejian

From: Zheng Yejian <zhengyejian1@huawei.com>

Commit b3a88803ac5b ("ftrace: Kill FTRACE_OPS_FL_PER_CPU") didn't
completely remove the comments related to FTRACE_OPS_FL_PER_CPU.

Link: https://lkml.kernel.org/r/20221025153923.1995973-1-zhengyejian1@huawei.com

Fixes: b3a88803ac5b ("ftrace: Kill FTRACE_OPS_FL_PER_CPU")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ftrace.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 33236241f236..65a5d36463e0 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -163,7 +163,7 @@ static void ftrace_sync_ipi(void *data)
 static ftrace_func_t ftrace_ops_get_list_func(struct ftrace_ops *ops)
 {
 	/*
-	 * If this is a dynamic, RCU, or per CPU ops, or we force list func,
+	 * If this is a dynamic or RCU ops, or we force list func,
 	 * then it needs to call the list anyway.
 	 */
 	if (ops->flags & (FTRACE_OPS_FL_DYNAMIC | FTRACE_OPS_FL_RCU) ||
@@ -3071,8 +3071,6 @@ int ftrace_shutdown(struct ftrace_ops *ops, int command)
 	/*
 	 * Dynamic ops may be freed, we must make sure that all
 	 * callers are done before leaving this function.
-	 * The same goes for freeing the per_cpu data of the per_cpu
-	 * ops.
 	 */
 	if (ops->flags & FTRACE_OPS_FL_DYNAMIC) {
 		/*
@@ -7519,8 +7517,6 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
 		/*
 		 * Check the following for each ops before calling their func:
 		 *  if RCU flag is set, then rcu_is_watching() must be true
-		 *  if PER_CPU is set, then ftrace_function_local_disable()
-		 *                          must be false
 		 *  Otherwise test if the ip matches the ops filter
 		 *
 		 * If any of the above fails then the op->func() is not executed.
@@ -7570,8 +7566,8 @@ NOKPROBE_SYMBOL(arch_ftrace_ops_list_func);
 
 /*
  * If there's only one function registered but it does not support
- * recursion, needs RCU protection and/or requires per cpu handling, then
- * this function will be called by the mcount trampoline.
+ * recursion, needs RCU protection, then this function will be called
+ * by the mcount trampoline.
  */
 static void ftrace_ops_assist_func(unsigned long ip, unsigned long parent_ip,
 				   struct ftrace_ops *op, struct ftrace_regs *fregs)
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-11-24 14:50 [for-next][PATCH 00/11] tracing: Updates for 6.2 Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 01/11] ftrace: Clean comments related to FTRACE_OPS_FL_PER_CPU Steven Rostedt
@ 2022-11-24 14:50 ` Steven Rostedt
  2022-12-12 14:53   ` Douglas Raillard
  2022-11-24 14:50 ` [for-next][PATCH 03/11] tracing: Add trace_trigger kernel command line option Steven Rostedt
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 14:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Masami Hiramatsu, Andrew Morton, Valentin Schneider

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

The trace events have a __bitmask field that can be used for anything
that requires bitmasks. Although currently it is only used for CPU
masks, it could be used in the future for any type of bitmasks.

There is some user space tooling that wants to know if a field is a CPU
mask and not just some random unsigned long bitmask. Introduce
"__cpumask()" helper functions that work the same as the current
__bitmask() helpers but displays in the format file:

  field:__data_loc cpumask_t *[] mask;    offset:36;      size:4; signed:0;

Instead of:

  field:__data_loc unsigned long[] mask;  offset:32;      size:4; signed:0;

The main difference is the type. Instead of "unsigned long" it is
"cpumask_t *". Note, this type field needs to be a real type in the
__dynamic_array() logic that both __cpumask and__bitmask use, but the
comparison field requires it to be a scalar type whereas cpumask_t is a
structure (non-scalar). But everything works when making it a pointer.

Valentin added changes to remove the need of passing in "nr_bits" and the
__cpumask will always use nr_cpumask_bits as its size.

Link: https://lkml.kernel.org/r/20221014080456.1d32b989@rorschach.local.home

Requested-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/trace/bpf_probe.h                    |  6 ++++
 include/trace/perf.h                         |  6 ++++
 include/trace/stages/stage1_struct_define.h  |  6 ++++
 include/trace/stages/stage2_data_offsets.h   |  6 ++++
 include/trace/stages/stage3_trace_output.h   |  6 ++++
 include/trace/stages/stage4_event_fields.h   |  6 ++++
 include/trace/stages/stage5_get_offsets.h    |  6 ++++
 include/trace/stages/stage6_event_callback.h | 20 ++++++++++++
 include/trace/stages/stage7_class_define.h   |  2 ++
 samples/trace_events/trace-events-sample.c   |  2 +-
 samples/trace_events/trace-events-sample.h   | 34 +++++++++++++++-----
 11 files changed, 91 insertions(+), 9 deletions(-)

diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index 6a13220d2d27..155c495b89ea 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -21,6 +21,9 @@
 #undef __get_bitmask
 #define __get_bitmask(field) (char *)__get_dynamic_array(field)
 
+#undef __get_cpumask
+#define __get_cpumask(field) (char *)__get_dynamic_array(field)
+
 #undef __get_sockaddr
 #define __get_sockaddr(field) ((struct sockaddr *)__get_dynamic_array(field))
 
@@ -40,6 +43,9 @@
 #undef __get_rel_bitmask
 #define __get_rel_bitmask(field) (char *)__get_rel_dynamic_array(field)
 
+#undef __get_rel_cpumask
+#define __get_rel_cpumask(field) (char *)__get_rel_dynamic_array(field)
+
 #undef __get_rel_sockaddr
 #define __get_rel_sockaddr(field) ((struct sockaddr *)__get_rel_dynamic_array(field))
 
diff --git a/include/trace/perf.h b/include/trace/perf.h
index 5800d13146c3..8f3bf1e17707 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -21,6 +21,9 @@
 #undef __get_bitmask
 #define __get_bitmask(field) (char *)__get_dynamic_array(field)
 
+#undef __get_cpumask
+#define __get_cpumask(field) (char *)__get_dynamic_array(field)
+
 #undef __get_sockaddr
 #define __get_sockaddr(field) ((struct sockaddr *)__get_dynamic_array(field))
 
@@ -41,6 +44,9 @@
 #undef __get_rel_bitmask
 #define __get_rel_bitmask(field) (char *)__get_rel_dynamic_array(field)
 
+#undef __get_rel_cpumask
+#define __get_rel_cpumask(field) (char *)__get_rel_dynamic_array(field)
+
 #undef __get_rel_sockaddr
 #define __get_rel_sockaddr(field) ((struct sockaddr *)__get_rel_dynamic_array(field))
 
diff --git a/include/trace/stages/stage1_struct_define.h b/include/trace/stages/stage1_struct_define.h
index 1b7bab60434c..69e0dae453bf 100644
--- a/include/trace/stages/stage1_struct_define.h
+++ b/include/trace/stages/stage1_struct_define.h
@@ -32,6 +32,9 @@
 #undef __bitmask
 #define __bitmask(item, nr_bits) __dynamic_array(char, item, -1)
 
+#undef __cpumask
+#define __cpumask(item) __dynamic_array(char, item, -1)
+
 #undef __sockaddr
 #define __sockaddr(field, len) __dynamic_array(u8, field, len)
 
@@ -47,6 +50,9 @@
 #undef __rel_bitmask
 #define __rel_bitmask(item, nr_bits) __rel_dynamic_array(char, item, -1)
 
+#undef __rel_cpumask
+#define __rel_cpumask(item) __rel_dynamic_array(char, item, -1)
+
 #undef __rel_sockaddr
 #define __rel_sockaddr(field, len) __rel_dynamic_array(u8, field, len)
 
diff --git a/include/trace/stages/stage2_data_offsets.h b/include/trace/stages/stage2_data_offsets.h
index 1b7a8f764fdd..469b6a64293d 100644
--- a/include/trace/stages/stage2_data_offsets.h
+++ b/include/trace/stages/stage2_data_offsets.h
@@ -38,6 +38,9 @@
 #undef __bitmask
 #define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
 
+#undef __cpumask
+#define __cpumask(item) __dynamic_array(unsigned long, item, -1)
+
 #undef __sockaddr
 #define __sockaddr(field, len) __dynamic_array(u8, field, len)
 
@@ -53,5 +56,8 @@
 #undef __rel_bitmask
 #define __rel_bitmask(item, nr_bits) __rel_dynamic_array(unsigned long, item, -1)
 
+#undef __rel_cpumask
+#define __rel_cpumask(item) __rel_dynamic_array(unsigned long, item, -1)
+
 #undef __rel_sockaddr
 #define __rel_sockaddr(field, len) __rel_dynamic_array(u8, field, len)
diff --git a/include/trace/stages/stage3_trace_output.h b/include/trace/stages/stage3_trace_output.h
index e3b183e9d18e..66374df61ed3 100644
--- a/include/trace/stages/stage3_trace_output.h
+++ b/include/trace/stages/stage3_trace_output.h
@@ -42,6 +42,9 @@
 		trace_print_bitmask_seq(p, __bitmask, __bitmask_size);	\
 	})
 
+#undef __get_cpumask
+#define __get_cpumask(field) __get_bitmask(field)
+
 #undef __get_rel_bitmask
 #define __get_rel_bitmask(field)						\
 	({								\
@@ -51,6 +54,9 @@
 		trace_print_bitmask_seq(p, __bitmask, __bitmask_size);	\
 	})
 
+#undef __get_rel_cpumask
+#define __get_rel_cpumask(field) __get_rel_bitmask(field)
+
 #undef __get_sockaddr
 #define __get_sockaddr(field)	((struct sockaddr *)__get_dynamic_array(field))
 
diff --git a/include/trace/stages/stage4_event_fields.h b/include/trace/stages/stage4_event_fields.h
index a8fb25f39a99..f2990d22313c 100644
--- a/include/trace/stages/stage4_event_fields.h
+++ b/include/trace/stages/stage4_event_fields.h
@@ -46,6 +46,9 @@
 #undef __bitmask
 #define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
 
+#undef __cpumask
+#define __cpumask(item) __dynamic_array(cpumask_t *, item, -1)
+
 #undef __sockaddr
 #define __sockaddr(field, len) __dynamic_array(u8, field, len)
 
@@ -64,5 +67,8 @@
 #undef __rel_bitmask
 #define __rel_bitmask(item, nr_bits) __rel_dynamic_array(unsigned long, item, -1)
 
+#undef __rel_cpumask
+#define __rel_cpumask(item) __rel_dynamic_array(cpumask_t *, item, -1)
+
 #undef __rel_sockaddr
 #define __rel_sockaddr(field, len) __rel_dynamic_array(u8, field, len)
diff --git a/include/trace/stages/stage5_get_offsets.h b/include/trace/stages/stage5_get_offsets.h
index fba4c24ed9e6..ac5c24d3beeb 100644
--- a/include/trace/stages/stage5_get_offsets.h
+++ b/include/trace/stages/stage5_get_offsets.h
@@ -82,10 +82,16 @@
 #define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item,	\
 					 __bitmask_size_in_longs(nr_bits))
 
+#undef __cpumask
+#define __cpumask(item) __bitmask(item, nr_cpumask_bits)
+
 #undef __rel_bitmask
 #define __rel_bitmask(item, nr_bits) __rel_dynamic_array(unsigned long, item,	\
 					 __bitmask_size_in_longs(nr_bits))
 
+#undef __rel_cpumask
+#define __rel_cpumask(item) __rel_bitmask(item, nr_cpumask_bits)
+
 #undef __sockaddr
 #define __sockaddr(field, len) __dynamic_array(u8, field, len)
 
diff --git a/include/trace/stages/stage6_event_callback.h b/include/trace/stages/stage6_event_callback.h
index 3c554a585320..49c32394b53f 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -57,6 +57,16 @@
 #define __assign_bitmask(dst, src, nr_bits)					\
 	memcpy(__get_bitmask(dst), (src), __bitmask_size_in_bytes(nr_bits))
 
+#undef __cpumask
+#define __cpumask(item) __dynamic_array(unsigned long, item, -1)
+
+#undef __get_cpumask
+#define __get_cpumask(field) (char *)__get_dynamic_array(field)
+
+#undef __assign_cpumask
+#define __assign_cpumask(dst, src)					\
+	memcpy(__get_cpumask(dst), (src), __bitmask_size_in_bytes(nr_cpumask_bits))
+
 #undef __sockaddr
 #define __sockaddr(field, len) __dynamic_array(u8, field, len)
 
@@ -98,6 +108,16 @@
 #define __assign_rel_bitmask(dst, src, nr_bits)					\
 	memcpy(__get_rel_bitmask(dst), (src), __bitmask_size_in_bytes(nr_bits))
 
+#undef __rel_cpumask
+#define __rel_cpumask(item) __rel_dynamic_array(unsigned long, item, -1)
+
+#undef __get_rel_cpumask
+#define __get_rel_cpumask(field) (char *)__get_rel_dynamic_array(field)
+
+#undef __assign_rel_cpumask
+#define __assign_rel_cpumask(dst, src)					\
+	memcpy(__get_rel_cpumask(dst), (src), __bitmask_size_in_bytes(nr_cpumask_bits))
+
 #undef __rel_sockaddr
 #define __rel_sockaddr(field, len) __rel_dynamic_array(u8, field, len)
 
diff --git a/include/trace/stages/stage7_class_define.h b/include/trace/stages/stage7_class_define.h
index 8a7ec24c246d..8795429f388b 100644
--- a/include/trace/stages/stage7_class_define.h
+++ b/include/trace/stages/stage7_class_define.h
@@ -13,11 +13,13 @@
 #undef __get_dynamic_array_len
 #undef __get_str
 #undef __get_bitmask
+#undef __get_cpumask
 #undef __get_sockaddr
 #undef __get_rel_dynamic_array
 #undef __get_rel_dynamic_array_len
 #undef __get_rel_str
 #undef __get_rel_bitmask
+#undef __get_rel_cpumask
 #undef __get_rel_sockaddr
 #undef __print_array
 #undef __print_hex_dump
diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index 608c4ae3b08a..ecc7db237f2e 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -50,7 +50,7 @@ static void do_simple_thread_func(int cnt, const char *fmt, ...)
 
 	trace_foo_with_template_print("I have to be different", cnt);
 
-	trace_foo_rel_loc("Hello __rel_loc", cnt, bitmask);
+	trace_foo_rel_loc("Hello __rel_loc", cnt, bitmask, current->cpus_ptr);
 }
 
 static void simple_thread_func(int cnt)
diff --git a/samples/trace_events/trace-events-sample.h b/samples/trace_events/trace-events-sample.h
index 1a92226202fc..fb4548a44153 100644
--- a/samples/trace_events/trace-events-sample.h
+++ b/samples/trace_events/trace-events-sample.h
@@ -200,6 +200,16 @@
  *
  *         __assign_bitmask(target_cpus, cpumask_bits(bar), nr_cpumask_bits);
  *
+ *   __cpumask: This is pretty much the same as __bitmask but is specific for
+ *         CPU masks. The type displayed to the user via the format files will
+ *         be "cpumaks_t" such that user space may deal with them differently
+ *         if they choose to do so, and the bits is always set to nr_cpumask_bits.
+ *
+ *         __cpumask(target_cpu)
+ *
+ *         To assign a cpumask, use the __assign_cpumask() helper macro.
+ *
+ *         __assign_cpumask(target_cpus, cpumask_bits(bar));
  *
  * fast_assign: This is a C like function that is used to store the items
  *    into the ring buffer. A special variable called "__entry" will be the
@@ -212,8 +222,8 @@
  *    This is also used to print out the data from the trace files.
  *    Again, the __entry macro is used to access the data from the ring buffer.
  *
- *    Note, __dynamic_array, __string, and __bitmask require special helpers
- *       to access the data.
+ *    Note, __dynamic_array, __string, __bitmask and __cpumask require special
+ *       helpers to access the data.
  *
  *      For __dynamic_array(int, foo, bar) use __get_dynamic_array(foo)
  *            Use __get_dynamic_array_len(foo) to get the length of the array
@@ -226,6 +236,8 @@
  *
  *      For __bitmask(target_cpus, nr_cpumask_bits) use __get_bitmask(target_cpus)
  *
+ *      For __cpumask(target_cpus) use __get_cpumask(target_cpus)
+ *
  *
  * Note, that for both the assign and the printk, __entry is the handler
  * to the data structure in the ring buffer, and is defined by the
@@ -288,6 +300,7 @@ TRACE_EVENT(foo_bar,
 		__dynamic_array(int,	list,   __length_of(lst))
 		__string(	str,	string			)
 		__bitmask(	cpus,	num_possible_cpus()	)
+		__cpumask(	cpum				)
 		__vstring(	vstr,	fmt,	va		)
 	),
 
@@ -299,9 +312,10 @@ TRACE_EVENT(foo_bar,
 		__assign_str(str, string);
 		__assign_vstr(vstr, fmt, va);
 		__assign_bitmask(cpus, cpumask_bits(mask), num_possible_cpus());
+		__assign_cpumask(cpum, cpumask_bits(mask));
 	),
 
-	TP_printk("foo %s %d %s %s %s %s (%s) %s", __entry->foo, __entry->bar,
+	TP_printk("foo %s %d %s %s %s %s (%s) (%s) %s", __entry->foo, __entry->bar,
 
 /*
  * Notice here the use of some helper functions. This includes:
@@ -345,7 +359,8 @@ TRACE_EVENT(foo_bar,
 		  __print_array(__get_dynamic_array(list),
 				__get_dynamic_array_len(list) / sizeof(int),
 				sizeof(int)),
-		  __get_str(str), __get_bitmask(cpus), __get_str(vstr))
+		  __get_str(str), __get_bitmask(cpus), __get_cpumask(cpus),
+		  __get_str(vstr))
 );
 
 /*
@@ -542,15 +557,16 @@ DEFINE_EVENT_PRINT(foo_template, foo_with_template_print,
 
 TRACE_EVENT(foo_rel_loc,
 
-	TP_PROTO(const char *foo, int bar, unsigned long *mask),
+	TP_PROTO(const char *foo, int bar, unsigned long *mask, const cpumask_t *cpus),
 
-	TP_ARGS(foo, bar, mask),
+	TP_ARGS(foo, bar, mask, cpus),
 
 	TP_STRUCT__entry(
 		__rel_string(	foo,	foo	)
 		__field(	int,	bar	)
 		__rel_bitmask(	bitmask,
 			BITS_PER_BYTE * sizeof(unsigned long)	)
+		__rel_cpumask(	cpumask )
 	),
 
 	TP_fast_assign(
@@ -558,10 +574,12 @@ TRACE_EVENT(foo_rel_loc,
 		__entry->bar = bar;
 		__assign_rel_bitmask(bitmask, mask,
 			BITS_PER_BYTE * sizeof(unsigned long));
+		__assign_rel_cpumask(cpumask, cpus);
 	),
 
-	TP_printk("foo_rel_loc %s, %d, %s", __get_rel_str(foo), __entry->bar,
-		  __get_rel_bitmask(bitmask))
+	TP_printk("foo_rel_loc %s, %d, %s, %s", __get_rel_str(foo), __entry->bar,
+		  __get_rel_bitmask(bitmask),
+		  __get_rel_cpumask(cpumask))
 );
 #endif
 
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [for-next][PATCH 03/11] tracing: Add trace_trigger kernel command line option
  2022-11-24 14:50 [for-next][PATCH 00/11] tracing: Updates for 6.2 Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 01/11] ftrace: Clean comments related to FTRACE_OPS_FL_PER_CPU Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t Steven Rostedt
@ 2022-11-24 14:50 ` Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 04/11] ring_buffer: Remove unused "event" parameter Steven Rostedt
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 14:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Masami Hiramatsu, Andrew Morton

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Allow triggers to be enabled at kernel boot up. For example:

  trace_trigger="sched_switch.stacktrace if prev_state == 2"

The above will enable the stacktrace trigger on top of the sched_switch
event and only trigger if its prev_state is 2 (TASK_UNINTERRUPTIBLE). Then
at boot up, a stacktrace will trigger and be recorded in the tracing ring
buffer every time the sched_switch happens where the previous state is
TASK_INTERRUPTIBLE.

Another useful trigger would be "traceoff" which can stop tracing on an
event if a field of the event matches a certain value defined by the
filter ("if" statement).

Link: https://lore.kernel.org/linux-trace-kernel/20221020210056.0d8d0a5b@gandalf.local.home

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 .../admin-guide/kernel-parameters.txt         | 19 +++++
 kernel/trace/trace_events.c                   | 72 ++++++++++++++++++-
 2 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a465d5242774..ccf91a4bf113 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6257,6 +6257,25 @@
 			See also Documentation/trace/ftrace.rst "trace options"
 			section.
 
+	trace_trigger=[trigger-list]
+			[FTRACE] Add a event trigger on specific events.
+			Set a trigger on top of a specific event, with an optional
+			filter.
+
+			The format is is "trace_trigger=<event>.<trigger>[ if <filter>],..."
+			Where more than one trigger may be specified that are comma deliminated.
+
+			For example:
+
+			  trace_trigger="sched_switch.stacktrace if prev_state == 2"
+
+			The above will enable the "stacktrace" trigger on the "sched_switch"
+			event but only trigger it if the "prev_state" of the "sched_switch"
+			event is "2" (TASK_UNINTERUPTIBLE).
+
+			See also "Event triggers" in Documentation/trace/events.rst
+
+
 	traceoff_on_warning
 			[FTRACE] enable this option to disable tracing when a
 			warning is hit. This turns off "tracing_on". Tracing can
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index f71ea6e79b3c..3bfaf560ecc4 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2796,6 +2796,44 @@ trace_create_new_event(struct trace_event_call *call,
 	return file;
 }
 
+#ifdef CONFIG_HIST_TRIGGERS
+#define MAX_BOOT_TRIGGERS 32
+
+static struct boot_triggers {
+	const char		*event;
+	char			*trigger;
+} bootup_triggers[MAX_BOOT_TRIGGERS];
+
+static char bootup_trigger_buf[COMMAND_LINE_SIZE];
+static int nr_boot_triggers;
+
+static __init int setup_trace_triggers(char *str)
+{
+	char *trigger;
+	char *buf;
+	int i;
+
+	strlcpy(bootup_trigger_buf, str, COMMAND_LINE_SIZE);
+	ring_buffer_expanded = true;
+	disable_tracing_selftest("running event triggers");
+
+	buf = bootup_trigger_buf;
+	for (i = 0; i < MAX_BOOT_TRIGGERS; i++) {
+		trigger = strsep(&buf, ",");
+		if (!trigger)
+			break;
+		bootup_triggers[i].event = strsep(&trigger, ".");
+		bootup_triggers[i].trigger = strsep(&trigger, ".");
+		if (!bootup_triggers[i].trigger)
+			break;
+	}
+
+	nr_boot_triggers = i;
+	return 1;
+}
+__setup("trace_trigger=", setup_trace_triggers);
+#endif
+
 /* Add an event to a trace directory */
 static int
 __trace_add_new_event(struct trace_event_call *call, struct trace_array *tr)
@@ -2812,6 +2850,28 @@ __trace_add_new_event(struct trace_event_call *call, struct trace_array *tr)
 		return event_define_fields(call);
 }
 
+#ifdef CONFIG_HIST_TRIGGERS
+static void trace_early_triggers(struct trace_event_file *file, const char *name)
+{
+	int ret;
+	int i;
+
+	for (i = 0; i < nr_boot_triggers; i++) {
+		if (strcmp(name, bootup_triggers[i].event))
+			continue;
+		mutex_lock(&event_mutex);
+		ret = trigger_process_regex(file, bootup_triggers[i].trigger);
+		mutex_unlock(&event_mutex);
+		if (ret)
+			pr_err("Failed to register trigger '%s' on event %s\n",
+			       bootup_triggers[i].trigger,
+			       bootup_triggers[i].event);
+	}
+}
+#else
+static inline void trace_early_triggers(struct trace_event_file *file, const char *name) { }
+#endif
+
 /*
  * Just create a descriptor for early init. A descriptor is required
  * for enabling events at boot. We want to enable events before
@@ -2822,12 +2882,19 @@ __trace_early_add_new_event(struct trace_event_call *call,
 			    struct trace_array *tr)
 {
 	struct trace_event_file *file;
+	int ret;
 
 	file = trace_create_new_event(call, tr);
 	if (!file)
 		return -ENOMEM;
 
-	return event_define_fields(call);
+	ret = event_define_fields(call);
+	if (ret)
+		return ret;
+
+	trace_early_triggers(file, trace_event_name(call));
+
+	return 0;
 }
 
 struct ftrace_module_file_ops;
@@ -3735,6 +3802,8 @@ static __init int event_trace_enable(void)
 			list_add(&call->list, &ftrace_events);
 	}
 
+	register_trigger_cmds();
+
 	/*
 	 * We need the top trace array to have a working set of trace
 	 * points at early init, before the debug files and directories
@@ -3749,7 +3818,6 @@ static __init int event_trace_enable(void)
 
 	register_event_cmds();
 
-	register_trigger_cmds();
 
 	return 0;
 }
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [for-next][PATCH 04/11] ring_buffer: Remove unused "event" parameter
  2022-11-24 14:50 [for-next][PATCH 00/11] tracing: Updates for 6.2 Steven Rostedt
                   ` (2 preceding siblings ...)
  2022-11-24 14:50 ` [for-next][PATCH 03/11] tracing: Add trace_trigger kernel command line option Steven Rostedt
@ 2022-11-24 14:50 ` Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 05/11] tracing/osnoise: Add osnoise/options file Steven Rostedt
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 14:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Masami Hiramatsu, Andrew Morton, Song Chen

From: Song Chen <chensong_2000@189.cn>

After commit a389d86f7fd0 ("ring-buffer: Have nested events still record
running time stamp"), the "event" parameter is no longer used in either
ring_buffer_unlock_commit() or rb_commit(). Best to remove it.

Link: https://lkml.kernel.org/r/1666274811-24138-1-git-send-email-chensong_2000@189.cn

Signed-off-by: Song Chen <chensong_2000@189.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/ring_buffer.h          |  3 +--
 kernel/trace/ring_buffer.c           | 12 +++++-------
 kernel/trace/ring_buffer_benchmark.c |  2 +-
 kernel/trace/trace.c                 |  2 +-
 4 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 3c7d295746f6..782e14f62201 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -113,8 +113,7 @@ void ring_buffer_change_overwrite(struct trace_buffer *buffer, int val);
 
 struct ring_buffer_event *ring_buffer_lock_reserve(struct trace_buffer *buffer,
 						   unsigned long length);
-int ring_buffer_unlock_commit(struct trace_buffer *buffer,
-			      struct ring_buffer_event *event);
+int ring_buffer_unlock_commit(struct trace_buffer *buffer);
 int ring_buffer_write(struct trace_buffer *buffer,
 		      unsigned long length, void *data);
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b21bf14bae9b..843818ee4814 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3180,8 +3180,7 @@ static inline void rb_event_discard(struct ring_buffer_event *event)
 		event->time_delta = 1;
 }
 
-static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer,
-		      struct ring_buffer_event *event)
+static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer)
 {
 	local_inc(&cpu_buffer->entries);
 	rb_end_commit(cpu_buffer);
@@ -3383,15 +3382,14 @@ void ring_buffer_nest_end(struct trace_buffer *buffer)
  *
  * Must be paired with ring_buffer_lock_reserve.
  */
-int ring_buffer_unlock_commit(struct trace_buffer *buffer,
-			      struct ring_buffer_event *event)
+int ring_buffer_unlock_commit(struct trace_buffer *buffer)
 {
 	struct ring_buffer_per_cpu *cpu_buffer;
 	int cpu = raw_smp_processor_id();
 
 	cpu_buffer = buffer->buffers[cpu];
 
-	rb_commit(cpu_buffer, event);
+	rb_commit(cpu_buffer);
 
 	rb_wakeups(buffer, cpu_buffer);
 
@@ -3977,7 +3975,7 @@ int ring_buffer_write(struct trace_buffer *buffer,
 
 	memcpy(body, data, length);
 
-	rb_commit(cpu_buffer, event);
+	rb_commit(cpu_buffer);
 
 	rb_wakeups(buffer, cpu_buffer);
 
@@ -5998,7 +5996,7 @@ static __init int rb_write_something(struct rb_test_data *data, bool nested)
 	}
 
  out:
-	ring_buffer_unlock_commit(data->buffer, event);
+	ring_buffer_unlock_commit(data->buffer);
 
 	return 0;
 }
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 78e576575b79..aef34673d79d 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -258,7 +258,7 @@ static void ring_buffer_producer(void)
 				hit++;
 				entry = ring_buffer_event_data(event);
 				*entry = smp_processor_id();
-				ring_buffer_unlock_commit(buffer, event);
+				ring_buffer_unlock_commit(buffer);
 			}
 		}
 		end_time = ktime_get();
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 5cfc95a52bc3..5c97dbef741b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -999,7 +999,7 @@ __buffer_unlock_commit(struct trace_buffer *buffer, struct ring_buffer_event *ev
 		/* ring_buffer_unlock_commit() enables preemption */
 		preempt_enable_notrace();
 	} else
-		ring_buffer_unlock_commit(buffer, event);
+		ring_buffer_unlock_commit(buffer);
 }
 
 /**
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [for-next][PATCH 05/11] tracing/osnoise: Add osnoise/options file
  2022-11-24 14:50 [for-next][PATCH 00/11] tracing: Updates for 6.2 Steven Rostedt
                   ` (3 preceding siblings ...)
  2022-11-24 14:50 ` [for-next][PATCH 04/11] ring_buffer: Remove unused "event" parameter Steven Rostedt
@ 2022-11-24 14:50 ` Steven Rostedt
  2022-11-24 17:31   ` Daniel Bristot de Oliveira
  2022-11-24 14:50 ` [for-next][PATCH 06/11] tracing/osnoise: Add OSNOISE_WORKLOAD option Steven Rostedt
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 14:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Andrew Morton, Daniel Bristot de Oliveira,
	Jonathan Corbet

From: Daniel Bristot de Oliveira <bristot@kernel.org>

Add the tracing/osnoise/options file to control
osnoise/timerlat tracer features. It is a single
file to contain multiple features, similar to
the sched/features file.

Reading the file displays a list of options. Writing
the OPTION_NAME enables it, writing NO_OPTION_NAME disables
it.

The DEAFULTS is a particular option that resets the options
to the default ones.

It uses a bitmask to keep track of the status of the option. When
needed, we can add a list of static keys, but for now
it does not justify the memory increase.

Link: https://lkml.kernel.org/r/f8d34aefdb225d2603fcb4c02a120832a0cd3339.1668692096.git.bristot@kernel.org

Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace_osnoise.c | 170 +++++++++++++++++++++++++++++++++++
 1 file changed, 170 insertions(+)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 4300c5dc4e5d..17b77fe3950b 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -48,6 +48,19 @@
 #define DEFAULT_TIMERLAT_PERIOD	1000			/* 1ms */
 #define DEFAULT_TIMERLAT_PRIO	95			/* FIFO 95 */
 
+/*
+ * osnoise/options entries.
+ */
+enum osnoise_options_index {
+	OSN_DEFAULTS = 0,
+	OSN_MAX
+};
+
+static const char * const osnoise_options_str[OSN_MAX] = { "DEFAULTS" };
+
+#define OSN_DEFAULT_OPTIONS	0
+unsigned long osnoise_options	= OSN_DEFAULT_OPTIONS;
+
 /*
  * trace_array of the enabled osnoise/timerlat instances.
  */
@@ -1860,6 +1873,150 @@ static void osnoise_init_hotplug_support(void)
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 
+/*
+ * seq file functions for the osnoise/options file.
+ */
+static void *s_options_start(struct seq_file *s, loff_t *pos)
+{
+	int option = *pos;
+
+	mutex_lock(&interface_lock);
+
+	if (option >= OSN_MAX)
+		return NULL;
+
+	return pos;
+}
+
+static void *s_options_next(struct seq_file *s, void *v, loff_t *pos)
+{
+	int option = ++(*pos);
+
+	if (option >= OSN_MAX)
+		return NULL;
+
+	return pos;
+}
+
+static int s_options_show(struct seq_file *s, void *v)
+{
+	loff_t *pos = v;
+	int option = *pos;
+
+	if (option == OSN_DEFAULTS) {
+		if (osnoise_options == OSN_DEFAULT_OPTIONS)
+			seq_printf(s, "%s", osnoise_options_str[option]);
+		else
+			seq_printf(s, "NO_%s", osnoise_options_str[option]);
+		goto out;
+	}
+
+	if (test_bit(option, &osnoise_options))
+		seq_printf(s, "%s", osnoise_options_str[option]);
+	else
+		seq_printf(s, "NO_%s", osnoise_options_str[option]);
+
+out:
+	if (option != OSN_MAX)
+		seq_puts(s, " ");
+
+	return 0;
+}
+
+static void s_options_stop(struct seq_file *s, void *v)
+{
+	seq_puts(s, "\n");
+	mutex_unlock(&interface_lock);
+}
+
+static const struct seq_operations osnoise_options_seq_ops = {
+	.start		= s_options_start,
+	.next		= s_options_next,
+	.show		= s_options_show,
+	.stop		= s_options_stop
+};
+
+static int osnoise_options_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &osnoise_options_seq_ops);
+};
+
+/**
+ * osnoise_options_write - Write function for "options" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * Writing the option name sets the option, writing the "NO_"
+ * prefix in front of the option name disables it.
+ *
+ * Writing "DEFAULTS" resets the option values to the default ones.
+ */
+static ssize_t osnoise_options_write(struct file *filp, const char __user *ubuf,
+				     size_t cnt, loff_t *ppos)
+{
+	int running, option, enable, retval;
+	char buf[256], *option_str;
+
+	if (cnt >= 256)
+		return -EINVAL;
+
+	if (copy_from_user(buf, ubuf, cnt))
+		return -EFAULT;
+
+	buf[cnt] = 0;
+
+	if (strncmp(buf, "NO_", 3)) {
+		option_str = strstrip(buf);
+		enable = true;
+	} else {
+		option_str = strstrip(&buf[3]);
+		enable = false;
+	}
+
+	option = match_string(osnoise_options_str, OSN_MAX, option_str);
+	if (option < 0)
+		return -EINVAL;
+
+	/*
+	 * trace_types_lock is taken to avoid concurrency on start/stop.
+	 */
+	mutex_lock(&trace_types_lock);
+	running = osnoise_has_registered_instances();
+	if (running)
+		stop_per_cpu_kthreads();
+
+	mutex_lock(&interface_lock);
+	/*
+	 * avoid CPU hotplug operations that might read options.
+	 */
+	cpus_read_lock();
+
+	retval = cnt;
+
+	if (enable) {
+		if (option == OSN_DEFAULTS)
+			osnoise_options = OSN_DEFAULT_OPTIONS;
+		else
+			set_bit(option, &osnoise_options);
+	} else {
+		if (option == OSN_DEFAULTS)
+			retval = -EINVAL;
+		else
+			clear_bit(option, &osnoise_options);
+	}
+
+	cpus_read_unlock();
+	mutex_unlock(&interface_lock);
+
+	if (running)
+		start_per_cpu_kthreads();
+	mutex_unlock(&trace_types_lock);
+
+	return retval;
+}
+
 /*
  * osnoise_cpus_read - Read function for reading the "cpus" file
  * @filp: The active open file structure
@@ -2042,6 +2199,14 @@ static const struct file_operations cpus_fops = {
 	.llseek		= generic_file_llseek,
 };
 
+static const struct file_operations osnoise_options_fops = {
+	.open		= osnoise_options_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+	.write		= osnoise_options_write
+};
+
 #ifdef CONFIG_TIMERLAT_TRACER
 #ifdef CONFIG_STACKTRACE
 static int init_timerlat_stack_tracefs(struct dentry *top_dir)
@@ -2128,6 +2293,11 @@ static int init_tracefs(void)
 	if (!tmp)
 		goto err;
 
+	tmp = trace_create_file("options", TRACE_MODE_WRITE, top_dir, NULL,
+				&osnoise_options_fops);
+	if (!tmp)
+		goto err;
+
 	ret = init_timerlat_tracefs(top_dir);
 	if (ret)
 		goto err;
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [for-next][PATCH 06/11] tracing/osnoise: Add OSNOISE_WORKLOAD option
  2022-11-24 14:50 [for-next][PATCH 00/11] tracing: Updates for 6.2 Steven Rostedt
                   ` (4 preceding siblings ...)
  2022-11-24 14:50 ` [for-next][PATCH 05/11] tracing/osnoise: Add osnoise/options file Steven Rostedt
@ 2022-11-24 14:50 ` Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 07/11] Documentation/osnoise: Add osnoise/options documentation Steven Rostedt
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 14:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Andrew Morton, Daniel Bristot de Oliveira,
	Jonathan Corbet

From: Daniel Bristot de Oliveira <bristot@kernel.org>

The osnoise tracer is not only a tracer, and a set of tracepoints,
but also a workload dispatcher.

In preparation for having other workloads, e.g., in user-space,
add an option to avoid dispatching the workload.

By not dispatching the workload, the osnoise: tracepoints become
generic events to measure the execution time of *any* task on Linux.

For example:

  # cd /sys/kernel/tracing/
  # cat osnoise/options
  DEFAULTS OSNOISE_WORKLOAD
  # echo NO_OSNOISE_WORKLOAD > osnoise/options
  # cat osnoise/options
  NO_DEFAULTS NO_OSNOISE_WORKLOAD
  # echo osnoise > set_event
  # echo osnoise > current_tracer
  # tail -8 trace
      make-94722   [002] d..3.  1371.794507: thread_noise:     make:94722 start 1371.794302286 duration 200897 ns
        sh-121042  [020] d..3.  1371.794534: thread_noise:       sh:121042 start 1371.781610976 duration 8943683 ns
      make-121097  [005] d..3.  1371.794542: thread_noise:     make:121097 start 1371.794481522 duration 60444 ns
     <...>-40      [005] d..3.  1371.794550: thread_noise: migration/5:40 start 1371.794542256 duration 7154 ns
    <idle>-0       [018] dNh2.  1371.794554: irq_noise: reschedule:253 start 1371.794553547 duration 40 ns
    <idle>-0       [018] dNh2.  1371.794561: irq_noise: local_timer:236 start 1371.794556222 duration 4890 ns
    <idle>-0       [018] .Ns2.  1371.794563: softirq_noise:    SCHED:7 start 1371.794561803 duration 992 ns
    <idle>-0       [018] d..3.  1371.794566: thread_noise: swapper/18:0 start 1371.781368110 duration 13191798 ns

In preparation for the rtla exec_time tracer/tool and
rtla osnoise --user option.

Link: https://lkml.kernel.org/r/f5cfbd37aefd419eefe9243b4d2fc38ed5753fe4.1668692096.git.bristot@kernel.org

Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace_osnoise.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 17b77fe3950b..3f10dd1f2f1c 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -53,12 +53,13 @@
  */
 enum osnoise_options_index {
 	OSN_DEFAULTS = 0,
+	OSN_WORKLOAD,
 	OSN_MAX
 };
 
-static const char * const osnoise_options_str[OSN_MAX] = { "DEFAULTS" };
+static const char * const osnoise_options_str[OSN_MAX] = { "DEFAULTS", "OSNOISE_WORKLOAD" };
 
-#define OSN_DEFAULT_OPTIONS	0
+#define OSN_DEFAULT_OPTIONS	0x2
 unsigned long osnoise_options	= OSN_DEFAULT_OPTIONS;
 
 /*
@@ -1186,11 +1187,12 @@ trace_sched_switch_callback(void *data, bool preempt,
 			    unsigned int prev_state)
 {
 	struct osnoise_variables *osn_var = this_cpu_osn_var();
+	int workload = test_bit(OSN_WORKLOAD, &osnoise_options);
 
-	if (p->pid != osn_var->pid)
+	if ((p->pid != osn_var->pid) || !workload)
 		thread_exit(osn_var, p);
 
-	if (n->pid != osn_var->pid)
+	if ((n->pid != osn_var->pid) || !workload)
 		thread_entry(osn_var, n);
 }
 
@@ -1723,9 +1725,16 @@ static void stop_kthread(unsigned int cpu)
 	struct task_struct *kthread;
 
 	kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
-	if (kthread)
+	if (kthread) {
 		kthread_stop(kthread);
-	per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
+		per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
+	} else {
+		if (!test_bit(OSN_WORKLOAD, &osnoise_options)) {
+			per_cpu(per_cpu_osnoise_var, cpu).sampling = false;
+			barrier();
+			return;
+		}
+	}
 }
 
 /*
@@ -1759,6 +1768,13 @@ static int start_kthread(unsigned int cpu)
 		snprintf(comm, 24, "timerlat/%d", cpu);
 		main = timerlat_main;
 	} else {
+		/* if no workload, just return */
+		if (!test_bit(OSN_WORKLOAD, &osnoise_options)) {
+			per_cpu(per_cpu_osnoise_var, cpu).sampling = true;
+			barrier();
+			return 0;
+		}
+
 		snprintf(comm, 24, "osnoise/%d", cpu);
 	}
 
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [for-next][PATCH 07/11] Documentation/osnoise: Add osnoise/options documentation
  2022-11-24 14:50 [for-next][PATCH 00/11] tracing: Updates for 6.2 Steven Rostedt
                   ` (5 preceding siblings ...)
  2022-11-24 14:50 ` [for-next][PATCH 06/11] tracing/osnoise: Add OSNOISE_WORKLOAD option Steven Rostedt
@ 2022-11-24 14:50 ` Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 08/11] tracing/perf: Use strndup_user instead of kzalloc/strncpy_from_user Steven Rostedt
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 14:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Andrew Morton, Daniel Bristot de Oliveira,
	Jonathan Corbet

From: Daniel Bristot de Oliveira <bristot@kernel.org>

Add the documentation about the osnoise/options file, along
with an explanation about the OSNOISE_WORKLOAD option.

Link: https://lkml.kernel.org/r/777af8f3d87beedd304805f98eff6c8291d64226.1668692096.git.bristot@kernel.org

Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 Documentation/trace/osnoise-tracer.rst | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/Documentation/trace/osnoise-tracer.rst b/Documentation/trace/osnoise-tracer.rst
index 963def9f97c6..3c675ed82b27 100644
--- a/Documentation/trace/osnoise-tracer.rst
+++ b/Documentation/trace/osnoise-tracer.rst
@@ -109,6 +109,11 @@ The tracer has a set of options inside the osnoise directory, they are:
  - tracing_threshold: the minimum delta between two time() reads to be
    considered as noise, in us. When set to 0, the default value will
    be used, which is currently 5 us.
+ - osnoise/options: a set of on/off options that can be enabled by
+   writing the option name to the file or disabled by writing the option
+   name preceded with the 'NO_' prefix. For example, writing
+   NO_OSNOISE_WORKLOAD disables the OSNOISE_WORKLOAD option. The
+   special DEAFAULTS option resets all options to the default value.
 
 Additional Tracing
 ------------------
@@ -150,3 +155,10 @@ tracepoints is smaller than eight us reported in the sample_threshold.
 The reason roots in the overhead of the entry and exit code that happens
 before and after any interference execution. This justifies the dual
 approach: measuring thread and tracing.
+
+Running osnoise tracer without workload
+---------------------------------------
+
+By enabling the osnoise tracer with the NO_OSNOISE_WORKLOAD option set,
+the osnoise: tracepoints serve to measure the execution time of
+any type of Linux task, free from the interference of other tasks.
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [for-next][PATCH 08/11] tracing/perf: Use strndup_user instead of kzalloc/strncpy_from_user
  2022-11-24 14:50 [for-next][PATCH 00/11] tracing: Updates for 6.2 Steven Rostedt
                   ` (6 preceding siblings ...)
  2022-11-24 14:50 ` [for-next][PATCH 07/11] Documentation/osnoise: Add osnoise/options documentation Steven Rostedt
@ 2022-11-24 14:50 ` Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 09/11] tracing: Make tracepoint_print_iter static Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 11/11] ftrace: Avoid needless updates of the ftrace function call Steven Rostedt
  9 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 14:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Masami Hiramatsu, Andrew Morton, Chuang Wang

From: Chuang Wang <nashuiliang@gmail.com>

This patch uses strndup_user instead of kzalloc + strncpy_from_user,
which makes the code more concise.

Link: https://lkml.kernel.org/r/20221121080831.707409-1-nashuiliang@gmail.com

Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace_event_perf.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 61e3a2620fa3..05e791241812 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -251,16 +251,12 @@ int perf_kprobe_init(struct perf_event *p_event, bool is_retprobe)
 	struct trace_event_call *tp_event;
 
 	if (p_event->attr.kprobe_func) {
-		func = kzalloc(KSYM_NAME_LEN, GFP_KERNEL);
-		if (!func)
-			return -ENOMEM;
-		ret = strncpy_from_user(
-			func, u64_to_user_ptr(p_event->attr.kprobe_func),
-			KSYM_NAME_LEN);
-		if (ret == KSYM_NAME_LEN)
-			ret = -E2BIG;
-		if (ret < 0)
-			goto out;
+		func = strndup_user(u64_to_user_ptr(p_event->attr.kprobe_func),
+				    KSYM_NAME_LEN);
+		if (IS_ERR(func)) {
+			ret = PTR_ERR(func);
+			return (ret == -EINVAL) ? -E2BIG : ret;
+		}
 
 		if (func[0] == '\0') {
 			kfree(func);
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [for-next][PATCH 09/11] tracing: Make tracepoint_print_iter static
  2022-11-24 14:50 [for-next][PATCH 00/11] tracing: Updates for 6.2 Steven Rostedt
                   ` (7 preceding siblings ...)
  2022-11-24 14:50 ` [for-next][PATCH 08/11] tracing/perf: Use strndup_user instead of kzalloc/strncpy_from_user Steven Rostedt
@ 2022-11-24 14:50 ` Steven Rostedt
  2022-11-24 14:50 ` [for-next][PATCH 11/11] ftrace: Avoid needless updates of the ftrace function call Steven Rostedt
  9 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 14:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Masami Hiramatsu, Andrew Morton, Xiu Jianfeng

From: Xiu Jianfeng <xiujianfeng@huawei.com>

After change in commit 4239174570da ("tracing: Make tracepoint_printk a
static_key"), this symbol is not used outside of the file, so mark it
static.

Link: https://lkml.kernel.org/r/20221122091456.72055-1-xiujianfeng@huawei.com

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 2 +-
 kernel/trace/trace.h | 2 --
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 5c97dbef741b..93a75a97118f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -85,7 +85,7 @@ void __init disable_tracing_selftest(const char *reason)
 #endif
 
 /* Pipe tracepoints to printk */
-struct trace_iterator *tracepoint_print_iter;
+static struct trace_iterator *tracepoint_print_iter;
 int tracepoint_printk;
 static bool tracepoint_printk_stop_on_boot __initdata;
 static DEFINE_STATIC_KEY_FALSE(tracepoint_printk_key);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index d42e24507152..48643f07bc01 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1942,8 +1942,6 @@ static inline void tracer_hardirqs_on(unsigned long a0, unsigned long a1) { }
 static inline void tracer_hardirqs_off(unsigned long a0, unsigned long a1) { }
 #endif
 
-extern struct trace_iterator *tracepoint_print_iter;
-
 /*
  * Reset the state of the trace_iterator so that it can read consumed data.
  * Normally, the trace_iterator is used for reading the data when it is not
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [for-next][PATCH 11/11] ftrace: Avoid needless updates of the ftrace function call
  2022-11-24 14:50 [for-next][PATCH 00/11] tracing: Updates for 6.2 Steven Rostedt
                   ` (8 preceding siblings ...)
  2022-11-24 14:50 ` [for-next][PATCH 09/11] tracing: Make tracepoint_print_iter static Steven Rostedt
@ 2022-11-24 14:50 ` Steven Rostedt
  9 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 14:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Masami Hiramatsu, Andrew Morton, Song Shuai

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Song Shuai reported:

    The list func (ftrace_ops_list_func) will be patched first
    before the transition between old and new calls are set,
    which fixed the race described in this commit `59338f75`.

    While ftrace_trace_function changes from the list func to a
    ftrace_ops func, like unregistering the klp_ops to leave the only
    global_ops in ftrace_ops_list, the ftrace_[regs]_call will be
    replaced with the list func although it already exists. So there
    should be a condition to avoid this.

And suggested using another variable to keep track of what the ftrace
function is set to. But this could be simplified by using a helper
function that does the same with a static variable.

Link: https://lore.kernel.org/lkml/20221026132039.2236233-1-suagrfillet@gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221122180905.737b6f52@gandalf.local.home

Reported-by: Song Shuai <suagrfillet@gmail.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ftrace.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 65a5d36463e0..d04552c0c275 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -2763,6 +2763,19 @@ void __weak ftrace_arch_code_modify_post_process(void)
 {
 }
 
+static int update_ftrace_func(ftrace_func_t func)
+{
+	static ftrace_func_t save_func;
+
+	/* Avoid updating if it hasn't changed */
+	if (func == save_func)
+		return 0;
+
+	save_func = func;
+
+	return ftrace_update_ftrace_func(func);
+}
+
 void ftrace_modify_all_code(int command)
 {
 	int update = command & FTRACE_UPDATE_TRACE_FUNC;
@@ -2783,7 +2796,7 @@ void ftrace_modify_all_code(int command)
 	 * traced.
 	 */
 	if (update) {
-		err = ftrace_update_ftrace_func(ftrace_ops_list_func);
+		err = update_ftrace_func(ftrace_ops_list_func);
 		if (FTRACE_WARN_ON(err))
 			return;
 	}
@@ -2799,7 +2812,7 @@ void ftrace_modify_all_code(int command)
 		/* If irqs are disabled, we are in stop machine */
 		if (!irqs_disabled())
 			smp_call_function(ftrace_sync_ipi, NULL, 1);
-		err = ftrace_update_ftrace_func(ftrace_trace_function);
+		err = update_ftrace_func(ftrace_trace_function);
 		if (FTRACE_WARN_ON(err))
 			return;
 	}
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 05/11] tracing/osnoise: Add osnoise/options file
  2022-11-24 14:50 ` [for-next][PATCH 05/11] tracing/osnoise: Add osnoise/options file Steven Rostedt
@ 2022-11-24 17:31   ` Daniel Bristot de Oliveira
  2022-11-24 19:28     ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Daniel Bristot de Oliveira @ 2022-11-24 17:31 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel
  Cc: Masami Hiramatsu, Andrew Morton, Jonathan Corbet

On 11/24/22 15:50, Steven Rostedt wrote:
> From: Daniel Bristot de Oliveira <bristot@kernel.org>
> 
> Add the tracing/osnoise/options file to control
> osnoise/timerlat tracer features. It is a single
> file to contain multiple features, similar to
> the sched/features file.
> 
> Reading the file displays a list of options. Writing
> the OPTION_NAME enables it, writing NO_OPTION_NAME disables
> it.
> 
> The DEAFULTS is a particular option that resets the options
> to the default ones.
> 
> It uses a bitmask to keep track of the status of the option. When
> needed, we can add a list of static keys, but for now
> it does not justify the memory increase.
> 
> Link: https://lkml.kernel.org/r/f8d34aefdb225d2603fcb4c02a120832a0cd3339.1668692096.git.bristot@kernel.org


Hi Steve,

Yesterday I sent a v2 of this patch series, adding some more options [1].

But as you already queued these, and as there is no real difference from the
v1 and v2 in these code patches, I think the best way is for me to send a v3
with the additional patches, build on top of the ftrace/core.

(Is it a bad idea? let me know :-))

[1] https://lore.kernel.org/lkml/cover.1669115208.git.bristot@kernel.org/ 

-- Daniel


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 05/11] tracing/osnoise: Add osnoise/options file
  2022-11-24 17:31   ` Daniel Bristot de Oliveira
@ 2022-11-24 19:28     ` Steven Rostedt
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-11-24 19:28 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Jonathan Corbet

On Thu, 24 Nov 2022 18:31:22 +0100
Daniel Bristot de Oliveira <bristot@kernel.org> wrote:

> Hi Steve,

Hi Daniel,

> 
> Yesterday I sent a v2 of this patch series, adding some more options [1].

Heh, it came in the window of me triaging my patch queue and running my
tests.

> 
> But as you already queued these, and as there is no real difference from the
> v1 and v2 in these code patches, I think the best way is for me to send a v3
> with the additional patches, build on top of the ftrace/core.
> 
> (Is it a bad idea? let me know :-))

Just send patches on top, as these already went through testing, and I
don't want to rebase the branch.

Thanks!

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-11-24 14:50 ` [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t Steven Rostedt
@ 2022-12-12 14:53   ` Douglas Raillard
  2022-12-12 16:12     ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Douglas Raillard @ 2022-12-12 14:53 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel
  Cc: Masami Hiramatsu, Andrew Morton, Valentin Schneider, douglas.raillard

On 24-11-2022 14:50, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> 
> The trace events have a __bitmask field that can be used for anything
> that requires bitmasks. Although currently it is only used for CPU
> masks, it could be used in the future for any type of bitmasks.
> 
> There is some user space tooling that wants to know if a field is a CPU
> mask and not just some random unsigned long bitmask. Introduce
> "__cpumask()" helper functions that work the same as the current
> __bitmask() helpers but displays in the format file:
> 
>    field:__data_loc cpumask_t *[] mask;    offset:36;      size:4; signed:0;
> 
> Instead of:
> 
>    field:__data_loc unsigned long[] mask;  offset:32;      size:4; signed:0;
> 
> The main difference is the type. Instead of "unsigned long" it is
> "cpumask_t *". Note, this type field needs to be a real type in the
> __dynamic_array() logic that both __cpumask and__bitmask use, but the
> comparison field requires it to be a scalar type whereas cpumask_t is a
> structure (non-scalar). But everything works when making it a pointer.

How is tooling expected to distinguish between a real dynamic array of pointers
from a type that is using dynamic arrays as an "implementation detail"
with a broken type description ? Any reasonable
interpretation of that type by the consuming tool will be broken
unless it specifically knows about __data_loc cpumask*[].
However, the set of types using that trick is unbounded so forward
compatibilty is impossible to ensure. On top of that, an actual
dynamic array of cpumask pointers becomes impossible to represent.

You might object that if the tool does not know about cpumask,
it does not matter "how it breaks" as the display will be useless anyway,
but that is not true. A parsing library might just parse up to
its knowledge limit and return the most elaborate it can for a given field.
It's acceptable for that representation to not be elaborated with the full
semantic expected by the end user, but it should not return
something that is lying on its nature. For example, it would be sane for
the user to assert the size of an array of pointers to be a multiple
of a pointer size. cpumask is currently an array of unsigned long but there is
nothing preventing a similar type to be based on an array of u8.
Such a type would also have different endianness handling and the resulting buffer
would be garbage.

To fix that issue, I propose to expose the following to userspace:
1. The binary representation type (unsigned long[] in cpumask case).
2. An (ordered list of) semantic type that may or may not be the same as 1.

Type (1) can be used to guarantee correct handling of endianness and a reasonable
default display, while (2) allows any sort of fancy interpretation, all that while preserving
forward compatibility. For cpumask, this would give:
1. unsigned long []
2. bitmask, cpumask

A consumer could know about bitmask as they are likely used in multiple places,
but not about cpumask specifically (e.g. assuming cpumask is a type recently introduced).
Displaying as a list of bits set in the mask would already allow proper formatting, and
knowing it's actually a cpumask can allow fancier behaviors.

 From an event format perspective, this could preserve reasonable backward compat
by simply adding another property:

   field:__data_loc unsigned long[] mask;    offset:36;      size:4; signed:0; semantic_type:bitmask,cpumask;

By default, "semantic_type" would simply have the same value as the normal type.

This applies to any type, not just dynamic arrays.

Thanks,

Douglas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-12-12 14:53   ` Douglas Raillard
@ 2022-12-12 16:12     ` Steven Rostedt
  2022-12-12 17:04       ` Steven Rostedt
  2022-12-12 22:19       ` Douglas Raillard
  0 siblings, 2 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-12-12 16:12 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Valentin Schneider

On Mon, 12 Dec 2022 14:53:27 +0000
Douglas Raillard <douglas.raillard@arm.com> wrote:

> On 24-11-2022 14:50, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> > 
> > The trace events have a __bitmask field that can be used for anything
> > that requires bitmasks. Although currently it is only used for CPU
> > masks, it could be used in the future for any type of bitmasks.
> > 
> > There is some user space tooling that wants to know if a field is a CPU
> > mask and not just some random unsigned long bitmask. Introduce
> > "__cpumask()" helper functions that work the same as the current
> > __bitmask() helpers but displays in the format file:
> > 
> >    field:__data_loc cpumask_t *[] mask;    offset:36;      size:4; signed:0;

The current parsing tools break the above into:

 "field:" "__data_loc" <some-type> "[]" <var-name> ";" "offset:"
 <offset> ";" "size:" "<size>" ";" "signed:" <signed> ";"

Where the <some-type> really can be anything, and in lots of cases, it is.
Thus its only a hint for the tooling, and has never been limited to what
they are.

> > 
> > Instead of:
> > 
> >    field:__data_loc unsigned long[] mask;  offset:32;      size:4; signed:0;
> > 
> > The main difference is the type. Instead of "unsigned long" it is
> > "cpumask_t *". Note, this type field needs to be a real type in the
> > __dynamic_array() logic that both __cpumask and__bitmask use, but the
> > comparison field requires it to be a scalar type whereas cpumask_t is a
> > structure (non-scalar). But everything works when making it a pointer.  

The above is for the kernel to build.

> 
> How is tooling expected to distinguish between a real dynamic array of pointers
> from a type that is using dynamic arrays as an "implementation detail"
> with a broken type description ? Any reasonable
> interpretation of that type by the consuming tool will be broken
> unless it specifically knows about __data_loc cpumask*[].

I'm curious to what the tool does differently with the above. What tool are
you using? Does it just give up on how to print it?

> However, the set of types using that trick is unbounded so forward
> compatibilty is impossible to ensure. On top of that, an actual
> dynamic array of cpumask pointers becomes impossible to represent.

I never thought about a user case where we print out an array of cpumask
pointers.

> 
> You might object that if the tool does not know about cpumask,
> it does not matter "how it breaks" as the display will be useless anyway,
> but that is not true. A parsing library might just parse up to
> its knowledge limit and return the most elaborate it can for a given field.
> It's acceptable for that representation to not be elaborated with the full
> semantic expected by the end user, but it should not return
> something that is lying on its nature. For example, it would be sane for
> the user to assert the size of an array of pointers to be a multiple
> of a pointer size. cpumask is currently an array of unsigned long but there is
> nothing preventing a similar type to be based on an array of u8.
> Such a type would also have different endianness handling and the resulting buffer
> would be garbage.
> 
> 
> To fix that issue, I propose to expose the following to userspace:
> 1. The binary representation type (unsigned long[] in cpumask case).
> 2. An (ordered list of) semantic type that may or may not be the same as 1.
> 
> Type (1) can be used to guarantee correct handling of endianness and a reasonable
> default display, while (2) allows any sort of fancy interpretation, all that while preserving
> forward compatibility. For cpumask, this would give:
> 1. unsigned long []
> 2. bitmask, cpumask
> 
> A consumer could know about bitmask as they are likely used in multiple places,
> but not about cpumask specifically (e.g. assuming cpumask is a type recently introduced).
> Displaying as a list of bits set in the mask would already allow proper formatting, and
> knowing it's actually a cpumask can allow fancier behaviors.
> 
>  From an event format perspective, this could preserve reasonable backward compat
> by simply adding another property:
> 
>    field:__data_loc unsigned long[] mask;    offset:36;      size:4; signed:0; semantic_type:bitmask,cpumask;
> 
> By default, "semantic_type" would simply have the same value as the normal type.

The problem with the above is that it adds a new field, and I have to check
if that doesn't break existing tooling.

Another possibility is that I can add parsing to the format that is exposed
to user space and simply s/__cpumask *[]/__cpumask[]/

Which will get rid of the pointer array of cpu masks.

> 
> This applies to any type, not just dynamic arrays.
> 

Let me know if the above does break existing user space and I'll revert it.

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-12-12 16:12     ` Steven Rostedt
@ 2022-12-12 17:04       ` Steven Rostedt
  2022-12-12 22:19       ` Douglas Raillard
  1 sibling, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2022-12-12 17:04 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Valentin Schneider

On Mon, 12 Dec 2022 11:12:56 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> Another possibility is that I can add parsing to the format that is exposed
> to user space and simply s/__cpumask *[]/__cpumask[]/
> 
> Which will get rid of the pointer array of cpu masks.

Actually, I found a better way to get rid of that "*". Would that work for
you? That is, if the above shows:

   field:__data_loc cpumask_t[] mask;    offset:36;      size:4; signed:0;
   
Instead?

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-12-12 16:12     ` Steven Rostedt
  2022-12-12 17:04       ` Steven Rostedt
@ 2022-12-12 22:19       ` Douglas Raillard
  2022-12-12 23:53         ` Steven Rostedt
  1 sibling, 1 reply; 24+ messages in thread
From: Douglas Raillard @ 2022-12-12 22:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Valentin Schneider

On 12-12-2022 16:12, Steven Rostedt wrote:
> On Mon, 12 Dec 2022 14:53:27 +0000
> Douglas Raillard <douglas.raillard@arm.com> wrote:
> 
>> On 24-11-2022 14:50, Steven Rostedt wrote:
>>> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
>>>
>>> The trace events have a __bitmask field that can be used for anything
>>> that requires bitmasks. Although currently it is only used for CPU
>>> masks, it could be used in the future for any type of bitmasks.
>>>
>>> There is some user space tooling that wants to know if a field is a CPU
>>> mask and not just some random unsigned long bitmask. Introduce
>>> "__cpumask()" helper functions that work the same as the current
>>> __bitmask() helpers but displays in the format file:
>>>
>>>     field:__data_loc cpumask_t *[] mask;    offset:36;      size:4; signed:0;
> 
> The current parsing tools break the above into:
> 
>   "field:" "__data_loc" <some-type> "[]" <var-name> ";" "offset:"
>   <offset> ";" "size:" "<size>" ";" "signed:" <signed> ";"
> 
> Where the <some-type> really can be anything, and in lots of cases, it is.
> Thus its only a hint for the tooling, and has never been limited to what
> they are.

While we are having a look at that, what is the exact format of "<some-type>" ?
The only way it can truly be any type is if it's an declaration specifier followed by
an abstract declarator, e.g. like in a func prototype where parameters can be anonymous, as
shown by the "parameter_declaration" rule in:
http://www.lysator.liu.se/c/ANSI-C-grammar-y.html#declaration-specifiers

At least that is how I implemented the parsing and it works for sure for all what I saw in traces
plus more.

> 
>>>
>>> Instead of:
>>>
>>>     field:__data_loc unsigned long[] mask;  offset:32;      size:4; signed:0;
>>>
>>> The main difference is the type. Instead of "unsigned long" it is
>>> "cpumask_t *". Note, this type field needs to be a real type in the
>>> __dynamic_array() logic that both __cpumask and__bitmask use, but the
>>> comparison field requires it to be a scalar type whereas cpumask_t is a
>>> structure (non-scalar). But everything works when making it a pointer.
> 
> The above is for the kernel to build.

That was my understanding that the comparison issue is related to in-kernel filtering ?
If that's the case, I completely agree that the type kernel code sees does not _have_
to be the same thing that is exposed to userspace if that simplifies the problem.

>>
>> How is tooling expected to distinguish between a real dynamic array of pointers
>> from a type that is using dynamic arrays as an "implementation detail"
>> with a broken type description ? Any reasonable
>> interpretation of that type by the consuming tool will be broken
>> unless it specifically knows about __data_loc cpumask*[].
> 
> I'm curious to what the tool does differently with the above. What tool are
> you using? Does it just give up on how to print it?

I'm implementing the Rust libtraceevent/libtracecmd prototype. The use case I'm trying
to address in my first draft is:

* a parser for trace.dat exposing a stable API that does not need to be updated every day.
* A serde backend using that parser
* Any "fancy" type (i.e. that is not a basic scalar type or string) is handled by
   the consumer of the deserializer. This means the backend must be able to fallback
   on providing sensible raw data that can be processed by the user.

In a more general way, I'm trying to structure the code so that the generic trace.dat parsing
lib will be able to handle opaque types by returning a sensibly-typed buffer to the caller. The
caller then decodes it the way it wants. This would apply equally well to a C API.

That is especially important for user-defined fancy types, e.g. in a kernel module. We cannot
reasonably expect users to modify the parsing library itself when they could handle the
decoding on their side with an annotation and a custom decoder [1].

Beyond that, any tool that reads the trace and pretty prints it is bound to provide
broken printing for these types: there will be a date T1 where a new fancy type is introduced
in the trace. Until date T2 when the tool gets explicit support for the type, it will
display the thing as an array of pointers. Interpreting a bitmask as an array of pointer makes
no sense for anyone (machine or human). It so happens that unsigned long == ptr size so it
kind of makes sense for cpumask, but an opaque type might internally be u8[], in which case,
printing as void* [] is meaningless. It would likely be better to display a placeholder
or a plain u8[]. The "semantic_type" field property proposal is an improvement of that.

> )
>> However, the set of types using that trick is unbounded so forward
>> compatibilty is impossible to ensure. On top of that, an actual
>> dynamic array of cpumask pointers becomes impossible to represent.
> 
> I never thought about a user case where we print out an array of cpumask
> pointers.

That case (array of pointer) might be a bit far fetched but kernel already contains
weird stuff such as pointers to opaque struct, I suppose as a way of giving a unique
ID to something. The solution we settle on for cpumask should be applicable to any opaque
type present and future so baking such limitations in the foundations of the event system
does not sound like a terribly good idea.

After reflecting on how the caller of the deserializer can process opaque data, I came
to the conclusion that it will have to know the underlying type (unsigned long[]) and
ask for an enum such as:

enum UnsignedLongArray {
	Bit32(Vec<u32>),
	Bit64(Vec<u64>),
}

When asked for UnsignedLongArray, the serde backend will use the ABI info from the trace
to select the appropriate variant.

> 
>>
>> You might object that if the tool does not know about cpumask,
>> it does not matter "how it breaks" as the display will be useless anyway,
>> but that is not true. A parsing library might just parse up to
>> its knowledge limit and return the most elaborate it can for a given field.
>> It's acceptable for that representation to not be elaborated with the full
>> semantic expected by the end user, but it should not return
>> something that is lying on its nature. For example, it would be sane for
>> the user to assert the size of an array of pointers to be a multiple
>> of a pointer size. cpumask is currently an array of unsigned long but there is
>> nothing preventing a similar type to be based on an array of u8.
>> Such a type would also have different endianness handling and the resulting buffer
>> would be garbage.
>>
>>
>> To fix that issue, I propose to expose the following to userspace:
>> 1. The binary representation type (unsigned long[] in cpumask case).
>> 2. An (ordered list of) semantic type that may or may not be the same as 1.
>>
>> Type (1) can be used to guarantee correct handling of endianness and a reasonable
>> default display, while (2) allows any sort of fancy interpretation, all that while preserving
>> forward compatibility. For cpumask, this would give:
>> 1. unsigned long []
>> 2. bitmask, cpumask
>>
>> A consumer could know about bitmask as they are likely used in multiple places,
>> but not about cpumask specifically (e.g. assuming cpumask is a type recently introduced).
>> Displaying as a list of bits set in the mask would already allow proper formatting, and
>> knowing it's actually a cpumask can allow fancier behaviors.
>>
>>   From an event format perspective, this could preserve reasonable backward compat
>> by simply adding another property:
>>
>>     field:__data_loc unsigned long[] mask;    offset:36;      size:4; signed:0; semantic_type:bitmask,cpumask;
>>
>> By default, "semantic_type" would simply have the same value as the normal type.
> 
> The problem with the above is that it adds a new field, and I have to check
> if that doesn't break existing tooling.

Yes, I was hesitating to propose that as well. I think it would be reasonable to allow for this
sort of extension in man trace.dat in the future but maybe that's something for version 8.
In my implementation I parse that to a map anyway, so unknown properties will simply be ignored.

> Another possibility is that I can add parsing to the format that is exposed
> to user space and simply s/__cpumask *[]/__cpumask[]/
> 
> Which will get rid of the pointer array of cpu masks.

My main concern is if the item size info becomes available in the future as a field property,
we won't be able to make dynamic arrays of opaque types if the array notation is
already taken for single items.

So really ideally we want "__data_loc cpumask" and not "__data_loc cpumask[]", as the latter is an
array of cpumasks, and the former a single one. Or maybe something like "__data_loc SINGLE_cpumask[]".
That format would still be parseable by current userspace tools as an opaque type since SINGLE_cpumask
is a valid C identifier.

Beyond extensibility, it would make __data_loc inconsistent:
* for basic types like u32, it's completely expected to be an actual array.
* but for opaque types, it's expected to be a single item and not an array at all.

> 
>>
>> This applies to any type, not just dynamic arrays.
>>
> 
> Let me know if the above does break existing user space and I'll revert it.

I could experiment with trace-cmd but I'm not familiar with everything out there decoding that.
The parser I'm working on is 1.5 week old and unreleased so no change would be an issue as long
as it's possible to distinguish new formats from old ones (to process old traces).

> -- Steve

Douglas

[1] Vague example of how a user application would use serde and the trace parser:

// Serde doc: https://serde.rs/
use serde::Deserialize;

// The yet-to-exist trace parser, pulled from crates.io like any other dependency.
// The user is not expected to modify that ftraceparser code, just the same way you
// are not expected to modify libtraceevent to make use of it in your own application.
use ftraceparser::read_trace;

// Top-level enum that contains a variant for each event of interest.
#[derive(Deserialize)]
enum TraceEvent {
	#[serde(rename = "sched_switch")]
	EventSchedSwitch(EventSchedSwitchFields),

	// This includes a custom event, that is e.g. defined in a kernel module, or
	// maybe it's a brand new event that the userspace tooling does not handle yet.
	// No ftraceparser code has knowledge of that type.
	#[serde(rename = "my_custom_event")]
	MyCustomEvent(MyCustomEventFields),
}

#[derive(Deserialize)]
pub struct EventSchedSwitchFields {
     pub common_comm: Comm,
     pub common_pid: PID,
     pub common_cpu: CPU,
     pub prev_comm: Comm,
     pub next_comm: Comm,
     pub prev_state: u32,
     pub prev_pid: PID,
     pub next_pid: PID,
     pub next_prio: Prio,
     pub prev_prio: Prio,
}

// Custom event struct, with one field containing an opaque fancy type
// that the parser does not know about (e.g. the event is added by a kernel
// module, or a kernel dev has a custom junk patch with quick&dirty debugging aids)
pub struct MyCustomEventFields {
	#[serde(deserialize_with = "decode_fancy")]
	fancy_field: Fancy,
}

// If we use the automatic Deserialize implementation provided by #[derive(Deserialize)],
// it will be under the assumptions that the parser is able to split the trace in Fancy's
// fields (or a sequence if it was a simple wrapper around an array type) so they can be mapped
// on that struct's fields. This is by definition not possible because the parser have no clue about
// the content of Fancy. ftraceparser only knows how to split the fields of MyCustomEventFields because
// of the format in the header. No such format header exists for Fancy.
// We therefore use a custom "decode_fancy()" function where we ask for a byte buffer and decode it manually.
// We could also provide a manual Deserialize implementation to avoid needing "#[serde(deserialize_with=...)]".
struct Fancy {
	...
}

fn decode_fancy<D>(deserializer: D) {
	// Request a buffer of bytes and decode it into a Fancy value. All the trace parser needs to know
	// is how to get the chunk of data we asked, we handle the rest here.
}

fn process(event: TraceEvent) {
	...
}

fn main() {
	...

	// Read the trace, asking for an iterator of TraceEvent values.
	let events = read_trace::<TraceEvent>("path/to/trace.dat");
	for event in events {
		process(event);
	}
}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-12-12 22:19       ` Douglas Raillard
@ 2022-12-12 23:53         ` Steven Rostedt
  2022-12-13 14:20           ` Douglas Raillard
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2022-12-12 23:53 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Valentin Schneider

On Mon, 12 Dec 2022 22:19:17 +0000
Douglas Raillard <douglas.raillard@arm.com> wrote:

> >>>     field:__data_loc cpumask_t *[] mask;    offset:36;      size:4; signed:0;  
> > 
> > The current parsing tools break the above into:
> > 
> >   "field:" "__data_loc" <some-type> "[]" <var-name> ";" "offset:"
> >   <offset> ";" "size:" "<size>" ";" "signed:" <signed> ";"
> > 
> > Where the <some-type> really can be anything, and in lots of cases, it is.
> > Thus its only a hint for the tooling, and has never been limited to what
> > they are.  
> 
> While we are having a look at that, what is the exact format of "<some-type>" ?

It's literally anything that the kernel excepts and can be in a kernel
structure. As the macro in the kernel defines both the internal structure
and what gets printed in the field. This is why the parsing tooling never
was very strict on <some-type> because a lot of times its something that it
does not recognize. Any new type added in a trace event will show up here.


> > 
> > The above is for the kernel to build.  
> 
> That was my understanding that the comparison issue is related to in-kernel filtering ?
> If that's the case, I completely agree that the type kernel code sees does not _have_
> to be the same thing that is exposed to userspace if that simplifies the problem.

Yes, and note the patch I sent out to fix this.

> 
> >>
> >> How is tooling expected to distinguish between a real dynamic array of pointers
> >> from a type that is using dynamic arrays as an "implementation detail"
> >> with a broken type description ? Any reasonable
> >> interpretation of that type by the consuming tool will be broken
> >> unless it specifically knows about __data_loc cpumask*[].  
> > 
> > I'm curious to what the tool does differently with the above. What tool are
> > you using? Does it just give up on how to print it?  
> 
> I'm implementing the Rust libtraceevent/libtracecmd prototype. The use case I'm trying

Duh! Of course. This is what we discussed at the tracing summit. I just now
noticed your name ;-)

> to address in my first draft is:
> 
> * a parser for trace.dat exposing a stable API that does not need to be updated every day.
> * A serde backend using that parser
> * Any "fancy" type (i.e. that is not a basic scalar type or string) is handled by
>    the consumer of the deserializer. This means the backend must be able to fallback
>    on providing sensible raw data that can be processed by the user.

That third part may be difficult with the above issue I mentioned.

Just do:

 git grep '__field(' | cut -d',' -f1 | cut -d'(' -f2 | sed -e 's/[    ]*//'
 | sort -u

to see what's in the kernel.


> 
> In a more general way, I'm trying to structure the code so that the generic trace.dat parsing
> lib will be able to handle opaque types by returning a sensibly-typed buffer to the caller. The
> caller then decodes it the way it wants. This would apply equally well to a C API.
> 
> That is especially important for user-defined fancy types, e.g. in a kernel module. We cannot
> reasonably expect users to modify the parsing library itself when they could handle the
> decoding on their side with an annotation and a custom decoder [1].
> 
> Beyond that, any tool that reads the trace and pretty prints it is bound to provide
> broken printing for these types: there will be a date T1 where a new fancy type is introduced
> in the trace. Until date T2 when the tool gets explicit support for the type, it will
> display the thing as an array of pointers. Interpreting a bitmask as an array of pointer makes
> no sense for anyone (machine or human). It so happens that unsigned long == ptr size so it
> kind of makes sense for cpumask, but an opaque type might internally be u8[], in which case,
> printing as void* [] is meaningless. It would likely be better to display a placeholder
> or a plain u8[]. The "semantic_type" field property proposal is an improvement of that.

Note, I put much more effort into the offset, size and sign than the type.
But is this only for the arrays that you have these restrictions, or any
field type?

> 
> 
> > )  
> >> However, the set of types using that trick is unbounded so forward
> >> compatibilty is impossible to ensure. On top of that, an actual
> >> dynamic array of cpumask pointers becomes impossible to represent.  
> > 
> > I never thought about a user case where we print out an array of cpumask
> > pointers.  
> 
> That case (array of pointer) might be a bit far fetched but kernel already contains
> weird stuff such as pointers to opaque struct, I suppose as a way of giving a unique
> ID to something. The solution we settle on for cpumask should be applicable to any opaque
> type present and future so baking such limitations in the foundations of the event system
> does not sound like a terribly good idea.

The above should be moot by now, as my new patch removes the "*" pointer.

> 
> After reflecting on how the caller of the deserializer can process opaque data, I came
> to the conclusion that it will have to know the underlying type (unsigned long[]) and
> ask for an enum such as:
> 
> enum UnsignedLongArray {
> 	Bit32(Vec<u32>),
> 	Bit64(Vec<u64>),
> }
> 
> When asked for UnsignedLongArray, the serde backend will use the ABI info from the trace
> to select the appropriate variant.
> 
> >   
> >>
> >> You might object that if the tool does not know about cpumask,
> >> it does not matter "how it breaks" as the display will be useless anyway,
> >> but that is not true. A parsing library might just parse up to
> >> its knowledge limit and return the most elaborate it can for a given field.
> >> It's acceptable for that representation to not be elaborated with the full
> >> semantic expected by the end user, but it should not return
> >> something that is lying on its nature. For example, it would be sane for
> >> the user to assert the size of an array of pointers to be a multiple
> >> of a pointer size. cpumask is currently an array of unsigned long but there is
> >> nothing preventing a similar type to be based on an array of u8.
> >> Such a type would also have different endianness handling and the resulting buffer
> >> would be garbage.
> >>
> >>
> >> To fix that issue, I propose to expose the following to userspace:
> >> 1. The binary representation type (unsigned long[] in cpumask case).
> >> 2. An (ordered list of) semantic type that may or may not be the same as 1.
> >>
> >> Type (1) can be used to guarantee correct handling of endianness and a reasonable
> >> default display, while (2) allows any sort of fancy interpretation, all that while preserving
> >> forward compatibility. For cpumask, this would give:
> >> 1. unsigned long []
> >> 2. bitmask, cpumask
> >>
> >> A consumer could know about bitmask as they are likely used in multiple places,
> >> but not about cpumask specifically (e.g. assuming cpumask is a type recently introduced).
> >> Displaying as a list of bits set in the mask would already allow proper formatting, and
> >> knowing it's actually a cpumask can allow fancier behaviors.
> >>
> >>   From an event format perspective, this could preserve reasonable backward compat
> >> by simply adding another property:
> >>
> >>     field:__data_loc unsigned long[] mask;    offset:36;      size:4; signed:0; semantic_type:bitmask,cpumask;
> >>
> >> By default, "semantic_type" would simply have the same value as the normal type.  
> > 
> > The problem with the above is that it adds a new field, and I have to check
> > if that doesn't break existing tooling.  
> 
> Yes, I was hesitating to propose that as well. I think it would be reasonable to allow for this
> sort of extension in man trace.dat in the future but maybe that's something for version 8.
> In my implementation I parse that to a map anyway, so unknown properties will simply be ignored.
> 
> > Another possibility is that I can add parsing to the format that is exposed
> > to user space and simply s/__cpumask *[]/__cpumask[]/
> > 
> > Which will get rid of the pointer array of cpu masks.  
> 
> My main concern is if the item size info becomes available in the future as a field property,
> we won't be able to make dynamic arrays of opaque types if the array notation is
> already taken for single items.

I believe we already have opaque type arrays.

And the current libtraceevent allows for plugins that will override the
print-fmt parsing to allow for the plugin to know how to parse it properly.


> 
> So really ideally we want "__data_loc cpumask" and not "__data_loc cpumask[]", as the latter is an
> array of cpumasks, and the former a single one. Or maybe something like "__data_loc SINGLE_cpumask[]".

The [] just means that the size is not defined, and that current tools will
parse it correctly, even if it does not know what cpumask is.

An array of cpumask_t is redundant, as it really becomes one bigger
cpumask_t.


> That format would still be parseable by current userspace tools as an opaque type since SINGLE_cpumask
> is a valid C identifier.
> 
> Beyond extensibility, it would make __data_loc inconsistent:
> * for basic types like u32, it's completely expected to be an actual array.
> * but for opaque types, it's expected to be a single item and not an array at all.
> 
> >   
> >>
> >> This applies to any type, not just dynamic arrays.
> >>  
> > 
> > Let me know if the above does break existing user space and I'll revert it.  
> 
> I could experiment with trace-cmd but I'm not familiar with everything out there decoding that.
> The parser I'm working on is 1.5 week old and unreleased so no change would be an issue as long
> as it's possible to distinguish new formats from old ones (to process old traces).
> 

[ /me goes and does some testing ...]

Actually, the cpumask[] fails to parse regardless it seems, so I'm free to
change this anyway I want and still not cause regressions.

This means I can make it this:

  field:__data_loc cpumask_t mask;    offset:36;      size:4; signed:0;


and it should not cause any regression, as it already fails as this is new
anyway.

I get this from the sample module:

    event-sample-861   [000]    99.584942: foo_bar:              [FAILED TO PARSE] foo=hello bar=3 list=ARRAY[01, 00, 00, 00, 02, 00, 00, 00, 03, 00, 00, 00] str=Frodo cpus=ARRAY[ff, 00, 00, 00, 00, 00, 00, 00] cpum=ARRAY[ff, 00, 00, 00, 00, 00, 00, 00] vstr=iter=3

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-12-12 23:53         ` Steven Rostedt
@ 2022-12-13 14:20           ` Douglas Raillard
  2022-12-13 15:11             ` Steven Rostedt
  2022-12-13 19:50             ` Douglas Raillard
  0 siblings, 2 replies; 24+ messages in thread
From: Douglas Raillard @ 2022-12-13 14:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Valentin Schneider

On 12-12-2022 23:53, Steven Rostedt wrote:
> On Mon, 12 Dec 2022 22:19:17 +0000
> Douglas Raillard <douglas.raillard@arm.com> wrote:
> 
>>>>>      field:__data_loc cpumask_t *[] mask;    offset:36;      size:4; signed:0;
>>>
>>> The current parsing tools break the above into:
>>>
>>>    "field:" "__data_loc" <some-type> "[]" <var-name> ";" "offset:"
>>>    <offset> ";" "size:" "<size>" ";" "signed:" <signed> ";"
>>>
>>> Where the <some-type> really can be anything, and in lots of cases, it is.
>>> Thus its only a hint for the tooling, and has never been limited to what
>>> they are.
>>
>> While we are having a look at that, what is the exact format of "<some-type>" ?
> 
> It's literally anything that the kernel excepts and can be in a kernel
> structure. As the macro in the kernel defines both the internal structure
> and what gets printed in the field. This is why the parsing tooling never
> was very strict on <some-type> because a lot of times its something that it
> does not recognize. Any new type added in a trace event will show up here.
> 
> 
>>>
>>> The above is for the kernel to build.
>>
>> That was my understanding that the comparison issue is related to in-kernel filtering ?
>> If that's the case, I completely agree that the type kernel code sees does not _have_
>> to be the same thing that is exposed to userspace if that simplifies the problem.
> 
> Yes, and note the patch I sent out to fix this.

I did, that's some seriously fast TAT :)

>>
>>>>
>>>> How is tooling expected to distinguish between a real dynamic array of pointers
>>>> from a type that is using dynamic arrays as an "implementation detail"
>>>> with a broken type description ? Any reasonable
>>>> interpretation of that type by the consuming tool will be broken
>>>> unless it specifically knows about __data_loc cpumask*[].
>>>
>>> I'm curious to what the tool does differently with the above. What tool are
>>> you using? Does it just give up on how to print it?
>>
>> I'm implementing the Rust libtraceevent/libtracecmd prototype. The use case I'm trying
> 
> Duh! Of course. This is what we discussed at the tracing summit. I just now
> noticed your name ;-)
> 
>> to address in my first draft is:
>>
>> * a parser for trace.dat exposing a stable API that does not need to be updated every day.
>> * A serde backend using that parser
>> * Any "fancy" type (i.e. that is not a basic scalar type or string) is handled by
>>     the consumer of the deserializer. This means the backend must be able to fallback
>>     on providing sensible raw data that can be processed by the user.
> 
> That third part may be difficult with the above issue I mentioned.
> 
> Just do:
> 
>   git grep '__field(' | cut -d',' -f1 | cut -d'(' -f2 | sed -e 's/[    ]*//'
>   | sort -u
> 
> to see what's in the kernel.

There are lots of types, but as long as the caller knows what to ask for, it shouldn't be an issue.
Pretty printing the trace is obviously an important aspect and ideally requires the parser to know
how to format everything.

But when it comes to other processing in a compiled language, it's not a big burden to let people
declare the events they require and the expected fields + types so they can get the data into their
own struct (e.g. as with serde or any equivalent technology).

> 
>>
>> In a more general way, I'm trying to structure the code so that the generic trace.dat parsing
>> lib will be able to handle opaque types by returning a sensibly-typed buffer to the caller. The
>> caller then decodes it the way it wants. This would apply equally well to a C API.
>>
>> That is especially important for user-defined fancy types, e.g. in a kernel module. We cannot
>> reasonably expect users to modify the parsing library itself when they could handle the
>> decoding on their side with an annotation and a custom decoder [1].
>>
>> Beyond that, any tool that reads the trace and pretty prints it is bound to provide
>> broken printing for these types: there will be a date T1 where a new fancy type is introduced
>> in the trace. Until date T2 when the tool gets explicit support for the type, it will
>> display the thing as an array of pointers. Interpreting a bitmask as an array of pointer makes
>> no sense for anyone (machine or human). It so happens that unsigned long == ptr size so it
>> kind of makes sense for cpumask, but an opaque type might internally be u8[], in which case,
>> printing as void* [] is meaningless. It would likely be better to display a placeholder
>> or a plain u8[]. The "semantic_type" field property proposal is an improvement of that.
> 
> Note, I put much more effort into the offset, size and sign than the type.
> But is this only for the arrays that you have these restrictions, or any
> field type?

In terms of borken pretty printing, it's a general issue not limited to dynamic arrays.
The only ways pretty printing for an opaque type can possibly work for new types the parser has no
specific knowledge of are:
1. The type is not actually opaque, i.e. it comes with some decoding schema (just like the events have
    a schema listing their fields + types)
2. The type is opaque, but also ships with an executable description of how to print it.
    E.g. if there was a WASM/eBPF/whatever bytecode printing routine made available to userspace.

Option (2) is not so appealing as it's both hard to achieve and only allows a fixed set of
behaviors for a type. Option (1) is a lot easier and allows the behaviors to be defined
on the user side.

Wild idea: include the BTF blob in the trace.dat header so no type is opaque anymore. The printing
issue is not entirely solved this way (e.g. cpumask still needs some plugin to be displayed as a list
of CPUs), but we could at least print all structs in "raw" mode and enum symbolically.

That could also allow creating a quick&dirty way of defining a proper event (aka not trace_printk()):

	#define SIMPLE_TRACE_EVENT(type, fields) \
	struct type fields;	
	TRACE_EVENT(type, \
		TP_PROTO(struct type *data), \
		TP_ARGS(data), \
		TP_STRUCT__entry(__field(struct type, data)), \
		TP_fast_assign(__entry->data = *data;), \
		TP_printk("print in raw mode to display the data"), \
	);
	#define SIMPLE_TRACE(type, fields) trace_struct_##type(&(struct type)fields)


	SIMPLE_TRACE_EVENT(myevent, {
		char name[11];
		int foobar;
	});
	
	SIMPLE_TRACE(myevent, {.name = "hello", .foobar = 42});

The format string could be either kernel-generated based on BTF or userspace could be expected
to make its own use of BTF.

> 
>>
>>
>>> )
>>>> However, the set of types using that trick is unbounded so forward
>>>> compatibilty is impossible to ensure. On top of that, an actual
>>>> dynamic array of cpumask pointers becomes impossible to represent.
>>>
>>> I never thought about a user case where we print out an array of cpumask
>>> pointers.
>>
>> That case (array of pointer) might be a bit far fetched but kernel already contains
>> weird stuff such as pointers to opaque struct, I suppose as a way of giving a unique
>> ID to something. The solution we settle on for cpumask should be applicable to any opaque
>> type present and future so baking such limitations in the foundations of the event system
>> does not sound like a terribly good idea.
> 
> The above should be moot by now, as my new patch removes the "*" pointer.
> 
>>
>> After reflecting on how the caller of the deserializer can process opaque data, I came
>> to the conclusion that it will have to know the underlying type (unsigned long[]) and
>> ask for an enum such as:
>>
>> enum UnsignedLongArray {
>> 	Bit32(Vec<u32>),
>> 	Bit64(Vec<u64>),
>> }
>>
>> When asked for UnsignedLongArray, the serde backend will use the ABI info from the trace
>> to select the appropriate variant.
>>
>>>    
>>>>
>>>> You might object that if the tool does not know about cpumask,
>>>> it does not matter "how it breaks" as the display will be useless anyway,
>>>> but that is not true. A parsing library might just parse up to
>>>> its knowledge limit and return the most elaborate it can for a given field.
>>>> It's acceptable for that representation to not be elaborated with the full
>>>> semantic expected by the end user, but it should not return
>>>> something that is lying on its nature. For example, it would be sane for
>>>> the user to assert the size of an array of pointers to be a multiple
>>>> of a pointer size. cpumask is currently an array of unsigned long but there is
>>>> nothing preventing a similar type to be based on an array of u8.
>>>> Such a type would also have different endianness handling and the resulting buffer
>>>> would be garbage.
>>>>
>>>>
>>>> To fix that issue, I propose to expose the following to userspace:
>>>> 1. The binary representation type (unsigned long[] in cpumask case).
>>>> 2. An (ordered list of) semantic type that may or may not be the same as 1.
>>>>
>>>> Type (1) can be used to guarantee correct handling of endianness and a reasonable
>>>> default display, while (2) allows any sort of fancy interpretation, all that while preserving
>>>> forward compatibility. For cpumask, this would give:
>>>> 1. unsigned long []
>>>> 2. bitmask, cpumask
>>>>
>>>> A consumer could know about bitmask as they are likely used in multiple places,
>>>> but not about cpumask specifically (e.g. assuming cpumask is a type recently introduced).
>>>> Displaying as a list of bits set in the mask would already allow proper formatting, and
>>>> knowing it's actually a cpumask can allow fancier behaviors.
>>>>
>>>>    From an event format perspective, this could preserve reasonable backward compat
>>>> by simply adding another property:
>>>>
>>>>      field:__data_loc unsigned long[] mask;    offset:36;      size:4; signed:0; semantic_type:bitmask,cpumask;
>>>>
>>>> By default, "semantic_type" would simply have the same value as the normal type.
>>>
>>> The problem with the above is that it adds a new field, and I have to check
>>> if that doesn't break existing tooling.
>>
>> Yes, I was hesitating to propose that as well. I think it would be reasonable to allow for this
>> sort of extension in man trace.dat in the future but maybe that's something for version 8.
>> In my implementation I parse that to a map anyway, so unknown properties will simply be ignored.
>>
>>> Another possibility is that I can add parsing to the format that is exposed
>>> to user space and simply s/__cpumask *[]/__cpumask[]/
>>>
>>> Which will get rid of the pointer array of cpu masks.
>>
>> My main concern is if the item size info becomes available in the future as a field property,
>> we won't be able to make dynamic arrays of opaque types if the array notation is
>> already taken for single items.
> 
> I believe we already have opaque type arrays.
> 
> And the current libtraceevent allows for plugins that will override the
> print-fmt parsing to allow for the plugin to know how to parse it properly.
> 
> 
>>
>> So really ideally we want "__data_loc cpumask" and not "__data_loc cpumask[]", as the latter is an
>> array of cpumasks, and the former a single one. Or maybe something like "__data_loc SINGLE_cpumask[]".
> 
> The [] just means that the size is not defined, and that current tools will
> parse it correctly, even if it does not know what cpumask is.
> 
> An array of cpumask_t is redundant, as it really becomes one bigger
> cpumask_t.

It's not. A list of 21 CPU affinity masks in a system with 5 CPUs is definitely not the same thing as one mask
of size 105. Yes they are isomorphic but so is a task's comm (15 char * 7 bits = 105 bits). We are trying to
express what the data represent, not merely the amount of info required to encode them.

Now I'm not saying that we can create such array of cpumasks using dynamic arrays in the current
state of the kernel macro infrastructure, but it's an implementation detail. That should be able to evolve
independently of the data model exposed to userspace.

> 
>> That format would still be parseable by current userspace tools as an opaque type since SINGLE_cpumask
>> is a valid C identifier.
>>
>> Beyond extensibility, it would make __data_loc inconsistent:
>> * for basic types like u32, it's completely expected to be an actual array.
>> * but for opaque types, it's expected to be a single item and not an array at all.
>>
>>>    
>>>>
>>>> This applies to any type, not just dynamic arrays.
>>>>   
>>>
>>> Let me know if the above does break existing user space and I'll revert it.
>>
>> I could experiment with trace-cmd but I'm not familiar with everything out there decoding that.
>> The parser I'm working on is 1.5 week old and unreleased so no change would be an issue as long
>> as it's possible to distinguish new formats from old ones (to process old traces).
>>
> 
> [ /me goes and does some testing ...]
> 
> Actually, the cpumask[] fails to parse regardless it seems, so I'm free to
> change this anyway I want and still not cause regressions.
> 
> This means I can make it this:
> 
>    field:__data_loc cpumask_t mask;    offset:36;      size:4; signed:0;
> 
> 
> and it should not cause any regression, as it already fails as this is new
> anyway.
> 
> I get this from the sample module:
> 
>      event-sample-861   [000]    99.584942: foo_bar:              [FAILED TO PARSE] foo=hello bar=3 list=ARRAY[01, 00, 00, 00, 02, 00, 00, 00, 03, 00, 00, 00] str=Frodo cpus=ARRAY[ff, 00, 00, 00, 00, 00, 00, 00] cpum=ARRAY[ff, 00, 00, 00, 00, 00, 00, 00] vstr=iter=3

Niiice. So we can have both real arrays of things like an "__data_loc pid_t[]" and at the same
time have types that happen to need dynamic amount of storage like cpumask_t. So __data_loc can
now be treated almost like a new type qualifier.

The only case I can think of where parsing would not follow regular C abstract declaration syntax is a type like that:

	__data_loc int [3][]

The outer-most array is by definition the dynamic one, so "[]". In normal C, [3] and [] would be swapped as
the outer-most array comes first. That's not too bad though as it is not ambiguous and easy to fixup directly
in the parse tree.

-- Douglas


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-12-13 14:20           ` Douglas Raillard
@ 2022-12-13 15:11             ` Steven Rostedt
  2022-12-13 17:40               ` Douglas Raillard
  2022-12-13 19:50             ` Douglas Raillard
  1 sibling, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2022-12-13 15:11 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Valentin Schneider

On Tue, 13 Dec 2022 14:20:12 +0000
Douglas Raillard <douglas.raillard@arm.com> wrote:


> >>>
> >>> The above is for the kernel to build.  
> >>
> >> That was my understanding that the comparison issue is related to in-kernel filtering ?
> >> If that's the case, I completely agree that the type kernel code sees does not _have_
> >> to be the same thing that is exposed to userspace if that simplifies the problem.  
> > 
> > Yes, and note the patch I sent out to fix this.  
> 
> I did, that's some seriously fast TAT :)

I'm going to pull that one in and start testing, so I can push it out in
this merge window.



> > That third part may be difficult with the above issue I mentioned.
> > 
> > Just do:
> > 
> >   git grep '__field(' | cut -d',' -f1 | cut -d'(' -f2 | sed -e 's/[    ]*//'
> >   | sort -u
> > 
> > to see what's in the kernel.  
> 
> There are lots of types, but as long as the caller knows what to ask for, it shouldn't be an issue.
> Pretty printing the trace is obviously an important aspect and ideally requires the parser to know
> how to format everything.
> 
> But when it comes to other processing in a compiled language, it's not a big burden to let people
> declare the events they require and the expected fields + types so they can get the data into their
> own struct (e.g. as with serde or any equivalent technology).

That's what is done today. The size and offset is how the tools can get to
the data and it knows what to do with it.

I'm not sure how rust can handle this type of opaque type scheme.


> > Note, I put much more effort into the offset, size and sign than the type.
> > But is this only for the arrays that you have these restrictions, or any
> > field type?  
> 
> In terms of borken pretty printing, it's a general issue not limited to dynamic arrays.

Yeah, the pretty printing can easily fail, as it can have anything that the
kernel can do. Including calling functions that are not available to user
space. This is why the fallback is always back to size, offset and sign.

> The only ways pretty printing for an opaque type can possibly work for new types the parser has no
> specific knowledge of are:
> 1. The type is not actually opaque, i.e. it comes with some decoding schema (just like the events have
>     a schema listing their fields + types)
> 2. The type is opaque, but also ships with an executable description of how to print it.
>     E.g. if there was a WASM/eBPF/whatever bytecode printing routine made available to userspace.
> 
> Option (2) is not so appealing as it's both hard to achieve and only allows a fixed set of
> behaviors for a type. Option (1) is a lot easier and allows the behaviors to be defined
> on the user side.
> 
> Wild idea: include the BTF blob in the trace.dat header so no type is opaque anymore. The printing
> issue is not entirely solved this way (e.g. cpumask still needs some plugin to be displayed as a list
> of CPUs), but we could at least print all structs in "raw" mode and enum symbolically.

And how big is that blob?

I'm not against the idea, but I would like it to only hold what is needed.

> 
> That could also allow creating a quick&dirty way of defining a proper event (aka not trace_printk()):


I prefer not to have "quick&dirty" ;-)

> 
> 	#define SIMPLE_TRACE_EVENT(type, fields) \
> 	struct type fields;	
> 	TRACE_EVENT(type, \
> 		TP_PROTO(struct type *data), \
> 		TP_ARGS(data), \
> 		TP_STRUCT__entry(__field(struct type, data)), \
> 		TP_fast_assign(__entry->data = *data;), \
> 		TP_printk("print in raw mode to display the data"), \
> 	);
> 	#define SIMPLE_TRACE(type, fields) trace_struct_##type(&(struct type)fields)
> 
> 
> 	SIMPLE_TRACE_EVENT(myevent, {
> 		char name[11];
> 		int foobar;
> 	});
> 	
> 	SIMPLE_TRACE(myevent, {.name = "hello", .foobar = 42});


> 
> The format string could be either kernel-generated based on BTF or userspace could be expected
> to make its own use of BTF.

What's the use case for the above?

> 

> >>
> >> So really ideally we want "__data_loc cpumask" and not "__data_loc cpumask[]", as the latter is an
> >> array of cpumasks, and the former a single one. Or maybe something like "__data_loc SINGLE_cpumask[]".  
> > 
> > The [] just means that the size is not defined, and that current tools will
> > parse it correctly, even if it does not know what cpumask is.
> > 
> > An array of cpumask_t is redundant, as it really becomes one bigger
> > cpumask_t.  
> 
> It's not. A list of 21 CPU affinity masks in a system with 5 CPUs is definitely not the same thing as one mask
> of size 105. Yes they are isomorphic but so is a task's comm (15 char * 7 bits = 105 bits). We are trying to
> express what the data represent, not merely the amount of info required to encode them.
> 
> Now I'm not saying that we can create such array of cpumasks using dynamic arrays in the current
> state of the kernel macro infrastructure, but it's an implementation detail. That should be able to evolve
> independently of the data model exposed to userspace.
> 
> >   
> >> That format would still be parseable by current userspace tools as an opaque type since SINGLE_cpumask
> >> is a valid C identifier.
> >>
> >> Beyond extensibility, it would make __data_loc inconsistent:
> >> * for basic types like u32, it's completely expected to be an actual array.
> >> * but for opaque types, it's expected to be a single item and not an array at all.
> >>  
> >>>      
> >>>>
> >>>> This applies to any type, not just dynamic arrays.
> >>>>     
> >>>
> >>> Let me know if the above does break existing user space and I'll revert it.  
> >>
> >> I could experiment with trace-cmd but I'm not familiar with everything out there decoding that.
> >> The parser I'm working on is 1.5 week old and unreleased so no change would be an issue as long
> >> as it's possible to distinguish new formats from old ones (to process old traces).
> >>  
> > 
> > [ /me goes and does some testing ...]
> > 
> > Actually, the cpumask[] fails to parse regardless it seems, so I'm free to
> > change this anyway I want and still not cause regressions.
> > 
> > This means I can make it this:
> > 
> >    field:__data_loc cpumask_t mask;    offset:36;      size:4; signed:0;
> > 
> > 
> > and it should not cause any regression, as it already fails as this is new
> > anyway.
> > 
> > I get this from the sample module:
> > 
> >      event-sample-861   [000]    99.584942: foo_bar:              [FAILED TO PARSE] foo=hello bar=3 list=ARRAY[01, 00, 00, 00, 02, 00, 00, 00, 03, 00, 00, 00] str=Frodo cpus=ARRAY[ff, 00, 00, 00, 00, 00, 00, 00] cpum=ARRAY[ff, 00, 00, 00, 00, 00, 00, 00] vstr=iter=3  
> 
> Niiice. So we can have both real arrays of things like an "__data_loc pid_t[]" and at the same
> time have types that happen to need dynamic amount of storage like cpumask_t. So __data_loc can
> now be treated almost like a new type qualifier.
> 
> The only case I can think of where parsing would not follow regular C abstract declaration syntax is a type like that:
> 
> 	__data_loc int [3][]
> 
> The outer-most array is by definition the dynamic one, so "[]". In normal C, [3] and [] would be swapped as
> the outer-most array comes first. That's not too bad though as it is not ambiguous and easy to fixup directly
> in the parse tree.
> 

Well, currently we do not have a way to create the above, so I'm not
worried.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-12-13 15:11             ` Steven Rostedt
@ 2022-12-13 17:40               ` Douglas Raillard
  2022-12-13 19:44                 ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Douglas Raillard @ 2022-12-13 17:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Valentin Schneider

On 13-12-2022 15:11, Steven Rostedt wrote:
> On Tue, 13 Dec 2022 14:20:12 +0000
> Douglas Raillard <douglas.raillard@arm.com> wrote:
> 
> 
>>>>>
>>>>> The above is for the kernel to build.
>>>>
>>>> That was my understanding that the comparison issue is related to in-kernel filtering ?
>>>> If that's the case, I completely agree that the type kernel code sees does not _have_
>>>> to be the same thing that is exposed to userspace if that simplifies the problem.
>>>
>>> Yes, and note the patch I sent out to fix this.
>>
>> I did, that's some seriously fast TAT :)
> 
> I'm going to pull that one in and start testing, so I can push it out in
> this merge window.
> 
> 
> 
>>> That third part may be difficult with the above issue I mentioned.
>>>
>>> Just do:
>>>
>>>    git grep '__field(' | cut -d',' -f1 | cut -d'(' -f2 | sed -e 's/[    ]*//'
>>>    | sort -u
>>>
>>> to see what's in the kernel.
>>
>> There are lots of types, but as long as the caller knows what to ask for, it shouldn't be an issue.
>> Pretty printing the trace is obviously an important aspect and ideally requires the parser to know
>> how to format everything.
>>
>> But when it comes to other processing in a compiled language, it's not a big burden to let people
>> declare the events they require and the expected fields + types so they can get the data into their
>> own struct (e.g. as with serde or any equivalent technology).
> 
> That's what is done today. The size and offset is how the tools can get to
> the data and it knows what to do with it.
> 
> I'm not sure how rust can handle this type of opaque type scheme.

In a similar way, the user asks to parse e.g. an UnsignedLongArray and
the lib returns either the 32bit or 64bit variant, with mismatching endianness
fixed up (so the data are in host endianness). The caller has to handle both 32bit
and 64bits cases so everything is type safe.

> 
> 
>>> Note, I put much more effort into the offset, size and sign than the type.
>>> But is this only for the arrays that you have these restrictions, or any
>>> field type?
>>
>> In terms of borken pretty printing, it's a general issue not limited to dynamic arrays.
> 
> Yeah, the pretty printing can easily fail, as it can have anything that the
> kernel can do. Including calling functions that are not available to user
> space. This is why the fallback is always back to size, offset and sign.
> 
>> The only ways pretty printing for an opaque type can possibly work for new types the parser has no
>> specific knowledge of are:
>> 1. The type is not actually opaque, i.e. it comes with some decoding schema (just like the events have
>>      a schema listing their fields + types)
>> 2. The type is opaque, but also ships with an executable description of how to print it.
>>      E.g. if there was a WASM/eBPF/whatever bytecode printing routine made available to userspace.
>>
>> Option (2) is not so appealing as it's both hard to achieve and only allows a fixed set of
>> behaviors for a type. Option (1) is a lot easier and allows the behaviors to be defined
>> on the user side.
>>
>> Wild idea: include the BTF blob in the trace.dat header so no type is opaque anymore. The printing
>> issue is not entirely solved this way (e.g. cpumask still needs some plugin to be displayed as a list
>> of CPUs), but we could at least print all structs in "raw" mode and enum symbolically.
> 
> And how big is that blob?

Less than 3MB AFAIR. For comparison, /proc/kallsyms takes 11MB on my machine.
In one of my test traces, it's 17MB. Maybe it's compressed in trace.dat v7 though, haven't checked.
  

> I'm not against the idea, but I would like it to only hold what is needed

I suppose it's doable to derive a subset of the info with some efforts.

> 
>>
>> That could also allow creating a quick&dirty way of defining a proper event (aka not trace_printk()):
> 
> 
> I prefer not to have "quick&dirty" ;-)

I'm not saying that I would like to see such quick and dirty events upstream, but the reality around me is
that ftrace events is the only sane way of having an idea what the scheduler does. This means people need
to create experiments all the time with ad-hoc trace events, on top of the trace events that we attach to
tracepoints via a module. Currently, people use trace_printk() for that, which comes with some significant
amount of work and pain (mostly regex speed).

That said having just looked at bprint, I could probably support trace_printk() format strings with simple
struct member access (i.e. no __printflags shenanigans etc) as normal events relatively easily. It's even
possible to use the fmt string pointer as an "event ID". Still a shame that all the event field format infra
basically gets duplicated in a printf format string ...

> 
>>
>> 	#define SIMPLE_TRACE_EVENT(type, fields) \
>> 	struct type fields;	
>> 	TRACE_EVENT(type, \
>> 		TP_PROTO(struct type *data), \
>> 		TP_ARGS(data), \
>> 		TP_STRUCT__entry(__field(struct type, data)), \
>> 		TP_fast_assign(__entry->data = *data;), \
>> 		TP_printk("print in raw mode to display the data"), \
>> 	);
>> 	#define SIMPLE_TRACE(type, fields) trace_struct_##type(&(struct type)fields)
>>
>>
>> 	SIMPLE_TRACE_EVENT(myevent, {
>> 		char name[11];
>> 		int foobar;
>> 	});
>> 	
>> 	SIMPLE_TRACE(myevent, {.name = "hello", .foobar = 42});
> 
> 
>>
>> The format string could be either kernel-generated based on BTF or userspace could be expected
>> to make its own use of BTF.
> 
> What's the use case for the above?

An equivalent to trace_printk() that exposes its fields in the "normal" way rather than having to parse
the format string and a comma-separated list of C expressions. Life is too short to write C interpreters.
Parsing BTF is at least a finite amount of work. But I guess it would be easy to handle only "REC->field"
expressions.

-- Douglas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-12-13 17:40               ` Douglas Raillard
@ 2022-12-13 19:44                 ` Steven Rostedt
  2022-12-13 21:14                   ` Douglas Raillard
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2022-12-13 19:44 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Valentin Schneider

On Tue, 13 Dec 2022 17:40:06 +0000
Douglas Raillard <douglas.raillard@arm.com> wrote:

> > I prefer not to have "quick&dirty" ;-)  
> 
> I'm not saying that I would like to see such quick and dirty events upstream, but the reality around me is
> that ftrace events is the only sane way of having an idea what the scheduler does. This means people need
> to create experiments all the time with ad-hoc trace events, on top of the trace events that we attach to
> tracepoints via a module. Currently, people use trace_printk() for that, which comes with some significant
> amount of work and pain (mostly regex speed).

Have you seen custom events?

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/trace_events/trace_custom_sched.h


> 
> That said having just looked at bprint, I could probably support trace_printk() format strings with simple
> struct member access (i.e. no __printflags shenanigans etc) as normal events relatively easily. It's even
> possible to use the fmt string pointer as an "event ID". Still a shame that all the event field format infra
> basically gets duplicated in a printf format string ...

It's the easiest thing to do in the kernel. The below is probably too much
work for people to use. The fact that it's just a string and does not have
any type information is one of the main reasons I force it not to be in
mainline (hence the nasty banner when it is added).

> 
> >   
> >>
> >> 	#define SIMPLE_TRACE_EVENT(type, fields) \
> >> 	struct type fields;	
> >> 	TRACE_EVENT(type, \
> >> 		TP_PROTO(struct type *data), \
> >> 		TP_ARGS(data), \
> >> 		TP_STRUCT__entry(__field(struct type, data)), \
> >> 		TP_fast_assign(__entry->data = *data;), \
> >> 		TP_printk("print in raw mode to display the data"), \
> >> 	);
> >> 	#define SIMPLE_TRACE(type, fields) trace_struct_##type(&(struct type)fields)
> >>
> >>
> >> 	SIMPLE_TRACE_EVENT(myevent, {
> >> 		char name[11];
> >> 		int foobar;
> >> 	});
> >> 	
> >> 	SIMPLE_TRACE(myevent, {.name = "hello", .foobar = 42});  
> > 
> >   
> >>
> >> The format string could be either kernel-generated based on BTF or userspace could be expected
> >> to make its own use of BTF.  
> > 
> > What's the use case for the above?  
> 
> An equivalent to trace_printk() that exposes its fields in the "normal" way rather than having to parse
> the format string and a comma-separated list of C expressions. Life is too short to write C interpreters.
> Parsing BTF is at least a finite amount of work. But I guess it would be easy to handle only "REC->field"
> expressions.

But the above isn't that much simpler than writing a trace event. When I
use trace_printk(), I seldom use it with tooling. And for the few times I
have written tools to parse printk, the printk formats are very easily
parsed, as I control them. Heck, I'd just do colon delimited string.

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-12-13 14:20           ` Douglas Raillard
  2022-12-13 15:11             ` Steven Rostedt
@ 2022-12-13 19:50             ` Douglas Raillard
  1 sibling, 0 replies; 24+ messages in thread
From: Douglas Raillard @ 2022-12-13 19:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Valentin Schneider

On 13-12-2022 14:20, Douglas Raillard wrote:
>
> The only case I can think of where parsing would not follow regular C abstract declaration syntax is a type like that:
> 
>      __data_loc int [3][]

After some experimentation, I came to the conclusion that "__data_loc <type> [] <id>" can indeed support any C type if they
need to be added in the future. I would heavily suggest that any future extension works as the following, to avoid messing up
the grammar in a subtle way that would prevent some types to be expressible, or make it a nightmare to implement.

The recipe to parse that with stock C parser is:
* consume "__data_loc"
* parse "<type> []" as a func prototype parameter declaration
  (declaration using an abstract declarator, i.e. not introducing any identifier)
* parse "<id>" as an identifier.

> The outer-most array is by definition the dynamic one, so "[]". In normal C, [3] and [] would be swapped as
> the outer-most array comes first. That's not too bad though as it is not ambiguous and easy to fixup directly
> in the parse tree.

Simply swapping is wrong in the general case. The correct modification of the "<type> []" parse tree is doing a
"barrel shift" on nested array sizes. If the type is "int [1][2][]", it needs to be turned into "int [][1][2]".
The now-top-level sizeless array is the dynamic array. Note that pointers level are transparent,
so "int (*[1])[2][]" needs to be turned into "int (*[])[1][2]"

-- Douglas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t
  2022-12-13 19:44                 ` Steven Rostedt
@ 2022-12-13 21:14                   ` Douglas Raillard
  0 siblings, 0 replies; 24+ messages in thread
From: Douglas Raillard @ 2022-12-13 21:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Valentin Schneider

On 13-12-2022 19:44, Steven Rostedt wrote:
> On Tue, 13 Dec 2022 17:40:06 +0000
> Douglas Raillard <douglas.raillard@arm.com> wrote:
> 
>>> I prefer not to have "quick&dirty" ;-)
>>
>> I'm not saying that I would like to see such quick and dirty events upstream, but the reality around me is
>> that ftrace events is the only sane way of having an idea what the scheduler does. This means people need
>> to create experiments all the time with ad-hoc trace events, on top of the trace events that we attach to
>> tracepoints via a module. Currently, people use trace_printk() for that, which comes with some significant
>> amount of work and pain (mostly regex speed).
> 
> Have you seen custom events?
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/trace_events/trace_custom_sched.h
> 

That's interesting, thanks. I currently define a new regular event along with a tp probe that I attach to the tp of interest.
The probe is then calling trace_mycustom_event(). However:
* Some events are emitted from multiple tracepoints. TRACE_CUSTOM_EVENT() can only attach to one tp, but maybe it's solvable
   with an event class.
* The tracepoint may or may not exist (we deal with all sorts of kernels including Android and our own dev kernel with extra tps).
   I use __find_tracepoint() and tracepoint_probe_register() to handle that.
* The module we have [1] is organized as a set of features that can be enabled dynamically by name as insmod parameter.
   The probe is registered only if the corresponding event feature is enabled. This allows an event feature to have
   side effects that would not be tolerable unless you want that specific event.

However TRACE_CUSTOM_EVENT() sounds great for simpler use cases, especially as (some) maintainers are much happier
with new tracepoints (no stable ABI) than new events (too easy for userspace to consume, ABI and tears when it breaks ensues)

[1] https://github.com/ARM-software/lisa/tree/master/lisa/_assets/kmodules/sched_tp

>>
>> That said having just looked at bprint, I could probably support trace_printk() format strings with simple
>> struct member access (i.e. no __printflags shenanigans etc) as normal events relatively easily. It's even
>> possible to use the fmt string pointer as an "event ID". Still a shame that all the event field format infra
>> basically gets duplicated in a printf format string ...
> 
> It's the easiest thing to do in the kernel. The below is probably too much
> work for people to use. The fact that it's just a string and does not have
> any type information is one of the main reasons I force it not to be in
> mainline (hence the nasty banner when it is added).
> 
>>
>>>    
>>>>
>>>> 	#define SIMPLE_TRACE_EVENT(type, fields) \
>>>> 	struct type fields;	
>>>> 	TRACE_EVENT(type, \
>>>> 		TP_PROTO(struct type *data), \
>>>> 		TP_ARGS(data), \
>>>> 		TP_STRUCT__entry(__field(struct type, data)), \
>>>> 		TP_fast_assign(__entry->data = *data;), \
>>>> 		TP_printk("print in raw mode to display the data"), \
>>>> 	);
>>>> 	#define SIMPLE_TRACE(type, fields) trace_struct_##type(&(struct type)fields)
>>>>
>>>>
>>>> 	SIMPLE_TRACE_EVENT(myevent, {
>>>> 		char name[11];
>>>> 		int foobar;
>>>> 	});
>>>> 	
>>>> 	SIMPLE_TRACE(myevent, {.name = "hello", .foobar = 42});
>>>
>>>    
>>>>
>>>> The format string could be either kernel-generated based on BTF or userspace could be expected
>>>> to make its own use of BTF.
>>>
>>> What's the use case for the above?
>>
>> An equivalent to trace_printk() that exposes its fields in the "normal" way rather than having to parse
>> the format string and a comma-separated list of C expressions. Life is too short to write C interpreters.
>> Parsing BTF is at least a finite amount of work. But I guess it would be easy to handle only "REC->field"
>> expressions.
> 
> But the above isn't that much simpler than writing a trace event.

The user would simply need this:

	SIMPLE_TRACE_EVENT(myevent, {
		char name[11];
		int foobar;
	});

	SIMPLE_TRACE(myevent, {.name = "hello", .foobar = 42});

Instead of currently that:

	TRACE_EVENT(myevent
		TP_PROTO(struct type *name[11], int foobar),
		TP_ARGS(name, foobar),
		TP_STRUCT__entry(
			__field(name[11], name),
			__field(int, foobar),
			),
		TP_fast_assign(
			__entry->name = *name;
			__entry->foobar = *foobar;
		),
		TP_printk("name=%s foobar=%i", __entry->name, __entry->foobar), \
	);

	trace_myevent(...);

We went from 3 lines of declaration to 12. "foobar" is repeated 7 times. While I agree that in the grand
scheme of things it's totally acceptable, people want to make that happen in 1 line, not have to repeat
themselves 7 times and then faff around with the #define CREATE_TRACE_POINTS etc.

> When I use trace_printk(), I seldom use it with tooling. And for the few times I
> have written tools to parse printk, the printk formats are very easily
> parsed, as I control them. Heck, I'd just do colon delimited string.

In the sort of things we do, people will want to put a trace_printk() to log a signal, and then get it as an overlay
on an existing per-task plot in a Jupyter notebook. This needs to take less than 15s (kernel compilation + reboot excluded).
So that means:

* We _always_ use tooling. In that case, tooling calls trace-cmd, parses output and loads it in a pandas dataframe, and from
   there either you do computations on it or you simply plot it (matplotlib, plotly, bokeh, whatever).

* As a result, the format must be machine-parsable in a robust way. The parser is written once and for all, no-one wants
   to have to write a regex to load a throw-away event. If whatever is provided requires customization to work on such a simple
   field-extraction case it's considered broken.

* The current format is ok-ish and works well for a human eye, but for all other purposes it's just a bad version of JSON:
	* untyped, you never know if a value is an integer or a string containing an integer. Any data pipeline
           will need to know that, even in dynamically typed language like Python (see numpy/pandas).

	* unquoted/escaped string. If I have a "comm" field, I'll end up with "comm=mytask foobar=1",
           but also "comm=mytask with spaces foobar=1" and why not "comm=mytask=1 foobar=1". So far, I never encountered
           task names with "=" in the wild but it would break parsing without hope of fixing. Android traces are full of tasks
           with spaces in their names, which makes the regex a lot less obvious and likely quite slower as well.

	* It's custom. That alone means (1) there is no off-the-shelve solution and (2) custom solutions will be slower, especially
           in languages like Python. You can find tons of JSON parsers, including ones using SIMD. Trying to match that speed with
	  a pure python implementation is not even remotely feasible, so it would require a C extension module, leading to having
	  to distribute a bunch of prebuilt binaries etc, which is _a lot_ more work than "import json; json.loads(a_line)".

So all that to say that concisely defining a structured event with a schema known by the tooling is _very_ valuable when you get
passed eyeballing a trace. The good news is that trace_printk() does provide a schema with its format string, it's just annoying
that it's a pure duplication of infrastructure wrt to other events that have their fields format reported in another way. Also
it's less powerful but we can live with that for quick experiments.

Maybe it would be feasible to write a function that takes a printk fmt string as input and creates a matching synthetic event. This
way we still get a regular event from userspace PoV, with the ease of definition of a format string in the code.

-- Douglas

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-12-13 21:14 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-24 14:50 [for-next][PATCH 00/11] tracing: Updates for 6.2 Steven Rostedt
2022-11-24 14:50 ` [for-next][PATCH 01/11] ftrace: Clean comments related to FTRACE_OPS_FL_PER_CPU Steven Rostedt
2022-11-24 14:50 ` [for-next][PATCH 02/11] tracing: Add __cpumask to denote a trace event field that is a cpumask_t Steven Rostedt
2022-12-12 14:53   ` Douglas Raillard
2022-12-12 16:12     ` Steven Rostedt
2022-12-12 17:04       ` Steven Rostedt
2022-12-12 22:19       ` Douglas Raillard
2022-12-12 23:53         ` Steven Rostedt
2022-12-13 14:20           ` Douglas Raillard
2022-12-13 15:11             ` Steven Rostedt
2022-12-13 17:40               ` Douglas Raillard
2022-12-13 19:44                 ` Steven Rostedt
2022-12-13 21:14                   ` Douglas Raillard
2022-12-13 19:50             ` Douglas Raillard
2022-11-24 14:50 ` [for-next][PATCH 03/11] tracing: Add trace_trigger kernel command line option Steven Rostedt
2022-11-24 14:50 ` [for-next][PATCH 04/11] ring_buffer: Remove unused "event" parameter Steven Rostedt
2022-11-24 14:50 ` [for-next][PATCH 05/11] tracing/osnoise: Add osnoise/options file Steven Rostedt
2022-11-24 17:31   ` Daniel Bristot de Oliveira
2022-11-24 19:28     ` Steven Rostedt
2022-11-24 14:50 ` [for-next][PATCH 06/11] tracing/osnoise: Add OSNOISE_WORKLOAD option Steven Rostedt
2022-11-24 14:50 ` [for-next][PATCH 07/11] Documentation/osnoise: Add osnoise/options documentation Steven Rostedt
2022-11-24 14:50 ` [for-next][PATCH 08/11] tracing/perf: Use strndup_user instead of kzalloc/strncpy_from_user Steven Rostedt
2022-11-24 14:50 ` [for-next][PATCH 09/11] tracing: Make tracepoint_print_iter static Steven Rostedt
2022-11-24 14:50 ` [for-next][PATCH 11/11] ftrace: Avoid needless updates of the ftrace function call Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).