linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v6 0/5] fix task state report from sched tracepoint
@ 2023-08-03  8:33 Ze Gao
  2023-08-03  8:33 ` [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel Ze Gao
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03  8:33 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel, Ze Gao

This is the 6th attempt to fix the report task state issue in sched
tracepint, you can check out previous discussions here:

v1: https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com
v2: https://lore.kernel.org/linux-trace-kernel/20230726121618.19198-1-zegao@tencent.com
v3: https://lore.kernel.org/linux-trace-kernel/20230801090124.8050-1-zegao@tencent.com
v4: https://lore.kernel.org/linux-trace-kernel/20230802121116.324604-1-zegao@tencent.com 
v5: https://lore.kernel.org/linux-trace-kernel/20230802124840.335638-1-zegao@tencent.com

Against v5, fix some code style and change to report prev_state
in 'short'; also send libtraceevent patches seperately to
linux-trace-devel@vger.kernel.org.

Note PATCH 1-3 are the normal fixes and cleanup whereas PATCH 4-5
introduce new changes.

--

FYI, this series are designed not to break anything now and still do the 
1-1 correspondence int-char mapping for each distinct task state we want to
report, and thus will not lose any details intended for debug purposes. Of
course, this might be compromised because of bugs introduced due to my
stupidity. So your sage comments are very important and appreciated!

--

In the status quo, we should see three different outcomes of the reported
sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
of tracepoint sched_switch.  And it's not hard to figure out that the
former two are built upon the third one, and the reason why we see this
inconsistency is that the former two does not catch up with the internal
change of reported task state definitions as the kernel evolves.

IMHO, exporting internal representations of task state in the tracepoint
sched_switch is not a good practice and not encouraged at all, which can
easily break userspace tools that relies on it. Especially when tracepoints
are massively used in many observability tools nowadays due to its stable
nature, which makes them no longer used for debug only purpose and we
should be careful to decide what ought to be reported to userspace and what
ought not.

Therefore, to fix the issues mentioned above for good, I proposed to add
a new variable to report task state in sched_switch with a symbolic char
along with the old hardcoded value, and save the further processing of
userspace tools and spare them from knowing implementation details in the
kernel.

After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint
only.

Reviews welcome!

Regards,
Ze

Ze Gao (5):
  perf sched: sync state char array with the kernel
  perf sched: reorganize sched-out task state report code
  sched, tracing: reorganize fields of switch event struct
  sched, tracing: add to report task state in symbolic chars
  perf sched: prefer to use prev_state_char introduced in sched_switch

 include/trace/events/sched.h | 70 +++++++++++++++++-------------
 tools/perf/builtin-sched.c   | 83 +++++++++++++++++-------------------
 2 files changed, 78 insertions(+), 75 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel
  2023-08-03  8:33 [RFC PATCH v6 0/5] fix task state report from sched tracepoint Ze Gao
@ 2023-08-03  8:33 ` Ze Gao
  2023-08-03  9:09   ` Steven Rostedt
  2023-08-03 15:10   ` Steven Rostedt
  2023-08-03  8:33 ` [RFC PATCH v6 2/5] perf sched: reorganize sched-out task state report code Ze Gao
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03  8:33 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel, Ze Gao

Update state char array and then remove unused and stale
macros, which are kernel internal representations and not
encouraged to use anymore.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 tools/perf/builtin-sched.c | 13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 9ab300b6f131..8dc8f071721c 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -92,23 +92,12 @@ struct sched_atom {
 	struct task_desc	*wakee;
 };
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
 
 /* task state bitmask, copied from include/linux/sched.h */
 #define TASK_RUNNING		0
 #define TASK_INTERRUPTIBLE	1
 #define TASK_UNINTERRUPTIBLE	2
-#define __TASK_STOPPED		4
-#define __TASK_TRACED		8
-/* in tsk->exit_state */
-#define EXIT_DEAD		16
-#define EXIT_ZOMBIE		32
-#define EXIT_TRACE		(EXIT_ZOMBIE | EXIT_DEAD)
-/* in tsk->state again */
-#define TASK_DEAD		64
-#define TASK_WAKEKILL		128
-#define TASK_WAKING		256
-#define TASK_PARKED		512
 
 enum thread_state {
 	THREAD_SLEEPING = 0,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v6 2/5] perf sched: reorganize sched-out task state report code
  2023-08-03  8:33 [RFC PATCH v6 0/5] fix task state report from sched tracepoint Ze Gao
  2023-08-03  8:33 ` [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel Ze Gao
@ 2023-08-03  8:33 ` Ze Gao
  2023-08-03  9:10   ` Steven Rostedt
  2023-08-03  8:33 ` [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct Ze Gao
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 35+ messages in thread
From: Ze Gao @ 2023-08-03  8:33 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel, Ze Gao

Mainly does housekeeping work and not introduce any
functional change.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 tools/perf/builtin-sched.c | 59 +++++++++++++++++---------------------
 1 file changed, 26 insertions(+), 33 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 8dc8f071721c..5042874ba204 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -94,11 +94,6 @@ struct sched_atom {
 
 #define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
 
-/* task state bitmask, copied from include/linux/sched.h */
-#define TASK_RUNNING		0
-#define TASK_INTERRUPTIBLE	1
-#define TASK_UNINTERRUPTIBLE	2
-
 enum thread_state {
 	THREAD_SLEEPING = 0,
 	THREAD_WAIT_CPU,
@@ -255,7 +250,7 @@ struct thread_runtime {
 	u64 total_preempt_time;
 	u64 total_delay_time;
 
-	int last_state;
+	char last_state;
 
 	char shortname[3];
 	bool comm_changed;
@@ -425,7 +420,7 @@ static void add_sched_event_wakeup(struct perf_sched *sched, struct task_desc *t
 }
 
 static void add_sched_event_sleep(struct perf_sched *sched, struct task_desc *task,
-				  u64 timestamp, u64 task_state __maybe_unused)
+				  u64 timestamp, char task_state __maybe_unused)
 {
 	struct sched_atom *event = get_new_event(task, timestamp);
 
@@ -840,6 +835,22 @@ replay_wakeup_event(struct perf_sched *sched,
 	return 0;
 }
 
+static inline char task_state_char(int state)
+{
+	static const char state_to_char[] = "RSDTtXZPI";
+	unsigned int bit = state ? ffs(state) : 0;
+
+	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
+}
+
+static inline char get_task_prev_state(struct evsel *evsel,
+				       struct perf_sample *sample)
+{
+	const int prev_state = evsel__intval(evsel, sample, "prev_state");
+
+	return task_state_char(prev_state);
+}
+
 static int replay_switch_event(struct perf_sched *sched,
 			       struct evsel *evsel,
 			       struct perf_sample *sample,
@@ -849,7 +860,7 @@ static int replay_switch_event(struct perf_sched *sched,
 		   *next_comm  = evsel__strval(evsel, sample, "next_comm");
 	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
 		  next_pid = evsel__intval(evsel, sample, "next_pid");
-	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+	const char prev_state = get_task_prev_state(evsel, sample);
 	struct task_desc *prev, __maybe_unused *next;
 	u64 timestamp0, timestamp = sample->time;
 	int cpu = sample->cpu;
@@ -1039,12 +1050,6 @@ static int thread_atoms_insert(struct perf_sched *sched, struct thread *thread)
 	return 0;
 }
 
-static char sched_out_state(u64 prev_state)
-{
-	const char *str = TASK_STATE_TO_CHAR_STR;
-
-	return str[prev_state];
-}
 
 static int
 add_sched_out_event(struct work_atoms *atoms,
@@ -1121,7 +1126,7 @@ static int latency_switch_event(struct perf_sched *sched,
 {
 	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
 		  next_pid = evsel__intval(evsel, sample, "next_pid");
-	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+	const char prev_state = get_task_prev_state(evsel, sample);
 	struct work_atoms *out_events, *in_events;
 	struct thread *sched_out, *sched_in;
 	u64 timestamp0, timestamp = sample->time;
@@ -1157,7 +1162,7 @@ static int latency_switch_event(struct perf_sched *sched,
 			goto out_put;
 		}
 	}
-	if (add_sched_out_event(out_events, sched_out_state(prev_state), timestamp))
+	if (add_sched_out_event(out_events, prev_state, timestamp))
 		return -1;
 
 	in_events = thread_atoms_search(&sched->atom_root, sched_in, &sched->cmp_pid);
@@ -2022,24 +2027,12 @@ static void timehist_header(struct perf_sched *sched)
 	printf("\n");
 }
 
-static char task_state_char(struct thread *thread, int state)
-{
-	static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
-	unsigned bit = state ? ffs(state) : 0;
-
-	/* 'I' for idle */
-	if (thread__tid(thread) == 0)
-		return 'I';
-
-	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
-}
-
 static void timehist_print_sample(struct perf_sched *sched,
 				  struct evsel *evsel,
 				  struct perf_sample *sample,
 				  struct addr_location *al,
 				  struct thread *thread,
-				  u64 t, int state)
+				  u64 t, char state)
 {
 	struct thread_runtime *tr = thread__priv(thread);
 	const char *next_comm = evsel__strval(evsel, sample, "next_comm");
@@ -2080,7 +2073,7 @@ static void timehist_print_sample(struct perf_sched *sched,
 	print_sched_time(tr->dt_run, 6);
 
 	if (sched->show_state)
-		printf(" %5c ", task_state_char(thread, state));
+		printf(" %5c ", thread->tid == 0 ? 'I' : state);
 
 	if (sched->show_next) {
 		snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm, next_pid);
@@ -2152,9 +2145,9 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
 		else if (r->last_time) {
 			u64 dt_wait = tprev - r->last_time;
 
-			if (r->last_state == TASK_RUNNING)
+			if (r->last_state == 'R')
 				r->dt_preempt = dt_wait;
-			else if (r->last_state == TASK_UNINTERRUPTIBLE)
+			else if (r->last_state == 'D')
 				r->dt_iowait = dt_wait;
 			else
 				r->dt_sleep = dt_wait;
@@ -2579,7 +2572,7 @@ static int timehist_sched_change_event(struct perf_tool *tool,
 	struct thread_runtime *tr = NULL;
 	u64 tprev, t = sample->time;
 	int rc = 0;
-	int state = evsel__intval(evsel, sample, "prev_state");
+	const char state = get_task_prev_state(evsel, sample);
 
 	addr_location__init(&al);
 	if (machine__resolve(machine, &al, sample) < 0) {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct
  2023-08-03  8:33 [RFC PATCH v6 0/5] fix task state report from sched tracepoint Ze Gao
  2023-08-03  8:33 ` [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel Ze Gao
  2023-08-03  8:33 ` [RFC PATCH v6 2/5] perf sched: reorganize sched-out task state report code Ze Gao
@ 2023-08-03  8:33 ` Ze Gao
  2023-08-03  8:53   ` Peter Zijlstra
                     ` (2 more replies)
  2023-08-03  8:33 ` [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars Ze Gao
  2023-08-03  8:33 ` [RFC PATCH v6 5/5] perf sched: prefer to use prev_state_char introduced in sched_switch Ze Gao
  4 siblings, 3 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03  8:33 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel, Ze Gao

Report prioritiy and prev_state in 'short' to save some buffer
space. And also reorder the fields so that we take struct
alignment into consideration to make the record compact.

Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Ze Gao <zegao@tencent.com>
---
 include/trace/events/sched.h | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index fbb99a61f714..43492daefa6f 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -187,14 +187,14 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
 	     TP_ARGS(p));
 
 #ifdef CREATE_TRACE_POINTS
-static inline long __trace_sched_switch_state(bool preempt,
+static inline short __trace_sched_switch_state(bool preempt,
 					      unsigned int prev_state,
 					      struct task_struct *p)
 {
 	unsigned int state;
 
 #ifdef CONFIG_SCHED_DEBUG
-	BUG_ON(p != current);
+	WARN_ON_ONCE(p != current);
 #endif /* CONFIG_SCHED_DEBUG */
 
 	/*
@@ -229,23 +229,23 @@ TRACE_EVENT(sched_switch,
 	TP_ARGS(preempt, prev, next, prev_state),
 
 	TP_STRUCT__entry(
-		__array(	char,	prev_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	prev_pid			)
-		__field(	int,	prev_prio			)
-		__field(	long,	prev_state			)
-		__array(	char,	next_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	next_pid			)
-		__field(	int,	next_prio			)
+		__field(	short,	prev_prio			)
+		__field(	short,	next_prio			)
+		__array(	char,	prev_comm,	TASK_COMM_LEN	)
+		__array(	char,	next_comm,	TASK_COMM_LEN	)
+		__field(	short,	prev_state			)
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
-		__entry->prev_pid	= prev->pid;
-		__entry->prev_prio	= prev->prio;
-		__entry->prev_state	= __trace_sched_switch_state(preempt, prev_state, prev);
+		__entry->prev_pid		= prev->pid;
+		__entry->next_pid		= next->pid;
+		__entry->prev_prio		= (short) prev->prio;
+		__entry->next_prio		= (short) next->prio;
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
-		__entry->next_pid	= next->pid;
-		__entry->next_prio	= next->prio;
+		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+		__entry->prev_state		= __trace_sched_switch_state(preempt, prev_state, prev);
 		/* XXX SCHED_DEADLINE */
 	),
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars
  2023-08-03  8:33 [RFC PATCH v6 0/5] fix task state report from sched tracepoint Ze Gao
                   ` (2 preceding siblings ...)
  2023-08-03  8:33 ` [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct Ze Gao
@ 2023-08-03  8:33 ` Ze Gao
  2023-08-03  8:59   ` Peter Zijlstra
  2023-08-03  9:29   ` Steven Rostedt
  2023-08-03  8:33 ` [RFC PATCH v6 5/5] perf sched: prefer to use prev_state_char introduced in sched_switch Ze Gao
  4 siblings, 2 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03  8:33 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel, Ze Gao

Internal representations of task state are likely to be changed
or ordered, and reporting them to userspace without exporting
them as part of API is basically wrong, which can easily break
a userspace observability tool as kernel evolves. For example,
perf suffers from this and still reports wrong states as of this
writing.

OTOH, some masqueraded states like TASK_REPORT_IDLE and
TASK_REPORT_MAX are also reported inadvertently, which confuses
things even more and most userspace tools do not even take them
into consideration.

So add a new variable in company with the old raw value to
report task state in symbolic chars, which are self-explaining
and no further translation is needed. Of course this does not
break any userspace tool.

Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
the old conventions for the rest.

Signed-off-by: Ze Gao <zegao@tencent.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
---
 include/trace/events/sched.h | 44 ++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 43492daefa6f..ae5b486cc969 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -6,6 +6,7 @@
 #define _TRACE_SCHED_H
 
 #include <linux/kthread.h>
+#include <linux/sched.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
@@ -214,6 +215,27 @@ static inline short __trace_sched_switch_state(bool preempt,
 
 	return state ? (1 << (state - 1)) : state;
 }
+
+static inline char __trace_sched_switch_state_char(bool preempt,
+						   unsigned int prev_state,
+						   struct task_struct *p)
+{
+	long state;
+
+#ifdef CONFIG_SCHED_DEBUG
+	WARN_ON_ONCE(p != current);
+#endif /* CONFIG_SCHED_DEBUG */
+
+	/*
+	 * For PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
+	 * conventions for the rest.
+	 */
+	if (preempt)
+		return 'p';
+
+	state = __task_state_index(prev_state, p->exit_state);
+	return task_index_to_char(state);
+}
 #endif /* CREATE_TRACE_POINTS */
 
 /*
@@ -236,6 +258,7 @@ TRACE_EVENT(sched_switch,
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
 		__array(	char,	next_comm,	TASK_COMM_LEN	)
 		__field(	short,	prev_state			)
+		__field(	char,	prev_state_char			)
 	),
 
 	TP_fast_assign(
@@ -246,26 +269,13 @@ TRACE_EVENT(sched_switch,
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		__entry->prev_state		= __trace_sched_switch_state(preempt, prev_state, prev);
+		__entry->prev_state_char	= __trace_sched_switch_state_char(preempt, prev_state, prev);
 		/* XXX SCHED_DEADLINE */
 	),
 
-	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
-		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
-
-		(__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
-		  __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
-				{ TASK_INTERRUPTIBLE, "S" },
-				{ TASK_UNINTERRUPTIBLE, "D" },
-				{ __TASK_STOPPED, "T" },
-				{ __TASK_TRACED, "t" },
-				{ EXIT_DEAD, "X" },
-				{ EXIT_ZOMBIE, "Z" },
-				{ TASK_PARKED, "P" },
-				{ TASK_DEAD, "I" }) :
-		  "R",
-
-		__entry->prev_state & TASK_REPORT_MAX ? "+" : "",
-		__entry->next_comm, __entry->next_pid, __entry->next_prio)
+	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%c ==> next_comm=%s next_pid=%d next_prio=%d",
+		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio, __entry->prev_state_char, __entry->next_comm,
+		__entry->next_pid, __entry->next_prio)
 );
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v6 5/5] perf sched: prefer to use prev_state_char introduced in sched_switch
  2023-08-03  8:33 [RFC PATCH v6 0/5] fix task state report from sched tracepoint Ze Gao
                   ` (3 preceding siblings ...)
  2023-08-03  8:33 ` [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars Ze Gao
@ 2023-08-03  8:33 ` Ze Gao
  2023-08-03  9:34   ` Steven Rostedt
  4 siblings, 1 reply; 35+ messages in thread
From: Ze Gao @ 2023-08-03  8:33 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel, Ze Gao

Since the sched_switch tracepoint introduces a new variable to
report sched-out task state in symbolic char, we prefer to use
it to spare from knowing internal implementations in kernel.

Also we keep the old parsing logic intact but sync the state char
array with the latest kernel.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 tools/perf/builtin-sched.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 5042874ba204..6f99a36206c9 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -837,7 +837,7 @@ replay_wakeup_event(struct perf_sched *sched,
 
 static inline char task_state_char(int state)
 {
-	static const char state_to_char[] = "RSDTtXZPI";
+	static const char state_to_char[] = "RSDTtXZPIp";
 	unsigned int bit = state ? ffs(state) : 0;
 
 	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
@@ -846,9 +846,20 @@ static inline char task_state_char(int state)
 static inline char get_task_prev_state(struct evsel *evsel,
 				       struct perf_sample *sample)
 {
-	const int prev_state = evsel__intval(evsel, sample, "prev_state");
+	char prev_state_char;
+	int prev_state;
 
-	return task_state_char(prev_state);
+	/* prefer to use prev_state_char */
+	if (evsel__field(evsel, "prev_state_char"))
+		prev_state_char = (char) evsel__intval(evsel,
+				sample, "prev_state_char");
+	else {
+		prev_state = (int) evsel__intval(evsel,
+				sample, "prev_state");
+		prev_state_char = task_state_char(prev_state);
+	}
+
+	return prev_state_char;
 }
 
 static int replay_switch_event(struct perf_sched *sched,
@@ -2145,7 +2156,7 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
 		else if (r->last_time) {
 			u64 dt_wait = tprev - r->last_time;
 
-			if (r->last_state == 'R')
+			if (r->last_state == 'R' || r->last_state == 'p')
 				r->dt_preempt = dt_wait;
 			else if (r->last_state == 'D')
 				r->dt_iowait = dt_wait;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct
  2023-08-03  8:33 ` [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct Ze Gao
@ 2023-08-03  8:53   ` Peter Zijlstra
  2023-08-03 11:06     ` Ze Gao
  2023-08-03  9:18   ` Steven Rostedt
  2023-08-23  2:52   ` kernel test robot
  2 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-08-03  8:53 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, Aug 03, 2023 at 04:33:50AM -0400, Ze Gao wrote:
> Report prioritiy and prev_state in 'short' to save some buffer
> space. And also reorder the fields so that we take struct
> alignment into consideration to make the record compact.
> 
> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Signed-off-by: Ze Gao <zegao@tencent.com>

I am really getting fed up with this. This again doesn't list any
reasons on why this is a sane thing to do.

Please review past discussions and collate the various things mentioned
into this Changelog.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars
  2023-08-03  8:33 ` [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars Ze Gao
@ 2023-08-03  8:59   ` Peter Zijlstra
  2023-08-03 13:09     ` Ze Gao
  2023-08-03  9:29   ` Steven Rostedt
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-08-03  8:59 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, Aug 03, 2023 at 04:33:51AM -0400, Ze Gao wrote:
> Internal representations of task state are likely to be changed
> or ordered, and reporting them to userspace without exporting
> them as part of API is basically wrong, which can easily break
> a userspace observability tool as kernel evolves. For example,
> perf suffers from this and still reports wrong states as of this
> writing.
> 
> OTOH, some masqueraded states like TASK_REPORT_IDLE and
> TASK_REPORT_MAX are also reported inadvertently, which confuses
> things even more and most userspace tools do not even take them
> into consideration.
> 
> So add a new variable in company with the old raw value to
> report task state in symbolic chars, which are self-explaining
> and no further translation is needed. Of course this does not
> break any userspace tool.
> 
> Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
> the old conventions for the rest.
> 
> Signed-off-by: Ze Gao <zegao@tencent.com>
> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> Acked-by: Ian Rogers <irogers@google.com>

I'm not sure you've actually read any of the things I've written. I hate
this. Not going to happen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel
  2023-08-03  8:33 ` [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel Ze Gao
@ 2023-08-03  9:09   ` Steven Rostedt
  2023-08-03 10:29     ` Ze Gao
  2023-08-03 12:39     ` Ze Gao
  2023-08-03 15:10   ` Steven Rostedt
  1 sibling, 2 replies; 35+ messages in thread
From: Steven Rostedt @ 2023-08-03  9:09 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu,  3 Aug 2023 04:33:48 -0400
Ze Gao <zegao2021@gmail.com> wrote:

Hi Ze,

> Update state char array and then remove unused and stale
> macros, which are kernel internal representations and not
> encouraged to use anymore.
> 

A couple of things.

First, the change logs of every commit need to specify the "why". The
subject can say "what", but the change log really needs to explain why this
patch is important. For example, this patch is really two changes (and thus
should actually be two patches). (I'll also comment on the other patches)

1. The update of the state char array. You should explain why it's being
updated. If it was wrong, it needs to state the commit that changed to make
that happen.

2. For the removing the stale macros, the change log can simply state that
the macros are unused in the code and are being removed.

Finally, I know you're eager to get this patch set in, but please hold off
sending a new version immediately after a comment or two. Some maintainers
prefer submitters to wait a week or so, otherwise you will tend to "spam"
their inboxes. There's more than one maintainer Cc'd on this series, and you
need to be courteous not to send too many emails in a short period of time.

-- Steve


> Signed-off-by: Ze Gao <zegao@tencent.com>
> ---
>  tools/perf/builtin-sched.c | 13 +------------
>  1 file changed, 1 insertion(+), 12 deletions(-)
> 
> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index 9ab300b6f131..8dc8f071721c 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -92,23 +92,12 @@ struct sched_atom {
>  	struct task_desc	*wakee;
>  };
>  
> -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
>  
>  /* task state bitmask, copied from include/linux/sched.h */
>  #define TASK_RUNNING		0
>  #define TASK_INTERRUPTIBLE	1
>  #define TASK_UNINTERRUPTIBLE	2
> -#define __TASK_STOPPED		4
> -#define __TASK_TRACED		8
> -/* in tsk->exit_state */
> -#define EXIT_DEAD		16
> -#define EXIT_ZOMBIE		32
> -#define EXIT_TRACE		(EXIT_ZOMBIE | EXIT_DEAD)
> -/* in tsk->state again */
> -#define TASK_DEAD		64
> -#define TASK_WAKEKILL		128
> -#define TASK_WAKING		256
> -#define TASK_PARKED		512
>  
>  enum thread_state {
>  	THREAD_SLEEPING = 0,


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 2/5] perf sched: reorganize sched-out task state report code
  2023-08-03  8:33 ` [RFC PATCH v6 2/5] perf sched: reorganize sched-out task state report code Ze Gao
@ 2023-08-03  9:10   ` Steven Rostedt
  2023-08-03 12:37     ` Ze Gao
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2023-08-03  9:10 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu,  3 Aug 2023 04:33:49 -0400
Ze Gao <zegao2021@gmail.com> wrote:

> Mainly does housekeeping work and not introduce any
> functional change.

This change log doesn't explain at all why this patch is needed, let alone
what it is even doing.

-- Steve


> 
> Signed-off-by: Ze Gao <zegao@tencent.com>
> ---
>  tools/perf/builtin-sched.c | 59 +++++++++++++++++---------------------
>  1 file changed, 26 insertions(+), 33 deletions(-)
> 
> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index 8dc8f071721c..5042874ba204 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -94,11 +94,6 @@ struct sched_atom {
>  
>  #define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
>  
> -/* task state bitmask, copied from include/linux/sched.h */
> -#define TASK_RUNNING		0
> -#define TASK_INTERRUPTIBLE	1
> -#define TASK_UNINTERRUPTIBLE	2
> -
>  enum thread_state {
>  	THREAD_SLEEPING = 0,
>  	THREAD_WAIT_CPU,
> @@ -255,7 +250,7 @@ struct thread_runtime {
>  	u64 total_preempt_time;
>  	u64 total_delay_time;
>  
> -	int last_state;
> +	char last_state;
>  
>  	char shortname[3];
>  	bool comm_changed;
> @@ -425,7 +420,7 @@ static void add_sched_event_wakeup(struct perf_sched *sched, struct task_desc *t
>  }
>  
>  static void add_sched_event_sleep(struct perf_sched *sched, struct task_desc *task,
> -				  u64 timestamp, u64 task_state __maybe_unused)
> +				  u64 timestamp, char task_state __maybe_unused)
>  {
>  	struct sched_atom *event = get_new_event(task, timestamp);
>  
> @@ -840,6 +835,22 @@ replay_wakeup_event(struct perf_sched *sched,
>  	return 0;
>  }
>  
> +static inline char task_state_char(int state)
> +{
> +	static const char state_to_char[] = "RSDTtXZPI";
> +	unsigned int bit = state ? ffs(state) : 0;
> +
> +	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
> +}
> +
> +static inline char get_task_prev_state(struct evsel *evsel,
> +				       struct perf_sample *sample)
> +{
> +	const int prev_state = evsel__intval(evsel, sample, "prev_state");
> +
> +	return task_state_char(prev_state);
> +}
> +
>  static int replay_switch_event(struct perf_sched *sched,
>  			       struct evsel *evsel,
>  			       struct perf_sample *sample,
> @@ -849,7 +860,7 @@ static int replay_switch_event(struct perf_sched *sched,
>  		   *next_comm  = evsel__strval(evsel, sample, "next_comm");
>  	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
>  		  next_pid = evsel__intval(evsel, sample, "next_pid");
> -	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
> +	const char prev_state = get_task_prev_state(evsel, sample);
>  	struct task_desc *prev, __maybe_unused *next;
>  	u64 timestamp0, timestamp = sample->time;
>  	int cpu = sample->cpu;
> @@ -1039,12 +1050,6 @@ static int thread_atoms_insert(struct perf_sched *sched, struct thread *thread)
>  	return 0;
>  }
>  
> -static char sched_out_state(u64 prev_state)
> -{
> -	const char *str = TASK_STATE_TO_CHAR_STR;
> -
> -	return str[prev_state];
> -}
>  
>  static int
>  add_sched_out_event(struct work_atoms *atoms,
> @@ -1121,7 +1126,7 @@ static int latency_switch_event(struct perf_sched *sched,
>  {
>  	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
>  		  next_pid = evsel__intval(evsel, sample, "next_pid");
> -	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
> +	const char prev_state = get_task_prev_state(evsel, sample);
>  	struct work_atoms *out_events, *in_events;
>  	struct thread *sched_out, *sched_in;
>  	u64 timestamp0, timestamp = sample->time;
> @@ -1157,7 +1162,7 @@ static int latency_switch_event(struct perf_sched *sched,
>  			goto out_put;
>  		}
>  	}
> -	if (add_sched_out_event(out_events, sched_out_state(prev_state), timestamp))
> +	if (add_sched_out_event(out_events, prev_state, timestamp))
>  		return -1;
>  
>  	in_events = thread_atoms_search(&sched->atom_root, sched_in, &sched->cmp_pid);
> @@ -2022,24 +2027,12 @@ static void timehist_header(struct perf_sched *sched)
>  	printf("\n");
>  }
>  
> -static char task_state_char(struct thread *thread, int state)
> -{
> -	static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
> -	unsigned bit = state ? ffs(state) : 0;
> -
> -	/* 'I' for idle */
> -	if (thread__tid(thread) == 0)
> -		return 'I';
> -
> -	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
> -}
> -
>  static void timehist_print_sample(struct perf_sched *sched,
>  				  struct evsel *evsel,
>  				  struct perf_sample *sample,
>  				  struct addr_location *al,
>  				  struct thread *thread,
> -				  u64 t, int state)
> +				  u64 t, char state)
>  {
>  	struct thread_runtime *tr = thread__priv(thread);
>  	const char *next_comm = evsel__strval(evsel, sample, "next_comm");
> @@ -2080,7 +2073,7 @@ static void timehist_print_sample(struct perf_sched *sched,
>  	print_sched_time(tr->dt_run, 6);
>  
>  	if (sched->show_state)
> -		printf(" %5c ", task_state_char(thread, state));
> +		printf(" %5c ", thread->tid == 0 ? 'I' : state);
>  
>  	if (sched->show_next) {
>  		snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm, next_pid);
> @@ -2152,9 +2145,9 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
>  		else if (r->last_time) {
>  			u64 dt_wait = tprev - r->last_time;
>  
> -			if (r->last_state == TASK_RUNNING)
> +			if (r->last_state == 'R')
>  				r->dt_preempt = dt_wait;
> -			else if (r->last_state == TASK_UNINTERRUPTIBLE)
> +			else if (r->last_state == 'D')
>  				r->dt_iowait = dt_wait;
>  			else
>  				r->dt_sleep = dt_wait;
> @@ -2579,7 +2572,7 @@ static int timehist_sched_change_event(struct perf_tool *tool,
>  	struct thread_runtime *tr = NULL;
>  	u64 tprev, t = sample->time;
>  	int rc = 0;
> -	int state = evsel__intval(evsel, sample, "prev_state");
> +	const char state = get_task_prev_state(evsel, sample);
>  
>  	addr_location__init(&al);
>  	if (machine__resolve(machine, &al, sample) < 0) {


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct
  2023-08-03  8:33 ` [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct Ze Gao
  2023-08-03  8:53   ` Peter Zijlstra
@ 2023-08-03  9:18   ` Steven Rostedt
  2023-08-03  9:51     ` Peter Zijlstra
                       ` (2 more replies)
  2023-08-23  2:52   ` kernel test robot
  2 siblings, 3 replies; 35+ messages in thread
From: Steven Rostedt @ 2023-08-03  9:18 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu,  3 Aug 2023 04:33:50 -0400
Ze Gao <zegao2021@gmail.com> wrote:

> Report prioritiy and prev_state in 'short' to save some buffer
> space. And also reorder the fields so that we take struct
> alignment into consideration to make the record compact.

If I were to write this, I would have wrote:

  The prev_state field in the sched_switch event is assigned by
  __trace_sched_switch_state(). The largest number that function will return
  is TASK_REPORT_MAX which is just 0x100. There's no reason that the
  prev_state field is a full 32 bits when it is using just 9 bits max. In
  order to save space on the ring buffer, shrink the prev_state to 16 bits
  (short).

  Also, change the positions of the other fields to accommodate the short
  value of prev_state to eliminate any holes that were created in the
  structure.

See the difference?

>  
>  #ifdef CREATE_TRACE_POINTS
> -static inline long __trace_sched_switch_state(bool preempt,
> +static inline short __trace_sched_switch_state(bool preempt,
>  					      unsigned int prev_state,
>  					      struct task_struct *p)
>  {
>  	unsigned int state;
>  
>  #ifdef CONFIG_SCHED_DEBUG
> -	BUG_ON(p != current);
> +	WARN_ON_ONCE(p != current);
>  #endif /* CONFIG_SCHED_DEBUG */

The above needs to be a separate patch.

-- Steve

>  
>  	/*
> @@ -229,23 +229,23 @@ TRACE_EVENT(sched_switch,
>  	TP_ARGS(preempt, prev, next, prev_state),
>  
>  	TP_STRUCT__entry(
> -		__array(	char,	prev_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	prev_pid			)
> -		__field(	int,	prev_prio			)
> -		__field(	long,	prev_state			)
> -		__array(	char,	next_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	next_pid			)
> -		__field(	int,	next_prio			)
> +		__field(	short,	prev_prio			)
> +		__field(	short,	next_prio			)
> +		__array(	char,	prev_comm,	TASK_COMM_LEN	)
> +		__array(	char,	next_comm,	TASK_COMM_LEN	)
> +		__field(	short,	prev_state			)
>  	),
>  
>  	TP_fast_assign(
> -		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> -		__entry->prev_pid	= prev->pid;
> -		__entry->prev_prio	= prev->prio;
> -		__entry->prev_state	= __trace_sched_switch_state(preempt, prev_state, prev);
> +		__entry->prev_pid		= prev->pid;
> +		__entry->next_pid		= next->pid;
> +		__entry->prev_prio		= (short) prev->prio;
> +		__entry->next_prio		= (short) next->prio;
>  		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> -		__entry->next_pid	= next->pid;
> -		__entry->next_prio	= next->prio;
> +		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> +		__entry->prev_state		= __trace_sched_switch_state(preempt, prev_state, prev);
>  		/* XXX SCHED_DEADLINE */
>  	),
>  


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars
  2023-08-03  8:33 ` [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars Ze Gao
  2023-08-03  8:59   ` Peter Zijlstra
@ 2023-08-03  9:29   ` Steven Rostedt
  2023-08-03 10:55     ` Ze Gao
  1 sibling, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2023-08-03  9:29 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu,  3 Aug 2023 04:33:51 -0400
Ze Gao <zegao2021@gmail.com> wrote:

> Internal representations of task state are likely to be changed
> or ordered, and reporting them to userspace without exporting
> them as part of API is basically wrong, which can easily break
> a userspace observability tool as kernel evolves. For example,
> perf suffers from this and still reports wrong states as of this
> writing.
> 
> OTOH, some masqueraded states like TASK_REPORT_IDLE and
> TASK_REPORT_MAX are also reported inadvertently, which confuses
> things even more and most userspace tools do not even take them
> into consideration.
> 
> So add a new variable in company with the old raw value to
> report task state in symbolic chars, which are self-explaining
> and no further translation is needed. Of course this does not
> break any userspace tool.
> 
> Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
> the old conventions for the rest.

The above is actually good.


> 
> Signed-off-by: Ze Gao <zegao@tencent.com>
> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> Acked-by: Ian Rogers <irogers@google.com>
> ---
>  include/trace/events/sched.h | 44 ++++++++++++++++++++++--------------
>  1 file changed, 27 insertions(+), 17 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 43492daefa6f..ae5b486cc969 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -6,6 +6,7 @@
>  #define _TRACE_SCHED_H
>  
>  #include <linux/kthread.h>
> +#include <linux/sched.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/tracepoint.h>
>  #include <linux/binfmts.h>
> @@ -214,6 +215,27 @@ static inline short __trace_sched_switch_state(bool preempt,
>  
>  	return state ? (1 << (state - 1)) : state;
>  }
> +
> +static inline char __trace_sched_switch_state_char(bool preempt,
> +						   unsigned int prev_state,
> +						   struct task_struct *p)
> +{
> +	long state;
> +
> +#ifdef CONFIG_SCHED_DEBUG
> +	WARN_ON_ONCE(p != current);
> +#endif /* CONFIG_SCHED_DEBUG */
> +
> +	/*
> +	 * For PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
> +	 * conventions for the rest.
> +	 */
> +	if (preempt)
> +		return 'p';
> +
> +	state = __task_state_index(prev_state, p->exit_state);
> +	return task_index_to_char(state);
> +}
>  #endif /* CREATE_TRACE_POINTS */
>  
>  /*
> @@ -236,6 +258,7 @@ TRACE_EVENT(sched_switch,
>  		__array(	char,	prev_comm,	TASK_COMM_LEN	)
>  		__array(	char,	next_comm,	TASK_COMM_LEN	)
>  		__field(	short,	prev_state			)
> +		__field(	char,	prev_state_char			)
>  	),
>  
>  	TP_fast_assign(
> @@ -246,26 +269,13 @@ TRACE_EVENT(sched_switch,
>  		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
>  		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
>  		__entry->prev_state		= __trace_sched_switch_state(preempt, prev_state, prev);
> +		__entry->prev_state_char	= __trace_sched_switch_state_char(preempt, prev_state, prev);
>  		/* XXX SCHED_DEADLINE */
>  	),
>  
> -	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
> -		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
> -
> -		(__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
> -		  __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
> -				{ TASK_INTERRUPTIBLE, "S" },
> -				{ TASK_UNINTERRUPTIBLE, "D" },
> -				{ __TASK_STOPPED, "T" },
> -				{ __TASK_TRACED, "t" },
> -				{ EXIT_DEAD, "X" },
> -				{ EXIT_ZOMBIE, "Z" },
> -				{ TASK_PARKED, "P" },
> -				{ TASK_DEAD, "I" }) :
> -		  "R",

I just realized, I have user space code that looks at this. As in the format file we have:

print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, (REC->prev_state & ((((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) - 1)) ? __print_flags(REC->prev_state & ((((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) - 1), "|", { 0x00000001, "S" }, { 0x00000002, "D" }, { 0x00000004, "T" }, { 0x00000008, "t" }, { 0x00000010, "X" }, { 0x00000020, "Z" }, { 0x00000040, "P" }, { 0x00000080, "I" }) : "R", REC->prev_state & (((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio

And I have used this in applications to find out what values "S" and "D"
are.

So, we need to keep that still. But we can add the prev_state_char to the
output too.

  "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s[%c] ==> next_comm=%s next_pid=%d next_prio=%d"

> -
> -		__entry->prev_state & TASK_REPORT_MAX ? "+" : "",
> -		__entry->next_comm, __entry->next_pid, __entry->next_prio)
> +	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%c ==> next_comm=%s next_pid=%d next_prio=%d",
> +		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio, __entry->prev_state_char, __entry->next_comm,
> +		__entry->next_pid, __entry->next_prio)
>  );
>  
>  /*


-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 5/5] perf sched: prefer to use prev_state_char introduced in sched_switch
  2023-08-03  8:33 ` [RFC PATCH v6 5/5] perf sched: prefer to use prev_state_char introduced in sched_switch Ze Gao
@ 2023-08-03  9:34   ` Steven Rostedt
  2023-08-03 11:01     ` Ze Gao
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2023-08-03  9:34 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu,  3 Aug 2023 04:33:52 -0400
Ze Gao <zegao2021@gmail.com> wrote:

> Since the sched_switch tracepoint introduces a new variable to
> report sched-out task state in symbolic char, we prefer to use
> it to spare from knowing internal implementations in kernel.

The above needs to be rewritten to be more comprehensive.

Feel free to reply to this thread with new versions of the change log and
such. No need to send a new patch series to fix this before you know that
the new version is acceptable or not.

Replying to a current patch series is fine, but sending out a new one
causes the spam as it's much easier to ignore a thread than to ignore a new
thread with a new series.

Thanks,

-- Steve


> 
> Also we keep the old parsing logic intact but sync the state char
> array with the latest kernel.
> 
> Signed-off-by: Ze Gao <zegao@tencent.com>
> ---
>  tools/perf/builtin-sched.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index 5042874ba204..6f99a36206c9 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -837,7 +837,7 @@ replay_wakeup_event(struct perf_sched *sched,
>  
>  static inline char task_state_char(int state)
>  {
> -	static const char state_to_char[] = "RSDTtXZPI";
> +	static const char state_to_char[] = "RSDTtXZPIp";
>  	unsigned int bit = state ? ffs(state) : 0;
>  
>  	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
> @@ -846,9 +846,20 @@ static inline char task_state_char(int state)
>  static inline char get_task_prev_state(struct evsel *evsel,
>  				       struct perf_sample *sample)
>  {
> -	const int prev_state = evsel__intval(evsel, sample, "prev_state");
> +	char prev_state_char;
> +	int prev_state;
>  
> -	return task_state_char(prev_state);
> +	/* prefer to use prev_state_char */
> +	if (evsel__field(evsel, "prev_state_char"))
> +		prev_state_char = (char) evsel__intval(evsel,
> +				sample, "prev_state_char");
> +	else {
> +		prev_state = (int) evsel__intval(evsel,
> +				sample, "prev_state");
> +		prev_state_char = task_state_char(prev_state);
> +	}
> +
> +	return prev_state_char;
>  }
>  
>  static int replay_switch_event(struct perf_sched *sched,
> @@ -2145,7 +2156,7 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
>  		else if (r->last_time) {
>  			u64 dt_wait = tprev - r->last_time;
>  
> -			if (r->last_state == 'R')
> +			if (r->last_state == 'R' || r->last_state == 'p')
>  				r->dt_preempt = dt_wait;
>  			else if (r->last_state == 'D')
>  				r->dt_iowait = dt_wait;


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct
  2023-08-03  9:18   ` Steven Rostedt
@ 2023-08-03  9:51     ` Peter Zijlstra
  2023-08-03 14:45       ` Steven Rostedt
  2023-08-03 12:54     ` Ze Gao
  2023-08-03 12:57     ` Ze Gao
  2 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-08-03  9:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ze Gao, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, Namhyung Kim, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, Aug 03, 2023 at 05:18:26AM -0400, Steven Rostedt wrote:
> On Thu,  3 Aug 2023 04:33:50 -0400
> Ze Gao <zegao2021@gmail.com> wrote:
> 
> > Report prioritiy and prev_state in 'short' to save some buffer
> > space. And also reorder the fields so that we take struct
> > alignment into consideration to make the record compact.
> 
> If I were to write this, I would have wrote:
> 
>   The prev_state field in the sched_switch event is assigned by
>   __trace_sched_switch_state(). The largest number that function will return
>   is TASK_REPORT_MAX which is just 0x100. There's no reason that the
>   prev_state field is a full 32 bits when it is using just 9 bits max. In
>   order to save space on the ring buffer, shrink the prev_state to 16 bits
>   (short).
> 
>   Also, change the positions of the other fields to accommodate the short
>   value of prev_state to eliminate any holes that were created in the
>   structure.
> 
> See the difference?

This also doesn't mention you broke the data format for all trace events
a while back to ensure people are using libtracefs and are thus
confident this won't break things.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel
  2023-08-03  9:09   ` Steven Rostedt
@ 2023-08-03 10:29     ` Ze Gao
  2023-08-03 12:25       ` Ze Gao
  2023-08-03 12:39     ` Ze Gao
  1 sibling, 1 reply; 35+ messages in thread
From: Ze Gao @ 2023-08-03 10:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, Aug 3, 2023 at 5:09 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu,  3 Aug 2023 04:33:48 -0400
> Ze Gao <zegao2021@gmail.com> wrote:
>
> Hi Ze,
>
> > Update state char array and then remove unused and stale
> > macros, which are kernel internal representations and not
> > encouraged to use anymore.
> >
>
> A couple of things.
>
> First, the change logs of every commit need to specify the "why". The
> subject can say "what", but the change log really needs to explain why this
> patch is important. For example, this patch is really two changes (and thus
> should actually be two patches). (I'll also comment on the other patches)

Thanks for the feedback! Will elaborate the changes in each changelog.

> 1. The update of the state char array. You should explain why it's being
> updated. If it was wrong, it needs to state the commit that changed to make
> that happen.
>
> 2. For the removing the stale macros, the change log can simply state that
> the macros are unused in the code and are being removed.
>
> Finally, I know you're eager to get this patch set in, but please hold off
> sending a new version immediately after a comment or two. Some maintainers
> prefer submitters to wait a week or so, otherwise you will tend to "spam"
> their inboxes. There's more than one maintainer Cc'd on this series, and you
> need to be courteous not to send too many emails in a short period of time.

Noted!  Actually I'm in no rush and just to make sure people see the
latest patches so they do not have to waste time on the old series.

Will hold off to resolve all the comments in this thread.

And thanks for pointing this out.

Regards,
Ze

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars
  2023-08-03  9:29   ` Steven Rostedt
@ 2023-08-03 10:55     ` Ze Gao
  0 siblings, 0 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03 10:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, Aug 3, 2023 at 5:29 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu,  3 Aug 2023 04:33:51 -0400
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > Internal representations of task state are likely to be changed
> > or ordered, and reporting them to userspace without exporting
> > them as part of API is basically wrong, which can easily break
> > a userspace observability tool as kernel evolves. For example,
> > perf suffers from this and still reports wrong states as of this
> > writing.
> >
> > OTOH, some masqueraded states like TASK_REPORT_IDLE and
> > TASK_REPORT_MAX are also reported inadvertently, which confuses
> > things even more and most userspace tools do not even take them
> > into consideration.
> >
> > So add a new variable in company with the old raw value to
> > report task state in symbolic chars, which are self-explaining
> > and no further translation is needed. Of course this does not
> > break any userspace tool.
> >
> > Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
> > the old conventions for the rest.
>
> The above is actually good.
>
>
> >
> > Signed-off-by: Ze Gao <zegao@tencent.com>
> > Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > Acked-by: Ian Rogers <irogers@google.com>
> > ---
> >  include/trace/events/sched.h | 44 ++++++++++++++++++++++--------------
> >  1 file changed, 27 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index 43492daefa6f..ae5b486cc969 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -6,6 +6,7 @@
> >  #define _TRACE_SCHED_H
> >
> >  #include <linux/kthread.h>
> > +#include <linux/sched.h>
> >  #include <linux/sched/numa_balancing.h>
> >  #include <linux/tracepoint.h>
> >  #include <linux/binfmts.h>
> > @@ -214,6 +215,27 @@ static inline short __trace_sched_switch_state(bool preempt,
> >
> >       return state ? (1 << (state - 1)) : state;
> >  }
> > +
> > +static inline char __trace_sched_switch_state_char(bool preempt,
> > +                                                unsigned int prev_state,
> > +                                                struct task_struct *p)
> > +{
> > +     long state;
> > +
> > +#ifdef CONFIG_SCHED_DEBUG
> > +     WARN_ON_ONCE(p != current);
> > +#endif /* CONFIG_SCHED_DEBUG */
> > +
> > +     /*
> > +      * For PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
> > +      * conventions for the rest.
> > +      */
> > +     if (preempt)
> > +             return 'p';
> > +
> > +     state = __task_state_index(prev_state, p->exit_state);
> > +     return task_index_to_char(state);
> > +}
> >  #endif /* CREATE_TRACE_POINTS */
> >
> >  /*
> > @@ -236,6 +258,7 @@ TRACE_EVENT(sched_switch,
> >               __array(        char,   prev_comm,      TASK_COMM_LEN   )
> >               __array(        char,   next_comm,      TASK_COMM_LEN   )
> >               __field(        short,  prev_state                      )
> > +             __field(        char,   prev_state_char                 )
> >       ),
> >
> >       TP_fast_assign(
> > @@ -246,26 +269,13 @@ TRACE_EVENT(sched_switch,
> >               memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> >               memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> >               __entry->prev_state             = __trace_sched_switch_state(preempt, prev_state, prev);
> > +             __entry->prev_state_char        = __trace_sched_switch_state_char(preempt, prev_state, prev);
> >               /* XXX SCHED_DEADLINE */
> >       ),
> >
> > -     TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
> > -             __entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
> > -
> > -             (__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
> > -               __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
> > -                             { TASK_INTERRUPTIBLE, "S" },
> > -                             { TASK_UNINTERRUPTIBLE, "D" },
> > -                             { __TASK_STOPPED, "T" },
> > -                             { __TASK_TRACED, "t" },
> > -                             { EXIT_DEAD, "X" },
> > -                             { EXIT_ZOMBIE, "Z" },
> > -                             { TASK_PARKED, "P" },
> > -                             { TASK_DEAD, "I" }) :
> > -               "R",
>
> I just realized, I have user space code that looks at this. As in the format file we have:
>
> print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, (REC->prev_state & ((((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) - 1)) ? __print_flags(REC->prev_state & ((((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) - 1), "|", { 0x00000001, "S" }, { 0x00000002, "D" }, { 0x00000004, "T" }, { 0x00000008, "t" }, { 0x00000010, "X" }, { 0x00000020, "Z" }, { 0x00000040, "P" }, { 0x00000080, "I" }) : "R", REC->prev_state & (((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio
>
> And I have used this in applications to find out what values "S" and "D"
> are.
>
> So, we need to keep that still. But we can add the prev_state_char to the
> output too.
>
>   "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s[%c] ==> next_comm=%s next_pid=%d next_prio=%d"

Good point!

But I see Peter has strong opinions over this change badly. So I'm not
sure if it's worth the effort to push this anymore :/ And apparently We
failed to convince each other over this problem.

How about we only keep all the fixing patches and throw away this
'prev_state_char' thing?

Again, I'm not meant to upset anyone here.  But Tons of thanks for your
sage reviews on this.

Thoughts?

Thanks,
Ze

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 5/5] perf sched: prefer to use prev_state_char introduced in sched_switch
  2023-08-03  9:34   ` Steven Rostedt
@ 2023-08-03 11:01     ` Ze Gao
  0 siblings, 0 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03 11:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, Aug 3, 2023 at 5:34 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu,  3 Aug 2023 04:33:52 -0400
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > Since the sched_switch tracepoint introduces a new variable to
> > report sched-out task state in symbolic char, we prefer to use
> > it to spare from knowing internal implementations in kernel.
>
> The above needs to be rewritten to be more comprehensive.
> Feel free to reply to this thread with new versions of the change log and
> such. No need to send a new patch series to fix this before you know that
> the new version is acceptable or not.

Copy that! Thank you.

> Replying to a current patch series is fine, but sending out a new one
> causes the spam as it's much easier to ignore a thread than to ignore a new
> thread with a new series.

Thanks Steven!  Lesson learned.

Regards,
Ze

>
>
> >
> > Also we keep the old parsing logic intact but sync the state char
> > array with the latest kernel.
> >
> > Signed-off-by: Ze Gao <zegao@tencent.com>
> > ---
> >  tools/perf/builtin-sched.c | 19 +++++++++++++++----
> >  1 file changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> > index 5042874ba204..6f99a36206c9 100644
> > --- a/tools/perf/builtin-sched.c
> > +++ b/tools/perf/builtin-sched.c
> > @@ -837,7 +837,7 @@ replay_wakeup_event(struct perf_sched *sched,
> >
> >  static inline char task_state_char(int state)
> >  {
> > -     static const char state_to_char[] = "RSDTtXZPI";
> > +     static const char state_to_char[] = "RSDTtXZPIp";
> >       unsigned int bit = state ? ffs(state) : 0;
> >
> >       return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
> > @@ -846,9 +846,20 @@ static inline char task_state_char(int state)
> >  static inline char get_task_prev_state(struct evsel *evsel,
> >                                      struct perf_sample *sample)
> >  {
> > -     const int prev_state = evsel__intval(evsel, sample, "prev_state");
> > +     char prev_state_char;
> > +     int prev_state;
> >
> > -     return task_state_char(prev_state);
> > +     /* prefer to use prev_state_char */
> > +     if (evsel__field(evsel, "prev_state_char"))
> > +             prev_state_char = (char) evsel__intval(evsel,
> > +                             sample, "prev_state_char");
> > +     else {
> > +             prev_state = (int) evsel__intval(evsel,
> > +                             sample, "prev_state");
> > +             prev_state_char = task_state_char(prev_state);
> > +     }
> > +
> > +     return prev_state_char;
> >  }
> >
> >  static int replay_switch_event(struct perf_sched *sched,
> > @@ -2145,7 +2156,7 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
> >               else if (r->last_time) {
> >                       u64 dt_wait = tprev - r->last_time;
> >
> > -                     if (r->last_state == 'R')
> > +                     if (r->last_state == 'R' || r->last_state == 'p')
> >                               r->dt_preempt = dt_wait;
> >                       else if (r->last_state == 'D')
> >                               r->dt_iowait = dt_wait;
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct
  2023-08-03  8:53   ` Peter Zijlstra
@ 2023-08-03 11:06     ` Ze Gao
  0 siblings, 0 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03 11:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, Aug 3, 2023 at 4:54 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Aug 03, 2023 at 04:33:50AM -0400, Ze Gao wrote:
> > Report prioritiy and prev_state in 'short' to save some buffer
> > space. And also reorder the fields so that we take struct
> > alignment into consideration to make the record compact.
> >
> > Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > Signed-off-by: Ze Gao <zegao@tencent.com>
>
> I am really getting fed up with this. This again doesn't list any
> reasons on why this is a sane thing to do.

Lesson learned from Steven's reply,  Will reorganize the changelog.

Thanks for pointing this out.

Regards,
Ze

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel
  2023-08-03 10:29     ` Ze Gao
@ 2023-08-03 12:25       ` Ze Gao
  0 siblings, 0 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03 12:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

Hi,

THIS IS THE NEW CHANGELOG FOR THIS PATCH:

    perf sched: sync state char array with the kernel

    Since commit e936e8e459e14 ("perf tools: Adapt the
    TASK_STATE_TO_CHAR_STR to new value in kernel space."),
    the state char array that is used to interpret the
    states of tasks being switched out have not synced
    once with kernel definitions. Whereas the task report
    logic is evolving over this time and the definition
    of this state char array has been changed multiple
    times. And this leads to inconsistency.

    As of this writing, perf timehist --state still reports
    the wrong states because TASK_STATE_TO_CHAR_STR is too
    outdated to use.

    So sync TASK_STATE_TO_CHAR_STR to match the latest kernel
    definitions to fix it.

    Signed-off-by: Ze Gao <zegao@tencent.com>

Regards,
Ze

On Thu, Aug 3, 2023 at 6:29 PM Ze Gao <zegao2021@gmail.com> wrote:
>
> On Thu, Aug 3, 2023 at 5:09 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu,  3 Aug 2023 04:33:48 -0400
> > Ze Gao <zegao2021@gmail.com> wrote:
> >
> > Hi Ze,
> >
> > > Update state char array and then remove unused and stale
> > > macros, which are kernel internal representations and not
> > > encouraged to use anymore.
> > >
> >
> > A couple of things.
> >
> > First, the change logs of every commit need to specify the "why". The
> > subject can say "what", but the change log really needs to explain why this
> > patch is important. For example, this patch is really two changes (and thus
> > should actually be two patches). (I'll also comment on the other patches)
>
> Thanks for the feedback! Will elaborate the changes in each changelog.
>
> > 1. The update of the state char array. You should explain why it's being
> > updated. If it was wrong, it needs to state the commit that changed to make
> > that happen.
> >
> > 2. For the removing the stale macros, the change log can simply state that
> > the macros are unused in the code and are being removed.
> >
> > Finally, I know you're eager to get this patch set in, but please hold off
> > sending a new version immediately after a comment or two. Some maintainers
> > prefer submitters to wait a week or so, otherwise you will tend to "spam"
> > their inboxes. There's more than one maintainer Cc'd on this series, and you
> > need to be courteous not to send too many emails in a short period of time.
>
> Noted!  Actually I'm in no rush and just to make sure people see the
> latest patches so they do not have to waste time on the old series.
>
> Will hold off to resolve all the comments in this thread.
>
> And thanks for pointing this out.
>
> Regards,
> Ze

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 2/5] perf sched: reorganize sched-out task state report code
  2023-08-03  9:10   ` Steven Rostedt
@ 2023-08-03 12:37     ` Ze Gao
  0 siblings, 0 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03 12:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

Hi Steven,

THIS IS THE NEW CHANGELOG FOR THIS PATCH:

    perf sched: reorganize sched-out task state report code

    Passing around the task state reported by tracepoint as an
    integer creates dependencies both on the task state macros
    and TASK_STATE_TO_CHAR_STR.

    Actually we can simplify this by computing the state based
    on TASK_STATE_TO_CHAR_STR and then pass the result as a
    'char', which saves us from using these macros anymore.
    So we can remove them for good.

    Note that sched_out_state() is basically doing the same
    thing as task_state_char(), so combine them into one
    and provide an intended helper get_task_prev_state() for
    extracting task state from perf record.

    IOW, this patch does not introduce any functional changes.
    mainly for housekeeping.

    Signed-off-by: Ze Gao <zegao@tencent.com>

Suggestions?

Regards,
Ze

On Thu, Aug 3, 2023 at 5:10 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu,  3 Aug 2023 04:33:49 -0400
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > Mainly does housekeeping work and not introduce any
> > functional change.
>
> This change log doesn't explain at all why this patch is needed, let alone
> what it is even doing.
>
> -- Steve
>
>
> >
> > Signed-off-by: Ze Gao <zegao@tencent.com>
> > ---
> >  tools/perf/builtin-sched.c | 59 +++++++++++++++++---------------------
> >  1 file changed, 26 insertions(+), 33 deletions(-)
> >
> > diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> > index 8dc8f071721c..5042874ba204 100644
> > --- a/tools/perf/builtin-sched.c
> > +++ b/tools/perf/builtin-sched.c
> > @@ -94,11 +94,6 @@ struct sched_atom {
> >
> >  #define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
> >
> > -/* task state bitmask, copied from include/linux/sched.h */
> > -#define TASK_RUNNING         0
> > -#define TASK_INTERRUPTIBLE   1
> > -#define TASK_UNINTERRUPTIBLE 2
> > -
> >  enum thread_state {
> >       THREAD_SLEEPING = 0,
> >       THREAD_WAIT_CPU,
> > @@ -255,7 +250,7 @@ struct thread_runtime {
> >       u64 total_preempt_time;
> >       u64 total_delay_time;
> >
> > -     int last_state;
> > +     char last_state;
> >
> >       char shortname[3];
> >       bool comm_changed;
> > @@ -425,7 +420,7 @@ static void add_sched_event_wakeup(struct perf_sched *sched, struct task_desc *t
> >  }
> >
> >  static void add_sched_event_sleep(struct perf_sched *sched, struct task_desc *task,
> > -                               u64 timestamp, u64 task_state __maybe_unused)
> > +                               u64 timestamp, char task_state __maybe_unused)
> >  {
> >       struct sched_atom *event = get_new_event(task, timestamp);
> >
> > @@ -840,6 +835,22 @@ replay_wakeup_event(struct perf_sched *sched,
> >       return 0;
> >  }
> >
> > +static inline char task_state_char(int state)
> > +{
> > +     static const char state_to_char[] = "RSDTtXZPI";
> > +     unsigned int bit = state ? ffs(state) : 0;
> > +
> > +     return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
> > +}
> > +
> > +static inline char get_task_prev_state(struct evsel *evsel,
> > +                                    struct perf_sample *sample)
> > +{
> > +     const int prev_state = evsel__intval(evsel, sample, "prev_state");
> > +
> > +     return task_state_char(prev_state);
> > +}
> > +
> >  static int replay_switch_event(struct perf_sched *sched,
> >                              struct evsel *evsel,
> >                              struct perf_sample *sample,
> > @@ -849,7 +860,7 @@ static int replay_switch_event(struct perf_sched *sched,
> >                  *next_comm  = evsel__strval(evsel, sample, "next_comm");
> >       const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
> >                 next_pid = evsel__intval(evsel, sample, "next_pid");
> > -     const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
> > +     const char prev_state = get_task_prev_state(evsel, sample);
> >       struct task_desc *prev, __maybe_unused *next;
> >       u64 timestamp0, timestamp = sample->time;
> >       int cpu = sample->cpu;
> > @@ -1039,12 +1050,6 @@ static int thread_atoms_insert(struct perf_sched *sched, struct thread *thread)
> >       return 0;
> >  }
> >
> > -static char sched_out_state(u64 prev_state)
> > -{
> > -     const char *str = TASK_STATE_TO_CHAR_STR;
> > -
> > -     return str[prev_state];
> > -}
> >
> >  static int
> >  add_sched_out_event(struct work_atoms *atoms,
> > @@ -1121,7 +1126,7 @@ static int latency_switch_event(struct perf_sched *sched,
> >  {
> >       const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
> >                 next_pid = evsel__intval(evsel, sample, "next_pid");
> > -     const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
> > +     const char prev_state = get_task_prev_state(evsel, sample);
> >       struct work_atoms *out_events, *in_events;
> >       struct thread *sched_out, *sched_in;
> >       u64 timestamp0, timestamp = sample->time;
> > @@ -1157,7 +1162,7 @@ static int latency_switch_event(struct perf_sched *sched,
> >                       goto out_put;
> >               }
> >       }
> > -     if (add_sched_out_event(out_events, sched_out_state(prev_state), timestamp))
> > +     if (add_sched_out_event(out_events, prev_state, timestamp))
> >               return -1;
> >
> >       in_events = thread_atoms_search(&sched->atom_root, sched_in, &sched->cmp_pid);
> > @@ -2022,24 +2027,12 @@ static void timehist_header(struct perf_sched *sched)
> >       printf("\n");
> >  }
> >
> > -static char task_state_char(struct thread *thread, int state)
> > -{
> > -     static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
> > -     unsigned bit = state ? ffs(state) : 0;
> > -
> > -     /* 'I' for idle */
> > -     if (thread__tid(thread) == 0)
> > -             return 'I';
> > -
> > -     return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
> > -}
> > -
> >  static void timehist_print_sample(struct perf_sched *sched,
> >                                 struct evsel *evsel,
> >                                 struct perf_sample *sample,
> >                                 struct addr_location *al,
> >                                 struct thread *thread,
> > -                               u64 t, int state)
> > +                               u64 t, char state)
> >  {
> >       struct thread_runtime *tr = thread__priv(thread);
> >       const char *next_comm = evsel__strval(evsel, sample, "next_comm");
> > @@ -2080,7 +2073,7 @@ static void timehist_print_sample(struct perf_sched *sched,
> >       print_sched_time(tr->dt_run, 6);
> >
> >       if (sched->show_state)
> > -             printf(" %5c ", task_state_char(thread, state));
> > +             printf(" %5c ", thread->tid == 0 ? 'I' : state);
> >
> >       if (sched->show_next) {
> >               snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm, next_pid);
> > @@ -2152,9 +2145,9 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
> >               else if (r->last_time) {
> >                       u64 dt_wait = tprev - r->last_time;
> >
> > -                     if (r->last_state == TASK_RUNNING)
> > +                     if (r->last_state == 'R')
> >                               r->dt_preempt = dt_wait;
> > -                     else if (r->last_state == TASK_UNINTERRUPTIBLE)
> > +                     else if (r->last_state == 'D')
> >                               r->dt_iowait = dt_wait;
> >                       else
> >                               r->dt_sleep = dt_wait;
> > @@ -2579,7 +2572,7 @@ static int timehist_sched_change_event(struct perf_tool *tool,
> >       struct thread_runtime *tr = NULL;
> >       u64 tprev, t = sample->time;
> >       int rc = 0;
> > -     int state = evsel__intval(evsel, sample, "prev_state");
> > +     const char state = get_task_prev_state(evsel, sample);
> >
> >       addr_location__init(&al);
> >       if (machine__resolve(machine, &al, sample) < 0) {
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel
  2023-08-03  9:09   ` Steven Rostedt
  2023-08-03 10:29     ` Ze Gao
@ 2023-08-03 12:39     ` Ze Gao
  1 sibling, 0 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03 12:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

> 2. For the removing the stale macros, the change log can simply state that
> the macros are unused in the code and are being removed.

I've split this part into a separate patch, and here is the changelog:

    perf sched: cleanup to remove unused macros

    The macros copied from kernel headers are unused and
    stale in the code and are being removed to avoid confusions.

    Signed-off-by: Ze Gao <zegao@tencent.com>

Regards,
Ze

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct
  2023-08-03  9:18   ` Steven Rostedt
  2023-08-03  9:51     ` Peter Zijlstra
@ 2023-08-03 12:54     ` Ze Gao
  2023-08-03 12:57     ` Ze Gao
  2 siblings, 0 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03 12:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, Aug 3, 2023 at 5:18 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu,  3 Aug 2023 04:33:50 -0400
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > Report prioritiy and prev_state in 'short' to save some buffer
> > space. And also reorder the fields so that we take struct
> > alignment into consideration to make the record compact.
>
> If I were to write this, I would have wrote:
>
>   The prev_state field in the sched_switch event is assigned by
>   __trace_sched_switch_state(). The largest number that function will return
>   is TASK_REPORT_MAX which is just 0x100. There's no reason that the
>   prev_state field is a full 32 bits when it is using just 9 bits max. In
>   order to save space on the ring buffer, shrink the prev_state to 16 bits
>   (short).
>
>   Also, change the positions of the other fields to accommodate the short
>   value of prev_state to eliminate any holes that were created in the
>   structure.
>
> See the difference?
>
> >
> >  #ifdef CREATE_TRACE_POINTS
> > -static inline long __trace_sched_switch_state(bool preempt,
> > +static inline short __trace_sched_switch_state(bool preempt,
> >                                             unsigned int prev_state,
> >                                             struct task_struct *p)
> >  {
> >       unsigned int state;
> >
> >  #ifdef CONFIG_SCHED_DEBUG
> > -     BUG_ON(p != current);
> > +     WARN_ON_ONCE(p != current);
> >  #endif /* CONFIG_SCHED_DEBUG */
>
> The above needs to be a separate patch.

I've moved this to a new patch, and this is the changelog:

    sched, tracing: change BUG_ON to WARN_ON_ONCE in __trace_sched_switch_state

    BUG_ON() was introduced in 2014 and old, and we
    switch it to WARN_ON_ONCE() to not to crash the
    kernel when the sched-out task is unexpected than
    the current, as suggested by Steven.

    Signed-off-by: Ze Gao <zegao@tencent.com>

Regards,
Ze

> >
> >       /*
> > @@ -229,23 +229,23 @@ TRACE_EVENT(sched_switch,
> >       TP_ARGS(preempt, prev, next, prev_state),
> >
> >       TP_STRUCT__entry(
> > -             __array(        char,   prev_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  prev_pid                        )
> > -             __field(        int,    prev_prio                       )
> > -             __field(        long,   prev_state                      )
> > -             __array(        char,   next_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  next_pid                        )
> > -             __field(        int,    next_prio                       )
> > +             __field(        short,  prev_prio                       )
> > +             __field(        short,  next_prio                       )
> > +             __array(        char,   prev_comm,      TASK_COMM_LEN   )
> > +             __array(        char,   next_comm,      TASK_COMM_LEN   )
> > +             __field(        short,  prev_state                      )
> >       ),
> >
> >       TP_fast_assign(
> > -             memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> > -             __entry->prev_pid       = prev->pid;
> > -             __entry->prev_prio      = prev->prio;
> > -             __entry->prev_state     = __trace_sched_switch_state(preempt, prev_state, prev);
> > +             __entry->prev_pid               = prev->pid;
> > +             __entry->next_pid               = next->pid;
> > +             __entry->prev_prio              = (short) prev->prio;
> > +             __entry->next_prio              = (short) next->prio;
> >               memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> > -             __entry->next_pid       = next->pid;
> > -             __entry->next_prio      = next->prio;
> > +             memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> > +             __entry->prev_state             = __trace_sched_switch_state(preempt, prev_state, prev);
> >               /* XXX SCHED_DEADLINE */
> >       ),
> >
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct
  2023-08-03  9:18   ` Steven Rostedt
  2023-08-03  9:51     ` Peter Zijlstra
  2023-08-03 12:54     ` Ze Gao
@ 2023-08-03 12:57     ` Ze Gao
  2 siblings, 0 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03 12:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, Aug 3, 2023 at 5:18 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu,  3 Aug 2023 04:33:50 -0400
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > Report prioritiy and prev_state in 'short' to save some buffer
> > space. And also reorder the fields so that we take struct
> > alignment into consideration to make the record compact.
>
> If I were to write this, I would have wrote:
>
>   The prev_state field in the sched_switch event is assigned by
>   __trace_sched_switch_state(). The largest number that function will return
>   is TASK_REPORT_MAX which is just 0x100. There's no reason that the
>   prev_state field is a full 32 bits when it is using just 9 bits max. In
>   order to save space on the ring buffer, shrink the prev_state to 16 bits
>   (short).
>
>   Also, change the positions of the other fields to accommodate the short
>   value of prev_state to eliminate any holes that were created in the
>   structure.

Thanks for this elaboration! But I see Peter have comments on this, I'll wait
to see these resolved and then send my new changelog.

Thanks,
Ze

> >
> >  #ifdef CREATE_TRACE_POINTS
> > -static inline long __trace_sched_switch_state(bool preempt,
> > +static inline short __trace_sched_switch_state(bool preempt,
> >                                             unsigned int prev_state,
> >                                             struct task_struct *p)
> >  {
> >       unsigned int state;
> >
> >  #ifdef CONFIG_SCHED_DEBUG
> > -     BUG_ON(p != current);
> > +     WARN_ON_ONCE(p != current);
> >  #endif /* CONFIG_SCHED_DEBUG */
>
> The above needs to be a separate patch.
>
> -- Steve
>
> >
> >       /*
> > @@ -229,23 +229,23 @@ TRACE_EVENT(sched_switch,
> >       TP_ARGS(preempt, prev, next, prev_state),
> >
> >       TP_STRUCT__entry(
> > -             __array(        char,   prev_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  prev_pid                        )
> > -             __field(        int,    prev_prio                       )
> > -             __field(        long,   prev_state                      )
> > -             __array(        char,   next_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  next_pid                        )
> > -             __field(        int,    next_prio                       )
> > +             __field(        short,  prev_prio                       )
> > +             __field(        short,  next_prio                       )
> > +             __array(        char,   prev_comm,      TASK_COMM_LEN   )
> > +             __array(        char,   next_comm,      TASK_COMM_LEN   )
> > +             __field(        short,  prev_state                      )
> >       ),
> >
> >       TP_fast_assign(
> > -             memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> > -             __entry->prev_pid       = prev->pid;
> > -             __entry->prev_prio      = prev->prio;
> > -             __entry->prev_state     = __trace_sched_switch_state(preempt, prev_state, prev);
> > +             __entry->prev_pid               = prev->pid;
> > +             __entry->next_pid               = next->pid;
> > +             __entry->prev_prio              = (short) prev->prio;
> > +             __entry->next_prio              = (short) next->prio;
> >               memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> > -             __entry->next_pid       = next->pid;
> > -             __entry->next_prio      = next->prio;
> > +             memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> > +             __entry->prev_state             = __trace_sched_switch_state(preempt, prev_state, prev);
> >               /* XXX SCHED_DEADLINE */
> >       ),
> >
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars
  2023-08-03  8:59   ` Peter Zijlstra
@ 2023-08-03 13:09     ` Ze Gao
  0 siblings, 0 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-03 13:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, Aug 3, 2023 at 4:59 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Aug 03, 2023 at 04:33:51AM -0400, Ze Gao wrote:
> > Internal representations of task state are likely to be changed
> > or ordered, and reporting them to userspace without exporting
> > them as part of API is basically wrong, which can easily break
> > a userspace observability tool as kernel evolves. For example,
> > perf suffers from this and still reports wrong states as of this
> > writing.
> >
> > OTOH, some masqueraded states like TASK_REPORT_IDLE and
> > TASK_REPORT_MAX are also reported inadvertently, which confuses
> > things even more and most userspace tools do not even take them
> > into consideration.
> >
> > So add a new variable in company with the old raw value to
> > report task state in symbolic chars, which are self-explaining
> > and no further translation is needed. Of course this does not
> > break any userspace tool.
> >
> > Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
> > the old conventions for the rest.
> >
> > Signed-off-by: Ze Gao <zegao@tencent.com>
> > Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > Acked-by: Ian Rogers <irogers@google.com>
>
> I'm not sure you've actually read any of the things I've written. I hate
> this. Not going to happen.

With all due respect, I'm afraid I've read so much carefully every word
you replied, I am just not fully convinced and still believe this is something
needs a fix but apparently you strongly refuse.

So I'm considering giving up this patch and only fixing the surface problem
to make everyone happy.

Thanks for taking your time on this topic.

Regards,
Ze

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct
  2023-08-03  9:51     ` Peter Zijlstra
@ 2023-08-03 14:45       ` Steven Rostedt
  0 siblings, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2023-08-03 14:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ze Gao, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, Namhyung Kim, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, 3 Aug 2023 11:51:32 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > 
> > See the difference?  
> 
> This also doesn't mention you broke the data format for all trace events
> a while back to ensure people are using libtracefs and are thus
> confident this won't break things.


It was the meta data that happened a while ago. As other events change all
the time with this restriction (keeping field names around), I didn't
realize that it was required for this change log. Doesn't hurt adding it.

More details on that:

 commit b000c8065 ("tracing: Remove the extra 4 bytes of padding in events")

This was to get rid of the padding that powertop relied on.

Nit, it's libtraceevent not libtracefs, as libtracefs gets you to the
format files, but it's libtraceveent that parses them.


-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel
  2023-08-03  8:33 ` [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel Ze Gao
  2023-08-03  9:09   ` Steven Rostedt
@ 2023-08-03 15:10   ` Steven Rostedt
  2023-08-04  2:21     ` Ze Gao
  1 sibling, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2023-08-03 15:10 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu,  3 Aug 2023 04:33:48 -0400
Ze Gao <zegao2021@gmail.com> wrote:

> Update state char array and then remove unused and stale
> macros, which are kernel internal representations and not
> encouraged to use anymore.
> 
> Signed-off-by: Ze Gao <zegao@tencent.com>
> ---
>  tools/perf/builtin-sched.c | 13 +------------
>  1 file changed, 1 insertion(+), 12 deletions(-)
> 
> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index 9ab300b6f131..8dc8f071721c 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -92,23 +92,12 @@ struct sched_atom {
>  	struct task_desc	*wakee;
>  };
>  
> -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"

Thinking about this more, this will always be wrong. Changing it just works
for the kernel you made the change for, but if it is run on another kernel,
it's broken again.

I actually wrote code once that basically just did a:

	struct trace_seq s;

	trace_seq_init(&s);
	tep_print_event(tep, &s, record, "%s", TEP_PRINT_INFO);

then searched s.buffer for "prev_state=%s ", to find the state character.

That's because the kernel should always be up to date (and why I said I
needed that string in the print_fmt).

As perf has a tep handle, this could be a helper function to extract the
state if needed, and get rind of relying on the above character array.

-- Steve


>  
>  /* task state bitmask, copied from include/linux/sched.h */
>  #define TASK_RUNNING		0
>  #define TASK_INTERRUPTIBLE	1
>  #define TASK_UNINTERRUPTIBLE	2
> -#define __TASK_STOPPED		4
> -#define __TASK_TRACED		8
> -/* in tsk->exit_state */
> -#define EXIT_DEAD		16
> -#define EXIT_ZOMBIE		32
> -#define EXIT_TRACE		(EXIT_ZOMBIE | EXIT_DEAD)
> -/* in tsk->state again */
> -#define TASK_DEAD		64
> -#define TASK_WAKEKILL		128
> -#define TASK_WAKING		256
> -#define TASK_PARKED		512
>  
>  enum thread_state {
>  	THREAD_SLEEPING = 0,


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel
  2023-08-03 15:10   ` Steven Rostedt
@ 2023-08-04  2:21     ` Ze Gao
  2023-08-04  2:38       ` Ze Gao
  0 siblings, 1 reply; 35+ messages in thread
From: Ze Gao @ 2023-08-04  2:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Thu, Aug 3, 2023 at 11:10 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu,  3 Aug 2023 04:33:48 -0400
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > Update state char array and then remove unused and stale
> > macros, which are kernel internal representations and not
> > encouraged to use anymore.
> >
> > Signed-off-by: Ze Gao <zegao@tencent.com>
> > ---
> >  tools/perf/builtin-sched.c | 13 +------------
> >  1 file changed, 1 insertion(+), 12 deletions(-)
> >
> > diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> > index 9ab300b6f131..8dc8f071721c 100644
> > --- a/tools/perf/builtin-sched.c
> > +++ b/tools/perf/builtin-sched.c
> > @@ -92,23 +92,12 @@ struct sched_atom {
> >       struct task_desc        *wakee;
> >  };
> >
> > -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> > +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
>
> Thinking about this more, this will always be wrong. Changing it just works
> for the kernel you made the change for, but if it is run on another kernel,
> it's broken again.

Indeed. There is no easy way to maintain backward compatibility unless
we stop using this bizarre 'prev_state' field. Basically all its users suffer
from this. That's why I believe this needs a fix to alert people does not
use 'prev_state' anymore.

> I actually wrote code once that basically just did a:
>
>         struct trace_seq s;
>
>         trace_seq_init(&s);
>         tep_print_event(tep, &s, record, "%s", TEP_PRINT_INFO);
>
> then searched s.buffer for "prev_state=%s ", to find the state character.
>
> That's because the kernel should always be up to date (and why I said I
> needed that string in the print_fmt).

Turing to building the state char array from print fmt string dynamically
is a great idea. :)

> As perf has a tep handle, this could be a helper function to extract the
> state if needed, and get rind of relying on the above character array.

I'll figure out how to make it happen.

BTW,  my last concern is that is there any better way to notice userspace to
avoid interpreting task state out of 'prev_state'. Because the awkward thing
happens again.

Thanks,
Ze

> -- Steve
>
>
> >
> >  /* task state bitmask, copied from include/linux/sched.h */
> >  #define TASK_RUNNING         0
> >  #define TASK_INTERRUPTIBLE   1
> >  #define TASK_UNINTERRUPTIBLE 2
> > -#define __TASK_STOPPED               4
> > -#define __TASK_TRACED                8
> > -/* in tsk->exit_state */
> > -#define EXIT_DEAD            16
> > -#define EXIT_ZOMBIE          32
> > -#define EXIT_TRACE           (EXIT_ZOMBIE | EXIT_DEAD)
> > -/* in tsk->state again */
> > -#define TASK_DEAD            64
> > -#define TASK_WAKEKILL                128
> > -#define TASK_WAKING          256
> > -#define TASK_PARKED          512
> >
> >  enum thread_state {
> >       THREAD_SLEEPING = 0,
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel
  2023-08-04  2:21     ` Ze Gao
@ 2023-08-04  2:38       ` Ze Gao
  2023-08-04  3:19         ` Ze Gao
  0 siblings, 1 reply; 35+ messages in thread
From: Ze Gao @ 2023-08-04  2:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Fri, Aug 4, 2023 at 10:21 AM Ze Gao <zegao2021@gmail.com> wrote:
>
> On Thu, Aug 3, 2023 at 11:10 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu,  3 Aug 2023 04:33:48 -0400
> > Ze Gao <zegao2021@gmail.com> wrote:
> >
> > > Update state char array and then remove unused and stale
> > > macros, which are kernel internal representations and not
> > > encouraged to use anymore.
> > >
> > > Signed-off-by: Ze Gao <zegao@tencent.com>
> > > ---
> > >  tools/perf/builtin-sched.c | 13 +------------
> > >  1 file changed, 1 insertion(+), 12 deletions(-)
> > >
> > > diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> > > index 9ab300b6f131..8dc8f071721c 100644
> > > --- a/tools/perf/builtin-sched.c
> > > +++ b/tools/perf/builtin-sched.c
> > > @@ -92,23 +92,12 @@ struct sched_atom {
> > >       struct task_desc        *wakee;
> > >  };
> > >
> > > -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> > > +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
> >
> > Thinking about this more, this will always be wrong. Changing it just works
> > for the kernel you made the change for, but if it is run on another kernel,
> > it's broken again.
>
> Indeed. There is no easy way to maintain backward compatibility unless
> we stop using this bizarre 'prev_state' field. Basically all its users suffer
> from this. That's why I believe this needs a fix to alert people does not
> use 'prev_state' anymore.
>
> > I actually wrote code once that basically just did a:
> >
> >         struct trace_seq s;
> >
> >         trace_seq_init(&s);
> >         tep_print_event(tep, &s, record, "%s", TEP_PRINT_INFO);
> >
> > then searched s.buffer for "prev_state=%s ", to find the state character.
> >
> > That's because the kernel should always be up to date (and why I said I
> > needed that string in the print_fmt).
>
> Turing to building the state char array from print fmt string dynamically
> is a great idea. :)
>
> > As perf has a tep handle, this could be a helper function to extract the
> > state if needed, and get rind of relying on the above character array.
>
> I'll figure out how to make it happen.
>
> BTW,  my last concern is that is there any better way to notice userspace to
> avoid interpreting task state out of 'prev_state'. Because the awkward thing
> happens again.

By userspace, I mean all tools consume 'prev_state' but don't have print fmt
available, taking bpf tracepoint for example.

Regards,
Ze

> Thanks,
> Ze
>
> > -- Steve
> >
> >
> > >
> > >  /* task state bitmask, copied from include/linux/sched.h */
> > >  #define TASK_RUNNING         0
> > >  #define TASK_INTERRUPTIBLE   1
> > >  #define TASK_UNINTERRUPTIBLE 2
> > > -#define __TASK_STOPPED               4
> > > -#define __TASK_TRACED                8
> > > -/* in tsk->exit_state */
> > > -#define EXIT_DEAD            16
> > > -#define EXIT_ZOMBIE          32
> > > -#define EXIT_TRACE           (EXIT_ZOMBIE | EXIT_DEAD)
> > > -/* in tsk->state again */
> > > -#define TASK_DEAD            64
> > > -#define TASK_WAKEKILL                128
> > > -#define TASK_WAKING          256
> > > -#define TASK_PARKED          512
> > >
> > >  enum thread_state {
> > >       THREAD_SLEEPING = 0,
> >

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel
  2023-08-04  2:38       ` Ze Gao
@ 2023-08-04  3:19         ` Ze Gao
  2023-08-04  3:41           ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Ze Gao @ 2023-08-04  3:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Fri, Aug 4, 2023 at 10:38 AM Ze Gao <zegao2021@gmail.com> wrote:
>
> On Fri, Aug 4, 2023 at 10:21 AM Ze Gao <zegao2021@gmail.com> wrote:
> >
> > On Thu, Aug 3, 2023 at 11:10 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Thu,  3 Aug 2023 04:33:48 -0400
> > > Ze Gao <zegao2021@gmail.com> wrote:
> > >
> > > > Update state char array and then remove unused and stale
> > > > macros, which are kernel internal representations and not
> > > > encouraged to use anymore.
> > > >
> > > > Signed-off-by: Ze Gao <zegao@tencent.com>
> > > > ---
> > > >  tools/perf/builtin-sched.c | 13 +------------
> > > >  1 file changed, 1 insertion(+), 12 deletions(-)
> > > >
> > > > diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> > > > index 9ab300b6f131..8dc8f071721c 100644
> > > > --- a/tools/perf/builtin-sched.c
> > > > +++ b/tools/perf/builtin-sched.c
> > > > @@ -92,23 +92,12 @@ struct sched_atom {
> > > >       struct task_desc        *wakee;
> > > >  };
> > > >
> > > > -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> > > > +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
> > >
> > > Thinking about this more, this will always be wrong. Changing it just works
> > > for the kernel you made the change for, but if it is run on another kernel,
> > > it's broken again.
> >
> > Indeed. There is no easy way to maintain backward compatibility unless
> > we stop using this bizarre 'prev_state' field. Basically all its users suffer
> > from this. That's why I believe this needs a fix to alert people does not
> > use 'prev_state' anymore.
> >
> > > I actually wrote code once that basically just did a:
> > >
> > >         struct trace_seq s;
> > >
> > >         trace_seq_init(&s);
> > >         tep_print_event(tep, &s, record, "%s", TEP_PRINT_INFO);
> > >
> > > then searched s.buffer for "prev_state=%s ", to find the state character.
> > >
> > > That's because the kernel should always be up to date (and why I said I
> > > needed that string in the print_fmt).
> >
> > Turing to building the state char array from print fmt string dynamically
> > is a great idea. :)

I realize this is not perfect as well after second thoughts, since this does not
take offline use of perf into consideration.  People might run perf on different
machines than where the perf.data gets recorded, in which way what we get
from  /sys/kernel/debug/tracing/events/sched/sched_switch/format is likely
different from the perf.data.

So let's parse it from TEP_PRINT_INFO of each record instead of building
the state char array and rely on 'prev_state' again. At least this fix all tools
that have TEP_PRINT_INFO available.

Thanks,
Ze



> > > As perf has a tep handle, this could be a helper function to extract the
> > > state if needed, and get rind of relying on the above character array.
> >
> > I'll figure out how to make it happen.
> >
> > BTW,  my last concern is that is there any better way to notice userspace to
> > avoid interpreting task state out of 'prev_state'. Because the awkward thing
> > happens again.
>
> By userspace, I mean all tools consume 'prev_state' but don't have print fmt
> available, taking bpf tracepoint for example.
>
> Regards,
> Ze
>
> > Thanks,
> > Ze
> >
> > > -- Steve
> > >
> > >
> > > >
> > > >  /* task state bitmask, copied from include/linux/sched.h */
> > > >  #define TASK_RUNNING         0
> > > >  #define TASK_INTERRUPTIBLE   1
> > > >  #define TASK_UNINTERRUPTIBLE 2
> > > > -#define __TASK_STOPPED               4
> > > > -#define __TASK_TRACED                8
> > > > -/* in tsk->exit_state */
> > > > -#define EXIT_DEAD            16
> > > > -#define EXIT_ZOMBIE          32
> > > > -#define EXIT_TRACE           (EXIT_ZOMBIE | EXIT_DEAD)
> > > > -/* in tsk->state again */
> > > > -#define TASK_DEAD            64
> > > > -#define TASK_WAKEKILL                128
> > > > -#define TASK_WAKING          256
> > > > -#define TASK_PARKED          512
> > > >
> > > >  enum thread_state {
> > > >       THREAD_SLEEPING = 0,
> > >

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel
  2023-08-04  3:19         ` Ze Gao
@ 2023-08-04  3:41           ` Steven Rostedt
  2023-08-10  5:50             ` Ze Gao
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2023-08-04  3:41 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Fri, 4 Aug 2023 11:19:06 +0800
Ze Gao <zegao2021@gmail.com> wrote:

> I realize this is not perfect as well after second thoughts, since this does not
> take offline use of perf into consideration.  People might run perf on different
> machines than where the perf.data gets recorded, in which way what we get
> from  /sys/kernel/debug/tracing/events/sched/sched_switch/format is likely
> different from the perf.data.

If perf data files does what trace.dat files do, it should save the
file formats in the data files. It should not be reading the kernel
when reading the data file.

With trace-cmd, you can do: trace-cmd dump --events

And it will show you all the formats of the events that it saved in the
file.

-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* (no subject)
  2023-08-04  3:41           ` Steven Rostedt
@ 2023-08-10  5:50             ` Ze Gao
  2023-08-10  6:07               ` [PATCH] perf sched: parse task state from tracepoint print format Ze Gao
  2023-08-11 17:28               ` Steven Rostedt
  0 siblings, 2 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-10  5:50 UTC (permalink / raw)
  To: rostedt
  Cc: acme, adrian.hunter, alexander.shishkin, irogers, jolsa,
	linux-kernel, linux-perf-users, linux-trace-kernel, mark.rutland,
	mhiramat, mingo, namhyung, peterz, zegao2021, zegao

Hi Steven,

I managed to build task state char map dynamically by parsing
the tracepoint print format from data recorded by perf. And
likewise for libtraceevent.

FYI, I tried TEP_PRINT_INFO but no shot. It turns out TEP_PRINT_INFO
stills relies on libtraceevent (i.e., sched_switch_handler() in 
plugin_sched_switch.c) and we need to parse the print format on our own.

Anyway, it works now and I've tested on some perf.data in old formats
but not cover all the kernel releases.

Thoughts?

Regards,
Ze


From 6b2035494952efb2963e6459ae4dbfce496c3b97 Mon Sep 17 00:00:00 2001
From: Ze Gao <zegao@tencent.com>
Date: Wed, 2 Aug 2023 08:19:54 -0400
Subject: [PATCH] perf sched: parse task state from tracepoint print format

As of this writing, we use prev_state to report task state,
which relies on both the task state macros and
TASK_STATE_TO_CHAR_STR in kernel to interpret its actual
meaning. In this way, perf gets broken literally each time
TASK_STATE_TO_CHAR_STR changes as kernel evolves. Counting
on TASK_STATE_TO_CHAR_STR gurantees no backward compatibilty.

To fix this, we build the state char map from the print
format parsed from perf.data on the fly and removes
dependencies on these internal kernel definitions.

Note that we provide an intended helper get_task_prev_state()
for extracting task state from perf record and passing task
state in char elsewhere, which helps to eliminate the need to
know task state macros further.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 tools/perf/builtin-sched.c | 126 +++++++++++++++++++++++--------------
 1 file changed, 80 insertions(+), 46 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 9ab300b6f131..9366bc0a991d 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -52,6 +52,10 @@
 #define SYM_LEN			129
 #define MAX_PID			1024000
 
+#define TASK_STATE_MAX 16
+static char state_to_char[TASK_STATE_MAX];
+static unsigned int num_sleep_states = 0;
+
 static const char *cpu_list;
 static DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 
@@ -92,24 +96,6 @@ struct sched_atom {
 	struct task_desc	*wakee;
 };
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
-
-/* task state bitmask, copied from include/linux/sched.h */
-#define TASK_RUNNING		0
-#define TASK_INTERRUPTIBLE	1
-#define TASK_UNINTERRUPTIBLE	2
-#define __TASK_STOPPED		4
-#define __TASK_TRACED		8
-/* in tsk->exit_state */
-#define EXIT_DEAD		16
-#define EXIT_ZOMBIE		32
-#define EXIT_TRACE		(EXIT_ZOMBIE | EXIT_DEAD)
-/* in tsk->state again */
-#define TASK_DEAD		64
-#define TASK_WAKEKILL		128
-#define TASK_WAKING		256
-#define TASK_PARKED		512
-
 enum thread_state {
 	THREAD_SLEEPING = 0,
 	THREAD_WAIT_CPU,
@@ -266,7 +252,7 @@ struct thread_runtime {
 	u64 total_preempt_time;
 	u64 total_delay_time;
 
-	int last_state;
+	char last_state;
 
 	char shortname[3];
 	bool comm_changed;
@@ -436,7 +422,7 @@ static void add_sched_event_wakeup(struct perf_sched *sched, struct task_desc *t
 }
 
 static void add_sched_event_sleep(struct perf_sched *sched, struct task_desc *task,
-				  u64 timestamp, u64 task_state __maybe_unused)
+				  u64 timestamp, char task_state __maybe_unused)
 {
 	struct sched_atom *event = get_new_event(task, timestamp);
 
@@ -851,6 +837,72 @@ replay_wakeup_event(struct perf_sched *sched,
 	return 0;
 }
 
+static struct tep_print_arg* task_state_print_flag(struct tep_event *event)
+{
+       struct tep_print_arg* args = event->print_fmt.args;
+
+       while(args)
+       {
+               if (args->type == TEP_PRINT_FLAGS)
+                       return args;
+               if (args->type == TEP_PRINT_OP) {
+                       args = args->op.right;
+                       args = args->op.left;
+                       continue;
+               }
+               args = args->next;
+       }
+       return NULL;
+}
+
+static void __parse_print_flag(struct tep_print_flag_sym *field)
+{
+
+       long val = strtol(field->value, NULL, 0);
+       unsigned int bit = val ? ffs(val) : 0;
+
+       state_to_char[bit] = field->str[0];
+       num_sleep_states++;
+       if(num_sleep_states > TASK_STATE_MAX - 1) {
+               pr_warning("too many states parsed, possibly bad format\n");
+               return;
+       }
+       if (field->next) {
+               __parse_print_flag(field->next);
+       }
+}
+
+static inline void parse_print_flag(struct tep_print_arg* args)
+{
+       __parse_print_flag(args->flags.flags);
+}
+
+static void build_task_state_arr(struct tep_event *event)
+{
+       struct tep_print_arg* args;
+
+       args = task_state_print_flag(event);
+       if (!args)
+               pr_warning("print flag not found, possibly bad format\n");
+       else
+               parse_print_flag(args);
+}
+
+static inline char get_prev_task_state(struct evsel *evsel,
+		struct perf_sample *sample)
+{
+	int prev_state = evsel__intval(evsel, sample, "prev_state");
+	unsigned int bit = prev_state ? ffs(prev_state) : 0;
+	char state;
+
+	if(!num_sleep_states)
+		build_task_state_arr(evsel->tp_format);
+
+	state = (!bit || bit > num_sleep_states) ? 'R' : state_to_char[bit];
+
+	return state;
+}
+
 static int replay_switch_event(struct perf_sched *sched,
 			       struct evsel *evsel,
 			       struct perf_sample *sample,
@@ -860,7 +912,7 @@ static int replay_switch_event(struct perf_sched *sched,
 		   *next_comm  = evsel__strval(evsel, sample, "next_comm");
 	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
 		  next_pid = evsel__intval(evsel, sample, "next_pid");
-	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+	const char prev_state = get_prev_task_state(evsel, sample);
 	struct task_desc *prev, __maybe_unused *next;
 	u64 timestamp0, timestamp = sample->time;
 	int cpu = sample->cpu;
@@ -1050,12 +1102,6 @@ static int thread_atoms_insert(struct perf_sched *sched, struct thread *thread)
 	return 0;
 }
 
-static char sched_out_state(u64 prev_state)
-{
-	const char *str = TASK_STATE_TO_CHAR_STR;
-
-	return str[prev_state];
-}
 
 static int
 add_sched_out_event(struct work_atoms *atoms,
@@ -1132,7 +1178,7 @@ static int latency_switch_event(struct perf_sched *sched,
 {
 	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
 		  next_pid = evsel__intval(evsel, sample, "next_pid");
-	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+	const char prev_state = get_prev_task_state(evsel, sample);
 	struct work_atoms *out_events, *in_events;
 	struct thread *sched_out, *sched_in;
 	u64 timestamp0, timestamp = sample->time;
@@ -1168,7 +1214,7 @@ static int latency_switch_event(struct perf_sched *sched,
 			goto out_put;
 		}
 	}
-	if (add_sched_out_event(out_events, sched_out_state(prev_state), timestamp))
+	if (add_sched_out_event(out_events, prev_state, timestamp))
 		return -1;
 
 	in_events = thread_atoms_search(&sched->atom_root, sched_in, &sched->cmp_pid);
@@ -2033,24 +2079,12 @@ static void timehist_header(struct perf_sched *sched)
 	printf("\n");
 }
 
-static char task_state_char(struct thread *thread, int state)
-{
-	static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
-	unsigned bit = state ? ffs(state) : 0;
-
-	/* 'I' for idle */
-	if (thread__tid(thread) == 0)
-		return 'I';
-
-	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
-}
-
 static void timehist_print_sample(struct perf_sched *sched,
 				  struct evsel *evsel,
 				  struct perf_sample *sample,
 				  struct addr_location *al,
 				  struct thread *thread,
-				  u64 t, int state)
+				  u64 t, char state)
 {
 	struct thread_runtime *tr = thread__priv(thread);
 	const char *next_comm = evsel__strval(evsel, sample, "next_comm");
@@ -2091,7 +2125,7 @@ static void timehist_print_sample(struct perf_sched *sched,
 	print_sched_time(tr->dt_run, 6);
 
 	if (sched->show_state)
-		printf(" %5c ", task_state_char(thread, state));
+		printf(" %5c ", thread->tid == 0 ? 'I' : state);
 
 	if (sched->show_next) {
 		snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm, next_pid);
@@ -2163,9 +2197,9 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
 		else if (r->last_time) {
 			u64 dt_wait = tprev - r->last_time;
 
-			if (r->last_state == TASK_RUNNING)
+			if (r->last_state == 'R')
 				r->dt_preempt = dt_wait;
-			else if (r->last_state == TASK_UNINTERRUPTIBLE)
+			else if (r->last_state == 'D')
 				r->dt_iowait = dt_wait;
 			else
 				r->dt_sleep = dt_wait;
@@ -2590,7 +2624,7 @@ static int timehist_sched_change_event(struct perf_tool *tool,
 	struct thread_runtime *tr = NULL;
 	u64 tprev, t = sample->time;
 	int rc = 0;
-	int state = evsel__intval(evsel, sample, "prev_state");
+	const char state = get_prev_task_state(evsel, sample);
 
 	addr_location__init(&al);
 	if (machine__resolve(machine, &al, sample) < 0) {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] perf sched: parse task state from tracepoint print format
  2023-08-10  5:50             ` Ze Gao
@ 2023-08-10  6:07               ` Ze Gao
  2023-08-11 17:28               ` Steven Rostedt
  1 sibling, 0 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-10  6:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arnaldo Carvalho de Melo, Adrian Hunter, Alexander Shishkin,
	Ian Rogers, Jiri Olsa, linux-kernel, linux-perf-users,
	linux-trace-kernel, Mark Rutland, Masami Hiramatsu, Ingo Molnar,
	Namhyung Kim, Peter Zijlstra, Ze Gao

[REPLIED WITH A SUBJECT GIT-SEND-EMAIL FAILED TO PARSE]

Regards,
Ze

On Thu, Aug 10, 2023 at 1:52 PM Ze Gao <zegao2021@gmail.com> wrote:
>
> Hi Steven,
>
> I managed to build task state char map dynamically by parsing
> the tracepoint print format from data recorded by perf. And
> likewise for libtraceevent.
>
> FYI, I tried TEP_PRINT_INFO but no shot. It turns out TEP_PRINT_INFO
> stills relies on libtraceevent (i.e., sched_switch_handler() in
> plugin_sched_switch.c) and we need to parse the print format on our own.
>
> Anyway, it works now and I've tested on some perf.data in old formats
> but not cover all the kernel releases.
>
> Thoughts?
>
> Regards,
> Ze
>
>
> From 6b2035494952efb2963e6459ae4dbfce496c3b97 Mon Sep 17 00:00:00 2001
> From: Ze Gao <zegao@tencent.com>
> Date: Wed, 2 Aug 2023 08:19:54 -0400
> Subject: [PATCH] perf sched: parse task state from tracepoint print format
>
> As of this writing, we use prev_state to report task state,
> which relies on both the task state macros and
> TASK_STATE_TO_CHAR_STR in kernel to interpret its actual
> meaning. In this way, perf gets broken literally each time
> TASK_STATE_TO_CHAR_STR changes as kernel evolves. Counting
> on TASK_STATE_TO_CHAR_STR gurantees no backward compatibilty.
>
> To fix this, we build the state char map from the print
> format parsed from perf.data on the fly and removes
> dependencies on these internal kernel definitions.
>
> Note that we provide an intended helper get_task_prev_state()
> for extracting task state from perf record and passing task
> state in char elsewhere, which helps to eliminate the need to
> know task state macros further.
>
> Signed-off-by: Ze Gao <zegao@tencent.com>
> ---
>  tools/perf/builtin-sched.c | 126 +++++++++++++++++++++++--------------
>  1 file changed, 80 insertions(+), 46 deletions(-)
>
> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index 9ab300b6f131..9366bc0a991d 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -52,6 +52,10 @@
>  #define SYM_LEN                        129
>  #define MAX_PID                        1024000
>
> +#define TASK_STATE_MAX 16
> +static char state_to_char[TASK_STATE_MAX];
> +static unsigned int num_sleep_states = 0;
> +
>  static const char *cpu_list;
>  static DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
>
> @@ -92,24 +96,6 @@ struct sched_atom {
>         struct task_desc        *wakee;
>  };
>
> -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> -
> -/* task state bitmask, copied from include/linux/sched.h */
> -#define TASK_RUNNING           0
> -#define TASK_INTERRUPTIBLE     1
> -#define TASK_UNINTERRUPTIBLE   2
> -#define __TASK_STOPPED         4
> -#define __TASK_TRACED          8
> -/* in tsk->exit_state */
> -#define EXIT_DEAD              16
> -#define EXIT_ZOMBIE            32
> -#define EXIT_TRACE             (EXIT_ZOMBIE | EXIT_DEAD)
> -/* in tsk->state again */
> -#define TASK_DEAD              64
> -#define TASK_WAKEKILL          128
> -#define TASK_WAKING            256
> -#define TASK_PARKED            512
> -
>  enum thread_state {
>         THREAD_SLEEPING = 0,
>         THREAD_WAIT_CPU,
> @@ -266,7 +252,7 @@ struct thread_runtime {
>         u64 total_preempt_time;
>         u64 total_delay_time;
>
> -       int last_state;
> +       char last_state;
>
>         char shortname[3];
>         bool comm_changed;
> @@ -436,7 +422,7 @@ static void add_sched_event_wakeup(struct perf_sched *sched, struct task_desc *t
>  }
>
>  static void add_sched_event_sleep(struct perf_sched *sched, struct task_desc *task,
> -                                 u64 timestamp, u64 task_state __maybe_unused)
> +                                 u64 timestamp, char task_state __maybe_unused)
>  {
>         struct sched_atom *event = get_new_event(task, timestamp);
>
> @@ -851,6 +837,72 @@ replay_wakeup_event(struct perf_sched *sched,
>         return 0;
>  }
>
> +static struct tep_print_arg* task_state_print_flag(struct tep_event *event)
> +{
> +       struct tep_print_arg* args = event->print_fmt.args;
> +
> +       while(args)
> +       {
> +               if (args->type == TEP_PRINT_FLAGS)
> +                       return args;
> +               if (args->type == TEP_PRINT_OP) {
> +                       args = args->op.right;
> +                       args = args->op.left;
> +                       continue;
> +               }
> +               args = args->next;
> +       }
> +       return NULL;
> +}
> +
> +static void __parse_print_flag(struct tep_print_flag_sym *field)
> +{
> +
> +       long val = strtol(field->value, NULL, 0);
> +       unsigned int bit = val ? ffs(val) : 0;
> +
> +       state_to_char[bit] = field->str[0];
> +       num_sleep_states++;
> +       if(num_sleep_states > TASK_STATE_MAX - 1) {
> +               pr_warning("too many states parsed, possibly bad format\n");
> +               return;
> +       }
> +       if (field->next) {
> +               __parse_print_flag(field->next);
> +       }
> +}
> +
> +static inline void parse_print_flag(struct tep_print_arg* args)
> +{
> +       __parse_print_flag(args->flags.flags);
> +}
> +
> +static void build_task_state_arr(struct tep_event *event)
> +{
> +       struct tep_print_arg* args;
> +
> +       args = task_state_print_flag(event);
> +       if (!args)
> +               pr_warning("print flag not found, possibly bad format\n");
> +       else
> +               parse_print_flag(args);
> +}
> +
> +static inline char get_prev_task_state(struct evsel *evsel,
> +               struct perf_sample *sample)
> +{
> +       int prev_state = evsel__intval(evsel, sample, "prev_state");
> +       unsigned int bit = prev_state ? ffs(prev_state) : 0;
> +       char state;
> +
> +       if(!num_sleep_states)
> +               build_task_state_arr(evsel->tp_format);
> +
> +       state = (!bit || bit > num_sleep_states) ? 'R' : state_to_char[bit];
> +
> +       return state;
> +}
> +
>  static int replay_switch_event(struct perf_sched *sched,
>                                struct evsel *evsel,
>                                struct perf_sample *sample,
> @@ -860,7 +912,7 @@ static int replay_switch_event(struct perf_sched *sched,
>                    *next_comm  = evsel__strval(evsel, sample, "next_comm");
>         const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
>                   next_pid = evsel__intval(evsel, sample, "next_pid");
> -       const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
> +       const char prev_state = get_prev_task_state(evsel, sample);
>         struct task_desc *prev, __maybe_unused *next;
>         u64 timestamp0, timestamp = sample->time;
>         int cpu = sample->cpu;
> @@ -1050,12 +1102,6 @@ static int thread_atoms_insert(struct perf_sched *sched, struct thread *thread)
>         return 0;
>  }
>
> -static char sched_out_state(u64 prev_state)
> -{
> -       const char *str = TASK_STATE_TO_CHAR_STR;
> -
> -       return str[prev_state];
> -}
>
>  static int
>  add_sched_out_event(struct work_atoms *atoms,
> @@ -1132,7 +1178,7 @@ static int latency_switch_event(struct perf_sched *sched,
>  {
>         const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
>                   next_pid = evsel__intval(evsel, sample, "next_pid");
> -       const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
> +       const char prev_state = get_prev_task_state(evsel, sample);
>         struct work_atoms *out_events, *in_events;
>         struct thread *sched_out, *sched_in;
>         u64 timestamp0, timestamp = sample->time;
> @@ -1168,7 +1214,7 @@ static int latency_switch_event(struct perf_sched *sched,
>                         goto out_put;
>                 }
>         }
> -       if (add_sched_out_event(out_events, sched_out_state(prev_state), timestamp))
> +       if (add_sched_out_event(out_events, prev_state, timestamp))
>                 return -1;
>
>         in_events = thread_atoms_search(&sched->atom_root, sched_in, &sched->cmp_pid);
> @@ -2033,24 +2079,12 @@ static void timehist_header(struct perf_sched *sched)
>         printf("\n");
>  }
>
> -static char task_state_char(struct thread *thread, int state)
> -{
> -       static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
> -       unsigned bit = state ? ffs(state) : 0;
> -
> -       /* 'I' for idle */
> -       if (thread__tid(thread) == 0)
> -               return 'I';
> -
> -       return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
> -}
> -
>  static void timehist_print_sample(struct perf_sched *sched,
>                                   struct evsel *evsel,
>                                   struct perf_sample *sample,
>                                   struct addr_location *al,
>                                   struct thread *thread,
> -                                 u64 t, int state)
> +                                 u64 t, char state)
>  {
>         struct thread_runtime *tr = thread__priv(thread);
>         const char *next_comm = evsel__strval(evsel, sample, "next_comm");
> @@ -2091,7 +2125,7 @@ static void timehist_print_sample(struct perf_sched *sched,
>         print_sched_time(tr->dt_run, 6);
>
>         if (sched->show_state)
> -               printf(" %5c ", task_state_char(thread, state));
> +               printf(" %5c ", thread->tid == 0 ? 'I' : state);
>
>         if (sched->show_next) {
>                 snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm, next_pid);
> @@ -2163,9 +2197,9 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
>                 else if (r->last_time) {
>                         u64 dt_wait = tprev - r->last_time;
>
> -                       if (r->last_state == TASK_RUNNING)
> +                       if (r->last_state == 'R')
>                                 r->dt_preempt = dt_wait;
> -                       else if (r->last_state == TASK_UNINTERRUPTIBLE)
> +                       else if (r->last_state == 'D')
>                                 r->dt_iowait = dt_wait;
>                         else
>                                 r->dt_sleep = dt_wait;
> @@ -2590,7 +2624,7 @@ static int timehist_sched_change_event(struct perf_tool *tool,
>         struct thread_runtime *tr = NULL;
>         u64 tprev, t = sample->time;
>         int rc = 0;
> -       int state = evsel__intval(evsel, sample, "prev_state");
> +       const char state = get_prev_task_state(evsel, sample);
>
>         addr_location__init(&al);
>         if (machine__resolve(machine, &al, sample) < 0) {
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] perf sched: parse task state from tracepoint print format
  2023-08-10  5:50             ` Ze Gao
  2023-08-10  6:07               ` [PATCH] perf sched: parse task state from tracepoint print format Ze Gao
@ 2023-08-11 17:28               ` Steven Rostedt
  2023-08-14  2:28                 ` Ze Gao
  1 sibling, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2023-08-11 17:28 UTC (permalink / raw)
  To: Ze Gao
  Cc: acme, adrian.hunter, alexander.shishkin, irogers, jolsa,
	linux-kernel, linux-perf-users, linux-trace-kernel, mark.rutland,
	mhiramat, mingo, namhyung, peterz, zegao

On Thu, 10 Aug 2023 01:50:24 -0400
Ze Gao <zegao2021@gmail.com> wrote:

> Hi Steven,
> 
> I managed to build task state char map dynamically by parsing
> the tracepoint print format from data recorded by perf. And
> likewise for libtraceevent.
> 
> FYI, I tried TEP_PRINT_INFO but no shot. It turns out TEP_PRINT_INFO
> stills relies on libtraceevent (i.e., sched_switch_handler() in 
> plugin_sched_switch.c) and we need to parse the print format on our own.

There is a way to unload plugins:

	tep_unload_plugins(t->plugin_list, tep);

Hmm, I should add:

	tep_unload_plugin(tep, t->plugin_list, "plugin-name");

To unload a single plugin.

I it can also just override what the plugin does by calling:

static int sched_switch_handler(struct trace_seq *s,
				struct tep_record *record,
				struct tep_event *event, void *context)
{
	// do whatever you want.
}

	tep_register_event_handler(tep, -1, "sched", "sched_switch",
				   sched_switch_handler, NULL);

> 
> Anyway, it works now and I've tested on some perf.data in old formats
> but not cover all the kernel releases.
> 
> Thoughts?

I don't maintain the perf code. You'll have to talk with the perf
maintainers.

-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] perf sched: parse task state from tracepoint print format
  2023-08-11 17:28               ` Steven Rostedt
@ 2023-08-14  2:28                 ` Ze Gao
  0 siblings, 0 replies; 35+ messages in thread
From: Ze Gao @ 2023-08-14  2:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: acme, adrian.hunter, alexander.shishkin, irogers, jolsa,
	linux-kernel, linux-perf-users, linux-trace-kernel, mark.rutland,
	mhiramat, mingo, namhyung, peterz, zegao

On Sat, Aug 12, 2023 at 1:28 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 10 Aug 2023 01:50:24 -0400
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > Hi Steven,
> >
> > I managed to build task state char map dynamically by parsing
> > the tracepoint print format from data recorded by perf. And
> > likewise for libtraceevent.
> >
> > FYI, I tried TEP_PRINT_INFO but no shot. It turns out TEP_PRINT_INFO
> > stills relies on libtraceevent (i.e., sched_switch_handler() in
> > plugin_sched_switch.c) and we need to parse the print format on our own.
>
> There is a way to unload plugins:
>
>         tep_unload_plugins(t->plugin_list, tep);
>
> Hmm, I should add:
>
>         tep_unload_plugin(tep, t->plugin_list, "plugin-name");
>
> To unload a single plugin.
>
> I it can also just override what the plugin does by calling:
>
> static int sched_switch_handler(struct trace_seq *s,
>                                 struct tep_record *record,
>                                 struct tep_event *event, void *context)
> {
>         // do whatever you want.
> }
>
>         tep_register_event_handler(tep, -1, "sched", "sched_switch",
>                                    sched_switch_handler, NULL);
>
Yes,  I chose to fix libtraceevent in a similar way, to not break
users of this plugin,
like perf script, I decided to build state char mapping dynamically
for both instead
of overriding sched_switch_handler in libtraceevent. Here is the patch:

From e4fae23d9538e60e75a9776fa7938102e7c26bbb Mon Sep 17 00:00:00 2001
From: Ze Gao <zegao@tencent.com>
Date: Wed, 2 Aug 2023 07:20:46 -0400
Subject: [PATCH] libtraceevent: parse task state from tracepoint print format

As of this writing, we use prev_state to report task state, which
relies on both the task state macros and TASK_STATE_TO_CHAR_STR
copied from the kernel to interpret its actual meaning. In this way,
libtraceevent gets broken literally each time TASK_STATE_TO_CHAR_STR
changes as the kernel evolves. Counting on hardcoded
TASK_STATE_TO_CHAR_STR gurantees no backward compatibilty.

To fix this, we build the state char map from the print format
parsed from date recorded on the fly and removes dependencies on
these internal kernel definitions.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 plugins/plugin_sched_switch.c | 103 ++++++++++++++++++++++++++++------
 1 file changed, 87 insertions(+), 16 deletions(-)

diff --git a/plugins/plugin_sched_switch.c b/plugins/plugin_sched_switch.c
index 8752cae..a0df138 100644
--- a/plugins/plugin_sched_switch.c
+++ b/plugins/plugin_sched_switch.c
@@ -6,28 +6,25 @@
 #include <stdlib.h>
 #include <string.h>

+#include "event-utils.h"
 #include "event-parse.h"
 #include "trace-seq.h"

-static void write_state(struct trace_seq *s, int val)
-{
-       const char states[] = "SDTtZXxW";
-       int found = 0;
-       int i;
+#define TASK_STATE_MAX 16

-       for (i = 0; i < (sizeof(states) - 1); i++) {
-               if (!(val & (1 << i)))
-                       continue;
+static char state_to_char[TASK_STATE_MAX];
+static unsigned int num_sleeping_states = 0;

-               if (found)
-                       trace_seq_putc(s, '|');

-               found = 1;
-               trace_seq_putc(s, states[i]);
-       }
+static void write_state(struct trace_seq *s, int val)
+{
+       unsigned int bit = val ? ffs(val) : 0;
+       char state;

-       if (!found)
-               trace_seq_putc(s, 'R');
+       state = (!bit || bit > num_sleeping_states) ? 'R' : state_to_char[bit];
+       trace_seq_putc(s, state);
+       if(bit > num_sleeping_states)
+               trace_seq_putc(s, '+');
 }

 static void write_and_save_comm(struct tep_format_field *field,
@@ -79,6 +76,76 @@ static int sched_wakeup_handler(struct trace_seq *s,
        return 0;
 }

+static const struct tep_print_arg* task_state_print_flag(const struct
tep_event *event) {
+
+       struct tep_print_arg* args;
+
+       if (!event)
+               goto out;
+
+       args = event->print_fmt.args;
+       while(args)
+       {
+               if (args->type == TEP_PRINT_FLAGS)
+                       return args;
+               if (args->type == TEP_PRINT_OP) {
+                       args = args->op.right;
+                       args = args->op.left;
+                       continue;
+               }
+               args = args->next;
+       }
+out:
+       return NULL;
+}
+
+static int parse_task_state_arr(const char *value, const char *str)
+{
+       long val;
+       unsigned int bit;
+
+       if (!value || !str)
+               return -1;
+
+       val = strtol(value, NULL, 0);
+       bit = val ? ffs(val) : 0;
+       state_to_char[bit] = str[0];
+       num_sleeping_states++;
+       if (num_sleeping_states > TASK_STATE_MAX - 1) {
+               tep_warning("too many states parsed, possibly bad format\n");
+               return -1;
+       }
+       return 0;
+}
+
+static void parse_print_flag(const struct tep_print_flag_sym* field,
+                            int (*parser)(const char *value, const char *str))
+{
+       int err;
+
+       if (!field || !parser)
+               return;
+       err = parser(field->value, field->str);
+       if (err){
+               tep_warning("parsing print flag failed, possibly bad format\n");
+               return;
+       }
+       if (field->next)
+               parse_print_flag(field->next, parser);
+
+}
+
+static void build_task_state_arr(const struct tep_event *event)
+{
+       const struct tep_print_arg* args;
+
+       args = task_state_print_flag(event);
+       if (!args)
+               tep_warning("print flag not found, possibly bad format\n");
+       else
+               parse_print_flag(args->flags.flags, parse_task_state_arr);
+}
+
 static int sched_switch_handler(struct trace_seq *s,
                                struct tep_record *record,
                                struct tep_event *event, void *context)
@@ -99,8 +166,12 @@ static int sched_switch_handler(struct trace_seq *s,
        if (tep_get_field_val(s, event, "prev_prio", record, &val, 1) == 0)
                trace_seq_printf(s, "[%d] ", (int) val);

-       if (tep_get_field_val(s,  event, "prev_state", record, &val, 1) == 0)
+
+       if (tep_get_field_val(s,  event, "prev_state", record, &val, 1) == 0) {
+               if (!num_sleeping_states)
+                       build_task_state_arr(event);
                write_state(s, val);
+       }

        trace_seq_puts(s, " ==> ");

--
2.41.0



> > Anyway, it works now and I've tested on some perf.data in old formats
> > but not cover all the kernel releases.
> >
> > Thoughts?
>
> I don't maintain the perf code. You'll have to talk with the perf
> maintainers.

Roger that! Thank you very much!

Regards,
Ze

> -- Steve

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct
  2023-08-03  8:33 ` [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct Ze Gao
  2023-08-03  8:53   ` Peter Zijlstra
  2023-08-03  9:18   ` Steven Rostedt
@ 2023-08-23  2:52   ` kernel test robot
  2 siblings, 0 replies; 35+ messages in thread
From: kernel test robot @ 2023-08-23  2:52 UTC (permalink / raw)
  To: Ze Gao
  Cc: oe-lkp, lkp, Steven Rostedt, Ze Gao, linux-kernel,
	linux-trace-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, Namhyung Kim, Peter Zijlstra,
	linux-perf-users, oliver.sang



Hello,

kernel test robot noticed "perf-sanity-tests.Parse_sched_tracepoints_fields.fail" on:

commit: 029aadfe946d99c4b11f1dd52f9ff76a09b21f69 ("[RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct")
url: https://github.com/intel-lab-lkp/linux/commits/Ze-Gao/perf-sched-sync-state-char-array-with-the-kernel/20230803-163946
base: https://git.kernel.org/cgit/linux/kernel/git/acme/linux.git perf/core
patch link: https://lore.kernel.org/all/20230803083352.1585-4-zegao@tencent.com/
patch subject: [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct

in testcase: perf-sanity-tests
version: perf-x86_64-00c7b5f4ddc5-1_20230402
with following parameters:

	perf_compiler: gcc



compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202308231018.8ddf15b7-oliver.sang@intel.com



 14: Roundtrip evsel->name                                           : Ok
 15: Parse sched tracepoints fields                                  : FAILED!
 16: syscalls:sys_enter_openat event fields                          : Ok



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230823/202308231018.8ddf15b7-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-08-23  2:52 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-03  8:33 [RFC PATCH v6 0/5] fix task state report from sched tracepoint Ze Gao
2023-08-03  8:33 ` [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel Ze Gao
2023-08-03  9:09   ` Steven Rostedt
2023-08-03 10:29     ` Ze Gao
2023-08-03 12:25       ` Ze Gao
2023-08-03 12:39     ` Ze Gao
2023-08-03 15:10   ` Steven Rostedt
2023-08-04  2:21     ` Ze Gao
2023-08-04  2:38       ` Ze Gao
2023-08-04  3:19         ` Ze Gao
2023-08-04  3:41           ` Steven Rostedt
2023-08-10  5:50             ` Ze Gao
2023-08-10  6:07               ` [PATCH] perf sched: parse task state from tracepoint print format Ze Gao
2023-08-11 17:28               ` Steven Rostedt
2023-08-14  2:28                 ` Ze Gao
2023-08-03  8:33 ` [RFC PATCH v6 2/5] perf sched: reorganize sched-out task state report code Ze Gao
2023-08-03  9:10   ` Steven Rostedt
2023-08-03 12:37     ` Ze Gao
2023-08-03  8:33 ` [RFC PATCH v6 3/5] sched, tracing: reorganize fields of switch event struct Ze Gao
2023-08-03  8:53   ` Peter Zijlstra
2023-08-03 11:06     ` Ze Gao
2023-08-03  9:18   ` Steven Rostedt
2023-08-03  9:51     ` Peter Zijlstra
2023-08-03 14:45       ` Steven Rostedt
2023-08-03 12:54     ` Ze Gao
2023-08-03 12:57     ` Ze Gao
2023-08-23  2:52   ` kernel test robot
2023-08-03  8:33 ` [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars Ze Gao
2023-08-03  8:59   ` Peter Zijlstra
2023-08-03 13:09     ` Ze Gao
2023-08-03  9:29   ` Steven Rostedt
2023-08-03 10:55     ` Ze Gao
2023-08-03  8:33 ` [RFC PATCH v6 5/5] perf sched: prefer to use prev_state_char introduced in sched_switch Ze Gao
2023-08-03  9:34   ` Steven Rostedt
2023-08-03 11:01     ` Ze Gao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).