All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not
@ 2022-01-20 16:25 Valentin Schneider
  2022-01-20 16:25 ` [PATCH v3 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event Valentin Schneider
                   ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Valentin Schneider @ 2022-01-20 16:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Abhijeet Dharmapurikar, Uwe Kleine-König, Dietmar Eggemann,
	Steven Rostedt, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Thomas Gleixner, Sebastian Andrzej Siewior, Juri Lelli,
	Daniel Bristot de Oliveira, Kees Cook, Andrew Morton,
	Eric W. Biederman, Alexey Gladkov, Kenta.Tada, Randy Dunlap,
	Ed Tsai

Hi folks,

Problem
=======

Abhijeet pointed out that the following sequence of trace events can be
observed:

  cat-1676  [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
  grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
  <idle>-0  [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004

IOW, not-yet-woken task gets presented as runnable and switched out in
favor of the idle task... Dietmar and I had a look, turns out this sequence
can happen: 

		      p->__state = TASK_INTERRUPTIBLE;
		      __schedule()
			deactivate_task(p);
  ttwu()
    READ !p->on_rq
    p->__state=TASK_WAKING
			trace_sched_switch()
			  __trace_sched_switch_state()
			    task_state_index()
			      return 0;

TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
spinning on p->on_cpu") which punted the TASK_WAKING write to the other
side of

  smp_cond_load_acquire(&p->on_cpu, !VAL);

in ttwu().

Uwe reported on #linux-rt what I think is a similar issue with
TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
you get prev_state=0 despite the task blocking on a lock.

Both of those are very confusing for tooling or even human observers.

Patches
=======

For the TASK_WAKING case, pushing the prev_state read in __schedule() down
to __trace_sched_switch_state() seems sufficient. That's patch 1.

For TASK_RTLOCK_WAIT, it's a bit trickier. One could use ->saved_state as
prev_state, but
a) that is also susceptible to racing (see ttwu() / ttwu_state_match())
b) some lock substitutions (e.g. rt_spin_lock()) leave ->saved_state as
   TASK_RUNNING.

Patch 2 adds TASK_RTLOCK_WAIT to TASK_REPORT instead.

Testing
=======

Briefly tested on an Arm Juno via ftrace+hackbench against
o tip/sched/core@82762d2af31a
o v5.16-rt15-rebase w/ CONFIG_PREEMPT_RT


I also had a look at the __schedule() disassembly as suggested by Peter; on x86
prev_state gets pushed to the stack and popped prior to the trace event static
call, the rest seems mostly unchanged (GCC 9.3).

Revisions
=========

v2 -> v3
++++++++

o Dropped TASK_RTLOCK_WAIT from TASK_REPORT, made it appear as
  TASK_UNINTERRUPTIBLE instead (Eric)

RFC v1 -> v2
++++++++++++

o Collected tags (Steven, Sebastian)
o Patched missed trace_switch event clients (Steven)

Cheers,
Valentin

Valentin Schneider (2):
  sched/tracing: Don't re-read p->state when emitting sched_switch event
  sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

 include/linux/sched.h             | 19 ++++++++++++++++---
 include/trace/events/sched.h      | 11 +++++++----
 kernel/sched/core.c               |  4 ++--
 kernel/trace/fgraph.c             |  4 +++-
 kernel/trace/ftrace.c             |  4 +++-
 kernel/trace/trace_events.c       |  8 ++++++--
 kernel/trace/trace_osnoise.c      |  4 +++-
 kernel/trace/trace_sched_switch.c |  1 +
 kernel/trace/trace_sched_wakeup.c |  1 +
 9 files changed, 42 insertions(+), 14 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-01-20 16:25 [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Valentin Schneider
@ 2022-01-20 16:25 ` Valentin Schneider
  2022-03-01 15:24   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
  2022-01-20 16:25 ` [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE Valentin Schneider
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 49+ messages in thread
From: Valentin Schneider @ 2022-01-20 16:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Abhijeet Dharmapurikar, Uwe Kleine-König, Dietmar Eggemann,
	Steven Rostedt, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Thomas Gleixner, Sebastian Andrzej Siewior, Juri Lelli,
	Daniel Bristot de Oliveira, Kees Cook, Andrew Morton,
	Eric W. Biederman, Alexey Gladkov, Kenta.Tada, Randy Dunlap,
	Ed Tsai

As of commit

  c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

the following sequence becomes possible:

		      p->__state = TASK_INTERRUPTIBLE;
		      __schedule()
			deactivate_task(p);
  ttwu()
    READ !p->on_rq
    p->__state=TASK_WAKING
			trace_sched_switch()
			  __trace_sched_switch_state()
			    task_state_index()
			      return 0;

TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
the trace event.

Prevent this by pushing the value read from __schedule() down the trace
event.

Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 include/linux/sched.h             | 11 ++++++++---
 include/trace/events/sched.h      | 11 +++++++----
 kernel/sched/core.c               |  4 ++--
 kernel/trace/fgraph.c             |  4 +++-
 kernel/trace/ftrace.c             |  4 +++-
 kernel/trace/trace_events.c       |  8 ++++++--
 kernel/trace/trace_osnoise.c      |  4 +++-
 kernel/trace/trace_sched_switch.c |  1 +
 kernel/trace/trace_sched_wakeup.c |  1 +
 9 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2e261adb8ea..d00837d12b9d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1616,10 +1616,10 @@ static inline pid_t task_pgrp_nr(struct task_struct *tsk)
 #define TASK_REPORT_IDLE	(TASK_REPORT + 1)
 #define TASK_REPORT_MAX		(TASK_REPORT_IDLE << 1)
 
-static inline unsigned int task_state_index(struct task_struct *tsk)
+static inline unsigned int __task_state_index(unsigned int tsk_state,
+					      unsigned int tsk_exit_state)
 {
-	unsigned int tsk_state = READ_ONCE(tsk->__state);
-	unsigned int state = (tsk_state | tsk->exit_state) & TASK_REPORT;
+	unsigned int state = (tsk_state | tsk_exit_state) & TASK_REPORT;
 
 	BUILD_BUG_ON_NOT_POWER_OF_2(TASK_REPORT_MAX);
 
@@ -1629,6 +1629,11 @@ static inline unsigned int task_state_index(struct task_struct *tsk)
 	return fls(state);
 }
 
+static inline unsigned int task_state_index(struct task_struct *tsk)
+{
+	return __task_state_index(READ_ONCE(tsk->__state), tsk->exit_state);
+}
+
 static inline char task_index_to_char(unsigned int state)
 {
 	static const char state_char[] = "RSDTtXZPI";
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 94640482cfe7..65e786756321 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -187,7 +187,9 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
 	     TP_ARGS(p));
 
 #ifdef CREATE_TRACE_POINTS
-static inline long __trace_sched_switch_state(bool preempt, struct task_struct *p)
+static inline long __trace_sched_switch_state(bool preempt,
+					      unsigned int prev_state,
+					      struct task_struct *p)
 {
 	unsigned int state;
 
@@ -208,7 +210,7 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
 	 * it for left shift operation to get the correct task->state
 	 * mapping.
 	 */
-	state = task_state_index(p);
+	state = __task_state_index(prev_state, p->exit_state);
 
 	return state ? (1 << (state - 1)) : state;
 }
@@ -220,10 +222,11 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
 TRACE_EVENT(sched_switch,
 
 	TP_PROTO(bool preempt,
+		 unsigned int prev_state,
 		 struct task_struct *prev,
 		 struct task_struct *next),
 
-	TP_ARGS(preempt, prev, next),
+	TP_ARGS(preempt, prev_state, prev, next),
 
 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
@@ -239,7 +242,7 @@ TRACE_EVENT(sched_switch,
 		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		__entry->prev_pid	= prev->pid;
 		__entry->prev_prio	= prev->prio;
-		__entry->prev_state	= __trace_sched_switch_state(preempt, prev);
+		__entry->prev_state	= __trace_sched_switch_state(preempt, prev_state, prev);
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe53e510e711..a8799a2d8546 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4822,7 +4822,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 {
 	struct rq *rq = this_rq();
 	struct mm_struct *mm = rq->prev_mm;
-	long prev_state;
+	unsigned int prev_state;
 
 	/*
 	 * The previous task will have left us with a preempt_count of 2
@@ -6287,7 +6287,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		migrate_disable_switch(rq, prev);
 		psi_sched_switch(prev, next, !task_on_rq_queued(prev));
 
-		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next);
+		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev_state, prev, next);
 
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 22061d38fc00..19028e072cdb 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -415,7 +415,9 @@ static int alloc_retstack_tasklist(struct ftrace_ret_stack **ret_stack_list)
 
 static void
 ftrace_graph_probe_sched_switch(void *ignore, bool preempt,
-			struct task_struct *prev, struct task_struct *next)
+				unsigned int prev_state,
+				struct task_struct *prev,
+				struct task_struct *next)
 {
 	unsigned long long timestamp;
 	int index;
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 30bc880c3849..e296ddeec99f 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7311,7 +7311,9 @@ ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops)
 
 static void
 ftrace_filter_pid_sched_switch_probe(void *data, bool preempt,
-		    struct task_struct *prev, struct task_struct *next)
+				     unsigned int prev_state,
+				     struct task_struct *prev,
+				     struct task_struct *next)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *pid_list;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 4021b9a79f93..6ddc6cc0d5d5 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -759,7 +759,9 @@ void trace_event_follow_fork(struct trace_array *tr, bool enable)
 
 static void
 event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
-		    struct task_struct *prev, struct task_struct *next)
+					unsigned int prev_state,
+					struct task_struct *prev,
+					struct task_struct *next)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *no_pid_list;
@@ -783,7 +785,9 @@ event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
 
 static void
 event_filter_pid_sched_switch_probe_post(void *data, bool preempt,
-		    struct task_struct *prev, struct task_struct *next)
+					 unsigned int prev_state,
+					 struct task_struct *prev,
+					 struct task_struct *next)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *no_pid_list;
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 7520d43aed55..a8a2d17f858c 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1168,7 +1168,9 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
  * used to record the beginning and to report the end of a thread noise window.
  */
 static void
-trace_sched_switch_callback(void *data, bool preempt, struct task_struct *p,
+trace_sched_switch_callback(void *data, bool preempt,
+			    unsigned int prev_state,
+			    struct task_struct *p,
 			    struct task_struct *n)
 {
 	struct osnoise_variables *osn_var = this_cpu_osn_var();
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index e304196d7c28..993b0ed10d8c 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -22,6 +22,7 @@ static DEFINE_MUTEX(sched_register_mutex);
 
 static void
 probe_sched_switch(void *ignore, bool preempt,
+		   unsigned int prev_state,
 		   struct task_struct *prev, struct task_struct *next)
 {
 	int flags;
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 2402de520eca..46429f9a96fa 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -426,6 +426,7 @@ tracing_sched_wakeup_trace(struct trace_array *tr,
 
 static void notrace
 probe_wakeup_sched_switch(void *ignore, bool preempt,
+			  unsigned int prev_state,
 			  struct task_struct *prev, struct task_struct *next)
 {
 	struct trace_array_cpu *data;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
  2022-01-20 16:25 [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Valentin Schneider
  2022-01-20 16:25 ` [PATCH v3 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event Valentin Schneider
@ 2022-01-20 16:25 ` Valentin Schneider
  2022-03-01 15:24   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
  2022-04-09 23:42   ` [PATCH v3 2/2] " Qais Yousef
  2022-01-21 17:15 ` [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Steven Rostedt
  2022-04-21 22:12 ` [PATCH] sched/tracing: append prev_state to tp args instead Delyan Kratunov
  3 siblings, 2 replies; 49+ messages in thread
From: Valentin Schneider @ 2022-01-20 16:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Uwe Kleine-König, Abhijeet Dharmapurikar, Dietmar Eggemann,
	Steven Rostedt, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Thomas Gleixner, Sebastian Andrzej Siewior, Juri Lelli,
	Daniel Bristot de Oliveira, Kees Cook, Andrew Morton,
	Eric W. Biederman, Alexey Gladkov, Kenta.Tada, Randy Dunlap,
	Ed Tsai

TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.

The actual state is saved in p->saved_state, but reading it after reading
p->__state has a few issues:
o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
o ttwu_state_match() might have changed that to TASK_RUNNING

As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
exposing a new state to userspace tools which way not know what to do with
them. The only information that needs to be conveyed here is that a task is
waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
need for a new state.

Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 include/linux/sched.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d00837d12b9d..068270ff04b8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1626,6 +1626,14 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
 	if (tsk_state == TASK_IDLE)
 		state = TASK_REPORT_IDLE;
 
+	/*
+	 * We're lying here, but rather than expose a completely new task state
+	 * to userspace, we can make this appear as if the task has gone through
+	 * a regular rt_mutex_lock() call.
+	 */
+	if (tsk_state == TASK_RTLOCK_WAIT)
+		state = TASK_UNINTERRUPTIBLE;
+
 	return fls(state);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not
  2022-01-20 16:25 [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Valentin Schneider
  2022-01-20 16:25 ` [PATCH v3 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event Valentin Schneider
  2022-01-20 16:25 ` [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE Valentin Schneider
@ 2022-01-21 17:15 ` Steven Rostedt
  2022-02-27 15:33   ` Peter Zijlstra
  2022-04-21 22:12 ` [PATCH] sched/tracing: append prev_state to tp args instead Delyan Kratunov
  3 siblings, 1 reply; 49+ messages in thread
From: Steven Rostedt @ 2022-01-21 17:15 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Abhijeet Dharmapurikar, Uwe Kleine-König,
	Dietmar Eggemann, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Thomas Gleixner, Sebastian Andrzej Siewior, Juri Lelli,
	Daniel Bristot de Oliveira, Kees Cook, Andrew Morton,
	Eric W. Biederman, Alexey Gladkov, Kenta.Tada, Randy Dunlap,
	Ed Tsai

On Thu, 20 Jan 2022 16:25:18 +0000
Valentin Schneider <valentin.schneider@arm.com> wrote:

> Hi folks,
> 
> Problem
> =======
> 
> Abhijeet pointed out that the following sequence of trace events can be
> observed:
> 
>   cat-1676  [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
>   grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
>   <idle>-0  [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004
> 
> IOW, not-yet-woken task gets presented as runnable and switched out in
> favor of the idle task... Dietmar and I had a look, turns out this sequence
> can happen: 
> 
> 		      p->__state = TASK_INTERRUPTIBLE;
> 		      __schedule()
> 			deactivate_task(p);
>   ttwu()
>     READ !p->on_rq
>     p->__state=TASK_WAKING
> 			trace_sched_switch()
> 			  __trace_sched_switch_state()
> 			    task_state_index()
> 			      return 0;
> 
> TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
> This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
> spinning on p->on_cpu") which punted the TASK_WAKING write to the other
> side of
> 
>   smp_cond_load_acquire(&p->on_cpu, !VAL);
> 
> in ttwu().
> 
> Uwe reported on #linux-rt what I think is a similar issue with
> TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
> you get prev_state=0 despite the task blocking on a lock.
> 
> Both of those are very confusing for tooling or even human observers.



This all looks fine to me:

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Peter, want to take this through your tree?

-- Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not
  2022-01-21 17:15 ` [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Steven Rostedt
@ 2022-02-27 15:33   ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2022-02-27 15:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Valentin Schneider, linux-kernel, Abhijeet Dharmapurikar,
	Uwe Kleine-König, Dietmar Eggemann, Ingo Molnar,
	Vincent Guittot, Thomas Gleixner, Sebastian Andrzej Siewior,
	Juri Lelli, Daniel Bristot de Oliveira, Kees Cook, Andrew Morton,
	Eric W. Biederman, Alexey Gladkov, Kenta.Tada, Randy Dunlap,
	Ed Tsai

On Fri, Jan 21, 2022 at 12:15:58PM -0500, Steven Rostedt wrote:
> 
> This all looks fine to me:
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> 
> Peter, want to take this through your tree?

Queued up now. Thanks!

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [tip: sched/core] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
  2022-01-20 16:25 ` [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE Valentin Schneider
@ 2022-03-01 15:24   ` tip-bot2 for Valentin Schneider
  2022-04-09 23:42   ` [PATCH v3 2/2] " Qais Yousef
  1 sibling, 0 replies; 49+ messages in thread
From: tip-bot2 for Valentin Schneider @ 2022-03-01 15:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: u.kleine-koenig, Valentin Schneider, Peter Zijlstra (Intel),
	Steven Rostedt (Google),
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     25795ef6299f07ce3838f3253a9cb34f64efcfae
Gitweb:        https://git.kernel.org/tip/25795ef6299f07ce3838f3253a9cb34f64efcfae
Author:        Valentin Schneider <valentin.schneider@arm.com>
AuthorDate:    Thu, 20 Jan 2022 16:25:20 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00

sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.

The actual state is saved in p->saved_state, but reading it after reading
p->__state has a few issues:
o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
o ttwu_state_match() might have changed that to TASK_RUNNING

As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
exposing a new state to userspace tools which way not know what to do with
them. The only information that needs to be conveyed here is that a task is
waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
need for a new state.

Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20220120162520.570782-3-valentin.schneider@arm.com
---
 include/linux/sched.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 457c8a0..78b606c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1630,6 +1630,14 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
 	if (tsk_state == TASK_IDLE)
 		state = TASK_REPORT_IDLE;
 
+	/*
+	 * We're lying here, but rather than expose a completely new task state
+	 * to userspace, we can make this appear as if the task has gone through
+	 * a regular rt_mutex_lock() call.
+	 */
+	if (tsk_state == TASK_RTLOCK_WAIT)
+		state = TASK_UNINTERRUPTIBLE;
+
 	return fls(state);
 }
 

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-01-20 16:25 ` [PATCH v3 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event Valentin Schneider
@ 2022-03-01 15:24   ` tip-bot2 for Valentin Schneider
  2022-03-04 16:13     ` Valentin Schneider
  2022-03-08 18:02     ` Qais Yousef
  0 siblings, 2 replies; 49+ messages in thread
From: tip-bot2 for Valentin Schneider @ 2022-03-01 15:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Abhijeet Dharmapurikar, Valentin Schneider,
	Peter Zijlstra (Intel), Steven Rostedt (Google),
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
Author:        Valentin Schneider <valentin.schneider@arm.com>
AuthorDate:    Thu, 20 Jan 2022 16:25:19 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00

sched/tracing: Don't re-read p->state when emitting sched_switch event

As of commit

  c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

the following sequence becomes possible:

		      p->__state = TASK_INTERRUPTIBLE;
		      __schedule()
			deactivate_task(p);
  ttwu()
    READ !p->on_rq
    p->__state=TASK_WAKING
			trace_sched_switch()
			  __trace_sched_switch_state()
			    task_state_index()
			      return 0;

TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
the trace event.

Prevent this by pushing the value read from __schedule() down the trace
event.

Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
---
 include/linux/sched.h             | 11 ++++++++---
 include/trace/events/sched.h      | 11 +++++++----
 kernel/sched/core.c               |  4 ++--
 kernel/trace/fgraph.c             |  4 +++-
 kernel/trace/ftrace.c             |  4 +++-
 kernel/trace/trace_events.c       |  8 ++++++--
 kernel/trace/trace_osnoise.c      |  4 +++-
 kernel/trace/trace_sched_switch.c |  1 +
 kernel/trace/trace_sched_wakeup.c |  1 +
 9 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f00132e..457c8a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1620,10 +1620,10 @@ static inline pid_t task_pgrp_nr(struct task_struct *tsk)
 #define TASK_REPORT_IDLE	(TASK_REPORT + 1)
 #define TASK_REPORT_MAX		(TASK_REPORT_IDLE << 1)
 
-static inline unsigned int task_state_index(struct task_struct *tsk)
+static inline unsigned int __task_state_index(unsigned int tsk_state,
+					      unsigned int tsk_exit_state)
 {
-	unsigned int tsk_state = READ_ONCE(tsk->__state);
-	unsigned int state = (tsk_state | tsk->exit_state) & TASK_REPORT;
+	unsigned int state = (tsk_state | tsk_exit_state) & TASK_REPORT;
 
 	BUILD_BUG_ON_NOT_POWER_OF_2(TASK_REPORT_MAX);
 
@@ -1633,6 +1633,11 @@ static inline unsigned int task_state_index(struct task_struct *tsk)
 	return fls(state);
 }
 
+static inline unsigned int task_state_index(struct task_struct *tsk)
+{
+	return __task_state_index(READ_ONCE(tsk->__state), tsk->exit_state);
+}
+
 static inline char task_index_to_char(unsigned int state)
 {
 	static const char state_char[] = "RSDTtXZPI";
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9464048..65e7867 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -187,7 +187,9 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
 	     TP_ARGS(p));
 
 #ifdef CREATE_TRACE_POINTS
-static inline long __trace_sched_switch_state(bool preempt, struct task_struct *p)
+static inline long __trace_sched_switch_state(bool preempt,
+					      unsigned int prev_state,
+					      struct task_struct *p)
 {
 	unsigned int state;
 
@@ -208,7 +210,7 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
 	 * it for left shift operation to get the correct task->state
 	 * mapping.
 	 */
-	state = task_state_index(p);
+	state = __task_state_index(prev_state, p->exit_state);
 
 	return state ? (1 << (state - 1)) : state;
 }
@@ -220,10 +222,11 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
 TRACE_EVENT(sched_switch,
 
 	TP_PROTO(bool preempt,
+		 unsigned int prev_state,
 		 struct task_struct *prev,
 		 struct task_struct *next),
 
-	TP_ARGS(preempt, prev, next),
+	TP_ARGS(preempt, prev_state, prev, next),
 
 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
@@ -239,7 +242,7 @@ TRACE_EVENT(sched_switch,
 		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		__entry->prev_pid	= prev->pid;
 		__entry->prev_prio	= prev->prio;
-		__entry->prev_state	= __trace_sched_switch_state(preempt, prev);
+		__entry->prev_state	= __trace_sched_switch_state(preempt, prev_state, prev);
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ef94612..3aafc15 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4836,7 +4836,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 {
 	struct rq *rq = this_rq();
 	struct mm_struct *mm = rq->prev_mm;
-	long prev_state;
+	unsigned int prev_state;
 
 	/*
 	 * The previous task will have left us with a preempt_count of 2
@@ -6300,7 +6300,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		migrate_disable_switch(rq, prev);
 		psi_sched_switch(prev, next, !task_on_rq_queued(prev));
 
-		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next);
+		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev_state, prev, next);
 
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 22061d3..19028e0 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -415,7 +415,9 @@ free:
 
 static void
 ftrace_graph_probe_sched_switch(void *ignore, bool preempt,
-			struct task_struct *prev, struct task_struct *next)
+				unsigned int prev_state,
+				struct task_struct *prev,
+				struct task_struct *next)
 {
 	unsigned long long timestamp;
 	int index;
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index f9feb19..6762ae0 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7347,7 +7347,9 @@ ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops)
 
 static void
 ftrace_filter_pid_sched_switch_probe(void *data, bool preempt,
-		    struct task_struct *prev, struct task_struct *next)
+				     unsigned int prev_state,
+				     struct task_struct *prev,
+				     struct task_struct *next)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *pid_list;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 3147614..2a19ea7 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -759,7 +759,9 @@ void trace_event_follow_fork(struct trace_array *tr, bool enable)
 
 static void
 event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
-		    struct task_struct *prev, struct task_struct *next)
+					unsigned int prev_state,
+					struct task_struct *prev,
+					struct task_struct *next)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *no_pid_list;
@@ -783,7 +785,9 @@ event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
 
 static void
 event_filter_pid_sched_switch_probe_post(void *data, bool preempt,
-		    struct task_struct *prev, struct task_struct *next)
+					 unsigned int prev_state,
+					 struct task_struct *prev,
+					 struct task_struct *next)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *no_pid_list;
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 870a08d..1829b4c 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1167,7 +1167,9 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
  * used to record the beginning and to report the end of a thread noise window.
  */
 static void
-trace_sched_switch_callback(void *data, bool preempt, struct task_struct *p,
+trace_sched_switch_callback(void *data, bool preempt,
+			    unsigned int prev_state,
+			    struct task_struct *p,
 			    struct task_struct *n)
 {
 	struct osnoise_variables *osn_var = this_cpu_osn_var();
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index e304196..993b0ed 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -22,6 +22,7 @@ static DEFINE_MUTEX(sched_register_mutex);
 
 static void
 probe_sched_switch(void *ignore, bool preempt,
+		   unsigned int prev_state,
 		   struct task_struct *prev, struct task_struct *next)
 {
 	int flags;
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 2402de5..46429f9 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -426,6 +426,7 @@ tracing_sched_wakeup_trace(struct trace_array *tr,
 
 static void notrace
 probe_wakeup_sched_switch(void *ignore, bool preempt,
+			  unsigned int prev_state,
 			  struct task_struct *prev, struct task_struct *next)
 {
 	struct trace_array_cpu *data;

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-03-01 15:24   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
@ 2022-03-04 16:13     ` Valentin Schneider
  2022-03-08 18:02     ` Qais Yousef
  1 sibling, 0 replies; 49+ messages in thread
From: Valentin Schneider @ 2022-03-04 16:13 UTC (permalink / raw)
  To: tip-bot2 for Valentin Schneider, linux-tip-commits
  Cc: Abhijeet Dharmapurikar, Peter Zijlstra (Intel),
	Steven Rostedt (Google),
	x86, linux-kernel

On 01/03/22 15:24, tip-bot2 for Valentin Schneider wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> Author:        Valentin Schneider <valentin.schneider@arm.com>
> AuthorDate:    Thu, 20 Jan 2022 16:25:19
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
>
> sched/tracing: Don't re-read p->state when emitting sched_switch event
>
> As of commit
>
>   c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
>
> the following sequence becomes possible:
>
>                     p->__state = TASK_INTERRUPTIBLE;
>                     __schedule()
>                       deactivate_task(p);
>   ttwu()
>     READ !p->on_rq
>     p->__state=TASK_WAKING
>                       trace_sched_switch()
>                         __trace_sched_switch_state()
>                           task_state_index()
>                             return 0;
>
> TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
> the trace event.
>
> Prevent this by pushing the value read from __schedule() down the trace
> event.
>
> Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com

So I wasn't sure if that would be considered a "bug", but folks are asking
for this to be backported; do we want to slap

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

to it?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-03-01 15:24   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
  2022-03-04 16:13     ` Valentin Schneider
@ 2022-03-08 18:02     ` Qais Yousef
  2022-03-08 18:10       ` Greg KH
  1 sibling, 1 reply; 49+ messages in thread
From: Qais Yousef @ 2022-03-08 18:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-tip-commits, Abhijeet Dharmapurikar, Valentin Schneider,
	Peter Zijlstra (Intel), Steven Rostedt (Google),
	x86, stable

+CC stable

On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> Author:        Valentin Schneider <valentin.schneider@arm.com>
> AuthorDate:    Thu, 20 Jan 2022 16:25:19 
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
> 
> sched/tracing: Don't re-read p->state when emitting sched_switch event
> 
> As of commit
> 
>   c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> 
> the following sequence becomes possible:
> 
> 		      p->__state = TASK_INTERRUPTIBLE;
> 		      __schedule()
> 			deactivate_task(p);
>   ttwu()
>     READ !p->on_rq
>     p->__state=TASK_WAKING
> 			trace_sched_switch()
> 			  __trace_sched_switch_state()
> 			    task_state_index()
> 			      return 0;
> 
> TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
> the trace event.
> 
> Prevent this by pushing the value read from __schedule() down the trace
> event.
> 
> Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com

Any objection to picking this for stable? I'm interested in this one for some
Android users but prefer if it can be taken by stable rather than backport it
individually.

I think it makes sense to pick the next one in the series too.

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-03-08 18:02     ` Qais Yousef
@ 2022-03-08 18:10       ` Greg KH
  2022-03-08 18:51         ` Qais Yousef
  0 siblings, 1 reply; 49+ messages in thread
From: Greg KH @ 2022-03-08 18:10 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, linux-tip-commits, Abhijeet Dharmapurikar,
	Valentin Schneider, Peter Zijlstra (Intel),
	Steven Rostedt (Google),
	x86, stable

On Tue, Mar 08, 2022 at 06:02:40PM +0000, Qais Yousef wrote:
> +CC stable
> 
> On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
> > The following commit has been merged into the sched/core branch of tip:
> > 
> > Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > Author:        Valentin Schneider <valentin.schneider@arm.com>
> > AuthorDate:    Thu, 20 Jan 2022 16:25:19 
> > Committer:     Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
> > 
> > sched/tracing: Don't re-read p->state when emitting sched_switch event
> > 
> > As of commit
> > 
> >   c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > 
> > the following sequence becomes possible:
> > 
> > 		      p->__state = TASK_INTERRUPTIBLE;
> > 		      __schedule()
> > 			deactivate_task(p);
> >   ttwu()
> >     READ !p->on_rq
> >     p->__state=TASK_WAKING
> > 			trace_sched_switch()
> > 			  __trace_sched_switch_state()
> > 			    task_state_index()
> > 			      return 0;
> > 
> > TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
> > the trace event.
> > 
> > Prevent this by pushing the value read from __schedule() down the trace
> > event.
> > 
> > Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
> > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
> 
> Any objection to picking this for stable? I'm interested in this one for some
> Android users but prefer if it can be taken by stable rather than backport it
> individually.
> 
> I think it makes sense to pick the next one in the series too.

What commit does this fix in Linus's tree?

Once it hits Linus's tree, let stable@vger.kernel.org know what the git
commit id is in Linus's tree.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-03-08 18:10       ` Greg KH
@ 2022-03-08 18:51         ` Qais Yousef
  2022-04-09 23:38           ` Qais Yousef
  0 siblings, 1 reply; 49+ messages in thread
From: Qais Yousef @ 2022-03-08 18:51 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, linux-tip-commits, Abhijeet Dharmapurikar,
	Valentin Schneider, Peter Zijlstra (Intel),
	Steven Rostedt (Google),
	x86, stable

On 03/08/22 19:10, Greg KH wrote:
> On Tue, Mar 08, 2022 at 06:02:40PM +0000, Qais Yousef wrote:
> > +CC stable
> > 
> > On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
> > > The following commit has been merged into the sched/core branch of tip:
> > > 
> > > Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > Author:        Valentin Schneider <valentin.schneider@arm.com>
> > > AuthorDate:    Thu, 20 Jan 2022 16:25:19 
> > > Committer:     Peter Zijlstra <peterz@infradead.org>
> > > CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
> > > 
> > > sched/tracing: Don't re-read p->state when emitting sched_switch event
> > > 
> > > As of commit
> > > 
> > >   c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > > 
> > > the following sequence becomes possible:
> > > 
> > > 		      p->__state = TASK_INTERRUPTIBLE;
> > > 		      __schedule()
> > > 			deactivate_task(p);
> > >   ttwu()
> > >     READ !p->on_rq
> > >     p->__state=TASK_WAKING
> > > 			trace_sched_switch()
> > > 			  __trace_sched_switch_state()
> > > 			    task_state_index()
> > > 			      return 0;
> > > 
> > > TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
> > > the trace event.
> > > 
> > > Prevent this by pushing the value read from __schedule() down the trace
> > > event.
> > > 
> > > Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
> > > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > > Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
> > 
> > Any objection to picking this for stable? I'm interested in this one for some
> > Android users but prefer if it can be taken by stable rather than backport it
> > individually.
> > 
> > I think it makes sense to pick the next one in the series too.
> 
> What commit does this fix in Linus's tree?

It should be this one: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

Not sure about the 2nd patch, but I can try to figure it out.

> Once it hits Linus's tree, let stable@vger.kernel.org know what the git
> commit id is in Linus's tree.

Sure. My bad if I rushed the request. I just wanted to know whether it's okay
for this to go into stable. If no one shouts, I'll ping once it lands in
Linus'.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-03-08 18:51         ` Qais Yousef
@ 2022-04-09 23:38           ` Qais Yousef
  2022-04-10 22:06             ` Qais Yousef
  2022-04-11 13:22             ` Greg KH
  0 siblings, 2 replies; 49+ messages in thread
From: Qais Yousef @ 2022-04-09 23:38 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, linux-tip-commits, Abhijeet Dharmapurikar,
	Valentin Schneider, Peter Zijlstra (Intel),
	Steven Rostedt (Google),
	x86, stable

On 03/08/22 18:51, Qais Yousef wrote:
> On 03/08/22 19:10, Greg KH wrote:
> > On Tue, Mar 08, 2022 at 06:02:40PM +0000, Qais Yousef wrote:
> > > +CC stable
> > > 
> > > On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
> > > > The following commit has been merged into the sched/core branch of tip:
> > > > 
> > > > Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > Author:        Valentin Schneider <valentin.schneider@arm.com>
> > > > AuthorDate:    Thu, 20 Jan 2022 16:25:19 
> > > > Committer:     Peter Zijlstra <peterz@infradead.org>
> > > > CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
> > > > 
> > > > sched/tracing: Don't re-read p->state when emitting sched_switch event
> > > > 
> > > > As of commit
> > > > 
> > > >   c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > > > 
> > > > the following sequence becomes possible:
> > > > 
> > > > 		      p->__state = TASK_INTERRUPTIBLE;
> > > > 		      __schedule()
> > > > 			deactivate_task(p);
> > > >   ttwu()
> > > >     READ !p->on_rq
> > > >     p->__state=TASK_WAKING
> > > > 			trace_sched_switch()
> > > > 			  __trace_sched_switch_state()
> > > > 			    task_state_index()
> > > > 			      return 0;
> > > > 
> > > > TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
> > > > the trace event.
> > > > 
> > > > Prevent this by pushing the value read from __schedule() down the trace
> > > > event.
> > > > 
> > > > Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
> > > > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > > > Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
> > > 
> > > Any objection to picking this for stable? I'm interested in this one for some
> > > Android users but prefer if it can be taken by stable rather than backport it
> > > individually.
> > > 
> > > I think it makes sense to pick the next one in the series too.
> > 
> > What commit does this fix in Linus's tree?
> 
> It should be this one: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

Should this be okay to be picked up by stable now? I can see AUTOSEL has picked
it up for v5.15+, but it impacts v5.10 too.

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
  2022-01-20 16:25 ` [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE Valentin Schneider
  2022-03-01 15:24   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
@ 2022-04-09 23:42   ` Qais Yousef
  2022-04-10  6:14     ` Greg KH
  1 sibling, 1 reply; 49+ messages in thread
From: Qais Yousef @ 2022-04-09 23:42 UTC (permalink / raw)
  To: stable
  Cc: linux-kernel, Uwe Kleine-König, Abhijeet Dharmapurikar,
	Dietmar Eggemann, Steven Rostedt, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Thomas Gleixner, Sebastian Andrzej Siewior,
	Juri Lelli, Daniel Bristot de Oliveira, Kees Cook, Andrew Morton,
	Eric W. Biederman, Alexey Gladkov, Kenta.Tada, Randy Dunlap,
	Ed Tsai, Valentin Schneider

+CC stable

On 01/20/22 16:25, Valentin Schneider wrote:
> TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
> 
> The actual state is saved in p->saved_state, but reading it after reading
> p->__state has a few issues:
> o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> o ttwu_state_match() might have changed that to TASK_RUNNING
> 
> As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
> exposing a new state to userspace tools which way not know what to do with
> them. The only information that needs to be conveyed here is that a task is
> waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
> need for a new state.
> 
> Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Any objection for this to be picked up by stable? We care about Patch 1 only in
this series for stable, but it seems sensible to pick this one too, no strong
feeling if it is omitted though.

AFAICT it seems the problem dates back since commit:

	1593baab910d ("sched/debug: Implement consistent task-state printing")

or even before. I think v4.14+ is good enough.

Thoughts?

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
  2022-04-09 23:42   ` [PATCH v3 2/2] " Qais Yousef
@ 2022-04-10  6:14     ` Greg KH
  2022-04-10 22:13       ` Qais Yousef
  0 siblings, 1 reply; 49+ messages in thread
From: Greg KH @ 2022-04-10  6:14 UTC (permalink / raw)
  To: Qais Yousef
  Cc: stable, linux-kernel, Uwe Kleine-König,
	Abhijeet Dharmapurikar, Dietmar Eggemann, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Vincent Guittot, Thomas Gleixner,
	Sebastian Andrzej Siewior, Juri Lelli,
	Daniel Bristot de Oliveira, Kees Cook, Andrew Morton,
	Eric W. Biederman, Alexey Gladkov, Kenta.Tada, Randy Dunlap,
	Ed Tsai, Valentin Schneider

On Sun, Apr 10, 2022 at 12:42:24AM +0100, Qais Yousef wrote:
> +CC stable
> 
> On 01/20/22 16:25, Valentin Schneider wrote:
> > TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> > on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
> > 
> > The actual state is saved in p->saved_state, but reading it after reading
> > p->__state has a few issues:
> > o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> > o ttwu_state_match() might have changed that to TASK_RUNNING
> > 
> > As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
> > exposing a new state to userspace tools which way not know what to do with
> > them. The only information that needs to be conveyed here is that a task is
> > waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
> > need for a new state.
> > 
> > Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
> > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> 
> Any objection for this to be picked up by stable? We care about Patch 1 only in
> this series for stable, but it seems sensible to pick this one too, no strong
> feeling if it is omitted though.
> 
> AFAICT it seems the problem dates back since commit:
> 
> 	1593baab910d ("sched/debug: Implement consistent task-state printing")
> 
> or even before. I think v4.14+ is good enough.


<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-04-09 23:38           ` Qais Yousef
@ 2022-04-10 22:06             ` Qais Yousef
  2022-04-10 23:22               ` Holger Hoffstätte
  2022-04-11 13:22             ` Greg KH
  1 sibling, 1 reply; 49+ messages in thread
From: Qais Yousef @ 2022-04-10 22:06 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, linux-tip-commits, Abhijeet Dharmapurikar,
	Valentin Schneider, Peter Zijlstra (Intel),
	Steven Rostedt (Google),
	x86, stable

On 04/10/22 00:38, Qais Yousef wrote:
> On 03/08/22 18:51, Qais Yousef wrote:
> > On 03/08/22 19:10, Greg KH wrote:
> > > On Tue, Mar 08, 2022 at 06:02:40PM +0000, Qais Yousef wrote:
> > > > +CC stable
> > > > 
> > > > On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
> > > > > The following commit has been merged into the sched/core branch of tip:
> > > > > 
> > > > > Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > > Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > > Author:        Valentin Schneider <valentin.schneider@arm.com>
> > > > > AuthorDate:    Thu, 20 Jan 2022 16:25:19 
> > > > > Committer:     Peter Zijlstra <peterz@infradead.org>
> > > > > CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
> > > > > 
> > > > > sched/tracing: Don't re-read p->state when emitting sched_switch event
> > > > > 
> > > > > As of commit
> > > > > 
> > > > >   c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > > > > 
> > > > > the following sequence becomes possible:
> > > > > 
> > > > > 		      p->__state = TASK_INTERRUPTIBLE;
> > > > > 		      __schedule()
> > > > > 			deactivate_task(p);
> > > > >   ttwu()
> > > > >     READ !p->on_rq
> > > > >     p->__state=TASK_WAKING
> > > > > 			trace_sched_switch()
> > > > > 			  __trace_sched_switch_state()
> > > > > 			    task_state_index()
> > > > > 			      return 0;
> > > > > 
> > > > > TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
> > > > > the trace event.
> > > > > 
> > > > > Prevent this by pushing the value read from __schedule() down the trace
> > > > > event.
> > > > > 
> > > > > Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
> > > > > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > > > > Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
> > > > 
> > > > Any objection to picking this for stable? I'm interested in this one for some
> > > > Android users but prefer if it can be taken by stable rather than backport it
> > > > individually.
> > > > 
> > > > I think it makes sense to pick the next one in the series too.
> > > 
> > > What commit does this fix in Linus's tree?
> > 
> > It should be this one: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> 
> Should this be okay to be picked up by stable now? I can see AUTOSEL has picked
> it up for v5.15+, but it impacts v5.10 too.

commit: fa2c3254d7cfff5f7a916ab928a562d1165f17bb
subject: sched/tracing: Don't re-read p->state when emitting sched_switch event

This patch has an impact on Android 5.10 users who experience tooling breakage.
Is it possible to include in 5.10 LTS please?

It was already picked up for 5.15+ by AUTOSEL and only 5.10 is missing.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
  2022-04-10  6:14     ` Greg KH
@ 2022-04-10 22:13       ` Qais Yousef
  2022-04-11 13:20         ` Greg KH
  0 siblings, 1 reply; 49+ messages in thread
From: Qais Yousef @ 2022-04-10 22:13 UTC (permalink / raw)
  To: Greg KH
  Cc: stable, linux-kernel, Uwe Kleine-König,
	Abhijeet Dharmapurikar, Dietmar Eggemann, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Vincent Guittot, Thomas Gleixner,
	Sebastian Andrzej Siewior, Juri Lelli,
	Daniel Bristot de Oliveira, Kees Cook, Andrew Morton,
	Eric W. Biederman, Alexey Gladkov, Kenta.Tada, Randy Dunlap,
	Ed Tsai, Valentin Schneider

On 04/10/22 08:14, Greg KH wrote:
> On Sun, Apr 10, 2022 at 12:42:24AM +0100, Qais Yousef wrote:
> > +CC stable
> > 
> > On 01/20/22 16:25, Valentin Schneider wrote:
> > > TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> > > on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
> > > 
> > > The actual state is saved in p->saved_state, but reading it after reading
> > > p->__state has a few issues:
> > > o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> > > o ttwu_state_match() might have changed that to TASK_RUNNING
> > > 
> > > As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
> > > exposing a new state to userspace tools which way not know what to do with
> > > them. The only information that needs to be conveyed here is that a task is
> > > waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
> > > need for a new state.
> > > 
> > > Reported-by: Uwe Kleine-K�nig <u.kleine-koenig@pengutronix.de>
> > > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> > 
> > Any objection for this to be picked up by stable? We care about Patch 1 only in
> > this series for stable, but it seems sensible to pick this one too, no strong
> > feeling if it is omitted though.
> > 
> > AFAICT it seems the problem dates back since commit:
> > 
> > 	1593baab910d ("sched/debug: Implement consistent task-state printing")
> > 
> > or even before. I think v4.14+ is good enough.
> 
> 
> <formletter>
> 
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree.  Please read:
>     https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.
> 
> </formletter>

Apologies.

commit: 25795ef6299f07ce3838f3253a9cb34f64efcfae
Subject: sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

I am interested in Patch 1 in this series as I know it impacts some Android
5.10 users. But this patch seems a good candidate for stable too since it was
observed by a user (Uwe) and AFAICT the problem dates back to v4.14+ kernels.

Suggested kernels: v4.14+. This has already been picked up by AUTOSEL for
v5.15+ stable trees.

Hope I got the process right this time.

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-04-10 22:06             ` Qais Yousef
@ 2022-04-10 23:22               ` Holger Hoffstätte
  2022-04-11  7:18                 ` Holger Hoffstätte
  0 siblings, 1 reply; 49+ messages in thread
From: Holger Hoffstätte @ 2022-04-10 23:22 UTC (permalink / raw)
  To: Qais Yousef, Greg KH
  Cc: linux-kernel, linux-tip-commits, Abhijeet Dharmapurikar,
	Valentin Schneider, Peter Zijlstra (Intel),
	Steven Rostedt (Google),
	x86, stable

On 2022-04-11 00:06, Qais Yousef wrote:
> On 04/10/22 00:38, Qais Yousef wrote:
>> On 03/08/22 18:51, Qais Yousef wrote:
>>> On 03/08/22 19:10, Greg KH wrote:
>>>> On Tue, Mar 08, 2022 at 06:02:40PM +0000, Qais Yousef wrote:
>>>>> +CC stable
>>>>>
>>>>> On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
>>>>>> The following commit has been merged into the sched/core branch of tip:
>>>>>>
>>>>>> Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
>>>>>> Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
>>>>>> Author:        Valentin Schneider <valentin.schneider@arm.com>
>>>>>> AuthorDate:    Thu, 20 Jan 2022 16:25:19
>>>>>> Committer:     Peter Zijlstra <peterz@infradead.org>
>>>>>> CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
>>>>>>
>>>>>> sched/tracing: Don't re-read p->state when emitting sched_switch event
>>>>>>
>>>>>> As of commit
>>>>>>
>>>>>>    c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
>>>>>>
>>>>>> the following sequence becomes possible:
>>>>>>
>>>>>> 		      p->__state = TASK_INTERRUPTIBLE;
>>>>>> 		      __schedule()
>>>>>> 			deactivate_task(p);
>>>>>>    ttwu()
>>>>>>      READ !p->on_rq
>>>>>>      p->__state=TASK_WAKING
>>>>>> 			trace_sched_switch()
>>>>>> 			  __trace_sched_switch_state()
>>>>>> 			    task_state_index()
>>>>>> 			      return 0;
>>>>>>
>>>>>> TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
>>>>>> the trace event.
>>>>>>
>>>>>> Prevent this by pushing the value read from __schedule() down the trace
>>>>>> event.
>>>>>>
>>>>>> Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
>>>>>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
>>>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>>>> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
>>>>>> Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
>>>>>
>>>>> Any objection to picking this for stable? I'm interested in this one for some
>>>>> Android users but prefer if it can be taken by stable rather than backport it
>>>>> individually.
>>>>>
>>>>> I think it makes sense to pick the next one in the series too.
>>>>
>>>> What commit does this fix in Linus's tree?
>>>
>>> It should be this one: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
>>
>> Should this be okay to be picked up by stable now? I can see AUTOSEL has picked
>> it up for v5.15+, but it impacts v5.10 too.
> 
> commit: fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> subject: sched/tracing: Don't re-read p->state when emitting sched_switch event
> 
> This patch has an impact on Android 5.10 users who experience tooling breakage.
> Is it possible to include in 5.10 LTS please?
> 
> It was already picked up for 5.15+ by AUTOSEL and only 5.10 is missing.
> 

https://lore.kernel.org/stable/Yk2PQzynOVOzJdPo@kroah.com/

However, since then further investigation (still in progress) has shown that this
may have been the fault of the tool in question, so if you can verify that tracing
sched still works for you with this patch in 5.15.x then by all means
let's merge it.

-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-04-10 23:22               ` Holger Hoffstätte
@ 2022-04-11  7:18                 ` Holger Hoffstätte
  2022-04-11  7:28                   ` Greg KH
  0 siblings, 1 reply; 49+ messages in thread
From: Holger Hoffstätte @ 2022-04-11  7:18 UTC (permalink / raw)
  To: Qais Yousef, Greg KH
  Cc: linux-kernel, linux-tip-commits, Abhijeet Dharmapurikar,
	Valentin Schneider, Peter Zijlstra (Intel),
	Steven Rostedt (Google),
	x86, stable

On 2022-04-11 01:22, Holger Hoffstätte wrote:
> On 2022-04-11 00:06, Qais Yousef wrote:
>> On 04/10/22 00:38, Qais Yousef wrote:
>>> On 03/08/22 18:51, Qais Yousef wrote:
>>>> On 03/08/22 19:10, Greg KH wrote:
>>>>> On Tue, Mar 08, 2022 at 06:02:40PM +0000, Qais Yousef wrote:
>>>>>> +CC stable
>>>>>>
>>>>>> On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
>>>>>>> The following commit has been merged into the sched/core branch of tip:
>>>>>>>
>>>>>>> Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
>>>>>>> Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
>>>>>>> Author:        Valentin Schneider <valentin.schneider@arm.com>
>>>>>>> AuthorDate:    Thu, 20 Jan 2022 16:25:19
>>>>>>> Committer:     Peter Zijlstra <peterz@infradead.org>
>>>>>>> CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
>>>>>>>
>>>>>>> sched/tracing: Don't re-read p->state when emitting sched_switch event
>>>>>>>
>>>>>>> As of commit
>>>>>>>
>>>>>>>    c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
>>>>>>>
>>>>>>> the following sequence becomes possible:
>>>>>>>
>>>>>>>               p->__state = TASK_INTERRUPTIBLE;
>>>>>>>               __schedule()
>>>>>>>             deactivate_task(p);
>>>>>>>    ttwu()
>>>>>>>      READ !p->on_rq
>>>>>>>      p->__state=TASK_WAKING
>>>>>>>             trace_sched_switch()
>>>>>>>               __trace_sched_switch_state()
>>>>>>>                 task_state_index()
>>>>>>>                   return 0;
>>>>>>>
>>>>>>> TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
>>>>>>> the trace event.
>>>>>>>
>>>>>>> Prevent this by pushing the value read from __schedule() down the trace
>>>>>>> event.
>>>>>>>
>>>>>>> Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
>>>>>>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
>>>>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>>>>> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
>>>>>>> Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
>>>>>>
>>>>>> Any objection to picking this for stable? I'm interested in this one for some
>>>>>> Android users but prefer if it can be taken by stable rather than backport it
>>>>>> individually.
>>>>>>
>>>>>> I think it makes sense to pick the next one in the series too.
>>>>>
>>>>> What commit does this fix in Linus's tree?
>>>>
>>>> It should be this one: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
>>>
>>> Should this be okay to be picked up by stable now? I can see AUTOSEL has picked
>>> it up for v5.15+, but it impacts v5.10 too.
>>
>> commit: fa2c3254d7cfff5f7a916ab928a562d1165f17bb
>> subject: sched/tracing: Don't re-read p->state when emitting sched_switch event
>>
>> This patch has an impact on Android 5.10 users who experience tooling breakage.
>> Is it possible to include in 5.10 LTS please?
>>
>> It was already picked up for 5.15+ by AUTOSEL and only 5.10 is missing.
>>
> 
> https://lore.kernel.org/stable/Yk2PQzynOVOzJdPo@kroah.com/
> 
> However, since then further investigation (still in progress) has shown that this
> may have been the fault of the tool in question, so if you can verify that tracing
> sched still works for you with this patch in 5.15.x then by all means
> let's merge it.

So it turns out the lockup is indeed the fault of the tool, which contains multiple
kernel-version dependent tracepoint definitions and now fails with this
patch.

Greg, please re-enqueue this patch where necessary (5.10, 5.15+)
  
-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-04-11  7:18                 ` Holger Hoffstätte
@ 2022-04-11  7:28                   ` Greg KH
  2022-04-11  8:05                     ` Holger Hoffstätte
  0 siblings, 1 reply; 49+ messages in thread
From: Greg KH @ 2022-04-11  7:28 UTC (permalink / raw)
  To: Holger Hoffstätte
  Cc: Qais Yousef, linux-kernel, linux-tip-commits,
	Abhijeet Dharmapurikar, Valentin Schneider,
	Peter Zijlstra (Intel), Steven Rostedt (Google),
	x86, stable

On Mon, Apr 11, 2022 at 09:18:19AM +0200, Holger Hoffstätte wrote:
> On 2022-04-11 01:22, Holger Hoffstätte wrote:
> > On 2022-04-11 00:06, Qais Yousef wrote:
> > > On 04/10/22 00:38, Qais Yousef wrote:
> > > > On 03/08/22 18:51, Qais Yousef wrote:
> > > > > On 03/08/22 19:10, Greg KH wrote:
> > > > > > On Tue, Mar 08, 2022 at 06:02:40PM +0000, Qais Yousef wrote:
> > > > > > > +CC stable
> > > > > > > 
> > > > > > > On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
> > > > > > > > The following commit has been merged into the sched/core branch of tip:
> > > > > > > > 
> > > > > > > > Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > > > > > Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > > > > > Author:        Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > AuthorDate:    Thu, 20 Jan 2022 16:25:19
> > > > > > > > Committer:     Peter Zijlstra <peterz@infradead.org>
> > > > > > > > CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
> > > > > > > > 
> > > > > > > > sched/tracing: Don't re-read p->state when emitting sched_switch event
> > > > > > > > 
> > > > > > > > As of commit
> > > > > > > > 
> > > > > > > >    c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > > > > > > > 
> > > > > > > > the following sequence becomes possible:
> > > > > > > > 
> > > > > > > >               p->__state = TASK_INTERRUPTIBLE;
> > > > > > > >               __schedule()
> > > > > > > >             deactivate_task(p);
> > > > > > > >    ttwu()
> > > > > > > >      READ !p->on_rq
> > > > > > > >      p->__state=TASK_WAKING
> > > > > > > >             trace_sched_switch()
> > > > > > > >               __trace_sched_switch_state()
> > > > > > > >                 task_state_index()
> > > > > > > >                   return 0;
> > > > > > > > 
> > > > > > > > TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
> > > > > > > > the trace event.
> > > > > > > > 
> > > > > > > > Prevent this by pushing the value read from __schedule() down the trace
> > > > > > > > event.
> > > > > > > > 
> > > > > > > > Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
> > > > > > > > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > > > > Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > > > > > > > Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
> > > > > > > 
> > > > > > > Any objection to picking this for stable? I'm interested in this one for some
> > > > > > > Android users but prefer if it can be taken by stable rather than backport it
> > > > > > > individually.
> > > > > > > 
> > > > > > > I think it makes sense to pick the next one in the series too.
> > > > > > 
> > > > > > What commit does this fix in Linus's tree?
> > > > > 
> > > > > It should be this one: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > > > 
> > > > Should this be okay to be picked up by stable now? I can see AUTOSEL has picked
> > > > it up for v5.15+, but it impacts v5.10 too.
> > > 
> > > commit: fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > subject: sched/tracing: Don't re-read p->state when emitting sched_switch event
> > > 
> > > This patch has an impact on Android 5.10 users who experience tooling breakage.
> > > Is it possible to include in 5.10 LTS please?
> > > 
> > > It was already picked up for 5.15+ by AUTOSEL and only 5.10 is missing.
> > > 
> > 
> > https://lore.kernel.org/stable/Yk2PQzynOVOzJdPo@kroah.com/
> > 
> > However, since then further investigation (still in progress) has shown that this
> > may have been the fault of the tool in question, so if you can verify that tracing
> > sched still works for you with this patch in 5.15.x then by all means
> > let's merge it.
> 
> So it turns out the lockup is indeed the fault of the tool, which contains multiple
> kernel-version dependent tracepoint definitions and now fails with this
> patch.

What tools is this?

> Greg, please re-enqueue this patch where necessary (5.10, 5.15+)

If I queue it up again, will the tools keep breaking?

thanks,

greg k-hj

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-04-11  7:28                   ` Greg KH
@ 2022-04-11  8:05                     ` Holger Hoffstätte
  2022-04-11 13:23                       ` Greg KH
  0 siblings, 1 reply; 49+ messages in thread
From: Holger Hoffstätte @ 2022-04-11  8:05 UTC (permalink / raw)
  To: Greg KH
  Cc: Qais Yousef, linux-kernel, linux-tip-commits,
	Abhijeet Dharmapurikar, Valentin Schneider,
	Peter Zijlstra (Intel), Steven Rostedt (Google),
	x86, stable

On 2022-04-11 09:28, Greg KH wrote:
> On Mon, Apr 11, 2022 at 09:18:19AM +0200, Holger Hoffstätte wrote:
>> On 2022-04-11 01:22, Holger Hoffstätte wrote:
>>> On 2022-04-11 00:06, Qais Yousef wrote:
>>>> On 04/10/22 00:38, Qais Yousef wrote:
>>>>> On 03/08/22 18:51, Qais Yousef wrote:
>>>>>> On 03/08/22 19:10, Greg KH wrote:
>>>>>>> On Tue, Mar 08, 2022 at 06:02:40PM +0000, Qais Yousef wrote:
>>>>>>>> +CC stable
>>>>>>>>
>>>>>>>> On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
>>>>>>>>> The following commit has been merged into the sched/core branch of tip:
>>>>>>>>>
>>>>>>>>> Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
>>>>>>>>> Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
>>>>>>>>> Author:        Valentin Schneider <valentin.schneider@arm.com>
>>>>>>>>> AuthorDate:    Thu, 20 Jan 2022 16:25:19
>>>>>>>>> Committer:     Peter Zijlstra <peterz@infradead.org>
>>>>>>>>> CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
>>>>>>>>>
>>>>>>>>> sched/tracing: Don't re-read p->state when emitting sched_switch event
>>>>>>>>>
>>>>>>>>> As of commit
>>>>>>>>>
>>>>>>>>>     c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
>>>>>>>>>
>>>>>>>>> the following sequence becomes possible:
>>>>>>>>>
>>>>>>>>>                p->__state = TASK_INTERRUPTIBLE;
>>>>>>>>>                __schedule()
>>>>>>>>>              deactivate_task(p);
>>>>>>>>>     ttwu()
>>>>>>>>>       READ !p->on_rq
>>>>>>>>>       p->__state=TASK_WAKING
>>>>>>>>>              trace_sched_switch()
>>>>>>>>>                __trace_sched_switch_state()
>>>>>>>>>                  task_state_index()
>>>>>>>>>                    return 0;
>>>>>>>>>
>>>>>>>>> TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
>>>>>>>>> the trace event.
>>>>>>>>>
>>>>>>>>> Prevent this by pushing the value read from __schedule() down the trace
>>>>>>>>> event.
>>>>>>>>>
>>>>>>>>> Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
>>>>>>>>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
>>>>>>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>>>>>>> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
>>>>>>>>> Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
>>>>>>>>
>>>>>>>> Any objection to picking this for stable? I'm interested in this one for some
>>>>>>>> Android users but prefer if it can be taken by stable rather than backport it
>>>>>>>> individually.
>>>>>>>>
>>>>>>>> I think it makes sense to pick the next one in the series too.
>>>>>>>
>>>>>>> What commit does this fix in Linus's tree?
>>>>>>
>>>>>> It should be this one: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
>>>>>
>>>>> Should this be okay to be picked up by stable now? I can see AUTOSEL has picked
>>>>> it up for v5.15+, but it impacts v5.10 too.
>>>>
>>>> commit: fa2c3254d7cfff5f7a916ab928a562d1165f17bb
>>>> subject: sched/tracing: Don't re-read p->state when emitting sched_switch event
>>>>
>>>> This patch has an impact on Android 5.10 users who experience tooling breakage.
>>>> Is it possible to include in 5.10 LTS please?
>>>>
>>>> It was already picked up for 5.15+ by AUTOSEL and only 5.10 is missing.
>>>>
>>>
>>> https://lore.kernel.org/stable/Yk2PQzynOVOzJdPo@kroah.com/
>>>
>>> However, since then further investigation (still in progress) has shown that this
>>> may have been the fault of the tool in question, so if you can verify that tracing
>>> sched still works for you with this patch in 5.15.x then by all means
>>> let's merge it.
>>
>> So it turns out the lockup is indeed the fault of the tool, which contains multiple
>> kernel-version dependent tracepoint definitions and now fails with this
>> patch.
> 
> What tools is this?

sysdig - which uses a helper kernel module which accesses tracepoints, but of course
(as I just found) with copypasta'd TP definitions, which broke with this patch due to
the additional parameter in the function signature. It's been prone to breakage forever
because of a lack of a stable kernel ABI.

Took me a while to find/figure out, but IMHO better safe than sorry. We've had
autoselected scheduler patches before that looked fine but really were not.

> 
>> Greg, please re-enqueue this patch where necessary (5.10, 5.15+)
> 
> If I queue it up again, will the tools keep breaking?

Yes, but that's their problem with an out-of-tree module; a few more #ifdefs
are not going to make a big difference.

thanks
Holger

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
  2022-04-10 22:13       ` Qais Yousef
@ 2022-04-11 13:20         ` Greg KH
  2022-04-11 20:18           ` Qais Yousef
  0 siblings, 1 reply; 49+ messages in thread
From: Greg KH @ 2022-04-11 13:20 UTC (permalink / raw)
  To: Qais Yousef
  Cc: stable, linux-kernel, Uwe Kleine-König,
	Abhijeet Dharmapurikar, Dietmar Eggemann, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Vincent Guittot, Thomas Gleixner,
	Sebastian Andrzej Siewior, Juri Lelli,
	Daniel Bristot de Oliveira, Kees Cook, Andrew Morton,
	Eric W. Biederman, Alexey Gladkov, Kenta.Tada, Randy Dunlap,
	Ed Tsai, Valentin Schneider

On Sun, Apr 10, 2022 at 11:13:25PM +0100, Qais Yousef wrote:
> On 04/10/22 08:14, Greg KH wrote:
> > On Sun, Apr 10, 2022 at 12:42:24AM +0100, Qais Yousef wrote:
> > > +CC stable
> > > 
> > > On 01/20/22 16:25, Valentin Schneider wrote:
> > > > TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> > > > on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
> > > > 
> > > > The actual state is saved in p->saved_state, but reading it after reading
> > > > p->__state has a few issues:
> > > > o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> > > > o ttwu_state_match() might have changed that to TASK_RUNNING
> > > > 
> > > > As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
> > > > exposing a new state to userspace tools which way not know what to do with
> > > > them. The only information that needs to be conveyed here is that a task is
> > > > waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
> > > > need for a new state.
> > > > 
> > > > Reported-by: Uwe Kleine-K�nig <u.kleine-koenig@pengutronix.de>
> > > > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> > > 
> > > Any objection for this to be picked up by stable? We care about Patch 1 only in
> > > this series for stable, but it seems sensible to pick this one too, no strong
> > > feeling if it is omitted though.
> > > 
> > > AFAICT it seems the problem dates back since commit:
> > > 
> > > 	1593baab910d ("sched/debug: Implement consistent task-state printing")
> > > 
> > > or even before. I think v4.14+ is good enough.
> > 
> > 
> > <formletter>
> > 
> > This is not the correct way to submit patches for inclusion in the
> > stable kernel tree.  Please read:
> >     https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > for how to do this properly.
> > 
> > </formletter>
> 
> Apologies.
> 
> commit: 25795ef6299f07ce3838f3253a9cb34f64efcfae
> Subject: sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
> 
> I am interested in Patch 1 in this series as I know it impacts some Android
> 5.10 users. But this patch seems a good candidate for stable too since it was
> observed by a user (Uwe) and AFAICT the problem dates back to v4.14+ kernels.
> 
> Suggested kernels: v4.14+. This has already been picked up by AUTOSEL for
> v5.15+ stable trees.

I do not think you have tested this in any of those kernels, as it
breaks the build :(

Please send a set of patches, properly backported and tested, that you
wish to see applied to stable kernels and we will be glad to review them
and apply them.  But to suggest patches to be merged that you have not
even tested is not good.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-04-09 23:38           ` Qais Yousef
  2022-04-10 22:06             ` Qais Yousef
@ 2022-04-11 13:22             ` Greg KH
  2022-04-11 21:06               ` Qais Yousef
  1 sibling, 1 reply; 49+ messages in thread
From: Greg KH @ 2022-04-11 13:22 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, linux-tip-commits, Abhijeet Dharmapurikar,
	Valentin Schneider, Peter Zijlstra (Intel),
	Steven Rostedt (Google),
	x86, stable

On Sun, Apr 10, 2022 at 12:38:29AM +0100, Qais Yousef wrote:
> On 03/08/22 18:51, Qais Yousef wrote:
> > On 03/08/22 19:10, Greg KH wrote:
> > > On Tue, Mar 08, 2022 at 06:02:40PM +0000, Qais Yousef wrote:
> > > > +CC stable
> > > > 
> > > > On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
> > > > > The following commit has been merged into the sched/core branch of tip:
> > > > > 
> > > > > Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > > Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > > Author:        Valentin Schneider <valentin.schneider@arm.com>
> > > > > AuthorDate:    Thu, 20 Jan 2022 16:25:19 
> > > > > Committer:     Peter Zijlstra <peterz@infradead.org>
> > > > > CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
> > > > > 
> > > > > sched/tracing: Don't re-read p->state when emitting sched_switch event
> > > > > 
> > > > > As of commit
> > > > > 
> > > > >   c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > > > > 
> > > > > the following sequence becomes possible:
> > > > > 
> > > > > 		      p->__state = TASK_INTERRUPTIBLE;
> > > > > 		      __schedule()
> > > > > 			deactivate_task(p);
> > > > >   ttwu()
> > > > >     READ !p->on_rq
> > > > >     p->__state=TASK_WAKING
> > > > > 			trace_sched_switch()
> > > > > 			  __trace_sched_switch_state()
> > > > > 			    task_state_index()
> > > > > 			      return 0;
> > > > > 
> > > > > TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
> > > > > the trace event.
> > > > > 
> > > > > Prevent this by pushing the value read from __schedule() down the trace
> > > > > event.
> > > > > 
> > > > > Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
> > > > > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > > > > Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
> > > > 
> > > > Any objection to picking this for stable? I'm interested in this one for some
> > > > Android users but prefer if it can be taken by stable rather than backport it
> > > > individually.
> > > > 
> > > > I think it makes sense to pick the next one in the series too.
> > > 
> > > What commit does this fix in Linus's tree?
> > 
> > It should be this one: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> 
> Should this be okay to be picked up by stable now? I can see AUTOSEL has picked
> it up for v5.15+, but it impacts v5.10 too.

It does not apply to 5.10 at all, how did you test this?

{sigh}

Again, if you want this applied to any stable trees, please test that it
works and send the properly backported patches.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-04-11  8:05                     ` Holger Hoffstätte
@ 2022-04-11 13:23                       ` Greg KH
  0 siblings, 0 replies; 49+ messages in thread
From: Greg KH @ 2022-04-11 13:23 UTC (permalink / raw)
  To: Holger Hoffstätte
  Cc: Qais Yousef, linux-kernel, linux-tip-commits,
	Abhijeet Dharmapurikar, Valentin Schneider,
	Peter Zijlstra (Intel), Steven Rostedt (Google),
	x86, stable

On Mon, Apr 11, 2022 at 10:05:19AM +0200, Holger Hoffstätte wrote:
> On 2022-04-11 09:28, Greg KH wrote:
> > On Mon, Apr 11, 2022 at 09:18:19AM +0200, Holger Hoffstätte wrote:
> > > On 2022-04-11 01:22, Holger Hoffstätte wrote:
> > > > On 2022-04-11 00:06, Qais Yousef wrote:
> > > > > On 04/10/22 00:38, Qais Yousef wrote:
> > > > > > On 03/08/22 18:51, Qais Yousef wrote:
> > > > > > > On 03/08/22 19:10, Greg KH wrote:
> > > > > > > > On Tue, Mar 08, 2022 at 06:02:40PM +0000, Qais Yousef wrote:
> > > > > > > > > +CC stable
> > > > > > > > > 
> > > > > > > > > On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
> > > > > > > > > > The following commit has been merged into the sched/core branch of tip:
> > > > > > > > > > 
> > > > > > > > > > Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > > > > > > > Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > > > > > > > Author:        Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > > > AuthorDate:    Thu, 20 Jan 2022 16:25:19
> > > > > > > > > > Committer:     Peter Zijlstra <peterz@infradead.org>
> > > > > > > > > > CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
> > > > > > > > > > 
> > > > > > > > > > sched/tracing: Don't re-read p->state when emitting sched_switch event
> > > > > > > > > > 
> > > > > > > > > > As of commit
> > > > > > > > > > 
> > > > > > > > > >     c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > > > > > > > > > 
> > > > > > > > > > the following sequence becomes possible:
> > > > > > > > > > 
> > > > > > > > > >                p->__state = TASK_INTERRUPTIBLE;
> > > > > > > > > >                __schedule()
> > > > > > > > > >              deactivate_task(p);
> > > > > > > > > >     ttwu()
> > > > > > > > > >       READ !p->on_rq
> > > > > > > > > >       p->__state=TASK_WAKING
> > > > > > > > > >              trace_sched_switch()
> > > > > > > > > >                __trace_sched_switch_state()
> > > > > > > > > >                  task_state_index()
> > > > > > > > > >                    return 0;
> > > > > > > > > > 
> > > > > > > > > > TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
> > > > > > > > > > the trace event.
> > > > > > > > > > 
> > > > > > > > > > Prevent this by pushing the value read from __schedule() down the trace
> > > > > > > > > > event.
> > > > > > > > > > 
> > > > > > > > > > Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
> > > > > > > > > > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > > > > > > Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > > > > > > > > > Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
> > > > > > > > > 
> > > > > > > > > Any objection to picking this for stable? I'm interested in this one for some
> > > > > > > > > Android users but prefer if it can be taken by stable rather than backport it
> > > > > > > > > individually.
> > > > > > > > > 
> > > > > > > > > I think it makes sense to pick the next one in the series too.
> > > > > > > > 
> > > > > > > > What commit does this fix in Linus's tree?
> > > > > > > 
> > > > > > > It should be this one: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > > > > > 
> > > > > > Should this be okay to be picked up by stable now? I can see AUTOSEL has picked
> > > > > > it up for v5.15+, but it impacts v5.10 too.
> > > > > 
> > > > > commit: fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > > subject: sched/tracing: Don't re-read p->state when emitting sched_switch event
> > > > > 
> > > > > This patch has an impact on Android 5.10 users who experience tooling breakage.
> > > > > Is it possible to include in 5.10 LTS please?
> > > > > 
> > > > > It was already picked up for 5.15+ by AUTOSEL and only 5.10 is missing.
> > > > > 
> > > > 
> > > > https://lore.kernel.org/stable/Yk2PQzynOVOzJdPo@kroah.com/
> > > > 
> > > > However, since then further investigation (still in progress) has shown that this
> > > > may have been the fault of the tool in question, so if you can verify that tracing
> > > > sched still works for you with this patch in 5.15.x then by all means
> > > > let's merge it.
> > > 
> > > So it turns out the lockup is indeed the fault of the tool, which contains multiple
> > > kernel-version dependent tracepoint definitions and now fails with this
> > > patch.
> > 
> > What tools is this?
> 
> sysdig - which uses a helper kernel module which accesses tracepoints, but of course
> (as I just found) with copypasta'd TP definitions, which broke with this patch due to
> the additional parameter in the function signature. It's been prone to breakage forever
> because of a lack of a stable kernel ABI.
> 
> Took me a while to find/figure out, but IMHO better safe than sorry. We've had
> autoselected scheduler patches before that looked fine but really were not.

Thanks for the info, sysdig does do some pretty intrusive things but I
thought it had moved over to being almost all bpf code.

Anyway, I'll wait for Qais to submit a properly backported and tested
set of patches before picking this up again due to all of the problems
I've had with this patch.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
  2022-04-11 13:20         ` Greg KH
@ 2022-04-11 20:18           ` Qais Yousef
  0 siblings, 0 replies; 49+ messages in thread
From: Qais Yousef @ 2022-04-11 20:18 UTC (permalink / raw)
  To: Greg KH
  Cc: stable, linux-kernel, Uwe Kleine-König,
	Abhijeet Dharmapurikar, Dietmar Eggemann, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Vincent Guittot, Thomas Gleixner,
	Sebastian Andrzej Siewior, Juri Lelli,
	Daniel Bristot de Oliveira, Kees Cook, Andrew Morton,
	Eric W. Biederman, Alexey Gladkov, Kenta.Tada, Randy Dunlap,
	Ed Tsai, Valentin Schneider

On 04/11/22 15:20, Greg KH wrote:
> On Sun, Apr 10, 2022 at 11:13:25PM +0100, Qais Yousef wrote:
> > On 04/10/22 08:14, Greg KH wrote:
> > > On Sun, Apr 10, 2022 at 12:42:24AM +0100, Qais Yousef wrote:
> > > > +CC stable
> > > > 
> > > > On 01/20/22 16:25, Valentin Schneider wrote:
> > > > > TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> > > > > on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
> > > > > 
> > > > > The actual state is saved in p->saved_state, but reading it after reading
> > > > > p->__state has a few issues:
> > > > > o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> > > > > o ttwu_state_match() might have changed that to TASK_RUNNING
> > > > > 
> > > > > As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
> > > > > exposing a new state to userspace tools which way not know what to do with
> > > > > them. The only information that needs to be conveyed here is that a task is
> > > > > waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
> > > > > need for a new state.
> > > > > 
> > > > > Reported-by: Uwe Kleine-K�nig <u.kleine-koenig@pengutronix.de>
> > > > > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> > > > 
> > > > Any objection for this to be picked up by stable? We care about Patch 1 only in
> > > > this series for stable, but it seems sensible to pick this one too, no strong
> > > > feeling if it is omitted though.
> > > > 
> > > > AFAICT it seems the problem dates back since commit:
> > > > 
> > > > 	1593baab910d ("sched/debug: Implement consistent task-state printing")
> > > > 
> > > > or even before. I think v4.14+ is good enough.
> > > 
> > > 
> > > <formletter>
> > > 
> > > This is not the correct way to submit patches for inclusion in the
> > > stable kernel tree.  Please read:
> > >     https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > > for how to do this properly.
> > > 
> > > </formletter>
> > 
> > Apologies.
> > 
> > commit: 25795ef6299f07ce3838f3253a9cb34f64efcfae
> > Subject: sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
> > 
> > I am interested in Patch 1 in this series as I know it impacts some Android
> > 5.10 users. But this patch seems a good candidate for stable too since it was
> > observed by a user (Uwe) and AFAICT the problem dates back to v4.14+ kernels.
> > 
> > Suggested kernels: v4.14+. This has already been picked up by AUTOSEL for
> > v5.15+ stable trees.
> 
> I do not think you have tested this in any of those kernels, as it
> breaks the build :(
> 
> Please send a set of patches, properly backported and tested, that you
> wish to see applied to stable kernels and we will be glad to review them
> and apply them.  But to suggest patches to be merged that you have not
> even tested is not good.

Okay. Maybe I was trying to chew too much. I care about patch 1 only. I'll make
sure it builds and works against 5.10 and post it separately. Please ignore
this one.

Sorry for the noise.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [tip: sched/core] sched/tracing: Don't re-read p->state when emitting sched_switch event
  2022-04-11 13:22             ` Greg KH
@ 2022-04-11 21:06               ` Qais Yousef
  0 siblings, 0 replies; 49+ messages in thread
From: Qais Yousef @ 2022-04-11 21:06 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, linux-tip-commits, Abhijeet Dharmapurikar,
	Valentin Schneider, Peter Zijlstra (Intel),
	Steven Rostedt (Google),
	x86, stable

On 04/11/22 15:22, Greg KH wrote:
> On Sun, Apr 10, 2022 at 12:38:29AM +0100, Qais Yousef wrote:
> > On 03/08/22 18:51, Qais Yousef wrote:
> > > On 03/08/22 19:10, Greg KH wrote:
> > > > On Tue, Mar 08, 2022 at 06:02:40PM +0000, Qais Yousef wrote:
> > > > > +CC stable
> > > > > 
> > > > > On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
> > > > > > The following commit has been merged into the sched/core branch of tip:
> > > > > > 
> > > > > > Commit-ID:     fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > > > Gitweb:        https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> > > > > > Author:        Valentin Schneider <valentin.schneider@arm.com>
> > > > > > AuthorDate:    Thu, 20 Jan 2022 16:25:19 
> > > > > > Committer:     Peter Zijlstra <peterz@infradead.org>
> > > > > > CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
> > > > > > 
> > > > > > sched/tracing: Don't re-read p->state when emitting sched_switch event
> > > > > > 
> > > > > > As of commit
> > > > > > 
> > > > > >   c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > > > > > 
> > > > > > the following sequence becomes possible:
> > > > > > 
> > > > > > 		      p->__state = TASK_INTERRUPTIBLE;
> > > > > > 		      __schedule()
> > > > > > 			deactivate_task(p);
> > > > > >   ttwu()
> > > > > >     READ !p->on_rq
> > > > > >     p->__state=TASK_WAKING
> > > > > > 			trace_sched_switch()
> > > > > > 			  __trace_sched_switch_state()
> > > > > > 			    task_state_index()
> > > > > > 			      return 0;
> > > > > > 
> > > > > > TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
> > > > > > the trace event.
> > > > > > 
> > > > > > Prevent this by pushing the value read from __schedule() down the trace
> > > > > > event.
> > > > > > 
> > > > > > Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
> > > > > > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > > Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > > > > > Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
> > > > > 
> > > > > Any objection to picking this for stable? I'm interested in this one for some
> > > > > Android users but prefer if it can be taken by stable rather than backport it
> > > > > individually.
> > > > > 
> > > > > I think it makes sense to pick the next one in the series too.
> > > > 
> > > > What commit does this fix in Linus's tree?
> > > 
> > > It should be this one: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > 
> > Should this be okay to be picked up by stable now? I can see AUTOSEL has picked
> > it up for v5.15+, but it impacts v5.10 too.
> 
> It does not apply to 5.10 at all, how did you test this?
> 
> {sigh}
> 
> Again, if you want this applied to any stable trees, please test that it
> works and send the properly backported patches.

Sorry. I'm picking up pieces after a colleague left and was under the
impression that 5.10 Android users already use as-is, but due to GKI just need
it formally in stable. Few things got lost in translation, I am very sorry!

Let me sort all of this out properly including the testing part for 5.15 too.

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH] sched/tracing: append prev_state to tp args instead
  2022-01-20 16:25 [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Valentin Schneider
                   ` (2 preceding siblings ...)
  2022-01-21 17:15 ` [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Steven Rostedt
@ 2022-04-21 22:12 ` Delyan Kratunov
  2022-04-22 10:13   ` Valentin Schneider
  2022-04-22 11:09   ` Peter Zijlstra
  3 siblings, 2 replies; 49+ messages in thread
From: Delyan Kratunov @ 2022-04-21 22:12 UTC (permalink / raw)
  To: valentin.schneider
  Cc: bigeasy, dietmar.eggemann, keescook, andrii, vincent.guittot,
	akpm, mingo, linux-kernel, rdunlap, rostedt, Kenta.Tada,
	adharmap, tglx, bristot, ebiederm, peterz, ast, legion, ed.tsai,
	u.kleine-koenig, juri.lelli, x86

Hi folks,

While working on bpf tooling, we noticed that the sched_switch tracepoint
signature recently changed in an incompatible manner. This affects the
runqslower tools in the kernel tree, as well as multiple libbpf tools in iovisor/bcc.

It would be a fair amount of churn to fix all these tools, not to mention any
non-public tools people may be using. If you are open to it, here's a
description and a patch that moves the new argument to the end,
so existing tools can continue working without change (the new argument
just won't be extracted in existing programs):

Commit fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting
sched_switch event, 2022-01-20) added a new prev_state argument to the
sched_switch tracepoint, before the prev task_struct pointer.

This reordering of arguments broke BPF programs that use the raw
tracepoint (e.g. tp_btf programs). The type of the second argument has
changed and existing programs that assume a task_struct* argument
(e.g. for bpf_task_storage or member access) will now fail to verify.

If we instead append the new argument to the end, all existing programs
will continue to work and can conditionally extract the prev_state
argument on supported kernel versions.

Fixes: fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20)
Signed-off-by: Delyan Kratunov <delyank@fb.com>
---
 include/trace/events/sched.h      | 6 +++---
 kernel/sched/core.c               | 2 +-
 kernel/trace/fgraph.c             | 4 ++--
 kernel/trace/ftrace.c             | 4 ++--
 kernel/trace/trace_events.c       | 8 ++++----
 kernel/trace/trace_osnoise.c      | 4 ++--
 kernel/trace/trace_sched_switch.c | 4 ++--
 kernel/trace/trace_sched_wakeup.c | 4 ++--
 8 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 65e786756321..fbb99a61f714 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -222,11 +222,11 @@ static inline long __trace_sched_switch_state(bool preempt,
 TRACE_EVENT(sched_switch,

 	TP_PROTO(bool preempt,
-		 unsigned int prev_state,
 		 struct task_struct *prev,
-		 struct task_struct *next),
+		 struct task_struct *next,
+		 unsigned int prev_state),

-	TP_ARGS(preempt, prev_state, prev, next),
+	TP_ARGS(preempt, prev, next, prev_state),

 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51efaabac3e4..d58c0389eb23 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6382,7 +6382,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		migrate_disable_switch(rq, prev);
 		psi_sched_switch(prev, next, !task_on_rq_queued(prev));

-		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev_state, prev, next);
+		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);

 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 8f4fb328133a..a7e84c8543cb 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -404,9 +404,9 @@ static int alloc_retstack_tasklist(struct ftrace_ret_stack **ret_stack_list)

 static void
 ftrace_graph_probe_sched_switch(void *ignore, bool preempt,
-				unsigned int prev_state,
 				struct task_struct *prev,
-				struct task_struct *next)
+				struct task_struct *next,
+				unsigned int prev_state)
 {
 	unsigned long long timestamp;
 	int index;
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 4f1d2f5e7263..af899b058c8d 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7420,9 +7420,9 @@ ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops)

 static void
 ftrace_filter_pid_sched_switch_probe(void *data, bool preempt,
-				     unsigned int prev_state,
 				     struct task_struct *prev,
-				     struct task_struct *next)
+				     struct task_struct *next,
+				     unsigned int prev_state)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *pid_list;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index e11e167b7809..f97de82d1342 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -773,9 +773,9 @@ void trace_event_follow_fork(struct trace_array *tr, bool enable)

 static void
 event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
-					unsigned int prev_state,
 					struct task_struct *prev,
-					struct task_struct *next)
+					struct task_struct *next,
+					unsigned int prev_state)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *no_pid_list;
@@ -799,9 +799,9 @@ event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,

 static void
 event_filter_pid_sched_switch_probe_post(void *data, bool preempt,
-					 unsigned int prev_state,
 					 struct task_struct *prev,
-					 struct task_struct *next)
+					 struct task_struct *next,
+					 unsigned int prev_state)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *no_pid_list;
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index e9ae1f33a7f0..afb92e2f0aea 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1168,9 +1168,9 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
  */
 static void
 trace_sched_switch_callback(void *data, bool preempt,
-			    unsigned int prev_state,
 			    struct task_struct *p,
-			    struct task_struct *n)
+			    struct task_struct *n,
+			    unsigned int prev_state)
 {
 	struct osnoise_variables *osn_var = this_cpu_osn_var();

diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 45796d8bd4b2..c9ffdcfe622e 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -22,8 +22,8 @@ static DEFINE_MUTEX(sched_register_mutex);

 static void
 probe_sched_switch(void *ignore, bool preempt,
-		   unsigned int prev_state,
-		   struct task_struct *prev, struct task_struct *next)
+		   struct task_struct *prev, struct task_struct *next,
+		   unsigned int prev_state)
 {
 	int flags;

diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 46429f9a96fa..330aee1c1a49 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -426,8 +426,8 @@ tracing_sched_wakeup_trace(struct trace_array *tr,

 static void notrace
 probe_wakeup_sched_switch(void *ignore, bool preempt,
-			  unsigned int prev_state,
-			  struct task_struct *prev, struct task_struct *next)
+			  struct task_struct *prev, struct task_struct *next,
+			  unsigned int prev_state)
 {
 	struct trace_array_cpu *data;
 	u64 T0, T1, delta;
--
2.35.1

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-21 22:12 ` [PATCH] sched/tracing: append prev_state to tp args instead Delyan Kratunov
@ 2022-04-22 10:13   ` Valentin Schneider
  2022-04-22 11:09   ` Peter Zijlstra
  1 sibling, 0 replies; 49+ messages in thread
From: Valentin Schneider @ 2022-04-22 10:13 UTC (permalink / raw)
  To: Delyan Kratunov, valentin.schneider
  Cc: bigeasy, dietmar.eggemann, keescook, andrii, vincent.guittot,
	akpm, mingo, linux-kernel, rdunlap, rostedt, Kenta.Tada,
	adharmap, tglx, bristot, ebiederm, peterz, ast, legion, ed.tsai,
	u.kleine-koenig, juri.lelli, x86

On 21/04/22 22:12, Delyan Kratunov wrote:
> Hi folks,
>
> While working on bpf tooling, we noticed that the sched_switch tracepoint
> signature recently changed in an incompatible manner. This affects the
> runqslower tools in the kernel tree, as well as multiple libbpf tools in iovisor/bcc.
>

Hmph, unfortunate. What should I have run to catch this in the first place?
This doesn't trigger a single warning for me:

  $ cd tools/bpf && make runqslower

I'm guessing this is just packaging the tool and the breakage only happens
when the actual bpf syscall happens?

> It would be a fair amount of churn to fix all these tools, not to mention any
> non-public tools people may be using. If you are open to it, here's a
> description and a patch that moves the new argument to the end,
> so existing tools can continue working without change (the new argument
> just won't be extracted in existing programs):
>
> Commit fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting
> sched_switch event, 2022-01-20) added a new prev_state argument to the
> sched_switch tracepoint, before the prev task_struct pointer.
>
> This reordering of arguments broke BPF programs that use the raw
> tracepoint (e.g. tp_btf programs). The type of the second argument has
> changed and existing programs that assume a task_struct* argument
> (e.g. for bpf_task_storage or member access) will now fail to verify.
>
> If we instead append the new argument to the end, all existing programs
> will continue to work and can conditionally extract the prev_state
> argument on supported kernel versions.
>

Providing this didn't miss any new user of the sched_switch TP (I didn't
find any with rg '\bregister_[A-z,0-9,-,_]+sched_switch'), I'm okay with it
(well, I think this falls into breaking change category, so I don't have
much choice do I :-))

Reviewed-by: Valentin Schneider <vschneid@redhat.com>

> Fixes: fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20)
> Signed-off-by: Delyan Kratunov <delyank@fb.com>
> ---
>  include/trace/events/sched.h      | 6 +++---
>  kernel/sched/core.c               | 2 +-
>  kernel/trace/fgraph.c             | 4 ++--
>  kernel/trace/ftrace.c             | 4 ++--
>  kernel/trace/trace_events.c       | 8 ++++----
>  kernel/trace/trace_osnoise.c      | 4 ++--
>  kernel/trace/trace_sched_switch.c | 4 ++--
>  kernel/trace/trace_sched_wakeup.c | 4 ++--
>  8 files changed, 18 insertions(+), 18 deletions(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 65e786756321..fbb99a61f714 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -222,11 +222,11 @@ static inline long __trace_sched_switch_state(bool preempt,
>  TRACE_EVENT(sched_switch,
>
>       TP_PROTO(bool preempt,
> -		 unsigned int prev_state,
>                struct task_struct *prev,
> -		 struct task_struct *next),
> +		 struct task_struct *next,
> +		 unsigned int prev_state),
>
> -	TP_ARGS(preempt, prev_state, prev, next),
> +	TP_ARGS(preempt, prev, next, prev_state),
>
>       TP_STRUCT__entry(
>               __array(	char,	prev_comm,	TASK_COMM_LEN	)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 51efaabac3e4..d58c0389eb23 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6382,7 +6382,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>               migrate_disable_switch(rq, prev);
>               psi_sched_switch(prev, next, !task_on_rq_queued(prev));
>
> -		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev_state, prev, next);
> +		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);
>
>               /* Also unlocks the rq: */
>               rq = context_switch(rq, prev, next, &rf);
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 8f4fb328133a..a7e84c8543cb 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -404,9 +404,9 @@ static int alloc_retstack_tasklist(struct ftrace_ret_stack **ret_stack_list)
>
>  static void
>  ftrace_graph_probe_sched_switch(void *ignore, bool preempt,
> -				unsigned int prev_state,
>                               struct task_struct *prev,
> -				struct task_struct *next)
> +				struct task_struct *next,
> +				unsigned int prev_state)
>  {
>       unsigned long long timestamp;
>       int index;
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 4f1d2f5e7263..af899b058c8d 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -7420,9 +7420,9 @@ ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops)
>
>  static void
>  ftrace_filter_pid_sched_switch_probe(void *data, bool preempt,
> -				     unsigned int prev_state,
>                                    struct task_struct *prev,
> -				     struct task_struct *next)
> +				     struct task_struct *next,
> +				     unsigned int prev_state)
>  {
>       struct trace_array *tr = data;
>       struct trace_pid_list *pid_list;
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index e11e167b7809..f97de82d1342 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -773,9 +773,9 @@ void trace_event_follow_fork(struct trace_array *tr, bool enable)
>
>  static void
>  event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
> -					unsigned int prev_state,
>                                       struct task_struct *prev,
> -					struct task_struct *next)
> +					struct task_struct *next,
> +					unsigned int prev_state)
>  {
>       struct trace_array *tr = data;
>       struct trace_pid_list *no_pid_list;
> @@ -799,9 +799,9 @@ event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
>
>  static void
>  event_filter_pid_sched_switch_probe_post(void *data, bool preempt,
> -					 unsigned int prev_state,
>                                        struct task_struct *prev,
> -					 struct task_struct *next)
> +					 struct task_struct *next,
> +					 unsigned int prev_state)
>  {
>       struct trace_array *tr = data;
>       struct trace_pid_list *no_pid_list;
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index e9ae1f33a7f0..afb92e2f0aea 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -1168,9 +1168,9 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
>   */
>  static void
>  trace_sched_switch_callback(void *data, bool preempt,
> -			    unsigned int prev_state,
>                           struct task_struct *p,
> -			    struct task_struct *n)
> +			    struct task_struct *n,
> +			    unsigned int prev_state)
>  {
>       struct osnoise_variables *osn_var = this_cpu_osn_var();
>
> diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
> index 45796d8bd4b2..c9ffdcfe622e 100644
> --- a/kernel/trace/trace_sched_switch.c
> +++ b/kernel/trace/trace_sched_switch.c
> @@ -22,8 +22,8 @@ static DEFINE_MUTEX(sched_register_mutex);
>
>  static void
>  probe_sched_switch(void *ignore, bool preempt,
> -		   unsigned int prev_state,
> -		   struct task_struct *prev, struct task_struct *next)
> +		   struct task_struct *prev, struct task_struct *next,
> +		   unsigned int prev_state)
>  {
>       int flags;
>
> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> index 46429f9a96fa..330aee1c1a49 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -426,8 +426,8 @@ tracing_sched_wakeup_trace(struct trace_array *tr,
>
>  static void notrace
>  probe_wakeup_sched_switch(void *ignore, bool preempt,
> -			  unsigned int prev_state,
> -			  struct task_struct *prev, struct task_struct *next)
> +			  struct task_struct *prev, struct task_struct *next,
> +			  unsigned int prev_state)
>  {
>       struct trace_array_cpu *data;
>       u64 T0, T1, delta;
> --
> 2.35.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-21 22:12 ` [PATCH] sched/tracing: append prev_state to tp args instead Delyan Kratunov
  2022-04-22 10:13   ` Valentin Schneider
@ 2022-04-22 11:09   ` Peter Zijlstra
  2022-04-22 15:55     ` Steven Rostedt
                       ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Peter Zijlstra @ 2022-04-22 11:09 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: valentin.schneider, bigeasy, dietmar.eggemann, keescook, andrii,
	vincent.guittot, akpm, mingo, linux-kernel, rdunlap, rostedt,
	Kenta.Tada, adharmap, tglx, bristot, ebiederm, ast, legion,
	ed.tsai, u.kleine-koenig, juri.lelli, x86

On Thu, Apr 21, 2022 at 10:12:25PM +0000, Delyan Kratunov wrote:
> Hi folks,
> 
> While working on bpf tooling, we noticed that the sched_switch tracepoint
> signature recently changed in an incompatible manner. This affects the
> runqslower tools in the kernel tree, as well as multiple libbpf tools in iovisor/bcc.
> 
> It would be a fair amount of churn to fix all these tools, not to mention any
> non-public tools people may be using.

So on the one hand, too bad, deal with it. None of this is ABI.

> If you are open to it, here's a
> description and a patch that moves the new argument to the end,
> so existing tools can continue working without change (the new argument
> just won't be extracted in existing programs):

And on the other hand; those users need to be fixed anyway, right?
Accessing prev->__state is equally broken.

Yes, the proposed hack is cute, but I have very little sympathy for any
of this. These are in-kernel interfaces, they change, there must not be
any impediment to change.

If bpf wants to ride on them, it needs to suffer the pain of doing so.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-22 11:09   ` Peter Zijlstra
@ 2022-04-22 15:55     ` Steven Rostedt
  2022-04-22 16:54       ` Andrii Nakryiko
  2022-04-22 16:37     ` Andrii Nakryiko
  2022-04-22 17:22     ` Delyan Kratunov
  2 siblings, 1 reply; 49+ messages in thread
From: Steven Rostedt @ 2022-04-22 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Delyan Kratunov, valentin.schneider, bigeasy, dietmar.eggemann,
	keescook, andrii, vincent.guittot, akpm, mingo, linux-kernel,
	rdunlap, Kenta.Tada, adharmap, tglx, bristot, ebiederm, ast,
	legion, ed.tsai, u.kleine-koenig, juri.lelli, x86

On Fri, 22 Apr 2022 13:09:03 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> If bpf wants to ride on them, it needs to suffer the pain of doing so.

And I constantly hear that BPF is not an ABI, and is not guaranteed to work
from one kernel version to the next.

-- Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-22 11:09   ` Peter Zijlstra
  2022-04-22 15:55     ` Steven Rostedt
@ 2022-04-22 16:37     ` Andrii Nakryiko
  2022-04-22 17:22     ` Delyan Kratunov
  2 siblings, 0 replies; 49+ messages in thread
From: Andrii Nakryiko @ 2022-04-22 16:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Delyan Kratunov, valentin.schneider, bigeasy, dietmar.eggemann,
	keescook, andrii, vincent.guittot, akpm, mingo, linux-kernel,
	rdunlap, rostedt, Kenta.Tada, adharmap, tglx, bristot, ebiederm,
	ast, legion, ed.tsai, u.kleine-koenig, juri.lelli, x86

On Fri, Apr 22, 2022 at 4:09 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Apr 21, 2022 at 10:12:25PM +0000, Delyan Kratunov wrote:
> > Hi folks,
> >
> > While working on bpf tooling, we noticed that the sched_switch tracepoint
> > signature recently changed in an incompatible manner. This affects the
> > runqslower tools in the kernel tree, as well as multiple libbpf tools in iovisor/bcc.
> >
> > It would be a fair amount of churn to fix all these tools, not to mention any
> > non-public tools people may be using.
>
> So on the one hand, too bad, deal with it. None of this is ABI.
>
> > If you are open to it, here's a
> > description and a patch that moves the new argument to the end,
> > so existing tools can continue working without change (the new argument
> > just won't be extracted in existing programs):
>
> And on the other hand; those users need to be fixed anyway, right?
> Accessing prev->__state is equally broken.
>
> Yes, the proposed hack is cute, but I have very little sympathy for any
> of this. These are in-kernel interfaces, they change, there must not be
> any impediment to change.
>
> If bpf wants to ride on them, it needs to suffer the pain of doing so.

Right, none of this is ABI and BPF users community is prepared to deal
with this, no one is claiming otherwise. And we did deal with __state
field rename already, see [0] for one example.

This function argument reordering is much harder to deal with in a
"contained" way like we did it for __state rename, unfortunately. So
given it doesn't have to require so much work for multiple people to
work around and is just a simple reordering of arguments to add a new
argument at the end of a function prototype, is there a big downside
to doing this?

  [0] https://github.com/iovisor/bcc/commit/8b350fd51b004d4eddd7caa410e274e2807904f9

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-22 15:55     ` Steven Rostedt
@ 2022-04-22 16:54       ` Andrii Nakryiko
  0 siblings, 0 replies; 49+ messages in thread
From: Andrii Nakryiko @ 2022-04-22 16:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Delyan Kratunov, valentin.schneider, bigeasy,
	dietmar.eggemann, keescook, andrii, vincent.guittot, akpm, mingo,
	linux-kernel, rdunlap, Kenta.Tada, adharmap, tglx, bristot,
	ebiederm, ast, legion, ed.tsai, u.kleine-koenig, juri.lelli, x86

On Fri, Apr 22, 2022 at 8:56 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 22 Apr 2022 13:09:03 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > If bpf wants to ride on them, it needs to suffer the pain of doing so.
>
> And I constantly hear that BPF is not an ABI, and is not guaranteed to work
> from one kernel version to the next.

Right, and that's true in terms of expectations of BPF users. But it's
also true that in pracitce people's tools have to keep working across
multiple kernel versions and we've developed multiple technologies
(e.g., BPF CO-RE) and techniques to allow people to adapt to kernel
changes. See [0] and [1] for some of the tricks people do in real
production tools to accommodate kernel changes. Some kernel changes
are easier to accommodate, some are harder. This particular one,
though, is pretty complicated as no function or symbol got renamed, so
it's much more involved to detect changes like this. But ultimately
people will do that anyway.

But the ask here is, given it's not too late and it's trivial to avoid
this breakage in the first place by reordering function arguments, we
(BPF users) kindly ask to consider doing that. Hopefully this trivial
change isn't hampering kernel development in any way.

  [0] https://github.com/iovisor/bcc/pull/3917
  [1] https://github.com/iovisor/bcc/pull/3747

>
> -- Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-22 11:09   ` Peter Zijlstra
  2022-04-22 15:55     ` Steven Rostedt
  2022-04-22 16:37     ` Andrii Nakryiko
@ 2022-04-22 17:22     ` Delyan Kratunov
  2022-04-22 18:30       ` Alexei Starovoitov
  2 siblings, 1 reply; 49+ messages in thread
From: Delyan Kratunov @ 2022-04-22 17:22 UTC (permalink / raw)
  To: peterz
  Cc: bigeasy, dietmar.eggemann, keescook, x86, andrii,
	u.kleine-koenig, vincent.guittot, akpm, mingo, linux-kernel,
	rdunlap, rostedt, Kenta.Tada, tglx, bristot, ebiederm, ast,
	legion, adharmap, valentin.schneider, ed.tsai, juri.lelli

On Fri, 2022-04-22 at 13:09 +0200, Peter Zijlstra wrote:
> And on the other hand; those users need to be fixed anyway, right?
> Accessing prev->__state is equally broken.

The users that access prev->__state would most likely have to be fixed, for sure.

However, not all users access prev->__state. `offcputime` for example just takes a
stack trace and associates it with the switched out task. This kind of user
would continue working with the proposed patch.

> If bpf wants to ride on them, it needs to suffer the pain of doing so.

Sure, I'm just advocating for a fairly trivial patch to avoid some of the suffering,
hopefully without being a burden to development. If that's not the case, then it's a
clear no-go.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-22 17:22     ` Delyan Kratunov
@ 2022-04-22 18:30       ` Alexei Starovoitov
  2022-04-26 12:28         ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Alexei Starovoitov @ 2022-04-22 18:30 UTC (permalink / raw)
  To: Delyan Kratunov, Namhyung Kim, Arnaldo Carvalho de Melo
  Cc: peterz, bigeasy, dietmar.eggemann, keescook, x86, andrii,
	u.kleine-koenig, vincent.guittot, akpm, mingo, linux-kernel,
	rdunlap, rostedt, Kenta.Tada, tglx, bristot, ebiederm, ast,
	legion, adharmap, valentin.schneider, ed.tsai, juri.lelli

On Fri, Apr 22, 2022 at 10:22 AM Delyan Kratunov <delyank@fb.com> wrote:
>
> On Fri, 2022-04-22 at 13:09 +0200, Peter Zijlstra wrote:
> > And on the other hand; those users need to be fixed anyway, right?
> > Accessing prev->__state is equally broken.
>
> The users that access prev->__state would most likely have to be fixed, for sure.
>
> However, not all users access prev->__state. `offcputime` for example just takes a
> stack trace and associates it with the switched out task. This kind of user
> would continue working with the proposed patch.
>
> > If bpf wants to ride on them, it needs to suffer the pain of doing so.
>
> Sure, I'm just advocating for a fairly trivial patch to avoid some of the suffering,
> hopefully without being a burden to development. If that's not the case, then it's a
> clear no-go.


Namhyung just sent this patch set:
https://patchwork.kernel.org/project/netdevbpf/patch/20220422053401.208207-3-namhyung@kernel.org/

to add off-cpu profiling to perf.
It also hooks into sched_switch tracepoint.
Notice it deals with state->__state rename just fine.
But it will have a hard time without this patch
until we add all the extra CO-RE features to detect
and automatically adjust bpf progs when tracepoint
arguments order changed.
We will do it eventually, of course.
There will be additional work in llvm, libbpf, kernel, etc.
But for now I think it would be good to land Delyan's patch
to avoid unnecessary pain to all the users.

Peter, do you mind?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-22 18:30       ` Alexei Starovoitov
@ 2022-04-26 12:28         ` Peter Zijlstra
  2022-04-26 14:09           ` Qais Yousef
  2022-04-26 15:51           ` [PATCH] " Andrii Nakryiko
  0 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2022-04-26 12:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Delyan Kratunov, Namhyung Kim, Arnaldo Carvalho de Melo, bigeasy,
	dietmar.eggemann, keescook, x86, andrii, u.kleine-koenig,
	vincent.guittot, akpm, mingo, linux-kernel, rdunlap, rostedt,
	Kenta.Tada, tglx, bristot, ebiederm, ast, legion, adharmap,
	valentin.schneider, ed.tsai, juri.lelli

On Fri, Apr 22, 2022 at 11:30:12AM -0700, Alexei Starovoitov wrote:
> On Fri, Apr 22, 2022 at 10:22 AM Delyan Kratunov <delyank@fb.com> wrote:
> >
> > On Fri, 2022-04-22 at 13:09 +0200, Peter Zijlstra wrote:
> > > And on the other hand; those users need to be fixed anyway, right?
> > > Accessing prev->__state is equally broken.
> >
> > The users that access prev->__state would most likely have to be fixed, for sure.
> >
> > However, not all users access prev->__state. `offcputime` for example just takes a
> > stack trace and associates it with the switched out task. This kind of user
> > would continue working with the proposed patch.
> >
> > > If bpf wants to ride on them, it needs to suffer the pain of doing so.
> >
> > Sure, I'm just advocating for a fairly trivial patch to avoid some of the suffering,
> > hopefully without being a burden to development. If that's not the case, then it's a
> > clear no-go.
> 
> 
> Namhyung just sent this patch set:
> https://patchwork.kernel.org/project/netdevbpf/patch/20220422053401.208207-3-namhyung@kernel.org/

That has:

+ * recently task_struct->state renamed to __state so it made an incompatible
+ * change.

git tells me:

  2f064a59a11f ("sched: Change task_struct::state")

is almost a year old by now. That don't qualify as recently in my book.
That says that 'old kernels used to call this...'.

> to add off-cpu profiling to perf.
> It also hooks into sched_switch tracepoint.
> Notice it deals with state->__state rename just fine.

So I don't speak BPF much; it always takes me more time to make bpf work
than to just hack up the kernel, which makes it hard to get motivated.

However, it was not just a rename, state changed type too, which is why I
did the rename, to make sure all users would get a compile fail and
could adjust.

If you're silently making it work by frobbing the name, you loose that.

Specifically, task_struct::state used to be 'volatile long', while
task_struct::__state is 'unsigned int'. As such, any user must now be
very careful to use READ_ONCE(). I don't see that happening with just
frobbing the name.

Additinoally, by shrinking the field, I suppose BE systems get to keep
the pieces?

> But it will have a hard time without this patch
> until we add all the extra CO-RE features to detect
> and automatically adjust bpf progs when tracepoint
> arguments order changed.

Could be me, but silently making it work sounds like fail :/ There's a
reason code changes, users need to adapt, not silently pretend stuff is
as before.

How will you know you need to fix your tool?

> We will do it eventually, of course.
> There will be additional work in llvm, libbpf, kernel, etc.
> But for now I think it would be good to land Delyan's patch
> to avoid unnecessary pain to all the users.
> 
> Peter, do you mind?

I suppose I can help out this time, but I really don't want to set a
precedent for these things. Broken is broken.

The down-side for me is that the argument order no longer makes any
sense.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-26 12:28         ` Peter Zijlstra
@ 2022-04-26 14:09           ` Qais Yousef
  2022-04-26 15:54             ` Andrii Nakryiko
  2022-04-26 15:51           ` [PATCH] " Andrii Nakryiko
  1 sibling, 1 reply; 49+ messages in thread
From: Qais Yousef @ 2022-04-26 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, Delyan Kratunov, Namhyung Kim,
	Arnaldo Carvalho de Melo, bigeasy, dietmar.eggemann, keescook,
	x86, andrii, u.kleine-koenig, vincent.guittot, akpm, mingo,
	linux-kernel, rdunlap, rostedt, Kenta.Tada, tglx, bristot,
	ebiederm, ast, legion, adharmap, valentin.schneider, ed.tsai,
	juri.lelli

On 04/26/22 14:28, Peter Zijlstra wrote:
> On Fri, Apr 22, 2022 at 11:30:12AM -0700, Alexei Starovoitov wrote:
> > On Fri, Apr 22, 2022 at 10:22 AM Delyan Kratunov <delyank@fb.com> wrote:
> > >
> > > On Fri, 2022-04-22 at 13:09 +0200, Peter Zijlstra wrote:
> > > > And on the other hand; those users need to be fixed anyway, right?
> > > > Accessing prev->__state is equally broken.
> > >
> > > The users that access prev->__state would most likely have to be fixed, for sure.
> > >
> > > However, not all users access prev->__state. `offcputime` for example just takes a
> > > stack trace and associates it with the switched out task. This kind of user
> > > would continue working with the proposed patch.
> > >
> > > > If bpf wants to ride on them, it needs to suffer the pain of doing so.
> > >
> > > Sure, I'm just advocating for a fairly trivial patch to avoid some of the suffering,
> > > hopefully without being a burden to development. If that's not the case, then it's a
> > > clear no-go.
> > 
> > 
> > Namhyung just sent this patch set:
> > https://patchwork.kernel.org/project/netdevbpf/patch/20220422053401.208207-3-namhyung@kernel.org/
> 
> That has:
> 
> + * recently task_struct->state renamed to __state so it made an incompatible
> + * change.
> 
> git tells me:
> 
>   2f064a59a11f ("sched: Change task_struct::state")
> 
> is almost a year old by now. That don't qualify as recently in my book.
> That says that 'old kernels used to call this...'.
> 
> > to add off-cpu profiling to perf.
> > It also hooks into sched_switch tracepoint.
> > Notice it deals with state->__state rename just fine.
> 
> So I don't speak BPF much; it always takes me more time to make bpf work
> than to just hack up the kernel, which makes it hard to get motivated.
> 
> However, it was not just a rename, state changed type too, which is why I
> did the rename, to make sure all users would get a compile fail and
> could adjust.
> 
> If you're silently making it work by frobbing the name, you loose that.
> 
> Specifically, task_struct::state used to be 'volatile long', while
> task_struct::__state is 'unsigned int'. As such, any user must now be
> very careful to use READ_ONCE(). I don't see that happening with just
> frobbing the name.
> 
> Additinoally, by shrinking the field, I suppose BE systems get to keep
> the pieces?
> 
> > But it will have a hard time without this patch
> > until we add all the extra CO-RE features to detect
> > and automatically adjust bpf progs when tracepoint
> > arguments order changed.
> 
> Could be me, but silently making it work sounds like fail :/ There's a
> reason code changes, users need to adapt, not silently pretend stuff is
> as before.
> 
> How will you know you need to fix your tool?

If libbpf doesn't fail, then yeah it's a big problem. I wonder how users of
kprobe who I suppose are more prone to this kind of problems have been coping.

> 
> > We will do it eventually, of course.
> > There will be additional work in llvm, libbpf, kernel, etc.
> > But for now I think it would be good to land Delyan's patch
> > to avoid unnecessary pain to all the users.
> > 
> > Peter, do you mind?
> 
> I suppose I can help out this time, but I really don't want to set a
> precedent for these things. Broken is broken.
> 
> The down-side for me is that the argument order no longer makes any
> sense.

I'm intending to backport fa2c3254d7cf to 5.10 and 5.15 but waiting for
a Tested-by. If you take this one, then it'll need to be backported too.

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-26 12:28         ` Peter Zijlstra
  2022-04-26 14:09           ` Qais Yousef
@ 2022-04-26 15:51           ` Andrii Nakryiko
  1 sibling, 0 replies; 49+ messages in thread
From: Andrii Nakryiko @ 2022-04-26 15:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, Delyan Kratunov, Namhyung Kim,
	Arnaldo Carvalho de Melo, bigeasy, dietmar.eggemann, keescook,
	x86, andrii, u.kleine-koenig, vincent.guittot, akpm, mingo,
	linux-kernel, rdunlap, rostedt, Kenta.Tada, tglx, bristot,
	ebiederm, ast, legion, adharmap, valentin.schneider, ed.tsai,
	juri.lelli

On Tue, Apr 26, 2022 at 5:28 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Apr 22, 2022 at 11:30:12AM -0700, Alexei Starovoitov wrote:
> > On Fri, Apr 22, 2022 at 10:22 AM Delyan Kratunov <delyank@fb.com> wrote:
> > >
> > > On Fri, 2022-04-22 at 13:09 +0200, Peter Zijlstra wrote:
> > > > And on the other hand; those users need to be fixed anyway, right?
> > > > Accessing prev->__state is equally broken.
> > >
> > > The users that access prev->__state would most likely have to be fixed, for sure.
> > >
> > > However, not all users access prev->__state. `offcputime` for example just takes a
> > > stack trace and associates it with the switched out task. This kind of user
> > > would continue working with the proposed patch.
> > >
> > > > If bpf wants to ride on them, it needs to suffer the pain of doing so.
> > >
> > > Sure, I'm just advocating for a fairly trivial patch to avoid some of the suffering,
> > > hopefully without being a burden to development. If that's not the case, then it's a
> > > clear no-go.
> >
> >
> > Namhyung just sent this patch set:
> > https://patchwork.kernel.org/project/netdevbpf/patch/20220422053401.208207-3-namhyung@kernel.org/
>
> That has:
>
> + * recently task_struct->state renamed to __state so it made an incompatible
> + * change.
>
> git tells me:
>
>   2f064a59a11f ("sched: Change task_struct::state")
>
> is almost a year old by now. That don't qualify as recently in my book.
> That says that 'old kernels used to call this...'.
>
> > to add off-cpu profiling to perf.
> > It also hooks into sched_switch tracepoint.
> > Notice it deals with state->__state rename just fine.
>
> So I don't speak BPF much; it always takes me more time to make bpf work
> than to just hack up the kernel, which makes it hard to get motivated.
>
> However, it was not just a rename, state changed type too, which is why I
> did the rename, to make sure all users would get a compile fail and
> could adjust.
>
> If you're silently making it work by frobbing the name, you loose that.

In general, libbpf is trying to accommodate it as best as it can. It
will adjust load size automatically based on BTF information for
direct memory reads. For cases when users have to use
bpf_probe_read_kernel(), it's impossible to do this automatically, but
libbpf and BPF CO-RE provide primitives to accommodate changes to
field types. We have bpf_core_field_size() and
BPF_CORE_READ_BITFIELD()/BPF_CORE_READ_BITFIELD_PROBED() macros that
can read any bitfield or integer type properly and store them into
u64.

So yes, user will still have to test and validate their BPF programs
for new kernels, but they do have instruments to handle this in a
contained way in BPF code without awkward extra user-space side
feature detection.

With __state rename, it's been done, it's easy to accommodate. That
easiness of being able to deal with this change is one of the reasons
we never asked to do anything about that. It was rather mild
inconveniences for a bunch of tools and applications.

This argument reordering, as I mentioned before, is a much bigger deal
and much harder to deal with easily. As Alexei mentioned, we are
discussing extending BPF CO-RE to be able to accommodate even such
changes in pure BPF-side code, but we don't have that mechanism yet,
so if it's not too hard, we still kindly ask to consider appending a
new argument at the end.


>
> Specifically, task_struct::state used to be 'volatile long', while
> task_struct::__state is 'unsigned int'. As such, any user must now be
> very careful to use READ_ONCE(). I don't see that happening with just
> frobbing the name.
>
> Additinoally, by shrinking the field, I suppose BE systems get to keep
> the pieces?
>
> > But it will have a hard time without this patch
> > until we add all the extra CO-RE features to detect
> > and automatically adjust bpf progs when tracepoint
> > arguments order changed.
>
> Could be me, but silently making it work sounds like fail :/ There's a
> reason code changes, users need to adapt, not silently pretend stuff is
> as before.
>
> How will you know you need to fix your tool?

There is no denying that tracing BPF code successfully compiling
doesn't mean it's going to work correctly (like with any other code,
really). So testing is still mandatory. This is more about being able
to deal with such changes without having to maintain two separately
compiled BPF programs or doing extensive and expensive feature
detection in user-space code.

>
> > We will do it eventually, of course.
> > There will be additional work in llvm, libbpf, kernel, etc.
> > But for now I think it would be good to land Delyan's patch
> > to avoid unnecessary pain to all the users.
> >
> > Peter, do you mind?
>
> I suppose I can help out this time, but I really don't want to set a
> precedent for these things. Broken is broken.

We'd really appreciate it if you do help out this time. The intent is
not to set any kind of precedent. This ask is due to
sched_switch is a very popular target *and* this change can't be
easily detected (once detected it can be easily accommodated, but
detection is painful), so overall impact is disproportionate.

>
> The down-side for me is that the argument order no longer makes any
> sense.

The order changes only for tracing-related macros/functions, so
hopefully that doesn't make the rest of scheduler code illogical or
inconvenient.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-26 14:09           ` Qais Yousef
@ 2022-04-26 15:54             ` Andrii Nakryiko
  2022-04-27 10:34               ` Qais Yousef
  0 siblings, 1 reply; 49+ messages in thread
From: Andrii Nakryiko @ 2022-04-26 15:54 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Alexei Starovoitov, Delyan Kratunov,
	Namhyung Kim, Arnaldo Carvalho de Melo, bigeasy,
	dietmar.eggemann, keescook, x86, andrii, u.kleine-koenig,
	vincent.guittot, akpm, mingo, linux-kernel, rdunlap, rostedt,
	Kenta.Tada, tglx, bristot, ebiederm, ast, legion, adharmap,
	valentin.schneider, ed.tsai, juri.lelli

On Tue, Apr 26, 2022 at 7:10 AM Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 04/26/22 14:28, Peter Zijlstra wrote:
> > On Fri, Apr 22, 2022 at 11:30:12AM -0700, Alexei Starovoitov wrote:
> > > On Fri, Apr 22, 2022 at 10:22 AM Delyan Kratunov <delyank@fb.com> wrote:
> > > >
> > > > On Fri, 2022-04-22 at 13:09 +0200, Peter Zijlstra wrote:
> > > > > And on the other hand; those users need to be fixed anyway, right?
> > > > > Accessing prev->__state is equally broken.
> > > >
> > > > The users that access prev->__state would most likely have to be fixed, for sure.
> > > >
> > > > However, not all users access prev->__state. `offcputime` for example just takes a
> > > > stack trace and associates it with the switched out task. This kind of user
> > > > would continue working with the proposed patch.
> > > >
> > > > > If bpf wants to ride on them, it needs to suffer the pain of doing so.
> > > >
> > > > Sure, I'm just advocating for a fairly trivial patch to avoid some of the suffering,
> > > > hopefully without being a burden to development. If that's not the case, then it's a
> > > > clear no-go.
> > >
> > >
> > > Namhyung just sent this patch set:
> > > https://patchwork.kernel.org/project/netdevbpf/patch/20220422053401.208207-3-namhyung@kernel.org/
> >
> > That has:
> >
> > + * recently task_struct->state renamed to __state so it made an incompatible
> > + * change.
> >
> > git tells me:
> >
> >   2f064a59a11f ("sched: Change task_struct::state")
> >
> > is almost a year old by now. That don't qualify as recently in my book.
> > That says that 'old kernels used to call this...'.
> >
> > > to add off-cpu profiling to perf.
> > > It also hooks into sched_switch tracepoint.
> > > Notice it deals with state->__state rename just fine.
> >
> > So I don't speak BPF much; it always takes me more time to make bpf work
> > than to just hack up the kernel, which makes it hard to get motivated.
> >
> > However, it was not just a rename, state changed type too, which is why I
> > did the rename, to make sure all users would get a compile fail and
> > could adjust.
> >
> > If you're silently making it work by frobbing the name, you loose that.
> >
> > Specifically, task_struct::state used to be 'volatile long', while
> > task_struct::__state is 'unsigned int'. As such, any user must now be
> > very careful to use READ_ONCE(). I don't see that happening with just
> > frobbing the name.
> >
> > Additinoally, by shrinking the field, I suppose BE systems get to keep
> > the pieces?
> >
> > > But it will have a hard time without this patch
> > > until we add all the extra CO-RE features to detect
> > > and automatically adjust bpf progs when tracepoint
> > > arguments order changed.
> >
> > Could be me, but silently making it work sounds like fail :/ There's a
> > reason code changes, users need to adapt, not silently pretend stuff is
> > as before.
> >
> > How will you know you need to fix your tool?
>
> If libbpf doesn't fail, then yeah it's a big problem. I wonder how users of
> kprobe who I suppose are more prone to this kind of problems have been coping.

See my reply to Peter. libbpf can't know user's intent to fail this
automatically, in general. In some cases when it can it does
accommodate this automatically. In other cases it provides instruments
for user to handle this (bpf_core_field_size(),
BPF_CORE_READ_BITFIELD(), etc).

But in the end no one eliminated the need for testing your application
for correctness. Tracing programs do break on kernel changes and BPF
users do adapt to them. Sometimes adapting is easy (like state ->
__state transition), sometimes it's much more involved (like this
argument order change).

>
> >
> > > We will do it eventually, of course.
> > > There will be additional work in llvm, libbpf, kernel, etc.
> > > But for now I think it would be good to land Delyan's patch
> > > to avoid unnecessary pain to all the users.
> > >
> > > Peter, do you mind?
> >
> > I suppose I can help out this time, but I really don't want to set a
> > precedent for these things. Broken is broken.
> >
> > The down-side for me is that the argument order no longer makes any
> > sense.
>
> I'm intending to backport fa2c3254d7cf to 5.10 and 5.15 but waiting for
> a Tested-by. If you take this one, then it'll need to be backported too.
>
> Cheers
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-26 15:54             ` Andrii Nakryiko
@ 2022-04-27 10:34               ` Qais Yousef
  2022-04-27 18:17                 ` Andrii Nakryiko
  0 siblings, 1 reply; 49+ messages in thread
From: Qais Yousef @ 2022-04-27 10:34 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Peter Zijlstra, Alexei Starovoitov, Delyan Kratunov,
	Namhyung Kim, Arnaldo Carvalho de Melo, bigeasy,
	dietmar.eggemann, keescook, x86, andrii, u.kleine-koenig,
	vincent.guittot, akpm, mingo, linux-kernel, rdunlap, rostedt,
	Kenta.Tada, tglx, bristot, ebiederm, ast, legion, adharmap,
	valentin.schneider, ed.tsai, juri.lelli

On 04/26/22 08:54, Andrii Nakryiko wrote:
> On Tue, Apr 26, 2022 at 7:10 AM Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > On 04/26/22 14:28, Peter Zijlstra wrote:
> > > On Fri, Apr 22, 2022 at 11:30:12AM -0700, Alexei Starovoitov wrote:
> > > > On Fri, Apr 22, 2022 at 10:22 AM Delyan Kratunov <delyank@fb.com> wrote:
> > > > >
> > > > > On Fri, 2022-04-22 at 13:09 +0200, Peter Zijlstra wrote:
> > > > > > And on the other hand; those users need to be fixed anyway, right?
> > > > > > Accessing prev->__state is equally broken.
> > > > >
> > > > > The users that access prev->__state would most likely have to be fixed, for sure.
> > > > >
> > > > > However, not all users access prev->__state. `offcputime` for example just takes a
> > > > > stack trace and associates it with the switched out task. This kind of user
> > > > > would continue working with the proposed patch.
> > > > >
> > > > > > If bpf wants to ride on them, it needs to suffer the pain of doing so.
> > > > >
> > > > > Sure, I'm just advocating for a fairly trivial patch to avoid some of the suffering,
> > > > > hopefully without being a burden to development. If that's not the case, then it's a
> > > > > clear no-go.
> > > >
> > > >
> > > > Namhyung just sent this patch set:
> > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220422053401.208207-3-namhyung@kernel.org/
> > >
> > > That has:
> > >
> > > + * recently task_struct->state renamed to __state so it made an incompatible
> > > + * change.
> > >
> > > git tells me:
> > >
> > >   2f064a59a11f ("sched: Change task_struct::state")
> > >
> > > is almost a year old by now. That don't qualify as recently in my book.
> > > That says that 'old kernels used to call this...'.
> > >
> > > > to add off-cpu profiling to perf.
> > > > It also hooks into sched_switch tracepoint.
> > > > Notice it deals with state->__state rename just fine.
> > >
> > > So I don't speak BPF much; it always takes me more time to make bpf work
> > > than to just hack up the kernel, which makes it hard to get motivated.
> > >
> > > However, it was not just a rename, state changed type too, which is why I
> > > did the rename, to make sure all users would get a compile fail and
> > > could adjust.
> > >
> > > If you're silently making it work by frobbing the name, you loose that.
> > >
> > > Specifically, task_struct::state used to be 'volatile long', while
> > > task_struct::__state is 'unsigned int'. As such, any user must now be
> > > very careful to use READ_ONCE(). I don't see that happening with just
> > > frobbing the name.
> > >
> > > Additinoally, by shrinking the field, I suppose BE systems get to keep
> > > the pieces?
> > >
> > > > But it will have a hard time without this patch
> > > > until we add all the extra CO-RE features to detect
> > > > and automatically adjust bpf progs when tracepoint
> > > > arguments order changed.
> > >
> > > Could be me, but silently making it work sounds like fail :/ There's a
> > > reason code changes, users need to adapt, not silently pretend stuff is
> > > as before.
> > >
> > > How will you know you need to fix your tool?
> >
> > If libbpf doesn't fail, then yeah it's a big problem. I wonder how users of
> > kprobe who I suppose are more prone to this kind of problems have been coping.
> 
> See my reply to Peter. libbpf can't know user's intent to fail this
> automatically, in general. In some cases when it can it does
> accommodate this automatically. In other cases it provides instruments
> for user to handle this (bpf_core_field_size(),
> BPF_CORE_READ_BITFIELD(), etc).

My naiive thinking is that the function signature has changed (there's 1 extra
arg not just a subtle swap of args of the same type) - so I thought that can be
detected. But maybe it is harder said than done.

I am trying to remember as I've used this before; I think you get the arg list
as part of ctx when you attach to a function?

I wonder if it'd be hard to provide a macro for the user to provide the
signature of the function they expect; this macro can try then to verify/assert
the number, type and order is the same. Not bullet proof and requires opt-in,
but could be useful?


// dummy pseudo-code

	BPF_CORE_ASSERT_SIG(sched_switch, NR_ARGS, ARG0, ARG1, ...)
		if (ctx->nr_args != NR_ARGS)
			assert()
		if (type_of(ctx->args[0]) != type_of(ARG0))
			assert()
		...

I'm not sure if you have any info about the type though..

> But in the end no one eliminated the need for testing your application
> for correctness. Tracing programs do break on kernel changes and BPF
> users do adapt to them. Sometimes adapting is easy (like state ->
> __state transition), sometimes it's much more involved (like this
> argument order change).

It's not just an arg re-order, it's a new argument inserted in the middle. But
fair enough :-)

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-27 10:34               ` Qais Yousef
@ 2022-04-27 18:17                 ` Andrii Nakryiko
  2022-04-27 20:32                   ` Alexei Starovoitov
  2022-04-28 10:02                   ` Qais Yousef
  0 siblings, 2 replies; 49+ messages in thread
From: Andrii Nakryiko @ 2022-04-27 18:17 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Alexei Starovoitov, Delyan Kratunov,
	Namhyung Kim, Arnaldo Carvalho de Melo, bigeasy,
	dietmar.eggemann, keescook, x86, andrii, u.kleine-koenig,
	vincent.guittot, akpm, mingo, linux-kernel, rdunlap, rostedt,
	Kenta.Tada, tglx, bristot, ebiederm, ast, legion, adharmap,
	valentin.schneider, ed.tsai, juri.lelli

On Wed, Apr 27, 2022 at 3:35 AM Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 04/26/22 08:54, Andrii Nakryiko wrote:
> > On Tue, Apr 26, 2022 at 7:10 AM Qais Yousef <qais.yousef@arm.com> wrote:
> > >
> > > On 04/26/22 14:28, Peter Zijlstra wrote:
> > > > On Fri, Apr 22, 2022 at 11:30:12AM -0700, Alexei Starovoitov wrote:
> > > > > On Fri, Apr 22, 2022 at 10:22 AM Delyan Kratunov <delyank@fb.com> wrote:
> > > > > >
> > > > > > On Fri, 2022-04-22 at 13:09 +0200, Peter Zijlstra wrote:
> > > > > > > And on the other hand; those users need to be fixed anyway, right?
> > > > > > > Accessing prev->__state is equally broken.
> > > > > >
> > > > > > The users that access prev->__state would most likely have to be fixed, for sure.
> > > > > >
> > > > > > However, not all users access prev->__state. `offcputime` for example just takes a
> > > > > > stack trace and associates it with the switched out task. This kind of user
> > > > > > would continue working with the proposed patch.
> > > > > >
> > > > > > > If bpf wants to ride on them, it needs to suffer the pain of doing so.
> > > > > >
> > > > > > Sure, I'm just advocating for a fairly trivial patch to avoid some of the suffering,
> > > > > > hopefully without being a burden to development. If that's not the case, then it's a
> > > > > > clear no-go.
> > > > >
> > > > >
> > > > > Namhyung just sent this patch set:
> > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220422053401.208207-3-namhyung@kernel.org/
> > > >
> > > > That has:
> > > >
> > > > + * recently task_struct->state renamed to __state so it made an incompatible
> > > > + * change.
> > > >
> > > > git tells me:
> > > >
> > > >   2f064a59a11f ("sched: Change task_struct::state")
> > > >
> > > > is almost a year old by now. That don't qualify as recently in my book.
> > > > That says that 'old kernels used to call this...'.
> > > >
> > > > > to add off-cpu profiling to perf.
> > > > > It also hooks into sched_switch tracepoint.
> > > > > Notice it deals with state->__state rename just fine.
> > > >
> > > > So I don't speak BPF much; it always takes me more time to make bpf work
> > > > than to just hack up the kernel, which makes it hard to get motivated.
> > > >
> > > > However, it was not just a rename, state changed type too, which is why I
> > > > did the rename, to make sure all users would get a compile fail and
> > > > could adjust.
> > > >
> > > > If you're silently making it work by frobbing the name, you loose that.
> > > >
> > > > Specifically, task_struct::state used to be 'volatile long', while
> > > > task_struct::__state is 'unsigned int'. As such, any user must now be
> > > > very careful to use READ_ONCE(). I don't see that happening with just
> > > > frobbing the name.
> > > >
> > > > Additinoally, by shrinking the field, I suppose BE systems get to keep
> > > > the pieces?
> > > >
> > > > > But it will have a hard time without this patch
> > > > > until we add all the extra CO-RE features to detect
> > > > > and automatically adjust bpf progs when tracepoint
> > > > > arguments order changed.
> > > >
> > > > Could be me, but silently making it work sounds like fail :/ There's a
> > > > reason code changes, users need to adapt, not silently pretend stuff is
> > > > as before.
> > > >
> > > > How will you know you need to fix your tool?
> > >
> > > If libbpf doesn't fail, then yeah it's a big problem. I wonder how users of
> > > kprobe who I suppose are more prone to this kind of problems have been coping.
> >
> > See my reply to Peter. libbpf can't know user's intent to fail this
> > automatically, in general. In some cases when it can it does
> > accommodate this automatically. In other cases it provides instruments
> > for user to handle this (bpf_core_field_size(),
> > BPF_CORE_READ_BITFIELD(), etc).
>
> My naiive thinking is that the function signature has changed (there's 1 extra
> arg not just a subtle swap of args of the same type) - so I thought that can be
> detected. But maybe it is harder said than done.

It is. We don't have number of arguments either:

struct bpf_raw_tracepoint_args {
        __u64 args[0];
};

What BPF program is getting is just an array of u64s.

>
> I am trying to remember as I've used this before; I think you get the arg list
> as part of ctx when you attach to a function?
>
> I wonder if it'd be hard to provide a macro for the user to provide the
> signature of the function they expect; this macro can try then to verify/assert
> the number, type and order is the same. Not bullet proof and requires opt-in,
> but could be useful?
>
>
> // dummy pseudo-code
>
>         BPF_CORE_ASSERT_SIG(sched_switch, NR_ARGS, ARG0, ARG1, ...)
>                 if (ctx->nr_args != NR_ARGS)
>                         assert()
>                 if (type_of(ctx->args[0]) != type_of(ARG0))
>                         assert()
>                 ...
>
> I'm not sure if you have any info about the type though..

What we have now under discussion is more generic way for user to
check signature of function prototype, struct/union, etc. But all that
will take some time to implement and finalize. So this patch is a way
to stop/prevent the bleeding until we have that available to users.

>
> > But in the end no one eliminated the need for testing your application
> > for correctness. Tracing programs do break on kernel changes and BPF
> > users do adapt to them. Sometimes adapting is easy (like state ->
> > __state transition), sometimes it's much more involved (like this
> > argument order change).
>
> It's not just an arg re-order, it's a new argument inserted in the middle. But
> fair enough :-)
>
> Cheers
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-27 18:17                 ` Andrii Nakryiko
@ 2022-04-27 20:32                   ` Alexei Starovoitov
  2022-04-28 10:02                   ` Qais Yousef
  1 sibling, 0 replies; 49+ messages in thread
From: Alexei Starovoitov @ 2022-04-27 20:32 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Qais Yousef, Peter Zijlstra, Delyan Kratunov, Namhyung Kim,
	Arnaldo Carvalho de Melo, bigeasy, dietmar.eggemann, keescook,
	x86, andrii, u.kleine-koenig, vincent.guittot, akpm, mingo,
	linux-kernel, rdunlap, rostedt, Kenta.Tada, tglx, bristot,
	ebiederm, ast, legion, adharmap, valentin.schneider, ed.tsai,
	juri.lelli

On Wed, Apr 27, 2022 at 11:17 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> > >
> > > See my reply to Peter. libbpf can't know user's intent to fail this
> > > automatically, in general. In some cases when it can it does
> > > accommodate this automatically. In other cases it provides instruments
> > > for user to handle this (bpf_core_field_size(),
> > > BPF_CORE_READ_BITFIELD(), etc).
> >
> > My naiive thinking is that the function signature has changed (there's 1 extra
> > arg not just a subtle swap of args of the same type) - so I thought that can be
> > detected. But maybe it is harder said than done.
>
> It is. We don't have number of arguments either:
>
> struct bpf_raw_tracepoint_args {
>         __u64 args[0];
> };
>
> What BPF program is getting is just an array of u64s.

Well, that's a true and false statement at the same time
that might confuse folks reading this thread.
To clarify:
raw_tp and tp_btf programs receive an array of u64-s.
raw_tp checks the number of arguments only.
The prog that accesses non-existing arg will be rejected.
tp_btf prog in addition to the number of args knows
the exact type of the arguments.
So in this case accessing arg0 in bpf prog
as 'struct task_struct *' while it's 'unsigned int prev_state'
in the kernel will cause the verifier to reject it.
If prog does bpf_probe_read_kernel() then all bets are off, of course.
There is no type checking mechanism for bpf_probe_read_kernel.
Only BTF powered pointer dereferences are type checked.
The 'tp_btf' prog type is the recommended way to access tracepoints.
The 'raw_tp' was implemented before BTF was introduced.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-27 18:17                 ` Andrii Nakryiko
  2022-04-27 20:32                   ` Alexei Starovoitov
@ 2022-04-28 10:02                   ` Qais Yousef
  2022-05-09 19:32                     ` Andrii Nakryiko
  1 sibling, 1 reply; 49+ messages in thread
From: Qais Yousef @ 2022-04-28 10:02 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Peter Zijlstra, Alexei Starovoitov, Delyan Kratunov,
	Namhyung Kim, Arnaldo Carvalho de Melo, bigeasy,
	dietmar.eggemann, keescook, x86, andrii, u.kleine-koenig,
	vincent.guittot, akpm, mingo, linux-kernel, rdunlap, rostedt,
	Kenta.Tada, tglx, bristot, ebiederm, ast, legion, adharmap,
	valentin.schneider, ed.tsai, juri.lelli

On 04/27/22 11:17, Andrii Nakryiko wrote:
> On Wed, Apr 27, 2022 at 3:35 AM Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > On 04/26/22 08:54, Andrii Nakryiko wrote:
> > > On Tue, Apr 26, 2022 at 7:10 AM Qais Yousef <qais.yousef@arm.com> wrote:
> > > >
> > > > On 04/26/22 14:28, Peter Zijlstra wrote:
> > > > > On Fri, Apr 22, 2022 at 11:30:12AM -0700, Alexei Starovoitov wrote:
> > > > > > On Fri, Apr 22, 2022 at 10:22 AM Delyan Kratunov <delyank@fb.com> wrote:
> > > > > > >
> > > > > > > On Fri, 2022-04-22 at 13:09 +0200, Peter Zijlstra wrote:
> > > > > > > > And on the other hand; those users need to be fixed anyway, right?
> > > > > > > > Accessing prev->__state is equally broken.
> > > > > > >
> > > > > > > The users that access prev->__state would most likely have to be fixed, for sure.
> > > > > > >
> > > > > > > However, not all users access prev->__state. `offcputime` for example just takes a
> > > > > > > stack trace and associates it with the switched out task. This kind of user
> > > > > > > would continue working with the proposed patch.
> > > > > > >
> > > > > > > > If bpf wants to ride on them, it needs to suffer the pain of doing so.
> > > > > > >
> > > > > > > Sure, I'm just advocating for a fairly trivial patch to avoid some of the suffering,
> > > > > > > hopefully without being a burden to development. If that's not the case, then it's a
> > > > > > > clear no-go.
> > > > > >
> > > > > >
> > > > > > Namhyung just sent this patch set:
> > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220422053401.208207-3-namhyung@kernel.org/
> > > > >
> > > > > That has:
> > > > >
> > > > > + * recently task_struct->state renamed to __state so it made an incompatible
> > > > > + * change.
> > > > >
> > > > > git tells me:
> > > > >
> > > > >   2f064a59a11f ("sched: Change task_struct::state")
> > > > >
> > > > > is almost a year old by now. That don't qualify as recently in my book.
> > > > > That says that 'old kernels used to call this...'.
> > > > >
> > > > > > to add off-cpu profiling to perf.
> > > > > > It also hooks into sched_switch tracepoint.
> > > > > > Notice it deals with state->__state rename just fine.
> > > > >
> > > > > So I don't speak BPF much; it always takes me more time to make bpf work
> > > > > than to just hack up the kernel, which makes it hard to get motivated.
> > > > >
> > > > > However, it was not just a rename, state changed type too, which is why I
> > > > > did the rename, to make sure all users would get a compile fail and
> > > > > could adjust.
> > > > >
> > > > > If you're silently making it work by frobbing the name, you loose that.
> > > > >
> > > > > Specifically, task_struct::state used to be 'volatile long', while
> > > > > task_struct::__state is 'unsigned int'. As such, any user must now be
> > > > > very careful to use READ_ONCE(). I don't see that happening with just
> > > > > frobbing the name.
> > > > >
> > > > > Additinoally, by shrinking the field, I suppose BE systems get to keep
> > > > > the pieces?
> > > > >
> > > > > > But it will have a hard time without this patch
> > > > > > until we add all the extra CO-RE features to detect
> > > > > > and automatically adjust bpf progs when tracepoint
> > > > > > arguments order changed.
> > > > >
> > > > > Could be me, but silently making it work sounds like fail :/ There's a
> > > > > reason code changes, users need to adapt, not silently pretend stuff is
> > > > > as before.
> > > > >
> > > > > How will you know you need to fix your tool?
> > > >
> > > > If libbpf doesn't fail, then yeah it's a big problem. I wonder how users of
> > > > kprobe who I suppose are more prone to this kind of problems have been coping.
> > >
> > > See my reply to Peter. libbpf can't know user's intent to fail this
> > > automatically, in general. In some cases when it can it does
> > > accommodate this automatically. In other cases it provides instruments
> > > for user to handle this (bpf_core_field_size(),
> > > BPF_CORE_READ_BITFIELD(), etc).
> >
> > My naiive thinking is that the function signature has changed (there's 1 extra
> > arg not just a subtle swap of args of the same type) - so I thought that can be
> > detected. But maybe it is harder said than done.
> 
> It is. We don't have number of arguments either:
> 
> struct bpf_raw_tracepoint_args {
>         __u64 args[0];
> };
> 
> What BPF program is getting is just an array of u64s.
> 
> >
> > I am trying to remember as I've used this before; I think you get the arg list
> > as part of ctx when you attach to a function?
> >
> > I wonder if it'd be hard to provide a macro for the user to provide the
> > signature of the function they expect; this macro can try then to verify/assert
> > the number, type and order is the same. Not bullet proof and requires opt-in,
> > but could be useful?
> >
> >
> > // dummy pseudo-code
> >
> >         BPF_CORE_ASSERT_SIG(sched_switch, NR_ARGS, ARG0, ARG1, ...)
> >                 if (ctx->nr_args != NR_ARGS)
> >                         assert()
> >                 if (type_of(ctx->args[0]) != type_of(ARG0))
> >                         assert()
> >                 ...
> >
> > I'm not sure if you have any info about the type though..
> 
> What we have now under discussion is more generic way for user to
> check signature of function prototype, struct/union, etc. But all that
> will take some time to implement and finalize. So this patch is a way
> to stop/prevent the bleeding until we have that available to users.

Okay good to know. Alexei mentioned a plan, but I didn't get that it included
signature verification.

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-04-28 10:02                   ` Qais Yousef
@ 2022-05-09 19:32                     ` Andrii Nakryiko
  2022-05-10  7:01                       ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Andrii Nakryiko @ 2022-05-09 19:32 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Alexei Starovoitov, Delyan Kratunov,
	Namhyung Kim, Arnaldo Carvalho de Melo, bigeasy,
	dietmar.eggemann, keescook, x86, andrii, u.kleine-koenig,
	vincent.guittot, akpm, mingo, linux-kernel, rdunlap, rostedt,
	Kenta.Tada, tglx, bristot, ebiederm, ast, legion, adharmap,
	valentin.schneider, ed.tsai, juri.lelli

On Thu, Apr 28, 2022 at 3:02 AM Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 04/27/22 11:17, Andrii Nakryiko wrote:
> > On Wed, Apr 27, 2022 at 3:35 AM Qais Yousef <qais.yousef@arm.com> wrote:
> > >
> > > On 04/26/22 08:54, Andrii Nakryiko wrote:
> > > > On Tue, Apr 26, 2022 at 7:10 AM Qais Yousef <qais.yousef@arm.com> wrote:
> > > > >
> > > > > On 04/26/22 14:28, Peter Zijlstra wrote:
> > > > > > On Fri, Apr 22, 2022 at 11:30:12AM -0700, Alexei Starovoitov wrote:
> > > > > > > On Fri, Apr 22, 2022 at 10:22 AM Delyan Kratunov <delyank@fb.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, 2022-04-22 at 13:09 +0200, Peter Zijlstra wrote:
> > > > > > > > > And on the other hand; those users need to be fixed anyway, right?
> > > > > > > > > Accessing prev->__state is equally broken.
> > > > > > > >
> > > > > > > > The users that access prev->__state would most likely have to be fixed, for sure.
> > > > > > > >
> > > > > > > > However, not all users access prev->__state. `offcputime` for example just takes a
> > > > > > > > stack trace and associates it with the switched out task. This kind of user
> > > > > > > > would continue working with the proposed patch.
> > > > > > > >
> > > > > > > > > If bpf wants to ride on them, it needs to suffer the pain of doing so.
> > > > > > > >
> > > > > > > > Sure, I'm just advocating for a fairly trivial patch to avoid some of the suffering,
> > > > > > > > hopefully without being a burden to development. If that's not the case, then it's a
> > > > > > > > clear no-go.
> > > > > > >
> > > > > > >
> > > > > > > Namhyung just sent this patch set:
> > > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220422053401.208207-3-namhyung@kernel.org/
> > > > > >
> > > > > > That has:
> > > > > >
> > > > > > + * recently task_struct->state renamed to __state so it made an incompatible
> > > > > > + * change.
> > > > > >
> > > > > > git tells me:
> > > > > >
> > > > > >   2f064a59a11f ("sched: Change task_struct::state")
> > > > > >
> > > > > > is almost a year old by now. That don't qualify as recently in my book.
> > > > > > That says that 'old kernels used to call this...'.
> > > > > >
> > > > > > > to add off-cpu profiling to perf.
> > > > > > > It also hooks into sched_switch tracepoint.
> > > > > > > Notice it deals with state->__state rename just fine.
> > > > > >
> > > > > > So I don't speak BPF much; it always takes me more time to make bpf work
> > > > > > than to just hack up the kernel, which makes it hard to get motivated.
> > > > > >
> > > > > > However, it was not just a rename, state changed type too, which is why I
> > > > > > did the rename, to make sure all users would get a compile fail and
> > > > > > could adjust.
> > > > > >
> > > > > > If you're silently making it work by frobbing the name, you loose that.
> > > > > >
> > > > > > Specifically, task_struct::state used to be 'volatile long', while
> > > > > > task_struct::__state is 'unsigned int'. As such, any user must now be
> > > > > > very careful to use READ_ONCE(). I don't see that happening with just
> > > > > > frobbing the name.
> > > > > >
> > > > > > Additinoally, by shrinking the field, I suppose BE systems get to keep
> > > > > > the pieces?
> > > > > >
> > > > > > > But it will have a hard time without this patch
> > > > > > > until we add all the extra CO-RE features to detect
> > > > > > > and automatically adjust bpf progs when tracepoint
> > > > > > > arguments order changed.
> > > > > >
> > > > > > Could be me, but silently making it work sounds like fail :/ There's a
> > > > > > reason code changes, users need to adapt, not silently pretend stuff is
> > > > > > as before.
> > > > > >
> > > > > > How will you know you need to fix your tool?
> > > > >
> > > > > If libbpf doesn't fail, then yeah it's a big problem. I wonder how users of
> > > > > kprobe who I suppose are more prone to this kind of problems have been coping.
> > > >
> > > > See my reply to Peter. libbpf can't know user's intent to fail this
> > > > automatically, in general. In some cases when it can it does
> > > > accommodate this automatically. In other cases it provides instruments
> > > > for user to handle this (bpf_core_field_size(),
> > > > BPF_CORE_READ_BITFIELD(), etc).
> > >
> > > My naiive thinking is that the function signature has changed (there's 1 extra
> > > arg not just a subtle swap of args of the same type) - so I thought that can be
> > > detected. But maybe it is harder said than done.
> >
> > It is. We don't have number of arguments either:
> >
> > struct bpf_raw_tracepoint_args {
> >         __u64 args[0];
> > };
> >
> > What BPF program is getting is just an array of u64s.
> >
> > >
> > > I am trying to remember as I've used this before; I think you get the arg list
> > > as part of ctx when you attach to a function?
> > >
> > > I wonder if it'd be hard to provide a macro for the user to provide the
> > > signature of the function they expect; this macro can try then to verify/assert
> > > the number, type and order is the same. Not bullet proof and requires opt-in,
> > > but could be useful?
> > >
> > >
> > > // dummy pseudo-code
> > >
> > >         BPF_CORE_ASSERT_SIG(sched_switch, NR_ARGS, ARG0, ARG1, ...)
> > >                 if (ctx->nr_args != NR_ARGS)
> > >                         assert()
> > >                 if (type_of(ctx->args[0]) != type_of(ARG0))
> > >                         assert()
> > >                 ...
> > >
> > > I'm not sure if you have any info about the type though..
> >
> > What we have now under discussion is more generic way for user to
> > check signature of function prototype, struct/union, etc. But all that
> > will take some time to implement and finalize. So this patch is a way
> > to stop/prevent the bleeding until we have that available to users.
>
> Okay good to know. Alexei mentioned a plan, but I didn't get that it included
> signature verification.
>

Given the discussion subsided, I'm still a bit unclear about the
outcome. Hopefully the discussion made it a bit clearer why this
one-time change is so helpful to the broader BPF community and is
rather an exception and not an expectation (and we have work in
progress to be able to handle even such changes in the future).

So can this patch be applied, please, or it's a hard no?

> Cheers
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-05-09 19:32                     ` Andrii Nakryiko
@ 2022-05-10  7:01                       ` Peter Zijlstra
  2022-05-10  8:29                         ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2022-05-10  7:01 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Qais Yousef, Alexei Starovoitov, Delyan Kratunov, Namhyung Kim,
	Arnaldo Carvalho de Melo, bigeasy, dietmar.eggemann, keescook,
	x86, andrii, u.kleine-koenig, vincent.guittot, akpm, mingo,
	linux-kernel, rdunlap, rostedt, Kenta.Tada, tglx, bristot,
	ebiederm, ast, legion, adharmap, valentin.schneider, ed.tsai,
	juri.lelli

On Mon, May 09, 2022 at 12:32:31PM -0700, Andrii Nakryiko wrote:
> So can this patch be applied, please, or it's a hard no?

Sorry; I got distracted with other work. I'll grudingly apply the patch.

Thanks!

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-05-10  7:01                       ` Peter Zijlstra
@ 2022-05-10  8:29                         ` Peter Zijlstra
  2022-05-10 14:31                           ` Steven Rostedt
  2022-05-11 18:28                           ` [PATCH v2] " Delyan Kratunov
  0 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2022-05-10  8:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Qais Yousef, Alexei Starovoitov, Delyan Kratunov, Namhyung Kim,
	Arnaldo Carvalho de Melo, bigeasy, dietmar.eggemann, keescook,
	x86, andrii, u.kleine-koenig, vincent.guittot, akpm, mingo,
	linux-kernel, rdunlap, rostedt, Kenta.Tada, tglx, bristot,
	ebiederm, ast, legion, adharmap, valentin.schneider, ed.tsai,
	juri.lelli

On Tue, May 10, 2022 at 09:01:22AM +0200, Peter Zijlstra wrote:
> On Mon, May 09, 2022 at 12:32:31PM -0700, Andrii Nakryiko wrote:
> > So can this patch be applied, please, or it's a hard no?
> 
> Sorry; I got distracted with other work. I'll grudingly apply the patch.

  gcc-11-i386-allmodconfig [15:35]  FAILED
  | In file included from ../include/trace/define_custom_trace.h:55,
  |                  from ../samples/trace_events/trace_custom_sched.h:96,
  |                  from ../samples/trace_events/trace_custom_sched.c:24:
  | ../samples/trace_events/./trace_custom_sched.h: In function ‘ftrace_test_custom_probe_sched_switch’:
  | ../include/trace/trace_custom_events.h:178:42: error: passing argument 1 of ‘check_trace_callback_type_sched_switch’ from incompatible pointer type [-Werror=incompatible-pointer-types]
  |   178 |         check_trace_callback_type_##call(trace_custom_event_raw_event_##template); \
  |       |                                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  |       |                                          |
  |       |                                          void (*)(void *, bool,  unsigned int,  struct task_struct *, struct task_struct *) {aka void (*)(void *, _Bool,  unsigned int,  struct task_struct *, struct task_struct *)}
  | ../include/trace/trace_custom_events.h:34:9: note: in expansion of macro ‘DEFINE_CUSTOM_EVENT’
  |    34 |         DEFINE_CUSTOM_EVENT(name, name, PARAMS(proto), PARAMS(args));
  |       |         ^~~~~~~~~~~~~~~~~~~
  | ../samples/trace_events/./trace_custom_sched.h:21:1: note: in expansion of macro ‘TRACE_CUSTOM_EVENT’
  |    21 | TRACE_CUSTOM_EVENT(sched_switch,
  |       | ^~~~~~~~~~~~~~~~~~
  | In file included from ../include/linux/trace_events.h:11,
  |                  from ../samples/trace_events/trace_custom_sched.c:10:
  | ../include/linux/tracepoint.h:279:49: note: expected ‘void (*)(void *, bool,  struct task_struct *, struct task_struct *, unsigned int)’ {aka ‘void (*)(void *, _Bool,  struct task_struct *, struct task_struct *, unsigned int)’} but argument is of type ‘void (*)(void *, bool,  unsigned int,  struct task_struct *, struct task_struct *)’ {aka ‘void (*)(void *, _Bool,  unsigned int,  struct task_struct *, struct task_struct *)’}
  |   279 |         check_trace_callback_type_##name(void (*cb)(data_proto))        \
  |       |                                          ~~~~~~~^~~~~~~~~~~~~~~
  | ../include/linux/tracepoint.h:419:9: note: in expansion of macro ‘__DECLARE_TRACE’
  |   419 |         __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),              \
  |       |         ^~~~~~~~~~~~~~~
  | ../include/linux/tracepoint.h:553:9: note: in expansion of macro ‘DECLARE_TRACE’
  |   553 |         DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
  |       |         ^~~~~~~~~~~~~
  | ../include/trace/events/sched.h:222:1: note: in expansion of macro ‘TRACE_EVENT’
  |   222 | TRACE_EVENT(sched_switch,
  |       | ^~~~~~~~~~~
  | cc1: some warnings being treated as errors
  | make[3]: *** [../scripts/Makefile.build:292: samples/trace_events/trace_custom_sched.o] Error 1
  | make[3]: *** Waiting for unfinished jobs....
  | make[2]: *** [../scripts/Makefile.build:555: samples/trace_events] Error 2
  | make[1]: *** [/opt/buildbot/linux-2.6/Makefile:1834: samples] Error 2
  | make[1]: *** Waiting for unfinished jobs....
  | make: *** [Makefile:219: __sub-make] Error 2
  `----


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] sched/tracing: append prev_state to tp args instead
  2022-05-10  8:29                         ` Peter Zijlstra
@ 2022-05-10 14:31                           ` Steven Rostedt
  2022-05-11 18:28                           ` [PATCH v2] " Delyan Kratunov
  1 sibling, 0 replies; 49+ messages in thread
From: Steven Rostedt @ 2022-05-10 14:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Qais Yousef, Alexei Starovoitov,
	Delyan Kratunov, Namhyung Kim, Arnaldo Carvalho de Melo, bigeasy,
	dietmar.eggemann, keescook, x86, andrii, u.kleine-koenig,
	vincent.guittot, akpm, mingo, linux-kernel, rdunlap, Kenta.Tada,
	tglx, bristot, ebiederm, ast, legion, adharmap,
	valentin.schneider, ed.tsai, juri.lelli

On Tue, 10 May 2022 10:29:40 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, May 10, 2022 at 09:01:22AM +0200, Peter Zijlstra wrote:
> > On Mon, May 09, 2022 at 12:32:31PM -0700, Andrii Nakryiko wrote:  
> > > So can this patch be applied, please, or it's a hard no?  
> > 
> > Sorry; I got distracted with other work. I'll grudingly apply the patch.  
> 
>   gcc-11-i386-allmodconfig [15:35]  FAILED
>   | In file included from ../include/trace/define_custom_trace.h:55,
>   |                  from ../samples/trace_events/trace_custom_sched.h:96,
>   |                  from ../samples/trace_events/trace_custom_sched.c:24:
>   | ../samples/trace_events/./trace_custom_sched.h: In function ‘ftrace_test_custom_probe_sched_switch’:
>   | ../include/trace/trace_custom_events.h:178:42: error: passing argument 1 of ‘check_trace_callback_type_sched_switch’ from incompatible pointer type [-Werror=incompatible-pointer-types]
>   |   178 |         check_trace_callback_type_##call(trace_custom_event_raw_event_##template); \
>   |       |                                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>   |       |                                          |
>   |       |                                          void (*)(void *, bool,  unsigned int,  struct task_struct *, struct task_struct *) {aka void (*)(void *, _Bool,  unsigned int,  struct task_struct *, struct task_struct *)}
>   | ../include/trace/trace_custom_events.h:34:9: note: in expansion of macro ‘DEFINE_CUSTOM_EVENT’
>   |    34 |         DEFINE_CUSTOM_EVENT(name, name, PARAMS(proto), PARAMS(args));


Yes, the custom tracepoint sample code uses the sched_switch tracepoint as
an example. That sample code needs to be updated:

In samples/trace_events/trace_custom_sched.h:

TRACE_CUSTOM_EVENT(sched_switch,

        /*
         * The TP_PROTO() and TP_ARGS must match the trace event
         * that the custom event is using.
         */
        TP_PROTO(bool preempt,
                 unsigned int prev_state,
                 struct task_struct *prev,
                 struct task_struct *next),

        TP_ARGS(preempt, prev_state, prev, next),



The arguments need to be updated.

-- Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2] sched/tracing: append prev_state to tp args instead
  2022-05-10  8:29                         ` Peter Zijlstra
  2022-05-10 14:31                           ` Steven Rostedt
@ 2022-05-11 18:28                           ` Delyan Kratunov
  2022-05-11 19:10                             ` Steven Rostedt
                                               ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Delyan Kratunov @ 2022-05-11 18:28 UTC (permalink / raw)
  To: peterz
  Cc: dietmar.eggemann, bigeasy, keescook, qais.yousef,
	Delyan Kratunov, andrii, x86, vincent.guittot, akpm, adharmap,
	alexei.starovoitov, andrii.nakryiko, ast, Kenta.Tada, rdunlap,
	linux-kernel, tglx, bristot, ebiederm, mingo, rostedt, ed.tsai,
	legion, acme, u.kleine-koenig, valentin.schneider, juri.lelli,
	namhyung

Commit fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting
sched_switch event, 2022-01-20) added a new prev_state argument to the
sched_switch tracepoint, before the prev task_struct pointer.

This reordering of arguments broke BPF programs that use the raw
tracepoint (e.g. tp_btf programs). The type of the second argument has
changed and existing programs that assume a task_struct* argument
(e.g. for bpf_task_storage access) will now fail to verify.

If we instead append the new argument to the end, all existing programs
would continue to work and can conditionally extract the prev_state
argument on supported kernel versions.

Fixes: fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20)
Signed-off-by: Delyan Kratunov <delyank@fb.com>
---
v1 -> v2:
Convert trace_events sample as well.

Sorry about that Peter and thanks for taking it in, it's much appreciated! 

 include/trace/events/sched.h              | 6 +++---
 kernel/sched/core.c                       | 2 +-
 kernel/trace/fgraph.c                     | 4 ++--
 kernel/trace/ftrace.c                     | 4 ++--
 kernel/trace/trace_events.c               | 8 ++++----
 kernel/trace/trace_osnoise.c              | 4 ++--
 kernel/trace/trace_sched_switch.c         | 4 ++--
 kernel/trace/trace_sched_wakeup.c         | 4 ++--
 samples/trace_events/trace_custom_sched.h | 6 +++---
 9 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 65e786756321..fbb99a61f714 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -222,11 +222,11 @@ static inline long __trace_sched_switch_state(bool preempt,
 TRACE_EVENT(sched_switch,
 
 	TP_PROTO(bool preempt,
-		 unsigned int prev_state,
 		 struct task_struct *prev,
-		 struct task_struct *next),
+		 struct task_struct *next,
+		 unsigned int prev_state),
 
-	TP_ARGS(preempt, prev_state, prev, next),
+	TP_ARGS(preempt, prev, next, prev_state),
 
 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51efaabac3e4..d58c0389eb23 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6382,7 +6382,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		migrate_disable_switch(rq, prev);
 		psi_sched_switch(prev, next, !task_on_rq_queued(prev));
 
-		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev_state, prev, next);
+		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);
 
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 8f4fb328133a..a7e84c8543cb 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -404,9 +404,9 @@ static int alloc_retstack_tasklist(struct ftrace_ret_stack **ret_stack_list)
 
 static void
 ftrace_graph_probe_sched_switch(void *ignore, bool preempt,
-				unsigned int prev_state,
 				struct task_struct *prev,
-				struct task_struct *next)
+				struct task_struct *next,
+				unsigned int prev_state)
 {
 	unsigned long long timestamp;
 	int index;
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 4f1d2f5e7263..af899b058c8d 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7420,9 +7420,9 @@ ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops)
 
 static void
 ftrace_filter_pid_sched_switch_probe(void *data, bool preempt,
-				     unsigned int prev_state,
 				     struct task_struct *prev,
-				     struct task_struct *next)
+				     struct task_struct *next,
+				     unsigned int prev_state)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *pid_list;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index e11e167b7809..f97de82d1342 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -773,9 +773,9 @@ void trace_event_follow_fork(struct trace_array *tr, bool enable)
 
 static void
 event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
-					unsigned int prev_state,
 					struct task_struct *prev,
-					struct task_struct *next)
+					struct task_struct *next,
+					unsigned int prev_state)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *no_pid_list;
@@ -799,9 +799,9 @@ event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
 
 static void
 event_filter_pid_sched_switch_probe_post(void *data, bool preempt,
-					 unsigned int prev_state,
 					 struct task_struct *prev,
-					 struct task_struct *next)
+					 struct task_struct *next,
+					 unsigned int prev_state)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *no_pid_list;
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index e9ae1f33a7f0..afb92e2f0aea 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1168,9 +1168,9 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
  */
 static void
 trace_sched_switch_callback(void *data, bool preempt,
-			    unsigned int prev_state,
 			    struct task_struct *p,
-			    struct task_struct *n)
+			    struct task_struct *n,
+			    unsigned int prev_state)
 {
 	struct osnoise_variables *osn_var = this_cpu_osn_var();
 
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 45796d8bd4b2..c9ffdcfe622e 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -22,8 +22,8 @@ static DEFINE_MUTEX(sched_register_mutex);
 
 static void
 probe_sched_switch(void *ignore, bool preempt,
-		   unsigned int prev_state,
-		   struct task_struct *prev, struct task_struct *next)
+		   struct task_struct *prev, struct task_struct *next,
+		   unsigned int prev_state)
 {
 	int flags;
 
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 46429f9a96fa..330aee1c1a49 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -426,8 +426,8 @@ tracing_sched_wakeup_trace(struct trace_array *tr,
 
 static void notrace
 probe_wakeup_sched_switch(void *ignore, bool preempt,
-			  unsigned int prev_state,
-			  struct task_struct *prev, struct task_struct *next)
+			  struct task_struct *prev, struct task_struct *next,
+			  unsigned int prev_state)
 {
 	struct trace_array_cpu *data;
 	u64 T0, T1, delta;
diff --git a/samples/trace_events/trace_custom_sched.h b/samples/trace_events/trace_custom_sched.h
index 9fdd8e7c2a45..951388334a3f 100644
--- a/samples/trace_events/trace_custom_sched.h
+++ b/samples/trace_events/trace_custom_sched.h
@@ -25,11 +25,11 @@ TRACE_CUSTOM_EVENT(sched_switch,
 	 * that the custom event is using.
 	 */
 	TP_PROTO(bool preempt,
-		 unsigned int prev_state,
 		 struct task_struct *prev,
-		 struct task_struct *next),
+		 struct task_struct *next,
+		 unsigned int prev_state),
 
-	TP_ARGS(preempt, prev_state, prev, next),
+	TP_ARGS(preempt, prev, next, prev_state),
 
 	/*
 	 * The next fields are where the customization happens.
-- 
2.35.3

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2] sched/tracing: append prev_state to tp args instead
  2022-05-11 18:28                           ` [PATCH v2] " Delyan Kratunov
@ 2022-05-11 19:10                             ` Steven Rostedt
  2022-05-11 22:45                             ` [tip: sched/urgent] sched/tracing: Append " tip-bot2 for Delyan Kratunov
  2022-05-11 23:40                             ` [PATCH v2] sched/tracing: append " Thomas Gleixner
  2 siblings, 0 replies; 49+ messages in thread
From: Steven Rostedt @ 2022-05-11 19:10 UTC (permalink / raw)
  To: Delyan Kratunov
  Cc: peterz, dietmar.eggemann, bigeasy, keescook, qais.yousef, andrii,
	x86, vincent.guittot, akpm, adharmap, alexei.starovoitov,
	andrii.nakryiko, ast, Kenta.Tada, rdunlap, linux-kernel, tglx,
	bristot, ebiederm, mingo, ed.tsai, legion, acme, u.kleine-koenig,
	valentin.schneider, juri.lelli, namhyung

On Wed, 11 May 2022 18:28:36 +0000
Delyan Kratunov <delyank@fb.com> wrote:

> Sorry about that Peter and thanks for taking it in, it's much appreciated! 
> 
>  include/trace/events/sched.h              | 6 +++---
>  kernel/sched/core.c                       | 2 +-
>  kernel/trace/fgraph.c                     | 4 ++--
>  kernel/trace/ftrace.c                     | 4 ++--
>  kernel/trace/trace_events.c               | 8 ++++----
>  kernel/trace/trace_osnoise.c              | 4 ++--
>  kernel/trace/trace_sched_switch.c         | 4 ++--
>  kernel/trace/trace_sched_wakeup.c         | 4 ++--
>  samples/trace_events/trace_custom_sched.h | 6 +++---
>  9 files changed, 21 insertions(+), 21 deletions(-)

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [tip: sched/urgent] sched/tracing: Append prev_state to tp args instead
  2022-05-11 18:28                           ` [PATCH v2] " Delyan Kratunov
  2022-05-11 19:10                             ` Steven Rostedt
@ 2022-05-11 22:45                             ` tip-bot2 for Delyan Kratunov
  2022-05-11 23:40                             ` [PATCH v2] sched/tracing: append " Thomas Gleixner
  2 siblings, 0 replies; 49+ messages in thread
From: tip-bot2 for Delyan Kratunov @ 2022-05-11 22:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Delyan Kratunov, Peter Zijlstra (Intel), Steven Rostedt (Google),
	x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     9c2136be0878c88c53dea26943ce40bb03ad8d8d
Gitweb:        https://git.kernel.org/tip/9c2136be0878c88c53dea26943ce40bb03ad8d8d
Author:        Delyan Kratunov <delyank@fb.com>
AuthorDate:    Wed, 11 May 2022 18:28:36 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 12 May 2022 00:37:11 +02:00

sched/tracing: Append prev_state to tp args instead

Commit fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting
sched_switch event, 2022-01-20) added a new prev_state argument to the
sched_switch tracepoint, before the prev task_struct pointer.

This reordering of arguments broke BPF programs that use the raw
tracepoint (e.g. tp_btf programs). The type of the second argument has
changed and existing programs that assume a task_struct* argument
(e.g. for bpf_task_storage access) will now fail to verify.

If we instead append the new argument to the end, all existing programs
would continue to work and can conditionally extract the prev_state
argument on supported kernel versions.

Fixes: fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20)
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/c8a6930dfdd58a4a5755fc01732675472979732b.camel@fb.com
---
 include/trace/events/sched.h              | 6 +++---
 kernel/sched/core.c                       | 2 +-
 kernel/trace/fgraph.c                     | 4 ++--
 kernel/trace/ftrace.c                     | 4 ++--
 kernel/trace/trace_events.c               | 8 ++++----
 kernel/trace/trace_osnoise.c              | 4 ++--
 kernel/trace/trace_sched_switch.c         | 4 ++--
 kernel/trace/trace_sched_wakeup.c         | 4 ++--
 samples/trace_events/trace_custom_sched.h | 6 +++---
 9 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 65e7867..fbb99a6 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -222,11 +222,11 @@ static inline long __trace_sched_switch_state(bool preempt,
 TRACE_EVENT(sched_switch,
 
 	TP_PROTO(bool preempt,
-		 unsigned int prev_state,
 		 struct task_struct *prev,
-		 struct task_struct *next),
+		 struct task_struct *next,
+		 unsigned int prev_state),
 
-	TP_ARGS(preempt, prev_state, prev, next),
+	TP_ARGS(preempt, prev, next, prev_state),
 
 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51efaab..d58c038 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6382,7 +6382,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		migrate_disable_switch(rq, prev);
 		psi_sched_switch(prev, next, !task_on_rq_queued(prev));
 
-		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev_state, prev, next);
+		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);
 
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 8f4fb32..a7e84c8 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -404,9 +404,9 @@ free:
 
 static void
 ftrace_graph_probe_sched_switch(void *ignore, bool preempt,
-				unsigned int prev_state,
 				struct task_struct *prev,
-				struct task_struct *next)
+				struct task_struct *next,
+				unsigned int prev_state)
 {
 	unsigned long long timestamp;
 	int index;
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 4f1d2f5..af899b0 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7420,9 +7420,9 @@ ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops)
 
 static void
 ftrace_filter_pid_sched_switch_probe(void *data, bool preempt,
-				     unsigned int prev_state,
 				     struct task_struct *prev,
-				     struct task_struct *next)
+				     struct task_struct *next,
+				     unsigned int prev_state)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *pid_list;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index e11e167..f97de82 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -773,9 +773,9 @@ void trace_event_follow_fork(struct trace_array *tr, bool enable)
 
 static void
 event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
-					unsigned int prev_state,
 					struct task_struct *prev,
-					struct task_struct *next)
+					struct task_struct *next,
+					unsigned int prev_state)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *no_pid_list;
@@ -799,9 +799,9 @@ event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
 
 static void
 event_filter_pid_sched_switch_probe_post(void *data, bool preempt,
-					 unsigned int prev_state,
 					 struct task_struct *prev,
-					 struct task_struct *next)
+					 struct task_struct *next,
+					 unsigned int prev_state)
 {
 	struct trace_array *tr = data;
 	struct trace_pid_list *no_pid_list;
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index e9ae1f3..afb92e2 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1168,9 +1168,9 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
  */
 static void
 trace_sched_switch_callback(void *data, bool preempt,
-			    unsigned int prev_state,
 			    struct task_struct *p,
-			    struct task_struct *n)
+			    struct task_struct *n,
+			    unsigned int prev_state)
 {
 	struct osnoise_variables *osn_var = this_cpu_osn_var();
 
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 45796d8..c9ffdcf 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -22,8 +22,8 @@ static DEFINE_MUTEX(sched_register_mutex);
 
 static void
 probe_sched_switch(void *ignore, bool preempt,
-		   unsigned int prev_state,
-		   struct task_struct *prev, struct task_struct *next)
+		   struct task_struct *prev, struct task_struct *next,
+		   unsigned int prev_state)
 {
 	int flags;
 
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 46429f9..330aee1 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -426,8 +426,8 @@ tracing_sched_wakeup_trace(struct trace_array *tr,
 
 static void notrace
 probe_wakeup_sched_switch(void *ignore, bool preempt,
-			  unsigned int prev_state,
-			  struct task_struct *prev, struct task_struct *next)
+			  struct task_struct *prev, struct task_struct *next,
+			  unsigned int prev_state)
 {
 	struct trace_array_cpu *data;
 	u64 T0, T1, delta;
diff --git a/samples/trace_events/trace_custom_sched.h b/samples/trace_events/trace_custom_sched.h
index 9fdd8e7..9513883 100644
--- a/samples/trace_events/trace_custom_sched.h
+++ b/samples/trace_events/trace_custom_sched.h
@@ -25,11 +25,11 @@ TRACE_CUSTOM_EVENT(sched_switch,
 	 * that the custom event is using.
 	 */
 	TP_PROTO(bool preempt,
-		 unsigned int prev_state,
 		 struct task_struct *prev,
-		 struct task_struct *next),
+		 struct task_struct *next,
+		 unsigned int prev_state),
 
-	TP_ARGS(preempt, prev_state, prev, next),
+	TP_ARGS(preempt, prev, next, prev_state),
 
 	/*
 	 * The next fields are where the customization happens.

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2] sched/tracing: append prev_state to tp args instead
  2022-05-11 18:28                           ` [PATCH v2] " Delyan Kratunov
  2022-05-11 19:10                             ` Steven Rostedt
  2022-05-11 22:45                             ` [tip: sched/urgent] sched/tracing: Append " tip-bot2 for Delyan Kratunov
@ 2022-05-11 23:40                             ` Thomas Gleixner
  2 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2022-05-11 23:40 UTC (permalink / raw)
  To: Delyan Kratunov, peterz
  Cc: dietmar.eggemann, bigeasy, keescook, qais.yousef,
	Delyan Kratunov, andrii, x86, vincent.guittot, akpm, adharmap,
	alexei.starovoitov, andrii.nakryiko, ast, Kenta.Tada, rdunlap,
	linux-kernel, bristot, ebiederm, mingo, rostedt, ed.tsai, legion,
	acme, u.kleine-koenig, valentin.schneider, juri.lelli, namhyung

On Wed, May 11 2022 at 18:28, Delyan Kratunov wrote:
> Commit fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting
> sched_switch event, 2022-01-20) added a new prev_state argument to the
> sched_switch tracepoint, before the prev task_struct pointer.
>
> This reordering of arguments broke BPF programs that use the raw
> tracepoint (e.g. tp_btf programs). The type of the second argument has
> changed and existing programs that assume a task_struct* argument
> (e.g. for bpf_task_storage access) will now fail to verify.
>
> If we instead append the new argument to the end, all existing programs
> would continue to work and can conditionally extract the prev_state
> argument on supported kernel versions.

If we instead? ... would continue to work?

  https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
  https://www.kernel.org/doc/html/latest/process/submitting-patches.html#submittingpatches

It's not too hard to actually follow documented rules. Something like
this:

  "Undo the reordering and move the new argument last, which ensures
   that all existing programs continue to work."

Hmm?

What's worse is that the changelog is missing a clear statement that
this is a one-time change which cannot be abused to turn tracepoints
into unmodifiable ABI as you stated in:

 https://lore.kernel.org/lkml/CAEf4BzaEo+dxSRJZHQiXYrj-a3_B-eODZUxGh3HrnPjquMYFXQ@mail.gmail.com

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2022-05-11 23:41 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-20 16:25 [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Valentin Schneider
2022-01-20 16:25 ` [PATCH v3 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event Valentin Schneider
2022-03-01 15:24   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2022-03-04 16:13     ` Valentin Schneider
2022-03-08 18:02     ` Qais Yousef
2022-03-08 18:10       ` Greg KH
2022-03-08 18:51         ` Qais Yousef
2022-04-09 23:38           ` Qais Yousef
2022-04-10 22:06             ` Qais Yousef
2022-04-10 23:22               ` Holger Hoffstätte
2022-04-11  7:18                 ` Holger Hoffstätte
2022-04-11  7:28                   ` Greg KH
2022-04-11  8:05                     ` Holger Hoffstätte
2022-04-11 13:23                       ` Greg KH
2022-04-11 13:22             ` Greg KH
2022-04-11 21:06               ` Qais Yousef
2022-01-20 16:25 ` [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE Valentin Schneider
2022-03-01 15:24   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2022-04-09 23:42   ` [PATCH v3 2/2] " Qais Yousef
2022-04-10  6:14     ` Greg KH
2022-04-10 22:13       ` Qais Yousef
2022-04-11 13:20         ` Greg KH
2022-04-11 20:18           ` Qais Yousef
2022-01-21 17:15 ` [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Steven Rostedt
2022-02-27 15:33   ` Peter Zijlstra
2022-04-21 22:12 ` [PATCH] sched/tracing: append prev_state to tp args instead Delyan Kratunov
2022-04-22 10:13   ` Valentin Schneider
2022-04-22 11:09   ` Peter Zijlstra
2022-04-22 15:55     ` Steven Rostedt
2022-04-22 16:54       ` Andrii Nakryiko
2022-04-22 16:37     ` Andrii Nakryiko
2022-04-22 17:22     ` Delyan Kratunov
2022-04-22 18:30       ` Alexei Starovoitov
2022-04-26 12:28         ` Peter Zijlstra
2022-04-26 14:09           ` Qais Yousef
2022-04-26 15:54             ` Andrii Nakryiko
2022-04-27 10:34               ` Qais Yousef
2022-04-27 18:17                 ` Andrii Nakryiko
2022-04-27 20:32                   ` Alexei Starovoitov
2022-04-28 10:02                   ` Qais Yousef
2022-05-09 19:32                     ` Andrii Nakryiko
2022-05-10  7:01                       ` Peter Zijlstra
2022-05-10  8:29                         ` Peter Zijlstra
2022-05-10 14:31                           ` Steven Rostedt
2022-05-11 18:28                           ` [PATCH v2] " Delyan Kratunov
2022-05-11 19:10                             ` Steven Rostedt
2022-05-11 22:45                             ` [tip: sched/urgent] sched/tracing: Append " tip-bot2 for Delyan Kratunov
2022-05-11 23:40                             ` [PATCH v2] sched/tracing: append " Thomas Gleixner
2022-04-26 15:51           ` [PATCH] " Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.