All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not
@ 2022-01-20 16:25 Valentin Schneider
  2022-01-20 16:25 ` [PATCH v3 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event Valentin Schneider
                   ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Valentin Schneider @ 2022-01-20 16:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Abhijeet Dharmapurikar, Uwe Kleine-König, Dietmar Eggemann,
	Steven Rostedt, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Thomas Gleixner, Sebastian Andrzej Siewior, Juri Lelli,
	Daniel Bristot de Oliveira, Kees Cook, Andrew Morton,
	Eric W. Biederman, Alexey Gladkov, Kenta.Tada, Randy Dunlap,
	Ed Tsai

Hi folks,

Problem
=======

Abhijeet pointed out that the following sequence of trace events can be
observed:

  cat-1676  [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
  grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
  <idle>-0  [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004

IOW, not-yet-woken task gets presented as runnable and switched out in
favor of the idle task... Dietmar and I had a look, turns out this sequence
can happen: 

		      p->__state = TASK_INTERRUPTIBLE;
		      __schedule()
			deactivate_task(p);
  ttwu()
    READ !p->on_rq
    p->__state=TASK_WAKING
			trace_sched_switch()
			  __trace_sched_switch_state()
			    task_state_index()
			      return 0;

TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
spinning on p->on_cpu") which punted the TASK_WAKING write to the other
side of

  smp_cond_load_acquire(&p->on_cpu, !VAL);

in ttwu().

Uwe reported on #linux-rt what I think is a similar issue with
TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
you get prev_state=0 despite the task blocking on a lock.

Both of those are very confusing for tooling or even human observers.

Patches
=======

For the TASK_WAKING case, pushing the prev_state read in __schedule() down
to __trace_sched_switch_state() seems sufficient. That's patch 1.

For TASK_RTLOCK_WAIT, it's a bit trickier. One could use ->saved_state as
prev_state, but
a) that is also susceptible to racing (see ttwu() / ttwu_state_match())
b) some lock substitutions (e.g. rt_spin_lock()) leave ->saved_state as
   TASK_RUNNING.

Patch 2 adds TASK_RTLOCK_WAIT to TASK_REPORT instead.

Testing
=======

Briefly tested on an Arm Juno via ftrace+hackbench against
o tip/sched/core@82762d2af31a
o v5.16-rt15-rebase w/ CONFIG_PREEMPT_RT


I also had a look at the __schedule() disassembly as suggested by Peter; on x86
prev_state gets pushed to the stack and popped prior to the trace event static
call, the rest seems mostly unchanged (GCC 9.3).

Revisions
=========

v2 -> v3
++++++++

o Dropped TASK_RTLOCK_WAIT from TASK_REPORT, made it appear as
  TASK_UNINTERRUPTIBLE instead (Eric)

RFC v1 -> v2
++++++++++++

o Collected tags (Steven, Sebastian)
o Patched missed trace_switch event clients (Steven)

Cheers,
Valentin

Valentin Schneider (2):
  sched/tracing: Don't re-read p->state when emitting sched_switch event
  sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

 include/linux/sched.h             | 19 ++++++++++++++++---
 include/trace/events/sched.h      | 11 +++++++----
 kernel/sched/core.c               |  4 ++--
 kernel/trace/fgraph.c             |  4 +++-
 kernel/trace/ftrace.c             |  4 +++-
 kernel/trace/trace_events.c       |  8 ++++++--
 kernel/trace/trace_osnoise.c      |  4 +++-
 kernel/trace/trace_sched_switch.c |  1 +
 kernel/trace/trace_sched_wakeup.c |  1 +
 9 files changed, 42 insertions(+), 14 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2022-05-11 23:41 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-20 16:25 [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Valentin Schneider
2022-01-20 16:25 ` [PATCH v3 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event Valentin Schneider
2022-03-01 15:24   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2022-03-04 16:13     ` Valentin Schneider
2022-03-08 18:02     ` Qais Yousef
2022-03-08 18:10       ` Greg KH
2022-03-08 18:51         ` Qais Yousef
2022-04-09 23:38           ` Qais Yousef
2022-04-10 22:06             ` Qais Yousef
2022-04-10 23:22               ` Holger Hoffstätte
2022-04-11  7:18                 ` Holger Hoffstätte
2022-04-11  7:28                   ` Greg KH
2022-04-11  8:05                     ` Holger Hoffstätte
2022-04-11 13:23                       ` Greg KH
2022-04-11 13:22             ` Greg KH
2022-04-11 21:06               ` Qais Yousef
2022-01-20 16:25 ` [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE Valentin Schneider
2022-03-01 15:24   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2022-04-09 23:42   ` [PATCH v3 2/2] " Qais Yousef
2022-04-10  6:14     ` Greg KH
2022-04-10 22:13       ` Qais Yousef
2022-04-11 13:20         ` Greg KH
2022-04-11 20:18           ` Qais Yousef
2022-01-21 17:15 ` [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Steven Rostedt
2022-02-27 15:33   ` Peter Zijlstra
2022-04-21 22:12 ` [PATCH] sched/tracing: append prev_state to tp args instead Delyan Kratunov
2022-04-22 10:13   ` Valentin Schneider
2022-04-22 11:09   ` Peter Zijlstra
2022-04-22 15:55     ` Steven Rostedt
2022-04-22 16:54       ` Andrii Nakryiko
2022-04-22 16:37     ` Andrii Nakryiko
2022-04-22 17:22     ` Delyan Kratunov
2022-04-22 18:30       ` Alexei Starovoitov
2022-04-26 12:28         ` Peter Zijlstra
2022-04-26 14:09           ` Qais Yousef
2022-04-26 15:54             ` Andrii Nakryiko
2022-04-27 10:34               ` Qais Yousef
2022-04-27 18:17                 ` Andrii Nakryiko
2022-04-27 20:32                   ` Alexei Starovoitov
2022-04-28 10:02                   ` Qais Yousef
2022-05-09 19:32                     ` Andrii Nakryiko
2022-05-10  7:01                       ` Peter Zijlstra
2022-05-10  8:29                         ` Peter Zijlstra
2022-05-10 14:31                           ` Steven Rostedt
2022-05-11 18:28                           ` [PATCH v2] " Delyan Kratunov
2022-05-11 19:10                             ` Steven Rostedt
2022-05-11 22:45                             ` [tip: sched/urgent] sched/tracing: Append " tip-bot2 for Delyan Kratunov
2022-05-11 23:40                             ` [PATCH v2] sched/tracing: append " Thomas Gleixner
2022-04-26 15:51           ` [PATCH] " Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.