All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not
@ 2021-11-29 12:35 Valentin Schneider
  2021-11-29 12:36 ` [RFC PATCH 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event Valentin Schneider
  2021-11-29 12:36 ` [RFC PATCH 2/2] sched/tracing: Add TASK_RTLOCK_WAIT to TASK_REPORT Valentin Schneider
  0 siblings, 2 replies; 11+ messages in thread
From: Valentin Schneider @ 2021-11-29 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Abhijeet Dharmapurikar, Uwe Kleine-König, Dietmar Eggemann,
	Steven Rostedt, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Thomas Gleixner, Sebastian Andrzej Siewior, Juri Lelli,
	Daniel Bristot de Oliveira, Kees Cook, Andrew Morton

Hi folks,

Problem
=======

Abhijeet pointed out that the following sequence of trace events can be
observed:

  cat-1676  [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
  grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
  <idle>-0  [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004

IOW, not-yet-woken task gets presented as runnable and switched out in
favor of the idle task... Dietmar and I had a look, turns out this sequence
can happen: 

		      p->__state = TASK_INTERRUPTIBLE;
		      __schedule()
			deactivate_task(p);
  ttwu()
    READ !p->on_rq
    p->__state=TASK_WAKING
			trace_sched_switch()
			  __trace_sched_switch_state()
			    task_state_index()
			      return 0;

TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
spinning on p->on_cpu") which punted the TASK_WAKING write to the other
side of

  smp_cond_load_acquire(&p->on_cpu, !VAL);

in ttwu().

Uwe reported on #linux-rt what I think is a similar issue with
TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
you get prev_state=0 despite the task blocking on a lock.

Both of those are very confusing for tooling or even human observers.

Patches
=======

For the TASK_WAKING case, pushing the prev_state read in __schedule() down
to __trace_sched_switch_state() seems sufficient. That's patch 1.

For TASK_RTLOCK_WAIT, it's a bit trickier. One could use ->saved_state as
prev_state, but
a) that is also susceptible to racing (see ttwu() / ttwu_state_match())
b) some lock substitutions (e.g. rt_spin_lock()) leave ->saved_state as
   TASK_RUNNING.

Patch 2 adds TASK_RTLOCK_WAIT to TASK_REPORT instead.

Testing
=======

Briefly tested on an Arm Juno via ftrace+hackbench against
o tip/sched/core@8c92606ab810
o v5.16-rc2-rt4 w/ CONFIG_PREEMPT_RT

Cheers,
Valentin

Valentin Schneider (2):
  sched/tracing: Don't re-read p->state when emitting sched_switch event
  sched/tracing: Add TASK_RTLOCK_WAIT to TASK_REPORT

 fs/proc/array.c                   |  3 ++-
 include/linux/sched.h             | 28 +++++++++++++++++-----------
 include/trace/events/sched.h      | 12 ++++++++----
 kernel/sched/core.c               |  4 ++--
 kernel/trace/fgraph.c             |  4 +++-
 kernel/trace/ftrace.c             |  4 +++-
 kernel/trace/trace_events.c       |  8 ++++++--
 kernel/trace/trace_sched_switch.c |  1 +
 8 files changed, 42 insertions(+), 22 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-01-14 10:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-29 12:35 [RFC PATCH 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Valentin Schneider
2021-11-29 12:36 ` [RFC PATCH 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event Valentin Schneider
2021-11-29 16:39   ` kernel test robot
2021-11-29 16:39     ` kernel test robot
2021-11-29 16:59   ` kernel test robot
2021-11-29 16:59     ` kernel test robot
2021-12-08 20:12   ` Steven Rostedt
2021-12-09 13:02     ` Valentin Schneider
2021-11-29 12:36 ` [RFC PATCH 2/2] sched/tracing: Add TASK_RTLOCK_WAIT to TASK_REPORT Valentin Schneider
2021-12-08 20:34   ` Steven Rostedt
2022-01-14 10:11   ` Sebastian Andrzej Siewior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.