From: Valentin Schneider <valentin.schneider@arm.com>
To: linux-kernel@vger.kernel.org
Cc: "Abhijeet Dharmapurikar" <adharmap@quicinc.com>,
"Uwe Kleine-König" <u.kleine-koenig@pengutronix.de>,
"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Peter Zijlstra" <peterz@infradead.org>,
"Ingo Molnar" <mingo@kernel.org>,
"Vincent Guittot" <vincent.guittot@linaro.org>,
"Thomas Gleixner" <tglx@linutronix.de>,
"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
"Juri Lelli" <juri.lelli@redhat.com>,
"Daniel Bristot de Oliveira" <bristot@redhat.com>,
"Kees Cook" <keescook@chromium.org>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Eric W. Biederman" <ebiederm@xmission.com>,
"Alexey Gladkov" <legion@kernel.org>,
"Kenta.Tada@sony.com" <Kenta.Tada@sony.com>,
"Randy Dunlap" <rdunlap@infradead.org>,
"Ed Tsai" <ed.tsai@mediatek.com>
Subject: [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not
Date: Thu, 20 Jan 2022 16:25:18 +0000 [thread overview]
Message-ID: <20220120162520.570782-1-valentin.schneider@arm.com> (raw)
Hi folks,
Problem
=======
Abhijeet pointed out that the following sequence of trace events can be
observed:
cat-1676 [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
<idle>-0 [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004
IOW, not-yet-woken task gets presented as runnable and switched out in
favor of the idle task... Dietmar and I had a look, turns out this sequence
can happen:
p->__state = TASK_INTERRUPTIBLE;
__schedule()
deactivate_task(p);
ttwu()
READ !p->on_rq
p->__state=TASK_WAKING
trace_sched_switch()
__trace_sched_switch_state()
task_state_index()
return 0;
TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
spinning on p->on_cpu") which punted the TASK_WAKING write to the other
side of
smp_cond_load_acquire(&p->on_cpu, !VAL);
in ttwu().
Uwe reported on #linux-rt what I think is a similar issue with
TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
you get prev_state=0 despite the task blocking on a lock.
Both of those are very confusing for tooling or even human observers.
Patches
=======
For the TASK_WAKING case, pushing the prev_state read in __schedule() down
to __trace_sched_switch_state() seems sufficient. That's patch 1.
For TASK_RTLOCK_WAIT, it's a bit trickier. One could use ->saved_state as
prev_state, but
a) that is also susceptible to racing (see ttwu() / ttwu_state_match())
b) some lock substitutions (e.g. rt_spin_lock()) leave ->saved_state as
TASK_RUNNING.
Patch 2 adds TASK_RTLOCK_WAIT to TASK_REPORT instead.
Testing
=======
Briefly tested on an Arm Juno via ftrace+hackbench against
o tip/sched/core@82762d2af31a
o v5.16-rt15-rebase w/ CONFIG_PREEMPT_RT
I also had a look at the __schedule() disassembly as suggested by Peter; on x86
prev_state gets pushed to the stack and popped prior to the trace event static
call, the rest seems mostly unchanged (GCC 9.3).
Revisions
=========
v2 -> v3
++++++++
o Dropped TASK_RTLOCK_WAIT from TASK_REPORT, made it appear as
TASK_UNINTERRUPTIBLE instead (Eric)
RFC v1 -> v2
++++++++++++
o Collected tags (Steven, Sebastian)
o Patched missed trace_switch event clients (Steven)
Cheers,
Valentin
Valentin Schneider (2):
sched/tracing: Don't re-read p->state when emitting sched_switch event
sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
include/linux/sched.h | 19 ++++++++++++++++---
include/trace/events/sched.h | 11 +++++++----
kernel/sched/core.c | 4 ++--
kernel/trace/fgraph.c | 4 +++-
kernel/trace/ftrace.c | 4 +++-
kernel/trace/trace_events.c | 8 ++++++--
kernel/trace/trace_osnoise.c | 4 +++-
kernel/trace/trace_sched_switch.c | 1 +
kernel/trace/trace_sched_wakeup.c | 1 +
9 files changed, 42 insertions(+), 14 deletions(-)
--
2.25.1
next reply other threads:[~2022-01-20 16:25 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-20 16:25 Valentin Schneider [this message]
2022-01-20 16:25 ` [PATCH v3 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event Valentin Schneider
2022-03-01 15:24 ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2022-03-04 16:13 ` Valentin Schneider
2022-03-08 18:02 ` Qais Yousef
2022-03-08 18:10 ` Greg KH
2022-03-08 18:51 ` Qais Yousef
2022-04-09 23:38 ` Qais Yousef
2022-04-10 22:06 ` Qais Yousef
2022-04-10 23:22 ` Holger Hoffstätte
2022-04-11 7:18 ` Holger Hoffstätte
2022-04-11 7:28 ` Greg KH
2022-04-11 8:05 ` Holger Hoffstätte
2022-04-11 13:23 ` Greg KH
2022-04-11 13:22 ` Greg KH
2022-04-11 21:06 ` Qais Yousef
2022-01-20 16:25 ` [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE Valentin Schneider
2022-03-01 15:24 ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2022-04-09 23:42 ` [PATCH v3 2/2] " Qais Yousef
2022-04-10 6:14 ` Greg KH
2022-04-10 22:13 ` Qais Yousef
2022-04-11 13:20 ` Greg KH
2022-04-11 20:18 ` Qais Yousef
2022-01-21 17:15 ` [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Steven Rostedt
2022-02-27 15:33 ` Peter Zijlstra
2022-04-21 22:12 ` [PATCH] sched/tracing: append prev_state to tp args instead Delyan Kratunov
2022-04-22 10:13 ` Valentin Schneider
2022-04-22 11:09 ` Peter Zijlstra
2022-04-22 15:55 ` Steven Rostedt
2022-04-22 16:54 ` Andrii Nakryiko
2022-04-22 16:37 ` Andrii Nakryiko
2022-04-22 17:22 ` Delyan Kratunov
2022-04-22 18:30 ` Alexei Starovoitov
2022-04-26 12:28 ` Peter Zijlstra
2022-04-26 14:09 ` Qais Yousef
2022-04-26 15:54 ` Andrii Nakryiko
2022-04-27 10:34 ` Qais Yousef
2022-04-27 18:17 ` Andrii Nakryiko
2022-04-27 20:32 ` Alexei Starovoitov
2022-04-28 10:02 ` Qais Yousef
2022-05-09 19:32 ` Andrii Nakryiko
2022-05-10 7:01 ` Peter Zijlstra
2022-05-10 8:29 ` Peter Zijlstra
2022-05-10 14:31 ` Steven Rostedt
2022-05-11 18:28 ` [PATCH v2] " Delyan Kratunov
2022-05-11 19:10 ` Steven Rostedt
2022-05-11 22:45 ` [tip: sched/urgent] sched/tracing: Append " tip-bot2 for Delyan Kratunov
2022-05-11 23:40 ` [PATCH v2] sched/tracing: append " Thomas Gleixner
2022-04-26 15:51 ` [PATCH] " Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220120162520.570782-1-valentin.schneider@arm.com \
--to=valentin.schneider@arm.com \
--cc=Kenta.Tada@sony.com \
--cc=adharmap@quicinc.com \
--cc=akpm@linux-foundation.org \
--cc=bigeasy@linutronix.de \
--cc=bristot@redhat.com \
--cc=dietmar.eggemann@arm.com \
--cc=ebiederm@xmission.com \
--cc=ed.tsai@mediatek.com \
--cc=juri.lelli@redhat.com \
--cc=keescook@chromium.org \
--cc=legion@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=rdunlap@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=u.kleine-koenig@pengutronix.de \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).