All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] perf sched: Add trace event for sched wait.
@ 2014-04-15 12:32 Dongsheng Yang
  2014-04-15 12:32 ` [PATCH 1/8] sched & trace: Add a trace event for wait Dongsheng Yang
                   ` (8 more replies)
  0 siblings, 9 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-15 12:32 UTC (permalink / raw)
  To: rostedt, fweisbec, mingo, peterz, acme; +Cc: linux-kernel, Dongsheng Yang

Hi all:
	This is to solve the bug message shown in perf sched latency.

	# perf sched latency|tail
	  ksoftirqd/0:3         |      0.597 ms |       57 | avg:    0.004 ms | max:    0.054 ms | max at: 19681.546204 s
	  ksoftirqd/1:14        |      0.637 ms |       58 | avg:    0.004 ms | max:    0.066 ms | max at: 19674.687734 s
	  irqbalance:349        |      0.429 ms |        1 | avg:    0.004 ms | max:    0.004 ms | max at: 19675.791528 s
	  ksoftirqd/3:24        |      0.527 ms |       67 | avg:    0.003 ms | max:    0.011 ms | max at: 19673.285019 s
	  migration/3:23        |      0.000 ms |        1 | avg:    0.002 ms | max:    0.002 ms | max at: 19672.055354 s
	 -----------------------------------------------------------------------------------------
	  TOTAL:                |   4384.616 ms |    36879 |
	 ---------------------------------------------------
	  INFO: 0.030% state machine bugs (11 out of 36684)

	After some investigation, there are two reasons cause this problem.

	(1). Sometimes, scheduler will wake up a running task, it is not necessary,
	     then I skip the wakeup if task->state is TASK_RUNNING. [4/8]

	(2). No tracing for sched wait.
	This is a simple graph for task state changing.

	----------------		1		----------------
	| TASK_RUNNING | ------------------------------>| TASK_RUNNING |
	|   (running)  |<------------------------------ |  (wait cpu)  |
	----------------		2		----------------
		^						|
		|4	-------------------------	3	|
		|-------|TASK_{UN}INTERRUPTABLE	|<--------------|
			|	in wait_rq	|
			-------------------------

	As the graph shown above, there are four event in scheduling, and
we currently are tracing 3 of them.

	1 & 2: sched:sched_switch
	    4: sched:sched_wakeup|sched:sched_wakeup_new

But about 3, we have no trace event for it.

This patchset add a new trace event for sched wait, and add a trace point
before any task added into wait queue. [1/8] & [3/8]

	BTW, other patchs in this thread are about some little fix and enhancement
in the development, please help to review at the same time. 

Thanx

Dongsheng (8):
  sched & trace: Add a trace event for wait.
  sched/wait: Add trace point before add task into wait queue.
  sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible.
  sched/core: Skip wakeup when task is already running.
  perf tools: record and process sched:sched_wait event.
  perf tools: add missing event for perf sched record.
  perf tools: Adapt the TASK_STATE_TO_CHAR_STR to new value in kernel
    space.
  perf tools: Clarify the output of perf sched map.

 include/trace/events/sched.h |  20 +++++++++
 kernel/sched/core.c          |   3 ++
 kernel/sched/wait.c          |  13 ++++--
 tools/perf/builtin-sched.c   | 100 +++++++++++++++++++++++++++++++++++--------
 4 files changed, 115 insertions(+), 21 deletions(-)

-- 
1.8.2.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/8] sched & trace: Add a trace event for wait.
  2014-04-15 12:32 [PATCH 0/8] perf sched: Add trace event for sched wait Dongsheng Yang
@ 2014-04-15 12:32 ` Dongsheng Yang
  2014-04-15 13:49   ` Peter Zijlstra
  2014-04-15 12:32 ` [PATCH 2/8] sched/wait: Add trace point before add task into wait queue Dongsheng Yang
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-15 12:32 UTC (permalink / raw)
  To: rostedt, fweisbec, mingo, peterz, acme; +Cc: linux-kernel, Dongsheng

From: Dongsheng <yangds.fnst@cn.fujitsu.com>

Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
---
 include/trace/events/sched.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 67e1bbf..7aa7d43a 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -51,6 +51,26 @@ TRACE_EVENT(sched_kthread_stop_ret,
 );
 
 /*
+ * Tracepoint for wait:
+ */
+TRACE_EVENT(sched_wait,
+
+	TP_PROTO(struct task_struct *p),
+
+	TP_ARGS(p),
+
+	TP_STRUCT__entry(
+		__field(	pid_t,	pid	)
+	),
+
+	TP_fast_assign(
+		__entry->pid	= p->pid;
+	),
+
+	TP_printk("pid=%d", __entry->pid)
+);
+
+/*
  * Tracepoint for waking up a task:
  */
 DECLARE_EVENT_CLASS(sched_wakeup_template,
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/8] sched/wait: Add trace point before add task into wait queue.
  2014-04-15 12:32 [PATCH 0/8] perf sched: Add trace event for sched wait Dongsheng Yang
  2014-04-15 12:32 ` [PATCH 1/8] sched & trace: Add a trace event for wait Dongsheng Yang
@ 2014-04-15 12:32 ` Dongsheng Yang
  2014-04-15 12:32 ` [PATCH 3/8] sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible Dongsheng Yang
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-15 12:32 UTC (permalink / raw)
  To: rostedt, fweisbec, mingo, peterz, acme; +Cc: linux-kernel, Dongsheng

From: Dongsheng <yangds.fnst@cn.fujitsu.com>

As there are lots of functions in sched/wait.c to call __add_wait_queue{_tail}(),
we need to add trace point before any time we add task into wait queue, then this
patch add trace point in any possible route to adding task into wait queue.

Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
---
 kernel/sched/wait.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 7d50f79..283750e 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -10,6 +10,8 @@
 #include <linux/wait.h>
 #include <linux/hash.h>
 
+#include <trace/events/sched.h>
+
 void __init_waitqueue_head(wait_queue_head_t *q, const char *name, struct lock_class_key *key)
 {
 	spin_lock_init(&q->lock);
@@ -23,6 +25,7 @@ void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait)
 {
 	unsigned long flags;
 
+	trace_sched_wait(current);
 	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
 	__add_wait_queue(q, wait);
@@ -34,6 +37,7 @@ void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
 {
 	unsigned long flags;
 
+	trace_sched_wait(current);
 	wait->flags |= WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
 	__add_wait_queue_tail(q, wait);
@@ -172,6 +176,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
 {
 	unsigned long flags;
 
+	trace_sched_wait(current);
 	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
 	if (list_empty(&wait->task_list))
@@ -186,6 +191,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
 {
 	unsigned long flags;
 
+	trace_sched_wait(current);
 	wait->flags |= WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
 	if (list_empty(&wait->task_list))
@@ -205,6 +211,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
 	wait->private = current;
 	wait->func = autoremove_wake_function;
 
+	trace_sched_wait(current);
 	spin_lock_irqsave(&q->lock, flags);
 	if (list_empty(&wait->task_list)) {
 		if (wait->flags & WQ_FLAG_EXCLUSIVE)
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/8] sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible.
  2014-04-15 12:32 [PATCH 0/8] perf sched: Add trace event for sched wait Dongsheng Yang
  2014-04-15 12:32 ` [PATCH 1/8] sched & trace: Add a trace event for wait Dongsheng Yang
  2014-04-15 12:32 ` [PATCH 2/8] sched/wait: Add trace point before add task into wait queue Dongsheng Yang
@ 2014-04-15 12:32 ` Dongsheng Yang
  2014-04-15 13:49   ` Peter Zijlstra
  2014-04-15 12:32 ` [PATCH 4/8] sched/core: Skip wakeup when task is already running Dongsheng Yang
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-15 12:32 UTC (permalink / raw)
  To: rostedt, fweisbec, mingo, peterz, acme; +Cc: linux-kernel, Dongsheng

From: Dongsheng <yangds.fnst@cn.fujitsu.com>

There is already a function in include/linux/wait.h to cover the
'exclusive' usage. So we can use it in sched/wait.c to replace
the opened implementation of it.

Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
---
 kernel/sched/wait.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 283750e..b04827e 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -38,9 +38,8 @@ void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
 	unsigned long flags;
 
 	trace_sched_wait(current);
-	wait->flags |= WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
-	__add_wait_queue_tail(q, wait);
+	__add_wait_queue_tail_exclusive(q, wait);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(add_wait_queue_exclusive);
@@ -192,10 +191,9 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
 	unsigned long flags;
 
 	trace_sched_wait(current);
-	wait->flags |= WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
 	if (list_empty(&wait->task_list))
-		__add_wait_queue_tail(q, wait);
+		__add_wait_queue_tail_exclusive(q, wait);
 	set_current_state(state);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/8] sched/core: Skip wakeup when task is already running.
  2014-04-15 12:32 [PATCH 0/8] perf sched: Add trace event for sched wait Dongsheng Yang
                   ` (2 preceding siblings ...)
  2014-04-15 12:32 ` [PATCH 3/8] sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible Dongsheng Yang
@ 2014-04-15 12:32 ` Dongsheng Yang
  2014-04-15 13:53   ` Peter Zijlstra
  2014-04-15 12:32 ` [PATCH 5/8] perf tools: record and process sched:sched_wait event Dongsheng Yang
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-15 12:32 UTC (permalink / raw)
  To: rostedt, fweisbec, mingo, peterz, acme; +Cc: linux-kernel, Dongsheng

From: Dongsheng <yangds.fnst@cn.fujitsu.com>

It is pointless to wake up a running task. Currently, we can
see it in perf sched latency.

	# perf sched record sleep 10
	[ perf record: Woken up 2 times to write data ]
	[ perf record: Captured and wrote 21.170 MB perf.data (~924950 samples) ]
	# perf sched latency |tail
	  ksoftirqd/13:57       |      0.011 ms |        1 | avg:    0.003 ms | max:    0.003 ms | max at: 254123.512011 s
	  khelper:6423          |      0.059 ms |        3 | avg:    0.003 ms | max:    0.005 ms | max at: 254124.947110 s
	  ksoftirqd/3:17        |      0.042 ms |        2 | avg:    0.003 ms | max:    0.003 ms | max at: 254126.478175 s
	  automount:6465        |      0.480 ms |        1 | avg:    0.000 ms | max:    0.000 ms | max at:  0.000000 s
	  automount:6618        |      0.465 ms |        1 | avg:    0.000 ms | max:    0.000 ms | max at:  0.000000 s
	 -----------------------------------------------------------------------------------------
	  TOTAL:                |   3535.020 ms |    28477 |
	 ---------------------------------------------------
	  INFO: 0.922% state machine bugs (259 out of 28092)

Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..fc6b644 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1416,6 +1416,9 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
 static void
 ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
+	if (p->state == TASK_RUNNING)
+		return;
+
 	check_preempt_curr(rq, p, wake_flags);
 	trace_sched_wakeup(p, true);
 
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/8] perf tools: record and process sched:sched_wait event.
  2014-04-15 12:32 [PATCH 0/8] perf sched: Add trace event for sched wait Dongsheng Yang
                   ` (3 preceding siblings ...)
  2014-04-15 12:32 ` [PATCH 4/8] sched/core: Skip wakeup when task is already running Dongsheng Yang
@ 2014-04-15 12:32 ` Dongsheng Yang
  2014-04-15 12:32 ` [PATCH 6/8] perf tools: add missing event for perf sched record Dongsheng Yang
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-15 12:32 UTC (permalink / raw)
  To: rostedt, fweisbec, mingo, peterz, acme; +Cc: linux-kernel, Dongsheng

From: Dongsheng <yangds.fnst@cn.fujitsu.com>

Currently, perf sched tool does not cover any trace event when
a task from TASK_RINNING to TASK_{UN}INTERRUPTIBLE. Then if
a thread changed to TASK_{UN}INTERRUPTIBLE, but we did not
capture a event for it, so the state of atom is still TASK_RUNNING,
at this time, when we process a sched_wakeup event, we will see
the thread state is not TASK_SLEEP, and record a state_bug.

This patch make it record and process sched_wait event, to solve
this problem.

Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
---
 tools/perf/builtin-sched.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 9ac0a49..a32af4e 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -117,6 +117,10 @@ struct trace_sched_handler {
 				  struct perf_evsel *evsel,
 				  struct perf_sample *sample,
 				  struct machine *machine);
+
+	int (*wait_event)(struct perf_sched *sched, struct perf_evsel *evsel,
+			  struct perf_sample *sample, struct machine *machine);
+
 };
 
 struct perf_sched {
@@ -863,6 +867,22 @@ add_sched_out_event(struct work_atoms *atoms,
 	return 0;
 }
 
+static int
+add_sched_wait_event(struct work_atoms *atoms,
+		     u64 timestamp)
+{
+	struct work_atom *atom = zalloc(sizeof(*atom));
+	if (!atom) {
+		pr_err("Non memory at %s", __func__);
+		return -1;
+	}
+
+	atom->sched_out_time = timestamp;
+
+	list_add_tail(&atom->list, &atoms->work_list);
+	return 0;
+}
+
 static void
 add_runtime_event(struct work_atoms *atoms, u64 delta,
 		  u64 timestamp __maybe_unused)
@@ -1100,6 +1120,32 @@ static int latency_migrate_task_event(struct perf_sched *sched,
 	return 0;
 }
 
+static int latency_wait_event(struct perf_sched *sched,
+			      struct perf_evsel *evsel,
+			      struct perf_sample *sample,
+			      struct machine *machine)
+{
+	const u32 pid = perf_evsel__intval(evsel, sample, "pid");
+	struct work_atoms *atoms;
+	struct thread *wakee;
+	u64 timestamp = sample->time;
+
+	wakee = machine__findnew_thread(machine, 0, pid);
+	atoms = thread_atoms_search(&sched->atom_root, wakee, &sched->cmp_pid);
+	if (!atoms) {
+		if (thread_atoms_insert(sched, wakee))
+			return -1;
+		atoms = thread_atoms_search(&sched->atom_root, wakee, &sched->cmp_pid);
+		if (!atoms) {
+			pr_err("wakeup-event: Internal tree error");
+			return -1;
+		}
+	}
+
+	add_sched_wait_event(atoms, timestamp);
+	return 0;
+}
+
 static void output_lat_thread(struct perf_sched *sched, struct work_atoms *work_list)
 {
 	int i;
@@ -1250,6 +1296,19 @@ static void perf_sched__sort_lat(struct perf_sched *sched)
 	}
 }
 
+static int process_sched_wait(struct perf_tool *tool,
+			      struct perf_evsel *evsel,
+			      struct perf_sample *sample,
+			      struct machine *machine)
+{
+	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
+
+	if (sched->tp_handler->wait_event)
+		return sched->tp_handler->wait_event(sched, evsel, sample, machine);
+
+	return 0;
+}
+
 static int process_sched_wakeup_event(struct perf_tool *tool,
 				      struct perf_evsel *evsel,
 				      struct perf_sample *sample,
@@ -1444,6 +1503,7 @@ static int perf_sched__read_events(struct perf_sched *sched,
 		{ "sched:sched_wakeup",	      process_sched_wakeup_event, },
 		{ "sched:sched_wakeup_new",   process_sched_wakeup_event, },
 		{ "sched:sched_migrate_task", process_sched_migrate_task_event, },
+		{ "sched:sched_wait",	      process_sched_wait, },
 	};
 	struct perf_session *session;
 	struct perf_data_file file = {
@@ -1636,6 +1696,7 @@ static int __cmd_record(int argc, const char **argv)
 		"-e", "sched:sched_process_fork",
 		"-e", "sched:sched_wakeup",
 		"-e", "sched:sched_migrate_task",
+		"-e", "sched:sched_wait",
 	};
 
 	rec_argc = ARRAY_SIZE(record_args) + argc - 1;
@@ -1722,6 +1783,7 @@ int cmd_sched(int argc, const char **argv, const char *prefix __maybe_unused)
 		.switch_event	    = latency_switch_event,
 		.runtime_event	    = latency_runtime_event,
 		.migrate_task_event = latency_migrate_task_event,
+		.wait_event	    = latency_wait_event,
 	};
 	struct trace_sched_handler map_ops  = {
 		.switch_event	    = map_switch_event,
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/8] perf tools: add missing event for perf sched record.
  2014-04-15 12:32 [PATCH 0/8] perf sched: Add trace event for sched wait Dongsheng Yang
                   ` (4 preceding siblings ...)
  2014-04-15 12:32 ` [PATCH 5/8] perf tools: record and process sched:sched_wait event Dongsheng Yang
@ 2014-04-15 12:32 ` Dongsheng Yang
  2014-04-15 12:32 ` [PATCH 7/8] perf tools: Adapt the TASK_STATE_TO_CHAR_STR to new value in kernel space Dongsheng Yang
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-15 12:32 UTC (permalink / raw)
  To: rostedt, fweisbec, mingo, peterz, acme; +Cc: linux-kernel, Dongsheng

From: Dongsheng <yangds.fnst@cn.fujitsu.com>

We should record and process sched:sched_wakeup_new event in
perf sched tool, but currently, there is the process function
for it, without recording it in record subcommand.

This patch add -e sched:sched_wakeup_new to perf sched record.

Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
---
 tools/perf/builtin-sched.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index a32af4e..72ec223 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1695,6 +1695,7 @@ static int __cmd_record(int argc, const char **argv)
 		"-e", "sched:sched_stat_runtime",
 		"-e", "sched:sched_process_fork",
 		"-e", "sched:sched_wakeup",
+		"-e", "sched:sched_wakeup_new",
 		"-e", "sched:sched_migrate_task",
 		"-e", "sched:sched_wait",
 	};
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 7/8] perf tools: Adapt the TASK_STATE_TO_CHAR_STR to new value in kernel space.
  2014-04-15 12:32 [PATCH 0/8] perf sched: Add trace event for sched wait Dongsheng Yang
                   ` (5 preceding siblings ...)
  2014-04-15 12:32 ` [PATCH 6/8] perf tools: add missing event for perf sched record Dongsheng Yang
@ 2014-04-15 12:32 ` Dongsheng Yang
  2014-04-15 12:32 ` [PATCH 8/8] perf tools: Clarify the output of perf sched map Dongsheng Yang
  2014-04-15 13:54 ` [PATCH 0/8] perf sched: Add trace event for sched wait Peter Zijlstra
  8 siblings, 0 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-15 12:32 UTC (permalink / raw)
  To: rostedt, fweisbec, mingo, peterz, acme; +Cc: linux-kernel, Dongsheng

From: Dongsheng <yangds.fnst@cn.fujitsu.com>

Currently, TASK_STATE_TO_CHAR_STR in kernel space is already expanded to RSDTtZXxKWP,
but it is still RSDTtZX in perf sched tool.

This patch update TASK_STATE_TO_CHAR_STR to the new value in kernel space.

Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
---
 tools/perf/builtin-sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 72ec223..030b7d0 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -66,7 +66,7 @@ struct sched_atom {
 	struct task_desc	*wakee;
 };
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtZX"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
 
 enum thread_state {
 	THREAD_SLEEPING = 0,
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 8/8] perf tools: Clarify the output of perf sched map.
  2014-04-15 12:32 [PATCH 0/8] perf sched: Add trace event for sched wait Dongsheng Yang
                   ` (6 preceding siblings ...)
  2014-04-15 12:32 ` [PATCH 7/8] perf tools: Adapt the TASK_STATE_TO_CHAR_STR to new value in kernel space Dongsheng Yang
@ 2014-04-15 12:32 ` Dongsheng Yang
  2014-04-15 13:54 ` [PATCH 0/8] perf sched: Add trace event for sched wait Peter Zijlstra
  8 siblings, 0 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-15 12:32 UTC (permalink / raw)
  To: rostedt, fweisbec, mingo, peterz, acme; +Cc: linux-kernel, Dongsheng

From: Dongsheng <yangds.fnst@cn.fujitsu.com>

In output of perf sched map, any shortname of thread will be explained
at the first time when it appear.

Example:
	      *A0       228836.978985 secs A0 => perf:23032
          *.   A0       228836.979016 secs B0 => swapper:0
           .  *C0       228836.979099 secs C0 => migration/3:22
  *A0      .   C0       228836.979115 secs
   A0      .  *.        228836.979115 secs

But B0, which is explained as swapper:0 did not appear in the
left part of output. Instead, we use '.' as the shortname of
swapper:0. So the comment of "B0 => swapper:0" is not easy to
understand.

This patch clarify the output of perf sched map with not allocating
one letter-number shortname for swapper:0 and print ". => swapper:0"
as the explaination for swapper:0.

Example:
              *A0       228836.978985 secs A0 => perf:23032
          * .  A0       228836.979016 secs . => swapper:0
            . *B0       228836.979099 secs B0 => migration/3:22
  *A0       .  B0       228836.979115 secs
   A0       . * .       228836.979115 secs
   A0     *C0   .       228836.979225 secs C0 => ksoftirqd/2:18
   A0     *D0   .       228836.979236 secs D0 => rcu_sched:7

Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
---
 tools/perf/builtin-sched.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 030b7d0..727b8a0 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1359,17 +1359,23 @@ static int map_switch_event(struct perf_sched *sched, struct perf_evsel *evsel,
 
 	new_shortname = 0;
 	if (!sched_in->shortname[0]) {
-		sched_in->shortname[0] = sched->next_shortname1;
-		sched_in->shortname[1] = sched->next_shortname2;
-
-		if (sched->next_shortname1 < 'Z') {
-			sched->next_shortname1++;
-		} else {
-			sched->next_shortname1='A';
-			if (sched->next_shortname2 < '9') {
-				sched->next_shortname2++;
+		if (!strcmp(thread__comm_str(sched_in), "swapper")) {
+			sched_in->shortname[0] = '.';
+			sched_in->shortname[1] = '\0';
+		}
+		else {
+			sched_in->shortname[0] = sched->next_shortname1;
+			sched_in->shortname[1] = sched->next_shortname2;
+
+			if (sched->next_shortname1 < 'Z') {
+				sched->next_shortname1++;
 			} else {
-				sched->next_shortname2='0';
+				sched->next_shortname1='A';
+				if (sched->next_shortname2 < '9') {
+					sched->next_shortname2++;
+				} else {
+					sched->next_shortname2='0';
+				}
 			}
 		}
 		new_shortname = 1;
@@ -1381,12 +1387,9 @@ static int map_switch_event(struct perf_sched *sched, struct perf_evsel *evsel,
 		else
 			printf("*");
 
-		if (sched->curr_thread[cpu]) {
-			if (sched->curr_thread[cpu]->tid)
-				printf("%2s ", sched->curr_thread[cpu]->shortname);
-			else
-				printf(".  ");
-		} else
+		if (sched->curr_thread[cpu])
+			printf("%2s ", sched->curr_thread[cpu]->shortname);
+		else
 			printf("   ");
 	}
 
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/8] sched & trace: Add a trace event for wait.
  2014-04-15 12:32 ` [PATCH 1/8] sched & trace: Add a trace event for wait Dongsheng Yang
@ 2014-04-15 13:49   ` Peter Zijlstra
  2014-04-16 14:23     ` Steven Rostedt
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2014-04-15 13:49 UTC (permalink / raw)
  To: Dongsheng Yang; +Cc: rostedt, fweisbec, mingo, acme, linux-kernel

On Tue, Apr 15, 2014 at 09:32:50PM +0900, Dongsheng Yang wrote:
> From: Dongsheng <yangds.fnst@cn.fujitsu.com>
> 

No changelog, and the tracepoint seems useless to me.

> Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
> ---
>  include/trace/events/sched.h | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 67e1bbf..7aa7d43a 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -51,6 +51,26 @@ TRACE_EVENT(sched_kthread_stop_ret,
>  );
>  
>  /*
> + * Tracepoint for wait:
> + */
> +TRACE_EVENT(sched_wait,
> +
> +	TP_PROTO(struct task_struct *p),
> +
> +	TP_ARGS(p),
> +
> +	TP_STRUCT__entry(
> +		__field(	pid_t,	pid	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->pid	= p->pid;
> +	),
> +
> +	TP_printk("pid=%d", __entry->pid)
> +);
> +
> +/*
>   * Tracepoint for waking up a task:
>   */
>  DECLARE_EVENT_CLASS(sched_wakeup_template,
> -- 
> 1.8.2.1
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible.
  2014-04-15 12:32 ` [PATCH 3/8] sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible Dongsheng Yang
@ 2014-04-15 13:49   ` Peter Zijlstra
  2014-04-16  9:51     ` Dongsheng Yang
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2014-04-15 13:49 UTC (permalink / raw)
  To: Dongsheng Yang; +Cc: rostedt, fweisbec, mingo, acme, linux-kernel

On Tue, Apr 15, 2014 at 09:32:52PM +0900, Dongsheng Yang wrote:
> From: Dongsheng <yangds.fnst@cn.fujitsu.com>
> 
> There is already a function in include/linux/wait.h to cover the
> 'exclusive' usage. So we can use it in sched/wait.c to replace
> the opened implementation of it.
> 
> Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
> ---
>  kernel/sched/wait.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 283750e..b04827e 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -38,9 +38,8 @@ void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
>  	unsigned long flags;
>  
>  	trace_sched_wait(current);
> -	wait->flags |= WQ_FLAG_EXCLUSIVE;
>  	spin_lock_irqsave(&q->lock, flags);
> -	__add_wait_queue_tail(q, wait);
> +	__add_wait_queue_tail_exclusive(q, wait);
>  	spin_unlock_irqrestore(&q->lock, flags);
>  }
>  EXPORT_SYMBOL(add_wait_queue_exclusive);
> @@ -192,10 +191,9 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
>  	unsigned long flags;
>  
>  	trace_sched_wait(current);
> -	wait->flags |= WQ_FLAG_EXCLUSIVE;
>  	spin_lock_irqsave(&q->lock, flags);
>  	if (list_empty(&wait->task_list))
> -		__add_wait_queue_tail(q, wait);
> +		__add_wait_queue_tail_exclusive(q, wait);
>  	set_current_state(state);
>  	spin_unlock_irqrestore(&q->lock, flags);

That is not a no-op, if !list_empty() we loose the WQ_flag

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.
  2014-04-15 12:32 ` [PATCH 4/8] sched/core: Skip wakeup when task is already running Dongsheng Yang
@ 2014-04-15 13:53   ` Peter Zijlstra
  2014-04-16 10:22     ` Dongsheng Yang
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2014-04-15 13:53 UTC (permalink / raw)
  To: Dongsheng Yang; +Cc: rostedt, fweisbec, mingo, acme, linux-kernel

On Tue, Apr 15, 2014 at 09:32:53PM +0900, Dongsheng Yang wrote:
> From: Dongsheng <yangds.fnst@cn.fujitsu.com>
> 
> It is pointless to wake up a running task. Currently, we can
> see it in perf sched latency.
> 
> Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
> ---
>  kernel/sched/core.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 268a45e..fc6b644 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1416,6 +1416,9 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
>  static void
>  ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>  {
> +	if (p->state == TASK_RUNNING)
> +		return;
> +
>  	check_preempt_curr(rq, p, wake_flags);
>  	trace_sched_wakeup(p, true);
>  


How can you get there with ->state == RUNNING? try_to_wake_up*() bail
when !(->state & state).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] perf sched: Add trace event for sched wait.
  2014-04-15 12:32 [PATCH 0/8] perf sched: Add trace event for sched wait Dongsheng Yang
                   ` (7 preceding siblings ...)
  2014-04-15 12:32 ` [PATCH 8/8] perf tools: Clarify the output of perf sched map Dongsheng Yang
@ 2014-04-15 13:54 ` Peter Zijlstra
  2014-04-16 10:28   ` Dongsheng Yang
  8 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2014-04-15 13:54 UTC (permalink / raw)
  To: Dongsheng Yang; +Cc: rostedt, fweisbec, mingo, acme, linux-kernel

On Tue, Apr 15, 2014 at 09:32:49PM +0900, Dongsheng Yang wrote:
> Hi all:
> 	This is to solve the bug message shown in perf sched latency.
> 
> 	# perf sched latency|tail
> 	  ksoftirqd/0:3         |      0.597 ms |       57 | avg:    0.004 ms | max:    0.054 ms | max at: 19681.546204 s
> 	  ksoftirqd/1:14        |      0.637 ms |       58 | avg:    0.004 ms | max:    0.066 ms | max at: 19674.687734 s
> 	  irqbalance:349        |      0.429 ms |        1 | avg:    0.004 ms | max:    0.004 ms | max at: 19675.791528 s
> 	  ksoftirqd/3:24        |      0.527 ms |       67 | avg:    0.003 ms | max:    0.011 ms | max at: 19673.285019 s
> 	  migration/3:23        |      0.000 ms |        1 | avg:    0.002 ms | max:    0.002 ms | max at: 19672.055354 s
> 	 -----------------------------------------------------------------------------------------
> 	  TOTAL:                |   4384.616 ms |    36879 |
> 	 ---------------------------------------------------
> 	  INFO: 0.030% state machine bugs (11 out of 36684)
> 
> 	After some investigation, there are two reasons cause this problem.
> 
> 	(1). Sometimes, scheduler will wake up a running task, it is not necessary,
> 	     then I skip the wakeup if task->state is TASK_RUNNING. [4/8]
> 
> 	(2). No tracing for sched wait.
> 	This is a simple graph for task state changing.
> 
> 	----------------		1		----------------
> 	| TASK_RUNNING | ------------------------------>| TASK_RUNNING |
> 	|   (running)  |<------------------------------ |  (wait cpu)  |
> 	----------------		2		----------------
> 		^						|
> 		|4	-------------------------	3	|
> 		|-------|TASK_{UN}INTERRUPTABLE	|<--------------|
> 			|	in wait_rq	|
> 			-------------------------
> 
> 	As the graph shown above, there are four event in scheduling, and
> we currently are tracing 3 of them.
> 
> 	1 & 2: sched:sched_switch
> 	    4: sched:sched_wakeup|sched:sched_wakeup_new
> 
> But about 3, we have no trace event for it.

We do, sched_switch() includes the previous task state. Also there's
tons more than a waitqueue you can get stuck in.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible.
  2014-04-15 13:49   ` Peter Zijlstra
@ 2014-04-16  9:51     ` Dongsheng Yang
  0 siblings, 0 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-16  9:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rostedt, fweisbec, mingo, acme, linux-kernel

On 04/15/2014 10:49 PM, Peter Zijlstra wrote:
> On Tue, Apr 15, 2014 at 09:32:52PM +0900, Dongsheng Yang wrote:
>> From: Dongsheng <yangds.fnst@cn.fujitsu.com>
>>
>> There is already a function in include/linux/wait.h to cover the
>> 'exclusive' usage. So we can use it in sched/wait.c to replace
>> the opened implementation of it.
>>
>> Signed-off-by: Dongsheng <yangds.fnst@cn.fujitsu.com>
>> ---
>>   kernel/sched/wait.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
>> index 283750e..b04827e 100644
>> --- a/kernel/sched/wait.c
>> +++ b/kernel/sched/wait.c
>> @@ -38,9 +38,8 @@ void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
>>   	unsigned long flags;
>>   
>>   	trace_sched_wait(current);
>> -	wait->flags |= WQ_FLAG_EXCLUSIVE;
>>   	spin_lock_irqsave(&q->lock, flags);
>> -	__add_wait_queue_tail(q, wait);
>> +	__add_wait_queue_tail_exclusive(q, wait);
>>   	spin_unlock_irqrestore(&q->lock, flags);
>>   }
>>   EXPORT_SYMBOL(add_wait_queue_exclusive);
>> @@ -192,10 +191,9 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
>>   	unsigned long flags;
>>   
>>   	trace_sched_wait(current);
>> -	wait->flags |= WQ_FLAG_EXCLUSIVE;
>>   	spin_lock_irqsave(&q->lock, flags);
>>   	if (list_empty(&wait->task_list))
>> -		__add_wait_queue_tail(q, wait);
>> +		__add_wait_queue_tail_exclusive(q, wait);
>>   	set_current_state(state);
>>   	spin_unlock_irqrestore(&q->lock, flags);
> That is not a no-op, if !list_empty() we loose the WQ_flag

Oh, yes, my mistake. Thanx
> .
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.
  2014-04-15 13:53   ` Peter Zijlstra
@ 2014-04-16 10:22     ` Dongsheng Yang
  2014-04-22 11:56       ` Dongsheng Yang
  0 siblings, 1 reply; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-16 10:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rostedt, fweisbec, mingo, acme, linux-kernel

On 04/15/2014 10:53 PM, Peter Zijlstra wrote:
> On Tue, Apr 15, 2014 at 09:32:53PM +0900, Dongsheng Yang wrote:
>
> How can you get there with ->state == RUNNING? try_to_wake_up*() bail
> when !(->state & state).
Yes, try_to_wake_up() did this check. But other callers would miss it.

With the following code ,I can get the actual message of waking up
a running task

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9f63275..1369cae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1418,8 +1418,10 @@ static void ttwu_activate(struct rq *rq, struct 
task_stru
  static void
  ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
  {
-       if (p->state == TASK_RUNNING)
+       if (p->state == TASK_RUNNING) {
+               printk("Wakeup a running task.");
                 return;
+       }

         check_preempt_curr(rq, p, wake_flags);
         trace_sched_wakeup(p, true);


# grep "Wakeup" /var/log/messages
Apr 15 20:16:21 localhost kernel: [    5.436505] Wakeup a running task.
Apr 15 20:16:21 localhost kernel: [    7.776042] Wakeup a running task.
Apr 15 20:16:21 localhost kernel: [    9.324274] Wakeup a running task.

So, I think there are some caller of ttwu_do_wakeup() is attempt to wake
up a running task.
> .
>


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] perf sched: Add trace event for sched wait.
  2014-04-15 13:54 ` [PATCH 0/8] perf sched: Add trace event for sched wait Peter Zijlstra
@ 2014-04-16 10:28   ` Dongsheng Yang
  0 siblings, 0 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-16 10:28 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rostedt, fweisbec, mingo, acme, linux-kernel

On 04/15/2014 10:54 PM, Peter Zijlstra wrote:
> On Tue, Apr 15, 2014 at 09:32:49PM +0900, Dongsheng Yang wrote:
>> But about 3, we have no trace event for it.
> We do, sched_switch() includes the previous task state. Also there's
> tons more than a waitqueue you can get stuck in.
Okey, prev_state is designed to trace No. 3 situation, but it seems
not working well now from my test result. I will look into it and dig
the reason out. Thanx for your help.
> .
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/8] sched & trace: Add a trace event for wait.
  2014-04-15 13:49   ` Peter Zijlstra
@ 2014-04-16 14:23     ` Steven Rostedt
  0 siblings, 0 replies; 42+ messages in thread
From: Steven Rostedt @ 2014-04-16 14:23 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dongsheng Yang, fweisbec, mingo, acme, linux-kernel

On Tue, 15 Apr 2014 15:49:16 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Apr 15, 2014 at 09:32:50PM +0900, Dongsheng Yang wrote:
> > From: Dongsheng <yangds.fnst@cn.fujitsu.com>
> > 
> 
> No changelog, and the tracepoint seems useless to me.
> 

At least patch 1 and 2 need to be combined.

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.
  2014-04-16 10:22     ` Dongsheng Yang
@ 2014-04-22 11:56       ` Dongsheng Yang
  2014-04-22 13:23         ` Peter Zijlstra
  2014-04-22 17:10         ` bsegall
  0 siblings, 2 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-04-22 11:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rostedt, fweisbec, mingo, acme, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4338 bytes --]

On 04/16/2014 07:22 PM, Dongsheng Yang wrote:
> On 04/15/2014 10:53 PM, Peter Zijlstra wrote:
>> On Tue, Apr 15, 2014 at 09:32:53PM +0900, Dongsheng Yang wrote:
>>
>> How can you get there with ->state == RUNNING? try_to_wake_up*() bail
>> when !(->state & state).
> Yes, try_to_wake_up() did this check. But other callers would miss it.
>
> With the following code ,I can get the actual message of waking up
> a running task
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9f63275..1369cae 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1418,8 +1418,10 @@ static void ttwu_activate(struct rq *rq, struct 
> task_stru
>  static void
>  ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>  {
> -       if (p->state == TASK_RUNNING)
> +       if (p->state == TASK_RUNNING) {
> +               printk("Wakeup a running task.");
>                 return;
> +       }
>
>         check_preempt_curr(rq, p, wake_flags);
>         trace_sched_wakeup(p, true);
>
>
> # grep "Wakeup" /var/log/messages
> Apr 15 20:16:21 localhost kernel: [    5.436505] Wakeup a running task.
> Apr 15 20:16:21 localhost kernel: [    7.776042] Wakeup a running task.
> Apr 15 20:16:21 localhost kernel: [    9.324274] Wakeup a running task.

Hi Peter, after some more investigation, I think I got the problem, 
which is that
some other task set p->state to TASK_RUNNING without holding p->pi_lock.

Scenario as attached graph shown, if some other task set p->state to
TASK_RUNNING after the check  if (! (p->state & state)), then we are
wasting time to wake up a running task in try_to_wake_up().

If the analyse is right, I think there are two methods to solve this 
problem:
     * Skip in ttwu_do_wakeup() when p->state is running, as what my patch
did.
     * Add a locking when we set p->state, lots of work to do and I am 
afraid
it will hurt the performance of kernel.


The following message is the backtrace info I got when it happened:

(gdb) bt
#0  try_to_wake_up (p=0xffff88027e651930, state=1, wake_flags=0)
     at kernel/sched/core.c:1605
#1  0xffffffff81099532 in default_wake_function (curr=<value optimized 
out>,
     mode=<value optimized out>, wake_flags=<value optimized out>,
     key=<value optimized out>) at kernel/sched/core.c:2853
#2  0xffffffff810aa489 in __wake_up_common (q=0xffff88027f03f210, mode=1,
     nr_exclusive=1, wake_flags=0, key=0x4) at kernel/sched/wait.c:75
#3  0xffffffff810aa838 in __wake_up (q=0xffff88027f03f210, mode=1,
     nr_exclusive=1, key=0x4) at kernel/sched/wait.c:97
#4  0xffffffff813cd0a4 in n_tty_check_unthrottle (tty=0xffff88027f03ec00,
     file=0xffff880278ab1b00,
     buf=0x7fff0fcf9720 "\r\nyum install -y 
./a/alsa-plugins-pulseaudio-1.0.27-1.fc19.x86_64.rpml -y 
./a/alsa-lib-1.0.27.1-2.fc19.x86_64.rpm.noarch.rpm\r\nyum install -y 
./a/albatross-xfwm4-theme-1.2-5.fc19.noarch.rpm\r\nyum instal"...,
     nr=16315) at drivers/tty/n_tty.c:280
#5  n_tty_read (tty=0xffff88027f03ec00, file=0xffff880278ab1b00,
     buf=0x7fff0fcf9720 "\r\nyum install -y 
./a/alsa-plugins-pulseaudio-1.0.27-1.fc19.x86_64.rpml -y 
./a/alsa-lib-1.0.27.1-2.fc19.x86_64.rpm.noarch.rpm\r\nyum install -y 
./a/albatross-xfwm4-theme-1.2-5.fc19.noarch.rpm\r\nyum instal"...,
     nr=16315) at drivers/tty/n_tty.c:2259
#6  0xffffffff813c5667 in tty_read (file=0xffff880278ab1b00,
     buf=0x7fff0fcf9720 "\r\nyum install -y 
./a/alsa-plugins-pulseaudio-1.0.27-1.fc19.x86_64.rpml -y 
./a/alsa-lib-1.0.27.1-2.fc19.x86_64.rpm.noarch.rpm\r\nyum in---Type 
<return> to continue, or q <return> to quit---q
Quit
(gdb) p p->state   ------> Currently, p->state is TASK_RUNNING.
$1 = 0
(gdb) l
1600
1601        success = 1; /* we're going to change ->state */
1602        cpu = task_cpu(p);
1603
1604        if (p->state == TASK_RUNNING) {
1605            printk("Wake up a running task.");
1606        }
1607        if (p->on_rq && ttwu_remote(p, wake_flags))
1608            goto stat;
1609
(gdb)
>
> So, I think there are some caller of ttwu_do_wakeup() is attempt to wake
> up a running task.
>> .
>>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> .
>


[-- Attachment #2: graph --]
[-- Type: text/plain, Size: 279 bytes --]

			CPU1							CPU2
	raw_spin_lock_irqsave(&p->pi_lock, flags);		set_task_state(tsk, TASK_INTERRUPTIBLE);
	if (!(p->state & state)) ---> TASK_INTERRUPTIBLE			 |
                goto out;							 |
	cpu = task_cpu(p);					set_task_state(tsk, TASK_RUNNING);
			... ---> TASK_RUNNING

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.
  2014-04-22 11:56       ` Dongsheng Yang
@ 2014-04-22 13:23         ` Peter Zijlstra
  2014-04-22 17:10         ` bsegall
  1 sibling, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2014-04-22 13:23 UTC (permalink / raw)
  To: Dongsheng Yang; +Cc: rostedt, fweisbec, mingo, acme, linux-kernel

On Tue, Apr 22, 2014 at 08:56:11PM +0900, Dongsheng Yang wrote:

> >On 04/15/2014 10:53 PM, Peter Zijlstra wrote:
> >>On Tue, Apr 15, 2014 at 09:32:53PM +0900, Dongsheng Yang wrote:
> >>
> >>How can you get there with ->state == RUNNING? try_to_wake_up*() bail
> >>when !(->state & state).

> Hi Peter, after some more investigation, I think I got the problem, which is
> that
> some other task set p->state to TASK_RUNNING without holding p->pi_lock.

That would be a plain bug, you're only ever allowed setting your own
->state, never someone else's. That is:

  for (;;) {
    set_current_state(TASK_INTERRUPTIBLE);
    if (cond)
      break;
    schedule();
  }
  __set_current_state(TASK_RUNNING);

Is the 'normal' way of blocking, and that is very much only touching its
own state. And in the above you cannot trigger the case you're seeing,
because the ->state = TASK_RUNNING, at the end, is _after_ its already
running, so we've completed the wakeup.

So I think you've spotted something rotten and we should catch/fix that.

Your backtrace only shows where we find ->state == TASK_RUNNING, not
where it became such.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.
  2014-04-22 11:56       ` Dongsheng Yang
  2014-04-22 13:23         ` Peter Zijlstra
@ 2014-04-22 17:10         ` bsegall
  2014-04-22 17:53           ` Steven Rostedt
  2014-04-22 18:18           ` Peter Zijlstra
  1 sibling, 2 replies; 42+ messages in thread
From: bsegall @ 2014-04-22 17:10 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Peter Zijlstra, rostedt, fweisbec, mingo, acme, linux-kernel

Dongsheng Yang <yangds.fnst@cn.fujitsu.com> writes:

> On 04/16/2014 07:22 PM, Dongsheng Yang wrote:
>> On 04/15/2014 10:53 PM, Peter Zijlstra wrote:
>>> On Tue, Apr 15, 2014 at 09:32:53PM +0900, Dongsheng Yang wrote:
>>>
>>> How can you get there with ->state == RUNNING? try_to_wake_up*() bail
>>> when !(->state & state).
>> Yes, try_to_wake_up() did this check. But other callers would miss it.
>>
>> With the following code ,I can get the actual message of waking up
>> a running task
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 9f63275..1369cae 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1418,8 +1418,10 @@ static void ttwu_activate(struct rq *rq, struct
>> task_stru
>>  static void
>>  ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>>  {
>> -       if (p->state == TASK_RUNNING)
>> +       if (p->state == TASK_RUNNING) {
>> +               printk("Wakeup a running task.");
>>                 return;
>> +       }
>>
>>         check_preempt_curr(rq, p, wake_flags);
>>         trace_sched_wakeup(p, true);
>>
>>
>> # grep "Wakeup" /var/log/messages
>> Apr 15 20:16:21 localhost kernel: [    5.436505] Wakeup a running task.
>> Apr 15 20:16:21 localhost kernel: [    7.776042] Wakeup a running task.
>> Apr 15 20:16:21 localhost kernel: [    9.324274] Wakeup a running task.
>
> Hi Peter, after some more investigation, I think I got the problem, which is
> that
> some other task set p->state to TASK_RUNNING without holding p->pi_lock.
>
> Scenario as attached graph shown, if some other task set p->state to
> TASK_RUNNING after the check  if (! (p->state & state)), then we are
> wasting time to wake up a running task in try_to_wake_up().
>
> If the analyse is right, I think there are two methods to solve this problem:
>     * Skip in ttwu_do_wakeup() when p->state is running, as what my patch
> did.
>     * Add a locking when we set p->state, lots of work to do and I am afraid
> it will hurt the performance of kernel.

This is all expected behavior, and the somewhat less than useful trace
events are expected. A task setting p->state to TASK_RUNNING without
locks is fine if and only p == current. The standard deschedule loop is
basically:

while (1) {
  set_current_state(TASK_(UN)INTERRUPTIBLE);
  if (should_still_sleep)
    schedule();
}
set_current_state(TASK_RUNNING);

Which can produce this in a race.

The only problem this causes is a wasted check_preempt_curr call in the
racing case, and a somewhat inaccurate sched:sched_wakeup trace event.
Note that even if you did recheck in ttwu_do_wakeup you could still race
and get an "inaccurate" trace event. Heck, even if the ttwu is
_necessary_ because p is currently trying to take rq->lock to
deschedule, you won't get a matching sched_switch event, because the
ttwu is running before schedule is.

You could sorta fix this I guess by tracking every write to p->state
with trace events, but that would be a somewhat different change, and
might be considered too expensive for all I know (and the trace events
could /still/ be resolved in a different order across cpus compared to
p->state's memory).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.
  2014-04-22 17:10         ` bsegall
@ 2014-04-22 17:53           ` Steven Rostedt
  2014-04-22 18:18           ` Peter Zijlstra
  1 sibling, 0 replies; 42+ messages in thread
From: Steven Rostedt @ 2014-04-22 17:53 UTC (permalink / raw)
  To: bsegall
  Cc: Dongsheng Yang, Peter Zijlstra, fweisbec, mingo, acme, linux-kernel

On Tue, 22 Apr 2014 10:10:52 -0700
bsegall@google.com wrote:


> This is all expected behavior, and the somewhat less than useful trace
> events are expected. A task setting p->state to TASK_RUNNING without
> locks is fine if and only p == current. The standard deschedule loop is

Sure, and if you are not current, then all you need is the rq lock of
the rq that p's CPU is for.

> basically:
> 
> while (1) {
>   set_current_state(TASK_(UN)INTERRUPTIBLE);

Yep, and set_current_state() implies a memory barrier.

>   if (should_still_sleep)
>     schedule();
> }
> set_current_state(TASK_RUNNING);

The above can use __set_current_state() as there's no races to deal
with when setting current's state to RUNNING.

> 
> Which can produce this in a race.
> 
> The only problem this causes is a wasted check_preempt_curr call in the
> racing case, and a somewhat inaccurate sched:sched_wakeup trace event.
> Note that even if you did recheck in ttwu_do_wakeup you could still race
> and get an "inaccurate" trace event. Heck, even if the ttwu is
> _necessary_ because p is currently trying to take rq->lock to
> deschedule, you won't get a matching sched_switch event, because the
> ttwu is running before schedule is.
> 
> You could sorta fix this I guess by tracking every write to p->state
> with trace events, but that would be a somewhat different change, and
> might be considered too expensive for all I know (and the trace events
> could /still/ be resolved in a different order across cpus compared to
> p->state's memory).

Yeah, let's not do that.

-- Steve


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.
  2014-04-22 17:10         ` bsegall
  2014-04-22 17:53           ` Steven Rostedt
@ 2014-04-22 18:18           ` Peter Zijlstra
  2014-05-05  6:32             ` Dongsheng Yang
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2014-04-22 18:18 UTC (permalink / raw)
  To: bsegall; +Cc: Dongsheng Yang, rostedt, fweisbec, mingo, acme, linux-kernel

On Tue, Apr 22, 2014 at 10:10:52AM -0700, bsegall@google.com wrote:
> This is all expected behavior, and the somewhat less than useful trace
> events are expected. A task setting p->state to TASK_RUNNING without
> locks is fine if and only p == current. The standard deschedule loop is
> basically:
> 
> while (1) {
>   set_current_state(TASK_(UN)INTERRUPTIBLE);
>   if (should_still_sleep)
>     schedule();
> }
> set_current_state(TASK_RUNNING);
> 
> Which can produce this in a race.
> 
> The only problem this causes is a wasted check_preempt_curr call in the
> racing case, and a somewhat inaccurate sched:sched_wakeup trace event.
> Note that even if you did recheck in ttwu_do_wakeup you could still race
> and get an "inaccurate" trace event. Heck, even if the ttwu is
> _necessary_ because p is currently trying to take rq->lock to
> deschedule, you won't get a matching sched_switch event, because the
> ttwu is running before schedule is.
> 
> You could sorta fix this I guess by tracking every write to p->state
> with trace events, but that would be a somewhat different change, and
> might be considered too expensive for all I know (and the trace events
> could /still/ be resolved in a different order across cpus compared to
> p->state's memory).

Ah, you're saying that a second task could try a spurious wakeup between
set_current_state() and schedule(). Yes, that'll trigger this indeed.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.
  2014-04-22 18:18           ` Peter Zijlstra
@ 2014-05-05  6:32             ` Dongsheng Yang
  2014-05-05  6:34               ` [PATCH] sched: Move the wakeup tracepoint from ttwu_do_wakeup() to ttwu_activate() Dongsheng Yang
  0 siblings, 1 reply; 42+ messages in thread
From: Dongsheng Yang @ 2014-05-05  6:32 UTC (permalink / raw)
  To: Peter Zijlstra, bsegall, rostedt; +Cc: fweisbec, mingo, acme, linux-kernel

Hi all,
Thanx for your reply, and sorry for the late.

On 04/23/2014 03:18 AM, Peter Zijlstra wrote:
> On Tue, Apr 22, 2014 at 10:10:52AM -0700, bsegall@google.com wrote:
>> This is all expected behavior, and the somewhat less than useful trace
>> events are expected. A task setting p->state to TASK_RUNNING without
>> locks is fine if and only p == current. The standard deschedule loop is
>> basically:
>>
>> while (1) {
>>    set_current_state(TASK_(UN)INTERRUPTIBLE);
>>    if (should_still_sleep)
>>      schedule();
>> }
>> set_current_state(TASK_RUNNING);
>>
>> Which can produce this in a race.
>>
>> The only problem this causes is a wasted check_preempt_curr call in the
>> racing case, and a somewhat inaccurate sched:sched_wakeup trace event.
>> Note that even if you did recheck in ttwu_do_wakeup you could still race
>> and get an "inaccurate" trace event.

Yes, even I recheck it in ttwu_do_wakeup(), I could still race.

Now I can not catch up a good idea to avoid this race.
But I think it is not too expensive.

About the event of sched:sched_wakeup we can get the bug information
as I mentioned at the first mail of this thread:

[root@yds-PC linux]# perf sched latency|tail
bash:25186 | 0.410 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 
0.000000 s
bash:25174 | 0.407 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 
0.000000 s
bash:25268 | 0.367 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 
0.000000 s
bash:25279 | 0.358 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 
0.000000 s
bash:25238 | 0.286 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 
0.000000 s
-----------------------------------------------------------------------------------------
TOTAL: | 1344.539 ms | 9604 |
---------------------------------------------------
*INFO: 0.557% state machine bugs (51 out of 9161)*

It is caused by the issue we discussed in this thread, that ttwu_do_wakeup()
could be called to wakeup a task which is on run queue.

There are two solutions in my mind:
* Add a new trace event, such as sched:sched_enqueue. Then we can trace
the event of it and get the timestamp when a task enqueue and start waiting
cpu. In this way, we can calculate the latency time with
(sched_in_time - sched_enqueue_time) in `perf sched latency`.

* Move the current trace point from ttwu_do_wakeup() to
ttwu_activate(). Currently the sched:sched_wakeup can tell user very little.
When we get a sched:sched_wakeup:
a) We can not say a task is inserted into run queue, it is also used for 
task
which is on_rq and only change the task->state to TASK_RUNNING.
b) We can not say the task->state is changed from {UN}INTERRUPTABLE to
RUNNING, sometimes task->state is already changed to RUNNING by other cpu.

I prefer the second one, anyway, current sched_wakeup tells user none.

>> Heck, even if the ttwu is
>> _necessary_ because p is currently trying to take rq->lock to
>> deschedule, you won't get a matching sched_switch event, because the
>> ttwu is running before schedule is.
>>
>> You could sorta fix this I guess by tracking every write to p->state
>> with trace events, but that would be a somewhat different change, and
>> might be considered too expensive for all I know (and the trace events
>> could /still/ be resolved in a different order across cpus compared to
>> p->state's memory).
> Ah, you're saying that a second task could try a spurious wakeup between
> set_current_state() and schedule(). Yes, that'll trigger this indeed.
>
>
> .
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] sched: Move the wakeup tracepoint from ttwu_do_wakeup() to ttwu_activate().
  2014-05-05  6:32             ` Dongsheng Yang
@ 2014-05-05  6:34               ` Dongsheng Yang
  2014-05-05 14:00                 ` Steven Rostedt
  0 siblings, 1 reply; 42+ messages in thread
From: Dongsheng Yang @ 2014-05-05  6:34 UTC (permalink / raw)
  To: peterz, bsegall, rostedt; +Cc: mingo, linux-kernel, Dongsheng Yang

The original design of sched:sched_wakeup is to trace the event inserting
a task into run queue. It means we can know that the state of this task is
changed from "sleeping" to "waiting_cpu". Then we can calculate the delay time
from this task on_rq = 1 to running on cpu in `perf sched latency`.

But currently, sched:sched_wakeup event is tracing ttwu_do_wakeup() and this
function only set the p->state to TASK_RUNNING. This trace point can tell user
very little. When get a event of sched:sched_wakeup, user know that kernel call
ttwu_do_wakeup() once, but we can not say that task start to wait cpu from now
on. Maybe it did not dequeue at all, in this case we will get a wrong latency
time calculated by "sched_in_time - wakeup_time".

Anyway, current sched:sched_wakeup event can tell user very little.
When we get a sched:sched_wakeup:
* We can not say a task is inserted into run queue, it is also used for task
which is on_rq and only change the task->state to TASK_RUNNING.
* We can not say the task->state is changed from {UN}INTERRUPTABLE to RUNNING,
sometimes task->state is already changed to RUNNING by other cpu.

As explained above, this patch move the trace point of sched:sched_wakeup from
ttwu_do_wakeup() to ttwu_activate(), then when we get an event of sched_wakeup,
we can say that a task enqueued and started waiting cpu to run.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9074c6d..0cae994 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1420,6 +1420,7 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
 {
 	activate_task(rq, p, en_flags);
 	p->on_rq = 1;
+	trace_sched_wakeup(p, true);
 
 	/* if a worker is waking up, notify workqueue */
 	if (p->flags & PF_WQ_WORKER)
@@ -1433,7 +1434,6 @@ static void
 ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	check_preempt_curr(rq, p, wake_flags);
-	trace_sched_wakeup(p, true);
 
 	p->state = TASK_RUNNING;
 #ifdef CONFIG_SMP
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Move the wakeup tracepoint from ttwu_do_wakeup() to ttwu_activate().
  2014-05-05  6:34               ` [PATCH] sched: Move the wakeup tracepoint from ttwu_do_wakeup() to ttwu_activate() Dongsheng Yang
@ 2014-05-05 14:00                 ` Steven Rostedt
  2014-05-06  0:19                   ` Dongsheng Yang
  0 siblings, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2014-05-05 14:00 UTC (permalink / raw)
  To: Dongsheng Yang; +Cc: peterz, bsegall, mingo, linux-kernel

On Mon, 5 May 2014 15:34:07 +0900
Dongsheng Yang <yangds.fnst@cn.fujitsu.com> wrote:

> The original design of sched:sched_wakeup is to trace the event inserting
> a task into run queue. It means we can know that the state of this task is
> changed from "sleeping" to "waiting_cpu". Then we can calculate the delay time
> from this task on_rq = 1 to running on cpu in `perf sched latency`.
> 
> But currently, sched:sched_wakeup event is tracing ttwu_do_wakeup() and this
> function only set the p->state to TASK_RUNNING. This trace point can tell user
> very little. When get a event of sched:sched_wakeup, user know that kernel call
> ttwu_do_wakeup() once, but we can not say that task start to wait cpu from now
> on. Maybe it did not dequeue at all, in this case we will get a wrong latency
> time calculated by "sched_in_time - wakeup_time".
> 
> Anyway, current sched:sched_wakeup event can tell user very little.
> When we get a sched:sched_wakeup:
> * We can not say a task is inserted into run queue, it is also used for task
> which is on_rq and only change the task->state to TASK_RUNNING.
> * We can not say the task->state is changed from {UN}INTERRUPTABLE to RUNNING,
> sometimes task->state is already changed to RUNNING by other cpu.
> 
> As explained above, this patch move the trace point of sched:sched_wakeup from
> ttwu_do_wakeup() to ttwu_activate(), then when we get an event of sched_wakeup,
> we can say that a task enqueued and started waiting cpu to run.
> 

tldr;

ttwu_do_wakeup() is called when the task's state is switched back to
TASK_RUNNING, whether or not the task actually scheduled out. Tracing
the wakeup event when the task never scheduled out is quite confusing.
It should only trace the task wake up if the task actually did go to
sleep. Move the tracepoint from ttwu_do_wakeup() to ttwu_activate()
where it is called only if the task is really woken up and not just
have its state changed.

-- Steve


> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9074c6d..0cae994 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1420,6 +1420,7 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
>  {
>  	activate_task(rq, p, en_flags);
>  	p->on_rq = 1;
> +	trace_sched_wakeup(p, true);
>  
>  	/* if a worker is waking up, notify workqueue */
>  	if (p->flags & PF_WQ_WORKER)
> @@ -1433,7 +1434,6 @@ static void
>  ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>  {
>  	check_preempt_curr(rq, p, wake_flags);
> -	trace_sched_wakeup(p, true);
>  
>  	p->state = TASK_RUNNING;
>  #ifdef CONFIG_SMP


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Move the wakeup tracepoint from ttwu_do_wakeup() to ttwu_activate().
  2014-05-05 14:00                 ` Steven Rostedt
@ 2014-05-06  0:19                   ` Dongsheng Yang
  2014-05-06  0:26                     ` Dongsheng Yang
  2014-05-06  2:06                     ` Steven Rostedt
  0 siblings, 2 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-05-06  0:19 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: peterz, bsegall, mingo, linux-kernel

On 05/05/2014 11:00 PM, Steven Rostedt wrote:
> On Mon, 5 May 2014 15:34:07 +0900
> Dongsheng Yang <yangds.fnst@cn.fujitsu.com> wrote:
>
>> The original design of sched:sched_wakeup is to trace the event inserting
>> a task into run queue. It means we can know that the state of this task is
>> changed from "sleeping" to "waiting_cpu". Then we can calculate the delay time
>> from this task on_rq = 1 to running on cpu in `perf sched latency`.
>>
>> But currently, sched:sched_wakeup event is tracing ttwu_do_wakeup() and this
>> function only set the p->state to TASK_RUNNING. This trace point can tell user
>> very little. When get a event of sched:sched_wakeup, user know that kernel call
>> ttwu_do_wakeup() once, but we can not say that task start to wait cpu from now
>> on. Maybe it did not dequeue at all, in this case we will get a wrong latency
>> time calculated by "sched_in_time - wakeup_time".
>>
>> Anyway, current sched:sched_wakeup event can tell user very little.
>> When we get a sched:sched_wakeup:
>> * We can not say a task is inserted into run queue, it is also used for task
>> which is on_rq and only change the task->state to TASK_RUNNING.
>> * We can not say the task->state is changed from {UN}INTERRUPTABLE to RUNNING,
>> sometimes task->state is already changed to RUNNING by other cpu.
>>
>> As explained above, this patch move the trace point of sched:sched_wakeup from
>> ttwu_do_wakeup() to ttwu_activate(), then when we get an event of sched_wakeup,
>> we can say that a task enqueued and started waiting cpu to run.
>>
> tldr;

Yes, it is really loooong, sorry for my terrible expression in english. :(
>
> ttwu_do_wakeup() is called when the task's state is switched back to
> TASK_RUNNING, whether or not the task actually scheduled out. Tracing
> the wakeup event when the task never scheduled out is quite confusing.
> It should only trace the task wake up if the task actually did go to
> sleep. Move the tracepoint from ttwu_do_wakeup() to ttwu_activate()
> where it is called only if the task is really woken up and not just
> have its state changed.

Thanx for your kind comment here, it looks shorter and more clear to me.
I will update my commit message with your suggestion.

Thank you very much! :)
>
> -- Steve
>
>
>> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
>> ---
>>   kernel/sched/core.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 9074c6d..0cae994 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1420,6 +1420,7 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
>>   {
>>   	activate_task(rq, p, en_flags);
>>   	p->on_rq = 1;
>> +	trace_sched_wakeup(p, true);
>>   
>>   	/* if a worker is waking up, notify workqueue */
>>   	if (p->flags & PF_WQ_WORKER)
>> @@ -1433,7 +1434,6 @@ static void
>>   ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>>   {
>>   	check_preempt_curr(rq, p, wake_flags);
>> -	trace_sched_wakeup(p, true);
>>   
>>   	p->state = TASK_RUNNING;
>>   #ifdef CONFIG_SMP
> .
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] sched: Move the wakeup tracepoint from ttwu_do_wakeup() to ttwu_activate().
  2014-05-06  0:19                   ` Dongsheng Yang
@ 2014-05-06  0:26                     ` Dongsheng Yang
  2014-05-06  2:06                     ` Steven Rostedt
  1 sibling, 0 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-05-06  0:26 UTC (permalink / raw)
  To: rostedt; +Cc: peterz, bsegall, mingo, linux-kernel, Dongsheng Yang

ttwu_do_wakeup() is called when the task's state is switched back to
TASK_RUNNING, whether or not the task actually scheduled out. Tracing
the wakeup event when the task never scheduled out is quite confusing.
It should only trace the task wake up if the task actually did go to
sleep. Move the tracepoint from ttwu_do_wakeup() to ttwu_activate()
where it is called only if the task is really woken up and not just
have its state changed.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9074c6d..0cae994 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1420,6 +1420,7 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
 {
 	activate_task(rq, p, en_flags);
 	p->on_rq = 1;
+	trace_sched_wakeup(p, true);
 
 	/* if a worker is waking up, notify workqueue */
 	if (p->flags & PF_WQ_WORKER)
@@ -1433,7 +1434,6 @@ static void
 ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	check_preempt_curr(rq, p, wake_flags);
-	trace_sched_wakeup(p, true);
 
 	p->state = TASK_RUNNING;
 #ifdef CONFIG_SMP
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Move the wakeup tracepoint from ttwu_do_wakeup() to ttwu_activate().
  2014-05-06  2:06                     ` Steven Rostedt
@ 2014-05-06  1:29                       ` Dongsheng Yang
  2014-05-06  1:52                         ` [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not Dongsheng Yang
  0 siblings, 1 reply; 42+ messages in thread
From: Dongsheng Yang @ 2014-05-06  1:29 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: peterz, bsegall, mingo, linux-kernel

On 05/06/2014 11:06 AM, Steven Rostedt wrote:
> On Tue, 6 May 2014 09:19:51 +0900
> Dongsheng Yang <yangds.fnst@cn.fujitsu.com> wrote:
>
> I wonder if we should have the event, or way to distinguish the
> difference. Hmm, there's that "success" parameter in the tracepoint.
> Could we possible be able to trace events where the success is true
> only if it was actually waking the event, and false otherwise?
>
> Having the sched_wakeup trace event show you when something woke the
> task up may still be useful information. For example, you add yourself
> to a wait queue and want to see the "wakeup". If we only show it for
> tasks that really woke up then we wont see it for those that added
> itself to a waitqueue but was "woken" before it could schedule out.
>
> The original sched_wakeup did this, but with the ttwu rewrite, it was
> lost.

Ha, got what you mean. Yes, we can take the use of success in sched_wakeup
event.
>
> Something like below?
>
> -- Steve
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 268a45e..e583989 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1404,6 +1404,7 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
>   {
>   	activate_task(rq, p, en_flags);
>   	p->on_rq = 1;
> +	trace_sched_wakeup(p, true);
>   
>   	/* if a worker is waking up, notify workqueue */
>   	if (p->flags & PF_WQ_WORKER)
> @@ -1417,7 +1418,6 @@ static void
>   ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>   {
>   	check_preempt_curr(rq, p, wake_flags);
> -	trace_sched_wakeup(p, true);
>   
>   	p->state = TASK_RUNNING;
>   #ifdef CONFIG_SMP
> @@ -1662,6 +1662,8 @@ static void try_to_wake_up_local(struct task_struct *p)
>   
>   	if (!p->on_rq)
>   		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
> +	else
> +		trace_sched_wakeup(p, false);

Also we should add a false trace point in ttwu_remote().
>   
>   	ttwu_do_wakeup(rq, p, 0);
>   	ttwu_stat(p, smp_processor_id(), 0);

Thanx steven, I will send a new patch as you suggested here.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> .
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-06  1:29                       ` Dongsheng Yang
@ 2014-05-06  1:52                         ` Dongsheng Yang
  2014-05-09  0:16                           ` Dongsheng Yang
  2014-05-10 15:29                           ` Peter Zijlstra
  0 siblings, 2 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-05-06  1:52 UTC (permalink / raw)
  To: rostedt; +Cc: peterz, bsegall, mingo, linux-kernel, Dongsheng Yang

ttwu_do_wakeup() is called when the task's state is switched back to
TASK_RUNNING, whether or not the task actually scheduled out. Tracing
the wakeup event when the task never scheduled out is quite confusing.

This patch take the use of 'success' parameter in sched_wakeup tracepoint,
to indicate that the task we are waking up did schedule out or just
change its state to TASK_RUNNING.

success is true:
	task was out of run queue and sleeping, we are really wake it up.
success is false:
	taks was on run queue all the time and we just change its state
	to TASK_RUNNING.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
---
 kernel/sched/core.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9074c6d..14b9fe4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1420,6 +1420,7 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
 {
 	activate_task(rq, p, en_flags);
 	p->on_rq = 1;
+	trace_sched_wakeup(p, true);
 
 	/* if a worker is waking up, notify workqueue */
 	if (p->flags & PF_WQ_WORKER)
@@ -1433,7 +1434,6 @@ static void
 ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	check_preempt_curr(rq, p, wake_flags);
-	trace_sched_wakeup(p, true);
 
 	p->state = TASK_RUNNING;
 #ifdef CONFIG_SMP
@@ -1481,6 +1481,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 	if (p->on_rq) {
 		/* check_preempt_curr() may use rq clock */
 		update_rq_clock(rq);
+		trace_sched_wakeup(p, false);
 		ttwu_do_wakeup(rq, p, wake_flags);
 		ret = 1;
 	}
@@ -1675,6 +1676,8 @@ static void try_to_wake_up_local(struct task_struct *p)
 
 	if (!p->on_rq)
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+	else
+		trace_sched_wakeup(p, false);
 
 	ttwu_do_wakeup(rq, p, 0);
 	ttwu_stat(p, smp_processor_id(), 0);
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Move the wakeup tracepoint from ttwu_do_wakeup() to ttwu_activate().
  2014-05-06  0:19                   ` Dongsheng Yang
  2014-05-06  0:26                     ` Dongsheng Yang
@ 2014-05-06  2:06                     ` Steven Rostedt
  2014-05-06  1:29                       ` Dongsheng Yang
  1 sibling, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2014-05-06  2:06 UTC (permalink / raw)
  To: Dongsheng Yang; +Cc: peterz, bsegall, mingo, linux-kernel

On Tue, 6 May 2014 09:19:51 +0900
Dongsheng Yang <yangds.fnst@cn.fujitsu.com> wrote:

> > tldr;
> 
> Yes, it is really loooong, sorry for my terrible expression in english. :(

No problem. I can imagine the trouble of having to write change logs in
a language that is not your mother tongue.

> >
> > ttwu_do_wakeup() is called when the task's state is switched back to
> > TASK_RUNNING, whether or not the task actually scheduled out. Tracing
> > the wakeup event when the task never scheduled out is quite confusing.
> > It should only trace the task wake up if the task actually did go to
> > sleep. Move the tracepoint from ttwu_do_wakeup() to ttwu_activate()
> > where it is called only if the task is really woken up and not just
> > have its state changed.
> 
> Thanx for your kind comment here, it looks shorter and more clear to me.
> I will update my commit message with your suggestion.
> 
> Thank you very much! :)

I wonder if we should have the event, or way to distinguish the
difference. Hmm, there's that "success" parameter in the tracepoint.
Could we possible be able to trace events where the success is true
only if it was actually waking the event, and false otherwise?

Having the sched_wakeup trace event show you when something woke the
task up may still be useful information. For example, you add yourself
to a wait queue and want to see the "wakeup". If we only show it for
tasks that really woke up then we wont see it for those that added
itself to a waitqueue but was "woken" before it could schedule out.

The original sched_wakeup did this, but with the ttwu rewrite, it was
lost.

Something like below?

-- Steve

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..e583989 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1404,6 +1404,7 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
 {
 	activate_task(rq, p, en_flags);
 	p->on_rq = 1;
+	trace_sched_wakeup(p, true);
 
 	/* if a worker is waking up, notify workqueue */
 	if (p->flags & PF_WQ_WORKER)
@@ -1417,7 +1418,6 @@ static void
 ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	check_preempt_curr(rq, p, wake_flags);
-	trace_sched_wakeup(p, true);
 
 	p->state = TASK_RUNNING;
 #ifdef CONFIG_SMP
@@ -1662,6 +1662,8 @@ static void try_to_wake_up_local(struct task_struct *p)
 
 	if (!p->on_rq)
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+	else
+		trace_sched_wakeup(p, false);
 
 	ttwu_do_wakeup(rq, p, 0);
 	ttwu_stat(p, smp_processor_id(), 0);

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-06  1:52                         ` [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not Dongsheng Yang
@ 2014-05-09  0:16                           ` Dongsheng Yang
  2014-05-09  1:27                             ` Steven Rostedt
  2014-05-10 15:29                           ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Dongsheng Yang @ 2014-05-09  0:16 UTC (permalink / raw)
  To: Dongsheng Yang; +Cc: rostedt, peterz, bsegall, mingo, linux-kernel

Hi steven, does this version look good to you?

Thanx :)

On 05/06/2014 10:52 AM, Dongsheng Yang wrote:
> ttwu_do_wakeup() is called when the task's state is switched back to
> TASK_RUNNING, whether or not the task actually scheduled out. Tracing
> the wakeup event when the task never scheduled out is quite confusing.
>
> This patch take the use of 'success' parameter in sched_wakeup tracepoint,
> to indicate that the task we are waking up did schedule out or just
> change its state to TASK_RUNNING.
>
> success is true:
> 	task was out of run queue and sleeping, we are really wake it up.
> success is false:
> 	taks was on run queue all the time and we just change its state
> 	to TASK_RUNNING.
>
> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
> ---
>   kernel/sched/core.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9074c6d..14b9fe4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1420,6 +1420,7 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
>   {
>   	activate_task(rq, p, en_flags);
>   	p->on_rq = 1;
> +	trace_sched_wakeup(p, true);
>   
>   	/* if a worker is waking up, notify workqueue */
>   	if (p->flags & PF_WQ_WORKER)
> @@ -1433,7 +1434,6 @@ static void
>   ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>   {
>   	check_preempt_curr(rq, p, wake_flags);
> -	trace_sched_wakeup(p, true);
>   
>   	p->state = TASK_RUNNING;
>   #ifdef CONFIG_SMP
> @@ -1481,6 +1481,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
>   	if (p->on_rq) {
>   		/* check_preempt_curr() may use rq clock */
>   		update_rq_clock(rq);
> +		trace_sched_wakeup(p, false);
>   		ttwu_do_wakeup(rq, p, wake_flags);
>   		ret = 1;
>   	}
> @@ -1675,6 +1676,8 @@ static void try_to_wake_up_local(struct task_struct *p)
>   
>   	if (!p->on_rq)
>   		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
> +	else
> +		trace_sched_wakeup(p, false);
>   
>   	ttwu_do_wakeup(rq, p, 0);
>   	ttwu_stat(p, smp_processor_id(), 0);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-09  0:16                           ` Dongsheng Yang
@ 2014-05-09  1:27                             ` Steven Rostedt
  0 siblings, 0 replies; 42+ messages in thread
From: Steven Rostedt @ 2014-05-09  1:27 UTC (permalink / raw)
  To: Dongsheng Yang; +Cc: peterz, bsegall, mingo, linux-kernel

On Fri, 9 May 2014 09:16:16 +0900
Dongsheng Yang <yangds.fnst@cn.fujitsu.com> wrote:

> Hi steven, does this version look good to you?

The patch looks fine to me. Peter, are you OK with it?

The change log needs a little cleaning up though. But I'm sure we can
do that for you.

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-06  1:52                         ` [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not Dongsheng Yang
  2014-05-09  0:16                           ` Dongsheng Yang
@ 2014-05-10 15:29                           ` Peter Zijlstra
       [not found]                             ` <536F90BE.2080806@gmail.com>
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2014-05-10 15:29 UTC (permalink / raw)
  To: Dongsheng Yang; +Cc: rostedt, bsegall, mingo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2259 bytes --]

On Tue, May 06, 2014 at 10:52:34AM +0900, Dongsheng Yang wrote:
> ttwu_do_wakeup() is called when the task's state is switched back to
> TASK_RUNNING, whether or not the task actually scheduled out. Tracing
> the wakeup event when the task never scheduled out is quite confusing.
> 
> This patch take the use of 'success' parameter in sched_wakeup tracepoint,
> to indicate that the task we are waking up did schedule out or just
> change its state to TASK_RUNNING.
> 
> success is true:
> 	task was out of run queue and sleeping, we are really wake it up.
> success is false:
> 	taks was on run queue all the time and we just change its state
> 	to TASK_RUNNING.

No, I think this patch is crap and doesn't actually change anything
much, at best it simply reduces the size of the race window.

I also think you're trying to solve something that cannot be solved.

So the fundamental wait loop is:

  for (;;) {
	set_current_state(TASK_UNINTERRUPTIBLE);
	if (cond)
		break;
	schedule();
  }
  __set_task_state(TASK_RUNNING);

And the fundamental wakeup is:

  cond = true;
  wake_up_process(TASK_NORMAL);

And this is very much on purpose a lock-free but strictly ordered
scenario. It is a variation of:

  X = Y = 0

  (wait)	(wake)
  [w] X = 1	[w] Y = 1
  MB		MB
  [r] Y		[r] X

[ where: X := state, Y := cond ]

And we all 'know' that the only provided guarantee is that:
  X==0 && Y==0
is impossible -- but only that, all 3 other states are observable.

This guarantee means that its impossible to both miss the condition and
the wakeup; iow. it guarantees fwd progress.

OTOH its fundamentally racy, nothing guarantees we will not 'observe' both
the condition and the wakeup.

The setting of .success=false when ->on_rq is actively wrong, suppose
the waiter has already observed cond==false but has not yet gotten to
schedule(), at that point the wakeup happens and sees ->on_rq==1. The
wakeup is still very much a real wakeup.

If it were not for that wakeup the waiter would have gone to sleep and
missed the update of condition.

So no.. not going to happen. And certainly not with such a crappy
Changelog.

And a frozen seafood of choice lobbed at Steven for not seeing this :-)


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Fwd: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
       [not found]                             ` <536F90BE.2080806@gmail.com>
@ 2014-05-11 15:24                               ` Dongsheng Yang
  2014-05-11 16:35                                 ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Dongsheng Yang @ 2014-05-11 15:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: yangds.fnst, linux-kernel, Steven Rostedt, mingo, bsegall

On 05/10/2014 11:29 PM, Peter Zijlstra wrote:
>
> On Tue, May 06, 2014 at 10:52:34AM +0900, Dongsheng Yang wrote:
>>
>> ttwu_do_wakeup() is called when the task's state is switched back to
>> TASK_RUNNING, whether or not the task actually scheduled out. Tracing
>> the wakeup event when the task never scheduled out is quite confusing.
>>
>> This patch take the use of 'success' parameter in sched_wakeup tracepoint,
>> to indicate that the task we are waking up did schedule out or just
>> change its state to TASK_RUNNING.
>>
>> success is true:
>>         task was out of run queue and sleeping, we are really wake it up.
>> success is false:
>>         taks was on run queue all the time and we just change its state
>>         to TASK_RUNNING.
>
> No, I think this patch is crap and doesn't actually change anything
> much, at best it simply reduces the size of the race window.
>
> I also think you're trying to solve something that cannot be solved.


Actually, this patch does not attempt to solve the race condition.
It only want to avoid sched:sched_wakeup with success==true in
a fake wakeup, as explained below.

> So the fundamental wait loop is:
>
>    for (;;) {
>         set_current_state(TASK_UNINTERRUPTIBLE);
>         if (cond)
>                 break;
>         schedule();
>    }
>    __set_task_state(TASK_RUNNING);
>
> And the fundamental wakeup is:
>
>    cond = true;
>    wake_up_process(TASK_NORMAL);
>
> And this is very much on purpose a lock-free but strictly ordered
> scenario. It is a variation of:
>
>    X = Y = 0
>
>    (wait)       (wake)
>    [w] X = 1    [w] Y = 1
>    MB           MB
>    [r] Y                [r] X
>
> [ where: X := state, Y := cond ]
>
> And we all 'know' that the only provided guarantee is that:
>    X==0 && Y==0
> is impossible -- but only that, all 3 other states are observable.
>
> This guarantee means that its impossible to both miss the condition and
> the wakeup; iow. it guarantees fwd progress.
>
> OTOH its fundamentally racy, nothing guarantees we will not 'observe' both
> the condition and the wakeup.
>
> The setting of .success=false when ->on_rq is actively wrong, suppose
> the waiter has already observed cond==false but has not yet gotten to
> schedule(), at that point the wakeup happens and sees ->on_rq==1. The
> wakeup is still very much a real wakeup.


Yes, if a wakeup happens before schedule(), wakeup
sees ->on_rq==1. Then we can get an event with .success==false.
But I think it is not a real wakeup. :(

Yes, at this moment, maybe the task is already out of run queue.
But *this* wakeup did not move it back to run queue, it only
change the state of it to TASK_RUNNING. I believe the next
wakeup for this task will do the real wake up moving it back
to run queue.

And if scheduler really wake it up, we can get an event with success==true.

Anyway, what I want with this patch is to make scheduler raise accurate
events when waking up a task.

If a wakeup only change the state of task, raise a event with success==false.
If a wakeup move a task back to runqueue, .success==true.

It means, we do not need to care about the task is on_rq or not currently,
the value of .success is decided by the behavior we did in the function
of try_to_wake_up().

Wish I explain myself clearly.

Thanx :)

>
> If it were not for that wakeup the waiter would have gone to sleep and
> missed the update of condition.
>
> So no.. not going to happen. And certainly not with such a crappy
> Changelog.
>
> And a frozen seafood of choice lobbed at Steven for not seeing this :-)
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Fwd: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-11 15:24                               ` Fwd: " Dongsheng Yang
@ 2014-05-11 16:35                                 ` Peter Zijlstra
  2014-05-11 18:52                                   ` Steven Rostedt
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2014-05-11 16:35 UTC (permalink / raw)
  To: Dongsheng Yang; +Cc: yangds.fnst, linux-kernel, Steven Rostedt, mingo, bsegall

[-- Attachment #1: Type: text/plain, Size: 2948 bytes --]

On Sun, May 11, 2014 at 11:24:22PM +0800, Dongsheng Yang wrote:
> Actually, this patch does not attempt to solve the race condition.
> It only want to avoid sched:sched_wakeup with success==true in
> a fake wakeup, as explained below.
> 
> > So the fundamental wait loop is:
> >
> >    for (;;) {
> >         set_current_state(TASK_UNINTERRUPTIBLE);
> >         if (cond)
> >                 break;
> >         schedule();
> >    }
> >    __set_task_state(TASK_RUNNING);
> >
> > And the fundamental wakeup is:
> >
> >    cond = true;
> >    wake_up_process(TASK_NORMAL);
> >
> > And this is very much on purpose a lock-free but strictly ordered
> > scenario. It is a variation of:
> >
> >    X = Y = 0
> >
> >    (wait)       (wake)
> >    [w] X = 1    [w] Y = 1
> >    MB           MB
> >    [r] Y                [r] X
> >
> > [ where: X := state, Y := cond ]
> >
> > And we all 'know' that the only provided guarantee is that:
> >    X==0 && Y==0
> > is impossible -- but only that, all 3 other states are observable.
> >
> > This guarantee means that its impossible to both miss the condition and
> > the wakeup; iow. it guarantees fwd progress.
> >
> > OTOH its fundamentally racy, nothing guarantees we will not 'observe' both
> > the condition and the wakeup.
> >
> > The setting of .success=false when ->on_rq is actively wrong, suppose
> > the waiter has already observed cond==false but has not yet gotten to
> > schedule(), at that point the wakeup happens and sees ->on_rq==1. The
> > wakeup is still very much a real wakeup.
> 
> 
> Yes, if a wakeup happens before schedule(), wakeup
> sees ->on_rq==1. Then we can get an event with .success==false.
> But I think it is not a real wakeup. :(
> 
> Yes, at this moment, maybe the task is already out of run queue.
> But *this* wakeup did not move it back to run queue, it only
> change the state of it to TASK_RUNNING. I believe the next
> wakeup for this task will do the real wake up moving it back
> to run queue.
> 
> And if scheduler really wake it up, we can get an event with success==true.
> 
> Anyway, what I want with this patch is to make scheduler raise accurate
> events when waking up a task.
> 
> If a wakeup only change the state of task, raise a event with success==false.
> If a wakeup move a task back to runqueue, .success==true.
> 
> It means, we do not need to care about the task is on_rq or not currently,
> the value of .success is decided by the behavior we did in the function
> of try_to_wake_up().
> 
> Wish I explain myself clearly.

So if the wait side has already observed cond==false, then without the
wakeup, which still potentially has ->on_rq == true, it would block.
Therefore the wakeup is a _real_ wakeup.

We fundamentally cannot know, on the wake side, if the wait side has or
has not observed cond, and therefore the distinction you're trying to
make is a false one.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-11 16:35                                 ` Peter Zijlstra
@ 2014-05-11 18:52                                   ` Steven Rostedt
  2014-05-12  6:47                                     ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2014-05-11 18:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dongsheng Yang, yangds.fnst, linux-kernel, mingo, bsegall

On Sun, 11 May 2014 18:35:31 +0200
Peter Zijlstra <peterz@infradead.org> wrote:


> So if the wait side has already observed cond==false, then without the
> wakeup, which still potentially has ->on_rq == true, it would block.
> Therefore the wakeup is a _real_ wakeup.
> 
> We fundamentally cannot know, on the wake side, if the wait side has or
> has not observed cond, and therefore the distinction you're trying to
> make is a false one.

I believe you may be misunderstanding Dongsheng. It has nothing to do
with the wake condition. But the "success" is basically saying, "did I
move the task on to the run queue?". That's a relevant piece of
information that the wake up event isn't currently showing.

Let me ask you this; with Donsheng's patch, will there ever be a
sched_switch event when the wakeup event sees 'false' and the
sched_switch event see the task with a state other than "R"? And if so,
how did the task doing the wakeup event, wake up that task?

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-11 18:52                                   ` Steven Rostedt
@ 2014-05-12  6:47                                     ` Peter Zijlstra
  2014-05-12  8:58                                       ` Dongsheng Yang
  2014-05-12 14:09                                       ` Steven Rostedt
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2014-05-12  6:47 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Dongsheng Yang, yangds.fnst, linux-kernel, mingo, bsegall

[-- Attachment #1: Type: text/plain, Size: 1242 bytes --]

On Sun, May 11, 2014 at 02:52:24PM -0400, Steven Rostedt wrote:
> On Sun, 11 May 2014 18:35:31 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> > So if the wait side has already observed cond==false, then without the
> > wakeup, which still potentially has ->on_rq == true, it would block.
> > Therefore the wakeup is a _real_ wakeup.
> > 
> > We fundamentally cannot know, on the wake side, if the wait side has or
> > has not observed cond, and therefore the distinction you're trying to
> > make is a false one.
> 
> I believe you may be misunderstanding Dongsheng. It has nothing to do
> with the wake condition. But the "success" is basically saying, "did I
> move the task on to the run queue?". That's a relevant piece of
> information that the wake up event isn't currently showing.
> 
> Let me ask you this; with Donsheng's patch, will there ever be a
> sched_switch event when the wakeup event sees 'false' and the
> sched_switch event see the task with a state other than "R"? And if so,
> how did the task doing the wakeup event, wake up that task?

But that has nothing what so fucking ever to do with 'success'. Reusing
that trace argument for something entirely different is just retarded.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-12  6:47                                     ` Peter Zijlstra
@ 2014-05-12  8:58                                       ` Dongsheng Yang
  2014-05-12 14:09                                       ` Steven Rostedt
  1 sibling, 0 replies; 42+ messages in thread
From: Dongsheng Yang @ 2014-05-12  8:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Dongsheng Yang, linux-kernel, mingo, bsegall

On 05/12/2014 03:47 PM, Peter Zijlstra wrote:
> On Sun, May 11, 2014 at 02:52:24PM -0400, Steven Rostedt wrote:
>> On Sun, 11 May 2014 18:35:31 +0200
>> Peter Zijlstra <peterz@infradead.org> wrote:
>> I believe you may be misunderstanding Dongsheng. It has nothing to do
>> with the wake condition. But the "success" is basically saying, "did I
>> move the task on to the run queue?". That's a relevant piece of
>> information that the wake up event isn't currently showing.
>>
>> Let me ask you this; with Donsheng's patch, will there ever be a
>> sched_switch event when the wakeup event sees 'false' and the
>> sched_switch event see the task with a state other than "R"? And if so,
>> how did the task doing the wakeup event, wake up that task?
> But that has nothing what so fucking ever to do with 'success'. Reusing
> that trace argument for something entirely different is just retarded.

Hi Peter,

we need to distinguish the true_wakeup and false_wakeup because
we want to know the timestamp when a task moved into runqueue.
Then we can calculate the latency time in perf sched latency command.

Without this patch, we will always get wakeup events with .success==true,
and we can not get the accurate time of task on run queue waiting for
cpu.

In original design of sched:sched_wakeup, actually was what I described
in commit message, .success==true means task go into run queue. And
perf-sched.c does only care events with this type:

         /* Note for later, it may be interesting to observe the failing 
cases */
         if (!success)
                 return 0;

And if scheduler raise an wakeup event with .success=true in 
ttwu_do_wakeup(),
perf sched latency will find the last state of this task is not out of 
run queue. Then
it will be confused and print a bug message for it.

# perf sched latency|tail
   bash:3498             |      0.785 ms |        1 | avg:    0.000 ms | 
max:    0.000 ms | max at:  0.000000 s
   bash:3600             |      0.522 ms |        1 | avg:    0.000 ms | 
max:    0.000 ms | max at:  0.000000 s
   bash:3570             |      0.295 ms |        1 | avg:    0.000 ms | 
max:    0.000 ms | max at:  0.000000 s
   bash:3581             |      0.288 ms |        1 | avg:    0.000 ms | 
max:    0.000 ms | max at:  0.000000 s
   bash:3594             |      0.286 ms |        1 | avg:    0.000 ms | 
max:    0.000 ms | max at:  0.000000 s
  -----------------------------------------------------------------------------------------
   TOTAL:                |   2768.420 ms |    17331 |
  ---------------------------------------------------
   INFO: 0.482% state machine bugs (82 out of 17008)




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-12  6:47                                     ` Peter Zijlstra
  2014-05-12  8:58                                       ` Dongsheng Yang
@ 2014-05-12 14:09                                       ` Steven Rostedt
  2014-05-12 15:09                                         ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2014-05-12 14:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dongsheng Yang, yangds.fnst, linux-kernel, mingo, bsegall

On Mon, 12 May 2014 08:47:30 +0200
Peter Zijlstra <peterz@infradead.org> wrote:


> But that has nothing what so fucking ever to do with 'success'. Reusing
> that trace argument for something entirely different is just retarded.

And having a "success" field that his hard coded to "true" is also
retarded. At least this change is fucking useful.

How about this. Add a new field called 'rq_added' or something. We can
ever shrink the size of the "success" field and hard code it to true in
the trace. That will never change. Then the caller of the tracepoint
will send in a "true" or "false" to "rq_added" and we can add that
tracepoint as well.

What I mean by shrink the size of the success field is that it is
currently 4 bytes in size. We can make it two (or one) and then add
another field for the 'rq_added' and have that be two bytes as well.
This shouldn't break any tools that use this.

Point being, this is useful information, why not pass it to userspace?

-- Steve


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-12 14:09                                       ` Steven Rostedt
@ 2014-05-12 15:09                                         ` Peter Zijlstra
  2014-05-12 15:17                                           ` Steven Rostedt
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2014-05-12 15:09 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Dongsheng Yang, yangds.fnst, linux-kernel, mingo, bsegall

[-- Attachment #1: Type: text/plain, Size: 1345 bytes --]

On Mon, May 12, 2014 at 10:09:08AM -0400, Steven Rostedt wrote:
> On Mon, 12 May 2014 08:47:30 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> > But that has nothing what so fucking ever to do with 'success'. Reusing
> > that trace argument for something entirely different is just retarded.
> 
> And having a "success" field that his hard coded to "true" is also
> retarded. At least this change is fucking useful.

You know why that's there :/ And yes, I started hating tracepoints a lot
more ever since that happened.

> How about this. Add a new field called 'rq_added' or something. We can
> ever shrink the size of the "success" field and hard code it to true in
> the trace. That will never change. Then the caller of the tracepoint
> will send in a "true" or "false" to "rq_added" and we can add that
> tracepoint as well.
> 
> What I mean by shrink the size of the success field is that it is
> currently 4 bytes in size. We can make it two (or one) and then add
> another field for the 'rq_added' and have that be two bytes as well.
> This shouldn't break any tools that use this.
> 
> Point being, this is useful information, why not pass it to userspace?

Because it adds code to one of the hottest code paths in the kernel and
so far nobody had a sane explanation of WTF they were doing.


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-12 15:09                                         ` Peter Zijlstra
@ 2014-05-12 15:17                                           ` Steven Rostedt
  2014-05-12 15:28                                             ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2014-05-12 15:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dongsheng Yang, yangds.fnst, linux-kernel, mingo, bsegall

On Mon, 12 May 2014 17:09:50 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > Point being, this is useful information, why not pass it to userspace?
> 
> Because it adds code to one of the hottest code paths in the kernel and
> so far nobody had a sane explanation of WTF they were doing.
> 

Is there a measurable difference with these extra tracepoints?

-- Steve


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not.
  2014-05-12 15:17                                           ` Steven Rostedt
@ 2014-05-12 15:28                                             ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2014-05-12 15:28 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Dongsheng Yang, yangds.fnst, linux-kernel, mingo, bsegall

[-- Attachment #1: Type: text/plain, Size: 559 bytes --]

On Mon, May 12, 2014 at 11:17:48AM -0400, Steven Rostedt wrote:
> On Mon, 12 May 2014 17:09:50 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > Point being, this is useful information, why not pass it to userspace?
> > 
> > Because it adds code to one of the hottest code paths in the kernel and
> > so far nobody had a sane explanation of WTF they were doing.
> > 
> 
> Is there a measurable difference with these extra tracepoints?

Dunno, nobody tried and reported. I suppose its sitting there along with
the coherent changelog.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2014-05-12 15:28 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-15 12:32 [PATCH 0/8] perf sched: Add trace event for sched wait Dongsheng Yang
2014-04-15 12:32 ` [PATCH 1/8] sched & trace: Add a trace event for wait Dongsheng Yang
2014-04-15 13:49   ` Peter Zijlstra
2014-04-16 14:23     ` Steven Rostedt
2014-04-15 12:32 ` [PATCH 2/8] sched/wait: Add trace point before add task into wait queue Dongsheng Yang
2014-04-15 12:32 ` [PATCH 3/8] sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible Dongsheng Yang
2014-04-15 13:49   ` Peter Zijlstra
2014-04-16  9:51     ` Dongsheng Yang
2014-04-15 12:32 ` [PATCH 4/8] sched/core: Skip wakeup when task is already running Dongsheng Yang
2014-04-15 13:53   ` Peter Zijlstra
2014-04-16 10:22     ` Dongsheng Yang
2014-04-22 11:56       ` Dongsheng Yang
2014-04-22 13:23         ` Peter Zijlstra
2014-04-22 17:10         ` bsegall
2014-04-22 17:53           ` Steven Rostedt
2014-04-22 18:18           ` Peter Zijlstra
2014-05-05  6:32             ` Dongsheng Yang
2014-05-05  6:34               ` [PATCH] sched: Move the wakeup tracepoint from ttwu_do_wakeup() to ttwu_activate() Dongsheng Yang
2014-05-05 14:00                 ` Steven Rostedt
2014-05-06  0:19                   ` Dongsheng Yang
2014-05-06  0:26                     ` Dongsheng Yang
2014-05-06  2:06                     ` Steven Rostedt
2014-05-06  1:29                       ` Dongsheng Yang
2014-05-06  1:52                         ` [PATCH] sched: Distinguish sched_wakeup event when wake up a task which did schedule out or not Dongsheng Yang
2014-05-09  0:16                           ` Dongsheng Yang
2014-05-09  1:27                             ` Steven Rostedt
2014-05-10 15:29                           ` Peter Zijlstra
     [not found]                             ` <536F90BE.2080806@gmail.com>
2014-05-11 15:24                               ` Fwd: " Dongsheng Yang
2014-05-11 16:35                                 ` Peter Zijlstra
2014-05-11 18:52                                   ` Steven Rostedt
2014-05-12  6:47                                     ` Peter Zijlstra
2014-05-12  8:58                                       ` Dongsheng Yang
2014-05-12 14:09                                       ` Steven Rostedt
2014-05-12 15:09                                         ` Peter Zijlstra
2014-05-12 15:17                                           ` Steven Rostedt
2014-05-12 15:28                                             ` Peter Zijlstra
2014-04-15 12:32 ` [PATCH 5/8] perf tools: record and process sched:sched_wait event Dongsheng Yang
2014-04-15 12:32 ` [PATCH 6/8] perf tools: add missing event for perf sched record Dongsheng Yang
2014-04-15 12:32 ` [PATCH 7/8] perf tools: Adapt the TASK_STATE_TO_CHAR_STR to new value in kernel space Dongsheng Yang
2014-04-15 12:32 ` [PATCH 8/8] perf tools: Clarify the output of perf sched map Dongsheng Yang
2014-04-15 13:54 ` [PATCH 0/8] perf sched: Add trace event for sched wait Peter Zijlstra
2014-04-16 10:28   ` Dongsheng Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.