All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] PERF: Do not export power_frequency, but power_start event
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-19 11:36 ` Thomas Renninger
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Thomas Gleixner

power_frequency moved to drivers/cpufreq/cpufreq.c which has
to be compiled in, no need to export it.

intel_idle can a be module though...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 drivers/idle/intel_idle.c   |    2 --
 kernel/trace/power-traces.c |    2 +-
 2 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index c37ef64..21ac077 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -201,9 +201,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 	kt_before = ktime_get_real();
 
 	stop_critical_timings();
-#ifndef MODULE
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
-#endif
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index a22582a..0e0497d 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,5 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
-EXPORT_TRACEPOINT_SYMBOL_GPL(power_frequency);
+EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
 
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH 1/3] PERF: Do not export power_frequency, but power_start event
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
  2010-10-19 11:36 ` [PATCH 1/3] PERF: Do not export power_frequency, but power_start event Thomas Renninger
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven, Ingo Molnar

power_frequency moved to drivers/cpufreq/cpufreq.c which has
to be compiled in, no need to export it.

intel_idle can a be module though...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 drivers/idle/intel_idle.c   |    2 --
 kernel/trace/power-traces.c |    2 +-
 2 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index c37ef64..21ac077 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -201,9 +201,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 	kt_before = ktime_get_real();
 
 	stop_critical_timings();
-#ifndef MODULE
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
-#endif
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index a22582a..0e0497d 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,5 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
-EXPORT_TRACEPOINT_SYMBOL_GPL(power_frequency);
+EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
                   ` (2 preceding siblings ...)
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-19 11:36 ` [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new " Thomas Renninger
  2010-10-19 11:36 ` Thomas Renninger
  5 siblings, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Thomas Gleixner

New power trace events:
power:processor_idle
power:processor_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:processor_idle

and
  power:power_frequency
is replaced with:
  power:processor_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process.c    |    5 ++-
 arch/x86/kernel/process_64.c |    1 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   80 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   14 +++++++
 kernel/trace/power-traces.c  |    3 ++
 8 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..b6b1578 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_processor_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -480,11 +483,9 @@ static void mwait_idle(void)
  */
 static void poll_idle(void)
 {
-	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
 }
 
 /*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..2c3254c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,7 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_processor_idle(0, smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..33bdc41 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_processor_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..f79de04 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_processor_idle(0, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..c78e496 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_processor_idle((eax >> 4) + 1, smp_processor_id());
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..d5cecd9 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,60 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(processor,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u64,		state		)
+		__field(	u64,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	     TP_ARGS(state, cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u64,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +123,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..0b5c841 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,20 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..6b6da42 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(processor_idle);
 
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
  2010-10-19 11:36 ` [PATCH 1/3] PERF: Do not export power_frequency, but power_start event Thomas Renninger
  2010-10-19 11:36 ` Thomas Renninger
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-25  6:54   ` Arjan van de Ven
                     ` (7 more replies)
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                   ` (2 subsequent siblings)
  5 siblings, 8 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven, Ingo Molnar

New power trace events:
power:processor_idle
power:processor_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:processor_idle

and
  power:power_frequency
is replaced with:
  power:processor_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process.c    |    5 ++-
 arch/x86/kernel/process_64.c |    1 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   80 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   14 +++++++
 kernel/trace/power-traces.c  |    3 ++
 8 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..b6b1578 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_processor_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -480,11 +483,9 @@ static void mwait_idle(void)
  */
 static void poll_idle(void)
 {
-	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
 }
 
 /*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..2c3254c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,7 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_processor_idle(0, smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..33bdc41 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_processor_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..f79de04 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_processor_idle(0, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..c78e496 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_processor_idle((eax >> 4) + 1, smp_processor_id());
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..d5cecd9 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,60 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(processor,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u64,		state		)
+		__field(	u64,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	     TP_ARGS(state, cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u64,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +123,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..0b5c841 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,20 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..6b6da42 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(processor_idle);
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new power events
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
                   ` (3 preceding siblings ...)
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-19 11:36 ` Thomas Renninger
  5 siblings, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Thomas Gleixner

The transition was rather smooth, only part I had to fiddle
some time was the check whether a tracepoint/event is
supported by the running kernel.

builtin-timechart must only pass -e power:xy events which
are supported by the running kernel.
For this I added the tiny helper function:
int is_valid_tracepoint(const char *event_string)
to parse-events.[hc]
which could be more generic as an interface and support
hardware/software/... events, not only tracepoints, but someone
else could extend that if needed...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/builtin-timechart.c |   87 ++++++++++++++++++++++++++++++++-------
 tools/perf/util/parse-events.c |   43 +++++++++++++++++++-
 tools/perf/util/parse-events.h |    1 +
 3 files changed, 114 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 9bcc38f..1304c27 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -32,6 +32,8 @@
 #include "util/session.h"
 #include "util/svghelper.h"
 
+#define SUPPORT_OLD_POWER_EVENTS 1
+
 static char		const *input_name = "perf.data";
 static char		const *output_name = "output.svg";
 
@@ -298,12 +300,25 @@ struct trace_entry {
 	int			lock_depth;
 };
 
-struct power_entry {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+struct power_entry_old {
 	struct trace_entry te;
 	u64	type;
 	u64	value;
 	u64	cpu_id;
 };
+#endif
+
+struct power_processor_entry {
+	struct trace_entry te;
+	u64	state;
+	u64	cpu_id;
+};
+
+struct power_suspend_entry {
+	struct trace_entry te;
+	u64	state;
+};
 
 #define TASK_COMM_LEN 16
 struct wakeup_entry {
@@ -489,29 +504,46 @@ static int process_sample_event(event_t *event, struct perf_session *session)
 	te = (void *)data.raw_data;
 	if (session->sample_type & PERF_SAMPLE_RAW && data.raw_size > 0) {
 		char *event_str;
-		struct power_entry *pe;
-
-		pe = (void *)te;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		struct power_entry_old *peo;
+		peo = (void *)te;
+#endif
 
 		event_str = perf_header__find_event(te->type);
 
 		if (!event_str)
 			return 0;
 
-		if (strcmp(event_str, "power:power_start") == 0)
-			c_state_start(pe->cpu_id, data.time, pe->value);
-
-		if (strcmp(event_str, "power:power_end") == 0)
-			c_state_end(pe->cpu_id, data.time);
+		if (strcmp(event_str, "power:processor_idle") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			if (ppe->state == 0)
+				c_state_end(ppe->cpu_id, data.time);
+			else
+				c_state_start(ppe->cpu_id, data.time,
+					      ppe->state);
+		}
 
-		if (strcmp(event_str, "power:power_frequency") == 0)
-			p_state_change(pe->cpu_id, data.time, pe->value);
+		else if (strcmp(event_str, "power:processor_frequency") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			p_state_change(ppe->cpu_id, data.time, ppe->state);
+		}
 
-		if (strcmp(event_str, "sched:sched_wakeup") == 0)
+		else if (strcmp(event_str, "sched:sched_wakeup") == 0)
 			sched_wakeup(data.cpu, data.time, data.pid, te);
 
-		if (strcmp(event_str, "sched:sched_switch") == 0)
+		else if (strcmp(event_str, "sched:sched_switch") == 0)
 			sched_switch(data.cpu, data.time, te);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		else if (strcmp(event_str, "power:power_start") == 0)
+			c_state_start(peo->cpu_id, data.time, peo->value);
+
+		else if (strcmp(event_str, "power:power_end") == 0)
+			c_state_end(peo->cpu_id, data.time);
+
+		else if (strcmp(event_str, "power:power_frequency") == 0)
+			p_state_change(peo->cpu_id, data.time, peo->value);
+#endif
 	}
 	return 0;
 }
@@ -968,7 +1000,8 @@ static const char * const timechart_usage[] = {
 	NULL
 };
 
-static const char *record_args[] = {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static const char *record_old_args[] = {
 	"record",
 	"-a",
 	"-R",
@@ -980,16 +1013,38 @@ static const char *record_args[] = {
 	"-e", "sched:sched_wakeup",
 	"-e", "sched:sched_switch",
 };
+#endif
+
+static const char *record_new_args[] = {
+	"record",
+	"-a",
+	"-R",
+	"-f",
+	"-c", "1",
+	"-e", "power:processor_frequency",
+	"-e", "power:processor_idle",
+	"-e", "sched:sched_wakeup",
+	"-e", "sched:sched_switch",
+};
 
 static int __cmd_record(int argc, const char **argv)
 {
 	unsigned int rec_argc, i, j;
 	const char **rec_argv;
+	const char **record_args = record_new_args;
+	unsigned int record_elems = ARRAY_SIZE(record_new_args);
 
-	rec_argc = ARRAY_SIZE(record_args) + argc - 1;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+	if (is_valid_tracepoint("power:power_start")) {
+		record_args = record_old_args;
+		record_elems = ARRAY_SIZE(record_old_args);
+	}
+#endif
+	
+	rec_argc = record_elems + argc - 1;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
-	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+	for (i = 0; i < record_elems; i++)
 		rec_argv[i] = strdup(record_args[i]);
 
 	for (j = 1; j < (unsigned int)argc; j++, i++)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4af5bd5..d706dcb 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -824,7 +824,7 @@ int parse_events(const struct option *opt __used, const char *str, int unset __u
 		if (ret != EVT_HANDLED_ALL) {
 			attrs[nr_counters] = attr;
 			nr_counters++;
-		}
+	}
 
 		if (*str == 0)
 			break;
@@ -906,6 +906,47 @@ static void print_tracepoint_events(void)
 }
 
 /*
+ * Check whether event is in <debugfs_mount_point>/tracing/events
+ */
+
+int is_valid_tracepoint(const char *event_string)
+{
+	DIR *sys_dir, *evt_dir;
+	struct dirent *sys_next, *evt_next, sys_dirent, evt_dirent;
+	char evt_path[MAXPATHLEN];
+	char dir_path[MAXPATHLEN];
+
+	if (debugfs_valid_mountpoint(debugfs_path))
+		return 0;
+
+	sys_dir = opendir(debugfs_path);
+	if (!sys_dir)
+		return 0;
+
+	for_each_subsystem(sys_dir, sys_dirent, sys_next) {
+
+		snprintf(dir_path, MAXPATHLEN, "%s/%s", debugfs_path,
+			 sys_dirent.d_name);
+		evt_dir = opendir(dir_path);
+		if (!evt_dir)
+			continue;
+
+		for_each_event(sys_dirent, evt_dir, evt_dirent, evt_next) {
+			snprintf(evt_path, MAXPATHLEN, "%s:%s",
+				 sys_dirent.d_name, evt_dirent.d_name);
+			if (!strcmp(evt_path, event_string)) {
+				closedir(evt_dir);
+				closedir(sys_dir);
+				return 1;
+			}
+		}
+		closedir(evt_dir);
+	}
+	closedir(sys_dir);
+	return 0;
+}
+
+/*
  * Print the help text for the event symbols:
  */
 void print_events(void)
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index fc4ab3f..7ab4685 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -29,6 +29,7 @@ extern int parse_filter(const struct option *opt, const char *str, int unset);
 #define EVENTS_HELP_MAX (128*1024)
 
 extern void print_events(void);
+extern int is_valid_tracepoint(const char *event_string);
 
 extern char debugfs_path[];
 extern int valid_debugfs_mount(const char *debugfs);
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new power events
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
                   ` (4 preceding siblings ...)
  2010-10-19 11:36 ` [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new " Thomas Renninger
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-26  0:18   ` [PATCH] PERF(userspace): Adjust perf timechart to the new power events V2 Thomas Renninger
  2010-10-26  0:18   ` Thomas Renninger
  5 siblings, 2 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven, Ingo Molnar

The transition was rather smooth, only part I had to fiddle
some time was the check whether a tracepoint/event is
supported by the running kernel.

builtin-timechart must only pass -e power:xy events which
are supported by the running kernel.
For this I added the tiny helper function:
int is_valid_tracepoint(const char *event_string)
to parse-events.[hc]
which could be more generic as an interface and support
hardware/software/... events, not only tracepoints, but someone
else could extend that if needed...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/builtin-timechart.c |   87 ++++++++++++++++++++++++++++++++-------
 tools/perf/util/parse-events.c |   43 +++++++++++++++++++-
 tools/perf/util/parse-events.h |    1 +
 3 files changed, 114 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 9bcc38f..1304c27 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -32,6 +32,8 @@
 #include "util/session.h"
 #include "util/svghelper.h"
 
+#define SUPPORT_OLD_POWER_EVENTS 1
+
 static char		const *input_name = "perf.data";
 static char		const *output_name = "output.svg";
 
@@ -298,12 +300,25 @@ struct trace_entry {
 	int			lock_depth;
 };
 
-struct power_entry {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+struct power_entry_old {
 	struct trace_entry te;
 	u64	type;
 	u64	value;
 	u64	cpu_id;
 };
+#endif
+
+struct power_processor_entry {
+	struct trace_entry te;
+	u64	state;
+	u64	cpu_id;
+};
+
+struct power_suspend_entry {
+	struct trace_entry te;
+	u64	state;
+};
 
 #define TASK_COMM_LEN 16
 struct wakeup_entry {
@@ -489,29 +504,46 @@ static int process_sample_event(event_t *event, struct perf_session *session)
 	te = (void *)data.raw_data;
 	if (session->sample_type & PERF_SAMPLE_RAW && data.raw_size > 0) {
 		char *event_str;
-		struct power_entry *pe;
-
-		pe = (void *)te;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		struct power_entry_old *peo;
+		peo = (void *)te;
+#endif
 
 		event_str = perf_header__find_event(te->type);
 
 		if (!event_str)
 			return 0;
 
-		if (strcmp(event_str, "power:power_start") == 0)
-			c_state_start(pe->cpu_id, data.time, pe->value);
-
-		if (strcmp(event_str, "power:power_end") == 0)
-			c_state_end(pe->cpu_id, data.time);
+		if (strcmp(event_str, "power:processor_idle") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			if (ppe->state == 0)
+				c_state_end(ppe->cpu_id, data.time);
+			else
+				c_state_start(ppe->cpu_id, data.time,
+					      ppe->state);
+		}
 
-		if (strcmp(event_str, "power:power_frequency") == 0)
-			p_state_change(pe->cpu_id, data.time, pe->value);
+		else if (strcmp(event_str, "power:processor_frequency") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			p_state_change(ppe->cpu_id, data.time, ppe->state);
+		}
 
-		if (strcmp(event_str, "sched:sched_wakeup") == 0)
+		else if (strcmp(event_str, "sched:sched_wakeup") == 0)
 			sched_wakeup(data.cpu, data.time, data.pid, te);
 
-		if (strcmp(event_str, "sched:sched_switch") == 0)
+		else if (strcmp(event_str, "sched:sched_switch") == 0)
 			sched_switch(data.cpu, data.time, te);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		else if (strcmp(event_str, "power:power_start") == 0)
+			c_state_start(peo->cpu_id, data.time, peo->value);
+
+		else if (strcmp(event_str, "power:power_end") == 0)
+			c_state_end(peo->cpu_id, data.time);
+
+		else if (strcmp(event_str, "power:power_frequency") == 0)
+			p_state_change(peo->cpu_id, data.time, peo->value);
+#endif
 	}
 	return 0;
 }
@@ -968,7 +1000,8 @@ static const char * const timechart_usage[] = {
 	NULL
 };
 
-static const char *record_args[] = {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static const char *record_old_args[] = {
 	"record",
 	"-a",
 	"-R",
@@ -980,16 +1013,38 @@ static const char *record_args[] = {
 	"-e", "sched:sched_wakeup",
 	"-e", "sched:sched_switch",
 };
+#endif
+
+static const char *record_new_args[] = {
+	"record",
+	"-a",
+	"-R",
+	"-f",
+	"-c", "1",
+	"-e", "power:processor_frequency",
+	"-e", "power:processor_idle",
+	"-e", "sched:sched_wakeup",
+	"-e", "sched:sched_switch",
+};
 
 static int __cmd_record(int argc, const char **argv)
 {
 	unsigned int rec_argc, i, j;
 	const char **rec_argv;
+	const char **record_args = record_new_args;
+	unsigned int record_elems = ARRAY_SIZE(record_new_args);
 
-	rec_argc = ARRAY_SIZE(record_args) + argc - 1;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+	if (is_valid_tracepoint("power:power_start")) {
+		record_args = record_old_args;
+		record_elems = ARRAY_SIZE(record_old_args);
+	}
+#endif
+	
+	rec_argc = record_elems + argc - 1;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
-	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+	for (i = 0; i < record_elems; i++)
 		rec_argv[i] = strdup(record_args[i]);
 
 	for (j = 1; j < (unsigned int)argc; j++, i++)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4af5bd5..d706dcb 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -824,7 +824,7 @@ int parse_events(const struct option *opt __used, const char *str, int unset __u
 		if (ret != EVT_HANDLED_ALL) {
 			attrs[nr_counters] = attr;
 			nr_counters++;
-		}
+	}
 
 		if (*str == 0)
 			break;
@@ -906,6 +906,47 @@ static void print_tracepoint_events(void)
 }
 
 /*
+ * Check whether event is in <debugfs_mount_point>/tracing/events
+ */
+
+int is_valid_tracepoint(const char *event_string)
+{
+	DIR *sys_dir, *evt_dir;
+	struct dirent *sys_next, *evt_next, sys_dirent, evt_dirent;
+	char evt_path[MAXPATHLEN];
+	char dir_path[MAXPATHLEN];
+
+	if (debugfs_valid_mountpoint(debugfs_path))
+		return 0;
+
+	sys_dir = opendir(debugfs_path);
+	if (!sys_dir)
+		return 0;
+
+	for_each_subsystem(sys_dir, sys_dirent, sys_next) {
+
+		snprintf(dir_path, MAXPATHLEN, "%s/%s", debugfs_path,
+			 sys_dirent.d_name);
+		evt_dir = opendir(dir_path);
+		if (!evt_dir)
+			continue;
+
+		for_each_event(sys_dirent, evt_dir, evt_dirent, evt_next) {
+			snprintf(evt_path, MAXPATHLEN, "%s:%s",
+				 sys_dirent.d_name, evt_dirent.d_name);
+			if (!strcmp(evt_path, event_string)) {
+				closedir(evt_dir);
+				closedir(sys_dir);
+				return 1;
+			}
+		}
+		closedir(evt_dir);
+	}
+	closedir(sys_dir);
+	return 0;
+}
+
+/*
  * Print the help text for the event symbols:
  */
 void print_events(void)
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index fc4ab3f..7ab4685 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -29,6 +29,7 @@ extern int parse_filter(const struct option *opt, const char *str, int unset);
 #define EVENTS_HELP_MAX (128*1024)
 
 extern void print_events(void);
+extern int is_valid_tracepoint(const char *event_string);
 
 extern char debugfs_path[];
 extern int valid_debugfs_mount(const char *debugfs);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
@ 2010-10-25  6:54   ` Arjan van de Ven
  2010-10-25  6:54   ` Arjan van de Ven
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25  6:54 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On 10/19/2010 4:36 AM, Thomas Renninger wrote:
>   static void poll_idle(void)
>   {
> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
>   	local_irq_enable();
>   	while (!need_resched())
>   		cpu_relax();
> -	trace_power_end(0);
>   }

why did you remove the idle tracepoints from this one ???

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
  2010-10-25  6:54   ` Arjan van de Ven
@ 2010-10-25  6:54   ` Arjan van de Ven
  2010-10-25  9:41     ` Thomas Renninger
  2010-10-25  9:41     ` Thomas Renninger
  2010-10-25  6:58   ` Arjan van de Ven
                     ` (5 subsequent siblings)
  7 siblings, 2 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25  6:54 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/19/2010 4:36 AM, Thomas Renninger wrote:
>   static void poll_idle(void)
>   {
> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
>   	local_irq_enable();
>   	while (!need_resched())
>   		cpu_relax();
> -	trace_power_end(0);
>   }

why did you remove the idle tracepoints from this one ???


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
  2010-10-25  6:54   ` Arjan van de Ven
  2010-10-25  6:54   ` Arjan van de Ven
@ 2010-10-25  6:58   ` Arjan van de Ven
  2010-10-25  6:58   ` Arjan van de Ven
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25  6:58 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>    power:power_start
>    power:power_end
> are replaced with:
>    power:processor_idle
>
I think you need two trace points for this
one to enter idle
one to exit

because using magic encoding games to encode "exit"is a mistake; as can 
be seen in this patch.
You're currently trying to use "0" to signal "end of idle", but "0" is 
also a valid idle state (namely that of polling)

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                     ` (2 preceding siblings ...)
  2010-10-25  6:58   ` Arjan van de Ven
@ 2010-10-25  6:58   ` Arjan van de Ven
  2010-10-25 10:04   ` Ingo Molnar
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25  6:58 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>    power:power_start
>    power:power_end
> are replaced with:
>    power:processor_idle
>
I think you need two trace points for this
one to enter idle
one to exit

because using magic encoding games to encode "exit"is a mistake; as can 
be seen in this patch.
You're currently trying to use "0" to signal "end of idle", but "0" is 
also a valid idle state (namely that of polling)


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25  6:54   ` Arjan van de Ven
  2010-10-25  9:41     ` Thomas Renninger
@ 2010-10-25  9:41     ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25  9:41 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> >   static void poll_idle(void)
> >   {
> > -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> >   	local_irq_enable();
> >   	while (!need_resched())
> >   		cpu_relax();
> > -	trace_power_end(0);
> >   }
> 
> why did you remove the idle tracepoints from this one ???
Because no idle/sleep state is entered here.
State 0 does not exist or say, it means the machine is not idle.
The new event uses idle state 0 spec conform as "exit sleep state".

If this should still be trackable some kind of dummy sleep state:
#define IDLE_BUSY_LOOP 0xFE
(or similar) must get defined and passed like this:
trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
    cpu_relax()
trace_processor_idle(0, smp_processor_id());

I could imagine this is somewhat worth it to compare idle results
to "no idle state at all" is used.
But nobody should ever use idle=poll, comparing deep sleep states
with C1 with (idle=halt) should be sufficient?

    Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25  6:54   ` Arjan van de Ven
@ 2010-10-25  9:41     ` Thomas Renninger
  2010-10-25 13:55       ` Arjan van de Ven
  2010-10-25 13:55       ` Arjan van de Ven
  2010-10-25  9:41     ` Thomas Renninger
  1 sibling, 2 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25  9:41 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> >   static void poll_idle(void)
> >   {
> > -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> >   	local_irq_enable();
> >   	while (!need_resched())
> >   		cpu_relax();
> > -	trace_power_end(0);
> >   }
> 
> why did you remove the idle tracepoints from this one ???
Because no idle/sleep state is entered here.
State 0 does not exist or say, it means the machine is not idle.
The new event uses idle state 0 spec conform as "exit sleep state".

If this should still be trackable some kind of dummy sleep state:
#define IDLE_BUSY_LOOP 0xFE
(or similar) must get defined and passed like this:
trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
    cpu_relax()
trace_processor_idle(0, smp_processor_id());

I could imagine this is somewhat worth it to compare idle results
to "no idle state at all" is used.
But nobody should ever use idle=poll, comparing deep sleep states
with C1 with (idle=halt) should be sufficient?

    Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                     ` (3 preceding siblings ...)
  2010-10-25  6:58   ` Arjan van de Ven
@ 2010-10-25 10:04   ` Ingo Molnar
  2010-10-25 10:04   ` Ingo Molnar
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-25 10:04 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner


* Thomas Renninger <trenn@suse.de> wrote:

> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle

Well, most power saving hw models (and the code implementing them) have this kind of 
model:

 enter power saving mode X
 exit power saving mode

Where X is some sort of 'power saving deepness' attribute, right?

 	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                     ` (4 preceding siblings ...)
  2010-10-25 10:04   ` Ingo Molnar
@ 2010-10-25 10:04   ` Ingo Molnar
  2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
  2010-10-25 23:33   ` Thomas Renninger
  7 siblings, 2 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-25 10:04 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle

Well, most power saving hw models (and the code implementing them) have this kind of 
model:

 enter power saving mode X
 exit power saving mode

Where X is some sort of 'power saving deepness' attribute, right?

 	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 10:04   ` Ingo Molnar
@ 2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:03     ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 11:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > New power trace events:
> > power:processor_idle
> > power:processor_frequency
> > power:machine_suspend
> > 
> > 
> > C-state/idle accounting events:
> >   power:power_start
> >   power:power_end
> > are replaced with:
> >   power:processor_idle
> 
> Well, most power saving hw models (and the code implementing them) have this kind of 
> model:
> 
>  enter power saving mode X
>  exit power saving mode
> 
> Where X is some sort of 'power saving deepness' attribute, right?
Sure.
But ACPI and afaik this model got picked up for PCI and other (sub-)archs
as well, defines state 0 as the non-power saving mode.
Same as done here with machine suspend state (S0 is back from suspend) and
this model should get picked up when device sleep states get tracked at
some time.
It's consistent and applies to some well known specifications.

Also tracking processor_idle_{start,end} as a separate event
makes no sense and there is no need to introduce:
processor_idle_start/processor_idle_end
machine_suspend_start/machine_suspend_end
device_power_mode_start/device_power_mode_end
events.
Using state 0 as "exit/end", is much nicer for kernel/
userspace implementations/code and the user.

     Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 10:04   ` Ingo Molnar
  2010-10-25 11:03     ` Thomas Renninger
@ 2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:55       ` Ingo Molnar
                         ` (3 more replies)
  1 sibling, 4 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 11:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > New power trace events:
> > power:processor_idle
> > power:processor_frequency
> > power:machine_suspend
> > 
> > 
> > C-state/idle accounting events:
> >   power:power_start
> >   power:power_end
> > are replaced with:
> >   power:processor_idle
> 
> Well, most power saving hw models (and the code implementing them) have this kind of 
> model:
> 
>  enter power saving mode X
>  exit power saving mode
> 
> Where X is some sort of 'power saving deepness' attribute, right?
Sure.
But ACPI and afaik this model got picked up for PCI and other (sub-)archs
as well, defines state 0 as the non-power saving mode.
Same as done here with machine suspend state (S0 is back from suspend) and
this model should get picked up when device sleep states get tracked at
some time.
It's consistent and applies to some well known specifications.

Also tracking processor_idle_{start,end} as a separate event
makes no sense and there is no need to introduce:
processor_idle_start/processor_idle_end
machine_suspend_start/machine_suspend_end
device_power_mode_start/device_power_mode_end
events.
Using state 0 as "exit/end", is much nicer for kernel/
userspace implementations/code and the user.

     Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:55       ` Ingo Molnar
@ 2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 13:58       ` Arjan van de Ven
  2010-10-25 13:58       ` Arjan van de Ven
  3 siblings, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-25 11:55 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner


* Thomas Renninger <trenn@suse.de> wrote:

> On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > New power trace events:
> > > power:processor_idle
> > > power:processor_frequency
> > > power:machine_suspend
> > > 
> > > 
> > > C-state/idle accounting events:
> > >   power:power_start
> > >   power:power_end
> > > are replaced with:
> > >   power:processor_idle
> > 
> > Well, most power saving hw models (and the code implementing them) have this kind of 
> > model:
> > 
> >  enter power saving mode X
> >  exit power saving mode
> > 
> > Where X is some sort of 'power saving deepness' attribute, right?
>
> Sure.

Which is is the 'saner' model?

> But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> defines state 0 as the non-power saving mode.

But the actual code does not actually deal with any 'state 0', does it? It enters an 
idle function and then exits it, right?

'power state' might be what is used for devices - but even there, we have:

  - enter power state X
  - exit power state

right?

> Same as done here with machine suspend state (S0 is back from suspend) and
> this model should get picked up when device sleep states get tracked at
> some time.
>
> It's consistent and applies to some well known specifications.

What we want it to be is for it to be the nicest, most understandable, most logical 
model - not one matching random hardware specifications.

( Hardware specifications only matter in so far that it should be possible to 
  express all the known hardware state transitions via these events efficiently. )

> Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> there is no need to introduce: processor_idle_start/processor_idle_end 
> machine_suspend_start/machine_suspend_end 
> device_power_mode_start/device_power_mode_end events.

What do you mean by "makes no sense"?

Are they superfluous? Inefficient? Illogical?

> Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> implementations/code and the user.

By that argument we should not have separate fork() and exit() syscalls either, but 
a set_process_state(1) and set_process_state(0) interface?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:03     ` Thomas Renninger
@ 2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 12:55         ` Thomas Renninger
                           ` (3 more replies)
  2010-10-25 11:55       ` Ingo Molnar
                         ` (2 subsequent siblings)
  3 siblings, 4 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-25 11:55 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > New power trace events:
> > > power:processor_idle
> > > power:processor_frequency
> > > power:machine_suspend
> > > 
> > > 
> > > C-state/idle accounting events:
> > >   power:power_start
> > >   power:power_end
> > > are replaced with:
> > >   power:processor_idle
> > 
> > Well, most power saving hw models (and the code implementing them) have this kind of 
> > model:
> > 
> >  enter power saving mode X
> >  exit power saving mode
> > 
> > Where X is some sort of 'power saving deepness' attribute, right?
>
> Sure.

Which is is the 'saner' model?

> But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> defines state 0 as the non-power saving mode.

But the actual code does not actually deal with any 'state 0', does it? It enters an 
idle function and then exits it, right?

'power state' might be what is used for devices - but even there, we have:

  - enter power state X
  - exit power state

right?

> Same as done here with machine suspend state (S0 is back from suspend) and
> this model should get picked up when device sleep states get tracked at
> some time.
>
> It's consistent and applies to some well known specifications.

What we want it to be is for it to be the nicest, most understandable, most logical 
model - not one matching random hardware specifications.

( Hardware specifications only matter in so far that it should be possible to 
  express all the known hardware state transitions via these events efficiently. )

> Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> there is no need to introduce: processor_idle_start/processor_idle_end 
> machine_suspend_start/machine_suspend_end 
> device_power_mode_start/device_power_mode_end events.

What do you mean by "makes no sense"?

Are they superfluous? Inefficient? Illogical?

> Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> implementations/code and the user.

By that argument we should not have separate fork() and exit() syscalls either, but 
a set_process_state(1) and set_process_state(0) interface?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 12:55         ` Thomas Renninger
@ 2010-10-25 12:55         ` Thomas Renninger
  2010-10-25 12:58         ` Mathieu Desnoyers
  2010-10-25 12:58         ` Mathieu Desnoyers
  3 siblings, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 12:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Monday 25 October 2010 13:55:25 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > 
> > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > model:
> > > 
> > >  enter power saving mode X
> > >  exit power saving mode
> > > 
> > > Where X is some sort of 'power saving deepness' attribute, right?
> >
> > Sure.
> 
> Which is is the 'saner' model?
> 
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > defines state 0 as the non-power saving mode.
> 
> But the actual code does not actually deal with any 'state 0', does it?
It does. Not being idle is tracked by cpuidle driver as state 0
(arch independent):
/sys/devices/system/cpu/cpu0/cpuidle/state0/
halt/C1 on X86 is:
/sys/devices/system/cpu/cpu0/cpuidle/state1/
...

> It enters an idle function and then exits it, right?
> 'power state' might be what is used for devices - but even there, we have:
> 
>   - enter power state X
>   - exit power state
> 
> right?
That is not true for PCI, probably others as well.
There you have D0 (being the maximum powered state) up to D3.
Same for PCI Bus Power States (B0, B1, B2, and B3).

Look at drivers/pci/pci.c:pci_raw_set_power_state()
To "exit" a power state you call:
pci_raw_set_power_state(dev, PCI_D0);

Same for suspend. "Exit" suspend is:
#define PM_SUSPEND_ON           ((__force suspend_state_t) 0)
so on resume we enter suspend_state_t 0.

> > Same as done here with machine suspend state (S0 is back from suspend) and
> > this model should get picked up when device sleep states get tracked at
> > some time.
> >
> > It's consistent and applies to some well known specifications.
> 
> What we want it to be is for it to be the nicest, most understandable, most logical 
> model - not one matching random hardware specifications.
> 
> ( Hardware specifications only matter in so far that it should be possible to 
>   express all the known hardware state transitions via these events efficiently. )
> 
> > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > there is no need to introduce: processor_idle_start/processor_idle_end 
> > machine_suspend_start/machine_suspend_end 
> > device_power_mode_start/device_power_mode_end events.
> 
> What do you mean by "makes no sense"?
> 
> Are they superfluous?
Yes, you do not need two different events to track one thing.

> Illogical?
Yes, A user who wants to enable processor idle tracking does
want to enable it via:
echo power:processor_idle >/sys/kernel/debug/tracing/events/enable
what do you intend to track with a:
power:power_start
event?

    Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:55       ` Ingo Molnar
@ 2010-10-25 12:55         ` Thomas Renninger
  2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 12:55         ` Thomas Renninger
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 12:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Monday 25 October 2010 13:55:25 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > 
> > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > model:
> > > 
> > >  enter power saving mode X
> > >  exit power saving mode
> > > 
> > > Where X is some sort of 'power saving deepness' attribute, right?
> >
> > Sure.
> 
> Which is is the 'saner' model?
> 
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > defines state 0 as the non-power saving mode.
> 
> But the actual code does not actually deal with any 'state 0', does it?
It does. Not being idle is tracked by cpuidle driver as state 0
(arch independent):
/sys/devices/system/cpu/cpu0/cpuidle/state0/
halt/C1 on X86 is:
/sys/devices/system/cpu/cpu0/cpuidle/state1/
...

> It enters an idle function and then exits it, right?
> 'power state' might be what is used for devices - but even there, we have:
> 
>   - enter power state X
>   - exit power state
> 
> right?
That is not true for PCI, probably others as well.
There you have D0 (being the maximum powered state) up to D3.
Same for PCI Bus Power States (B0, B1, B2, and B3).

Look at drivers/pci/pci.c:pci_raw_set_power_state()
To "exit" a power state you call:
pci_raw_set_power_state(dev, PCI_D0);

Same for suspend. "Exit" suspend is:
#define PM_SUSPEND_ON           ((__force suspend_state_t) 0)
so on resume we enter suspend_state_t 0.

> > Same as done here with machine suspend state (S0 is back from suspend) and
> > this model should get picked up when device sleep states get tracked at
> > some time.
> >
> > It's consistent and applies to some well known specifications.
> 
> What we want it to be is for it to be the nicest, most understandable, most logical 
> model - not one matching random hardware specifications.
> 
> ( Hardware specifications only matter in so far that it should be possible to 
>   express all the known hardware state transitions via these events efficiently. )
> 
> > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > there is no need to introduce: processor_idle_start/processor_idle_end 
> > machine_suspend_start/machine_suspend_end 
> > device_power_mode_start/device_power_mode_end events.
> 
> What do you mean by "makes no sense"?
> 
> Are they superfluous?
Yes, you do not need two different events to track one thing.

> Illogical?
Yes, A user who wants to enable processor idle tracking does
want to enable it via:
echo power:processor_idle >/sys/kernel/debug/tracing/events/enable
what do you intend to track with a:
power:power_start
event?

    Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 12:55         ` Thomas Renninger
  2010-10-25 12:55         ` Thomas Renninger
@ 2010-10-25 12:58         ` Mathieu Desnoyers
  2010-10-25 12:58         ` Mathieu Desnoyers
  3 siblings, 0 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-25 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Thomas Gleixner, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > 
> > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > model:
> > > 
> > >  enter power saving mode X
> > >  exit power saving mode
> > > 
> > > Where X is some sort of 'power saving deepness' attribute, right?
> >
> > Sure.
> 
> Which is is the 'saner' model?
> 
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > defines state 0 as the non-power saving mode.
> 
> But the actual code does not actually deal with any 'state 0', does it? It enters an 
> idle function and then exits it, right?
> 
> 'power state' might be what is used for devices - but even there, we have:
> 
>   - enter power state X
>   - exit power state
> 
> right?
> 
> > Same as done here with machine suspend state (S0 is back from suspend) and
> > this model should get picked up when device sleep states get tracked at
> > some time.
> >
> > It's consistent and applies to some well known specifications.
> 
> What we want it to be is for it to be the nicest, most understandable, most logical 
> model - not one matching random hardware specifications.
> 
> ( Hardware specifications only matter in so far that it should be possible to 
>   express all the known hardware state transitions via these events efficiently. )
> 
> > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > there is no need to introduce: processor_idle_start/processor_idle_end 
> > machine_suspend_start/machine_suspend_end 
> > device_power_mode_start/device_power_mode_end events.
> 
> What do you mean by "makes no sense"?
> 
> Are they superfluous? Inefficient? Illogical?

I think it would require deep understanding of specific power modes of each
architecture to split into this topology. On the bright side, it would bring
clear understanding of which HW resource is being put to sleep, which would make
automated analysis much easier to do. But maybe it's too much pain compared to
the benefit. The related question is also: where is it best to put this logic ?
In the kernel code ? In per-arch TRACE_EVENT() handlers or in external trace
analysis plugins ?

> 
> > Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> > implementations/code and the user.
> 
> By that argument we should not have separate fork() and exit() syscalls either, but 
> a set_process_state(1) and set_process_state(0) interface?

I'm by no mean expert on power saving hardware specs, but if it is possible for
hardware to switch between two power saving states without passing through power
state 0, then using a "set state" rather than an enter/exit would be more
appropriate; even if we go for a scheme introducing

processor_idle_start/processor_idle_end,
machine_suspend_start/machine_suspend_end,
device_power_mode_start/device_power_mode_end.

I must defer to you guys to figure out if some hardware actually do that for
either of CPU idle, suspend or device power modes.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:55       ` Ingo Molnar
                           ` (2 preceding siblings ...)
  2010-10-25 12:58         ` Mathieu Desnoyers
@ 2010-10-25 12:58         ` Mathieu Desnoyers
  2010-10-25 20:29           ` Rafael J. Wysocki
  2010-10-25 20:29           ` Rafael J. Wysocki
  3 siblings, 2 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-25 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Arjan van de Ven

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > 
> > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > model:
> > > 
> > >  enter power saving mode X
> > >  exit power saving mode
> > > 
> > > Where X is some sort of 'power saving deepness' attribute, right?
> >
> > Sure.
> 
> Which is is the 'saner' model?
> 
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > defines state 0 as the non-power saving mode.
> 
> But the actual code does not actually deal with any 'state 0', does it? It enters an 
> idle function and then exits it, right?
> 
> 'power state' might be what is used for devices - but even there, we have:
> 
>   - enter power state X
>   - exit power state
> 
> right?
> 
> > Same as done here with machine suspend state (S0 is back from suspend) and
> > this model should get picked up when device sleep states get tracked at
> > some time.
> >
> > It's consistent and applies to some well known specifications.
> 
> What we want it to be is for it to be the nicest, most understandable, most logical 
> model - not one matching random hardware specifications.
> 
> ( Hardware specifications only matter in so far that it should be possible to 
>   express all the known hardware state transitions via these events efficiently. )
> 
> > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > there is no need to introduce: processor_idle_start/processor_idle_end 
> > machine_suspend_start/machine_suspend_end 
> > device_power_mode_start/device_power_mode_end events.
> 
> What do you mean by "makes no sense"?
> 
> Are they superfluous? Inefficient? Illogical?

I think it would require deep understanding of specific power modes of each
architecture to split into this topology. On the bright side, it would bring
clear understanding of which HW resource is being put to sleep, which would make
automated analysis much easier to do. But maybe it's too much pain compared to
the benefit. The related question is also: where is it best to put this logic ?
In the kernel code ? In per-arch TRACE_EVENT() handlers or in external trace
analysis plugins ?

> 
> > Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> > implementations/code and the user.
> 
> By that argument we should not have separate fork() and exit() syscalls either, but 
> a set_process_state(1) and set_process_state(0) interface?

I'm by no mean expert on power saving hardware specs, but if it is possible for
hardware to switch between two power saving states without passing through power
state 0, then using a "set state" rather than an enter/exit would be more
appropriate; even if we go for a scheme introducing

processor_idle_start/processor_idle_end,
machine_suspend_start/machine_suspend_end,
device_power_mode_start/device_power_mode_end.

I must defer to you guys to figure out if some hardware actually do that for
either of CPU idle, suspend or device power modes.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25  9:41     ` Thomas Renninger
@ 2010-10-25 13:55       ` Arjan van de Ven
  2010-10-25 13:55       ` Arjan van de Ven
  1 sibling, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25 13:55 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On 10/25/2010 2:41 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
>> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
>>>    static void poll_idle(void)
>>>    {
>>> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
>>>    	local_irq_enable();
>>>    	while (!need_resched())
>>>    		cpu_relax();
>>> -	trace_power_end(0);
>>>    }
>> why did you remove the idle tracepoints from this one ???
> Because no idle/sleep state is entered here.
> State 0 does not exist or say, it means the machine is not idle.
> The new event uses idle state 0 spec conform as "exit sleep state".
>
> If this should still be trackable some kind of dummy sleep state:
> #define IDLE_BUSY_LOOP 0xFE
> (or similar) must get defined and passed like this:
> trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
>      cpu_relax()
> trace_processor_idle(0, smp_processor_id());
>
> I could imagine this is somewhat worth it to compare idle results
> to "no idle state at all" is used.
> But nobody should ever use idle=poll, comparing deep sleep states
> with C1 with (idle=halt) should be sufficient?

this is not idle=poll on the command line only.
this also gets used normally, in two cases
1) during real time operations, for some short periods of time
     (think wallstreet trading)
2) by the menu governor when the next event is less than a few 
microseconds away, so short that even C1 is too much

I know that your new API tries to use "0" as exit, but 0 is already 
taken (in all power terminology at least on x86 it is) for this.

why isn't your "exit" a special define?


also, if you look at many other similar perf events, they ever separate 
entry/exit points:

process/do_process.cpp:         
perf_events->add_event("irq:irq_handler_entry");
process/do_process.cpp:         
perf_events->add_event("irq:irq_handler_exit");
process/do_process.cpp:         perf_events->add_event("irq:softirq_entry");
process/do_process.cpp:         perf_events->add_event("irq:softirq_exit");
process/do_process.cpp:         
perf_events->add_event("timer:timer_expire_entry");
process/do_process.cpp:         
perf_events->add_event("timer:timer_expire_exit");
process/do_process.cpp:         
perf_events->add_event("timer:hrtimer_expire_entry");
process/do_process.cpp:         
perf_events->add_event("timer:hrtimer_expire_exit");
process/do_process.cpp:         perf_events->add_event("power:power_start");
process/do_process.cpp:         perf_events->add_event("power:power_end");
process/do_process.cpp:         
perf_events->add_event("workqueue:workqueue_execute_start");
process/do_process.cpp:         
perf_events->add_event("workqueue:workqueue_execute_end");

so there is already an API consistency precedent
(and frankly, trying to multiplex in "exit" via a magic value is asking 
for trouble API wise)

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25  9:41     ` Thomas Renninger
  2010-10-25 13:55       ` Arjan van de Ven
@ 2010-10-25 13:55       ` Arjan van de Ven
  2010-10-25 14:36         ` Thomas Renninger
  2010-10-25 14:36         ` Thomas Renninger
  1 sibling, 2 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25 13:55 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/25/2010 2:41 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
>> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
>>>    static void poll_idle(void)
>>>    {
>>> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
>>>    	local_irq_enable();
>>>    	while (!need_resched())
>>>    		cpu_relax();
>>> -	trace_power_end(0);
>>>    }
>> why did you remove the idle tracepoints from this one ???
> Because no idle/sleep state is entered here.
> State 0 does not exist or say, it means the machine is not idle.
> The new event uses idle state 0 spec conform as "exit sleep state".
>
> If this should still be trackable some kind of dummy sleep state:
> #define IDLE_BUSY_LOOP 0xFE
> (or similar) must get defined and passed like this:
> trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
>      cpu_relax()
> trace_processor_idle(0, smp_processor_id());
>
> I could imagine this is somewhat worth it to compare idle results
> to "no idle state at all" is used.
> But nobody should ever use idle=poll, comparing deep sleep states
> with C1 with (idle=halt) should be sufficient?

this is not idle=poll on the command line only.
this also gets used normally, in two cases
1) during real time operations, for some short periods of time
     (think wallstreet trading)
2) by the menu governor when the next event is less than a few 
microseconds away, so short that even C1 is too much

I know that your new API tries to use "0" as exit, but 0 is already 
taken (in all power terminology at least on x86 it is) for this.

why isn't your "exit" a special define?


also, if you look at many other similar perf events, they ever separate 
entry/exit points:

process/do_process.cpp:         
perf_events->add_event("irq:irq_handler_entry");
process/do_process.cpp:         
perf_events->add_event("irq:irq_handler_exit");
process/do_process.cpp:         perf_events->add_event("irq:softirq_entry");
process/do_process.cpp:         perf_events->add_event("irq:softirq_exit");
process/do_process.cpp:         
perf_events->add_event("timer:timer_expire_entry");
process/do_process.cpp:         
perf_events->add_event("timer:timer_expire_exit");
process/do_process.cpp:         
perf_events->add_event("timer:hrtimer_expire_entry");
process/do_process.cpp:         
perf_events->add_event("timer:hrtimer_expire_exit");
process/do_process.cpp:         perf_events->add_event("power:power_start");
process/do_process.cpp:         perf_events->add_event("power:power_end");
process/do_process.cpp:         
perf_events->add_event("workqueue:workqueue_execute_start");
process/do_process.cpp:         
perf_events->add_event("workqueue:workqueue_execute_end");

so there is already an API consistency precedent
(and frankly, trying to multiplex in "exit" via a magic value is asking 
for trouble API wise)


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 11:55       ` Ingo Molnar
@ 2010-10-25 13:58       ` Arjan van de Ven
  2010-10-25 13:58       ` Arjan van de Ven
  3 siblings, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25 13:58 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Andrew Morton, Mathieu Desnoyers

On 10/25/2010 4:03 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
>> * Thomas Renninger<trenn@suse.de>  wrote:
>>
>>> New power trace events:
>>> power:processor_idle
>>> power:processor_frequency
>>> power:machine_suspend
>>>
>>>
>>> C-state/idle accounting events:
>>>    power:power_start
>>>    power:power_end
>>> are replaced with:
>>>    power:processor_idle
>> Well, most power saving hw models (and the code implementing them) have this kind of
>> model:
>>
>>   enter power saving mode X
>>   exit power saving mode
>>
>> Where X is some sort of 'power saving deepness' attribute, right?
> Sure.
> But ACPI and afaik this model got picked up for PCI and other (sub-)archs
> as well, defines state 0 as the non-power saving mode.

correct ,... "C0" is not power efficient... but it's still a valid OS 
idle state!
Also tracking processor_idle_{start,end} as a separate event!

same for "S0"... S0 as standby state is still valid... sure it doesn't 
save you much power... but that does not mean it's not valid.
(as indication, the Intel Moorestown platform, which is currently in 
production and available to OEMs, has such a S0 standby state)


> makes no sense and there is no need to introduce:
> processor_idle_start/processor_idle_end
> machine_suspend_start/machine_suspend_end
> device_power_mode_start/device_power_mode_end
> events.
> Using state 0 as "exit/end", is much nicer for kernel/
> userspace implementations/code and the user.
actually no; having written a few of these in userspace so far, having a 
separate end event is easier to deal with;
the actions you take on entry and exit are complete separate code paths.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:03     ` Thomas Renninger
                         ` (2 preceding siblings ...)
  2010-10-25 13:58       ` Arjan van de Ven
@ 2010-10-25 13:58       ` Arjan van de Ven
  2010-10-25 20:33         ` Rafael J. Wysocki
  2010-10-25 20:33         ` Rafael J. Wysocki
  3 siblings, 2 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25 13:58 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/25/2010 4:03 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
>> * Thomas Renninger<trenn@suse.de>  wrote:
>>
>>> New power trace events:
>>> power:processor_idle
>>> power:processor_frequency
>>> power:machine_suspend
>>>
>>>
>>> C-state/idle accounting events:
>>>    power:power_start
>>>    power:power_end
>>> are replaced with:
>>>    power:processor_idle
>> Well, most power saving hw models (and the code implementing them) have this kind of
>> model:
>>
>>   enter power saving mode X
>>   exit power saving mode
>>
>> Where X is some sort of 'power saving deepness' attribute, right?
> Sure.
> But ACPI and afaik this model got picked up for PCI and other (sub-)archs
> as well, defines state 0 as the non-power saving mode.

correct ,... "C0" is not power efficient... but it's still a valid OS 
idle state!
Also tracking processor_idle_{start,end} as a separate event!

same for "S0"... S0 as standby state is still valid... sure it doesn't 
save you much power... but that does not mean it's not valid.
(as indication, the Intel Moorestown platform, which is currently in 
production and available to OEMs, has such a S0 standby state)


> makes no sense and there is no need to introduce:
> processor_idle_start/processor_idle_end
> machine_suspend_start/machine_suspend_end
> device_power_mode_start/device_power_mode_end
> events.
> Using state 0 as "exit/end", is much nicer for kernel/
> userspace implementations/code and the user.
actually no; having written a few of these in userspace so far, having a 
separate end event is easier to deal with;
the actions you take on entry and exit are complete separate code paths.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 12:55         ` Thomas Renninger
@ 2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 14:11           ` Arjan van de Ven
  1 sibling, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25 14:11 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Andrew Morton, Mathieu Desnoyers

On 10/25/2010 5:55 AM, Thomas Renninger wrote:


>> But the actual code does not actually deal with any 'state 0', does it?
> It does. Not being idle is tracked by cpuidle driver as state 0
> (arch independent):
> /sys/devices/system/cpu/cpu0/cpuidle/state0/
> halt/C1 on X86 is:
> /sys/devices/system/cpu/cpu0/cpuidle/state1/
> ...
state0 is still OS idle!


the API is just weird for this, from a userspace perspective

if the kernel picks this state 0 for the idle handler, the userspace app 
gets
two events

one for going to state 0 to enter the idle state
one for going to state 0 to exit idle

but they're the exact same event in your API.

rather unpleasant from a userspace program perspective....
now I need to start tracking even more state on top in powertop to be 
able to make a guess at which of the two meanings a state 0 entry has.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 12:55         ` Thomas Renninger
  2010-10-25 14:11           ` Arjan van de Ven
@ 2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 14:51             ` Thomas Renninger
  2010-10-25 14:51             ` Thomas Renninger
  1 sibling, 2 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25 14:11 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/25/2010 5:55 AM, Thomas Renninger wrote:


>> But the actual code does not actually deal with any 'state 0', does it?
> It does. Not being idle is tracked by cpuidle driver as state 0
> (arch independent):
> /sys/devices/system/cpu/cpu0/cpuidle/state0/
> halt/C1 on X86 is:
> /sys/devices/system/cpu/cpu0/cpuidle/state1/
> ...
state0 is still OS idle!


the API is just weird for this, from a userspace perspective

if the kernel picks this state 0 for the idle handler, the userspace app 
gets
two events

one for going to state 0 to enter the idle state
one for going to state 0 to exit idle

but they're the exact same event in your API.

rather unpleasant from a userspace program perspective....
now I need to start tracking even more state on top in powertop to be 
able to make a guess at which of the two meanings a state 0 entry has.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 13:55       ` Arjan van de Ven
  2010-10-25 14:36         ` Thomas Renninger
@ 2010-10-25 14:36         ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 14:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On Monday 25 October 2010 15:55:08 Arjan van de Ven wrote:
> On 10/25/2010 2:41 AM, Thomas Renninger wrote:
> > On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
> >> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> >>>    static void poll_idle(void)
> >>>    {
> >>> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> >>>    	local_irq_enable();
> >>>    	while (!need_resched())
> >>>    		cpu_relax();
> >>> -	trace_power_end(0);
> >>>    }
> >> why did you remove the idle tracepoints from this one ???
> > Because no idle/sleep state is entered here.
> > State 0 does not exist or say, it means the machine is not idle.
> > The new event uses idle state 0 spec conform as "exit sleep state".
> >
> > If this should still be trackable some kind of dummy sleep state:
> > #define IDLE_BUSY_LOOP 0xFE
> > (or similar) must get defined and passed like this:
> > trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
> >      cpu_relax()
> > trace_processor_idle(0, smp_processor_id());
> >
> > I could imagine this is somewhat worth it to compare idle results
> > to "no idle state at all" is used.
> > But nobody should ever use idle=poll, comparing deep sleep states
> > with C1 with (idle=halt) should be sufficient?
> 
> this is not idle=poll on the command line only.
> this also gets used normally, in two cases
> 1) during real time operations, for some short periods of time
>      (think wallstreet trading)
> 2) by the menu governor when the next event is less than a few 
> microseconds away, so short that even C1 is too much
> 
> I know that your new API tries to use "0" as exit, but 0 is already 
> taken (in all power terminology at least on x86 it is) for this.
cpuidle indeed misuses C0 as "poll idle" state.
That's really bad/misleading, but nothing that can be changed easily.

I agree shifting C0 (cpuidle) <-> POLL_IDLE event
and              "not idle"   <-> real C0 (executing instructions)
or however this gets mapped makes things even worse.

Damn, it could be that easy and straight forward, but I agree that
this kills the approach to trigger state 0 event if C0 is entered
(C0 as defined as operational mode executing instructions).

     Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 13:55       ` Arjan van de Ven
@ 2010-10-25 14:36         ` Thomas Renninger
  2010-10-25 14:45           ` Arjan van de Ven
  2010-10-25 14:45           ` Arjan van de Ven
  2010-10-25 14:36         ` Thomas Renninger
  1 sibling, 2 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 14:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On Monday 25 October 2010 15:55:08 Arjan van de Ven wrote:
> On 10/25/2010 2:41 AM, Thomas Renninger wrote:
> > On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
> >> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> >>>    static void poll_idle(void)
> >>>    {
> >>> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> >>>    	local_irq_enable();
> >>>    	while (!need_resched())
> >>>    		cpu_relax();
> >>> -	trace_power_end(0);
> >>>    }
> >> why did you remove the idle tracepoints from this one ???
> > Because no idle/sleep state is entered here.
> > State 0 does not exist or say, it means the machine is not idle.
> > The new event uses idle state 0 spec conform as "exit sleep state".
> >
> > If this should still be trackable some kind of dummy sleep state:
> > #define IDLE_BUSY_LOOP 0xFE
> > (or similar) must get defined and passed like this:
> > trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
> >      cpu_relax()
> > trace_processor_idle(0, smp_processor_id());
> >
> > I could imagine this is somewhat worth it to compare idle results
> > to "no idle state at all" is used.
> > But nobody should ever use idle=poll, comparing deep sleep states
> > with C1 with (idle=halt) should be sufficient?
> 
> this is not idle=poll on the command line only.
> this also gets used normally, in two cases
> 1) during real time operations, for some short periods of time
>      (think wallstreet trading)
> 2) by the menu governor when the next event is less than a few 
> microseconds away, so short that even C1 is too much
> 
> I know that your new API tries to use "0" as exit, but 0 is already 
> taken (in all power terminology at least on x86 it is) for this.
cpuidle indeed misuses C0 as "poll idle" state.
That's really bad/misleading, but nothing that can be changed easily.

I agree shifting C0 (cpuidle) <-> POLL_IDLE event
and              "not idle"   <-> real C0 (executing instructions)
or however this gets mapped makes things even worse.

Damn, it could be that easy and straight forward, but I agree that
this kills the approach to trigger state 0 event if C0 is entered
(C0 as defined as operational mode executing instructions).

     Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:36         ` Thomas Renninger
@ 2010-10-25 14:45           ` Arjan van de Ven
  2010-10-25 14:45           ` Arjan van de Ven
  1 sibling, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25 14:45 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On 10/25/2010 7:36 AM, Thomas Renninger wrote:
>> I know that your new API tries to use "0" as exit, but 0 is already
>> taken (in all power terminology at least on x86 it is) for this.
> cpuidle indeed misuses C0 as "poll idle" state.
> That's really bad/misleading, but nothing that can be changed easily.
>
> I agree shifting C0 (cpuidle)<->  POLL_IDLE event
> and              "not idle"<->  real C0 (executing instructions)
> or however this gets mapped makes things even worse.
>
> Damn, it could be that easy and straight forward, but I agree that
> this kills the approach to trigger state 0 event if C0 is entered
> (C0 as defined as operational mode executing instructions).

ok so we have

"C0 idle"
and
"C0 no longer idle"

I'd propose using the number 0 for the first one (it makes the most 
logical sense, it's the least deep idle state etc etc)

we could use "-1" or "INT_MAX" for the later

but as a user of the API I rather like a separate "we're no longer idle" 
event... but if not, as long as things aren't ambigious I'll find a way 
to code around it.
basically with a separate event, I demultiplex based on event number 
between entry and exit.... with a special exit value I would just need a 
double demultiplex,
one on "idle" and then a second one on the state number to split between 
entry/exit.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:36         ` Thomas Renninger
  2010-10-25 14:45           ` Arjan van de Ven
@ 2010-10-25 14:45           ` Arjan van de Ven
  2010-10-25 14:56             ` Ingo Molnar
  2010-10-25 14:56             ` Ingo Molnar
  1 sibling, 2 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25 14:45 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/25/2010 7:36 AM, Thomas Renninger wrote:
>> I know that your new API tries to use "0" as exit, but 0 is already
>> taken (in all power terminology at least on x86 it is) for this.
> cpuidle indeed misuses C0 as "poll idle" state.
> That's really bad/misleading, but nothing that can be changed easily.
>
> I agree shifting C0 (cpuidle)<->  POLL_IDLE event
> and              "not idle"<->  real C0 (executing instructions)
> or however this gets mapped makes things even worse.
>
> Damn, it could be that easy and straight forward, but I agree that
> this kills the approach to trigger state 0 event if C0 is entered
> (C0 as defined as operational mode executing instructions).

ok so we have

"C0 idle"
and
"C0 no longer idle"

I'd propose using the number 0 for the first one (it makes the most 
logical sense, it's the least deep idle state etc etc)

we could use "-1" or "INT_MAX" for the later

but as a user of the API I rather like a separate "we're no longer idle" 
event... but if not, as long as things aren't ambigious I'll find a way 
to code around it.
basically with a separate event, I demultiplex based on event number 
between entry and exit.... with a special exit value I would just need a 
double demultiplex,
one on "idle" and then a second one on the state number to split between 
entry/exit.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 14:51             ` Thomas Renninger
@ 2010-10-25 14:51             ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 14:51 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Andrew Morton, Mathieu Desnoyers

On Monday 25 October 2010 16:11:10 Arjan van de Ven wrote:
> On 10/25/2010 5:55 AM, Thomas Renninger wrote:
> 
> 
> >> But the actual code does not actually deal with any 'state 0', does it?
> > It does. Not being idle is tracked by cpuidle driver as state 0
> > (arch independent):
> > /sys/devices/system/cpu/cpu0/cpuidle/state0/
> > halt/C1 on X86 is:
> > /sys/devices/system/cpu/cpu0/cpuidle/state1/
> > ...
> state0 is still OS idle!
Yes, I just realized that.
Which is very unfortunate.
The whole cpuidle stuff is based on ACPI C-states and
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
C0
is plain wrong if it's used as "poll idle" time.
C0 is defined as (in the ACPI spec):
----------
2.5 Processor Power State Definitions
C0 Processor Power State
While the processor is in this state, it executes instructions.
----------

> the API is just weird for this, from a userspace perspective
> 
> if the kernel picks this state 0 for the idle handler, the userspace app 
> gets
> two events
> 
> one for going to state 0 to enter the idle state
> one for going to state 0 to exit idle
> 
> but they're the exact same event in your API.
> 
> rather unpleasant from a userspace program perspective....
Yeah. But the re-definition of C0 being "Linux poll idle"
will confuse users as well. Not sure whether this should get
touched, though.

Thanks for clarification, I wasn't aware of that...

    Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:11           ` Arjan van de Ven
@ 2010-10-25 14:51             ` Thomas Renninger
  2010-10-25 14:51             ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 14:51 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Monday 25 October 2010 16:11:10 Arjan van de Ven wrote:
> On 10/25/2010 5:55 AM, Thomas Renninger wrote:
> 
> 
> >> But the actual code does not actually deal with any 'state 0', does it?
> > It does. Not being idle is tracked by cpuidle driver as state 0
> > (arch independent):
> > /sys/devices/system/cpu/cpu0/cpuidle/state0/
> > halt/C1 on X86 is:
> > /sys/devices/system/cpu/cpu0/cpuidle/state1/
> > ...
> state0 is still OS idle!
Yes, I just realized that.
Which is very unfortunate.
The whole cpuidle stuff is based on ACPI C-states and
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
C0
is plain wrong if it's used as "poll idle" time.
C0 is defined as (in the ACPI spec):
----------
2.5 Processor Power State Definitions
C0 Processor Power State
While the processor is in this state, it executes instructions.
----------

> the API is just weird for this, from a userspace perspective
> 
> if the kernel picks this state 0 for the idle handler, the userspace app 
> gets
> two events
> 
> one for going to state 0 to enter the idle state
> one for going to state 0 to exit idle
> 
> but they're the exact same event in your API.
> 
> rather unpleasant from a userspace program perspective....
Yeah. But the re-definition of C0 being "Linux poll idle"
will confuse users as well. Not sure whether this should get
touched, though.

Thanks for clarification, I wasn't aware of that...

    Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:45           ` Arjan van de Ven
@ 2010-10-25 14:56             ` Ingo Molnar
  2010-10-25 14:56             ` Ingo Molnar
  1 sibling, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-25 14:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Mathieu Desnoyers


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On 10/25/2010 7:36 AM, Thomas Renninger wrote:
> >>I know that your new API tries to use "0" as exit, but 0 is already
> >>taken (in all power terminology at least on x86 it is) for this.
> >cpuidle indeed misuses C0 as "poll idle" state.
> >That's really bad/misleading, but nothing that can be changed easily.
> >
> >I agree shifting C0 (cpuidle)<->  POLL_IDLE event
> >and              "not idle"<->  real C0 (executing instructions)
> >or however this gets mapped makes things even worse.
> >
> >Damn, it could be that easy and straight forward, but I agree that
> >this kills the approach to trigger state 0 event if C0 is entered
> >(C0 as defined as operational mode executing instructions).
> 
> ok so we have
> 
> "C0 idle"
> and
> "C0 no longer idle"
> 
> I'd propose using the number 0 for the first one (it makes the most
> logical sense, it's the least deep idle state etc etc)
> 
> we could use "-1" or "INT_MAX" for the later
> 
> but as a user of the API I rather like a separate "we're no longer idle" event... 
> but if not, as long as things aren't ambigious I'll find a way to code around it.
>
> basically with a separate event, I demultiplex based on event number between entry 
> and exit.... with a special exit value I would just need a double demultiplex,

Hm, does not sound particularly smart.

> one on "idle" and then a second one on the state number to split between 
> entry/exit.

The thing is, in terms of CPU idle state, if the old tracepoints give us all the 
information that the new tracepoints, why dont we simply add the tracepoints to ARM 
and be done with it? No app needs to be changed in that case, etc.

Plus, lets express the suspend/resume tracepoints as suspend_enter(X)/suspend_exit() 
events as well, to keep it symmetric and consistent with the other enter/exit 
events.

The rename alone isnt a strong enough reason really. 'entering idle state X' and 
'exiting idle' is pretty much synonymous to 'enter idle state X'.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:45           ` Arjan van de Ven
  2010-10-25 14:56             ` Ingo Molnar
@ 2010-10-25 14:56             ` Ingo Molnar
  2010-10-25 15:48               ` Thomas Renninger
  2010-10-25 15:48               ` Thomas Renninger
  1 sibling, 2 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-25 14:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On 10/25/2010 7:36 AM, Thomas Renninger wrote:
> >>I know that your new API tries to use "0" as exit, but 0 is already
> >>taken (in all power terminology at least on x86 it is) for this.
> >cpuidle indeed misuses C0 as "poll idle" state.
> >That's really bad/misleading, but nothing that can be changed easily.
> >
> >I agree shifting C0 (cpuidle)<->  POLL_IDLE event
> >and              "not idle"<->  real C0 (executing instructions)
> >or however this gets mapped makes things even worse.
> >
> >Damn, it could be that easy and straight forward, but I agree that
> >this kills the approach to trigger state 0 event if C0 is entered
> >(C0 as defined as operational mode executing instructions).
> 
> ok so we have
> 
> "C0 idle"
> and
> "C0 no longer idle"
> 
> I'd propose using the number 0 for the first one (it makes the most
> logical sense, it's the least deep idle state etc etc)
> 
> we could use "-1" or "INT_MAX" for the later
> 
> but as a user of the API I rather like a separate "we're no longer idle" event... 
> but if not, as long as things aren't ambigious I'll find a way to code around it.
>
> basically with a separate event, I demultiplex based on event number between entry 
> and exit.... with a special exit value I would just need a double demultiplex,

Hm, does not sound particularly smart.

> one on "idle" and then a second one on the state number to split between 
> entry/exit.

The thing is, in terms of CPU idle state, if the old tracepoints give us all the 
information that the new tracepoints, why dont we simply add the tracepoints to ARM 
and be done with it? No app needs to be changed in that case, etc.

Plus, lets express the suspend/resume tracepoints as suspend_enter(X)/suspend_exit() 
events as well, to keep it symmetric and consistent with the other enter/exit 
events.

The rename alone isnt a strong enough reason really. 'entering idle state X' and 
'exiting idle' is pretty much synonymous to 'enter idle state X'.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:56             ` Ingo Molnar
  2010-10-25 15:48               ` Thomas Renninger
@ 2010-10-25 15:48               ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 15:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, linux-trace-users, Arjan van de Ven,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Monday 25 October 2010 16:56:04 Ingo Molnar wrote:
> 
> * Arjan van de Ven <arjan@linux.intel.com> wrote:
> > On 10/25/2010 7:36 AM, Thomas Renninger wrote:
> > ok so we have
> > 
> > "C0 idle"
Ideally this should not be called C0, but expressed
as (#define) POLL_IDLE wherever possible.

In all documentations/specs/white papers about other OSes
C0 is refered to as not being idle.
Linux mis-uses it as a self-defined idle state which
is really confusing.

> > and
> > "C0 no longer idle"
> > 
> > I'd propose using the number 0 for the first one (it makes the most
> > logical sense, it's the least deep idle state etc etc)
I would use a special number for the "Linux only" state.

> > we could use "-1" or "INT_MAX" for the later
> > but as a user of the API I rather like a separate "we're no longer idle" event... 
> > but if not, as long as things aren't ambigious I'll find a way to code around it.
> >
> > basically with a separate event, I demultiplex based on event number between entry 
> > and exit.... with a special exit value I would just need a double demultiplex,
> 
> Hm, does not sound particularly smart.
> 
> > one on "idle" and then a second one on the state number to split between 
> > entry/exit.
> 
> The thing is, in terms of CPU idle state, if the old tracepoints give us all the 
> information that the new tracepoints, why dont we simply add the tracepoints to ARM 
> and be done with it? No app needs to be changed in that case, etc.
> 
> Plus, lets express the suspend/resume tracepoints as suspend_enter(X)/suspend_exit() 
> events as well, to keep it symmetric and consistent with the other enter/exit 
> events.
> 
> The rename alone isnt a strong enough reason really. 'entering idle state X' and 
> 'exiting idle' is pretty much synonymous to 'enter idle state X'.
It's not only that, my patch also:
  - eleminates the never ever used type= field
  - uses a better name, currently it's power:power_{start,end}
    How would you name another power event...

Altogether, it should justify the proposed cleanup(s).
But with this C0 clash, I am not sure whether:
  1) as Ingo said any clean up
  2) a minimal cleanup:
       - rename power:power_{start,end} to power:processor_idle{start,end}
       - get rid of type= field
  3) or a maximum cleanup:
       - plus not use start/end events, but use one state transition
         event.
should be done.
I think best is Jean goes with current definitions.
2. is far less intrusive and if you like to have it, I can
still send another patch.

     Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:56             ` Ingo Molnar
@ 2010-10-25 15:48               ` Thomas Renninger
  2010-10-25 16:00                 ` Arjan van de Ven
  2010-10-25 16:00                 ` Arjan van de Ven
  2010-10-25 15:48               ` Thomas Renninger
  1 sibling, 2 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 15:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Monday 25 October 2010 16:56:04 Ingo Molnar wrote:
> 
> * Arjan van de Ven <arjan@linux.intel.com> wrote:
> > On 10/25/2010 7:36 AM, Thomas Renninger wrote:
> > ok so we have
> > 
> > "C0 idle"
Ideally this should not be called C0, but expressed
as (#define) POLL_IDLE wherever possible.

In all documentations/specs/white papers about other OSes
C0 is refered to as not being idle.
Linux mis-uses it as a self-defined idle state which
is really confusing.

> > and
> > "C0 no longer idle"
> > 
> > I'd propose using the number 0 for the first one (it makes the most
> > logical sense, it's the least deep idle state etc etc)
I would use a special number for the "Linux only" state.

> > we could use "-1" or "INT_MAX" for the later
> > but as a user of the API I rather like a separate "we're no longer idle" event... 
> > but if not, as long as things aren't ambigious I'll find a way to code around it.
> >
> > basically with a separate event, I demultiplex based on event number between entry 
> > and exit.... with a special exit value I would just need a double demultiplex,
> 
> Hm, does not sound particularly smart.
> 
> > one on "idle" and then a second one on the state number to split between 
> > entry/exit.
> 
> The thing is, in terms of CPU idle state, if the old tracepoints give us all the 
> information that the new tracepoints, why dont we simply add the tracepoints to ARM 
> and be done with it? No app needs to be changed in that case, etc.
> 
> Plus, lets express the suspend/resume tracepoints as suspend_enter(X)/suspend_exit() 
> events as well, to keep it symmetric and consistent with the other enter/exit 
> events.
> 
> The rename alone isnt a strong enough reason really. 'entering idle state X' and 
> 'exiting idle' is pretty much synonymous to 'enter idle state X'.
It's not only that, my patch also:
  - eleminates the never ever used type= field
  - uses a better name, currently it's power:power_{start,end}
    How would you name another power event...

Altogether, it should justify the proposed cleanup(s).
But with this C0 clash, I am not sure whether:
  1) as Ingo said any clean up
  2) a minimal cleanup:
       - rename power:power_{start,end} to power:processor_idle{start,end}
       - get rid of type= field
  3) or a maximum cleanup:
       - plus not use start/end events, but use one state transition
         event.
should be done.
I think best is Jean goes with current definitions.
2. is far less intrusive and if you like to have it, I can
still send another patch.

     Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 15:48               ` Thomas Renninger
  2010-10-25 16:00                 ` Arjan van de Ven
@ 2010-10-25 16:00                 ` Arjan van de Ven
  1 sibling, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25 16:00 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Andrew Morton, Mathieu Desnoyers

On 10/25/2010 8:48 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 16:56:04 Ingo Molnar wrote:
>> * Arjan van de Ven<arjan@linux.intel.com>  wrote:
>>> On 10/25/2010 7:36 AM, Thomas Renninger wrote:
>>> ok so we have
>>>
>>> "C0 idle"
> Ideally this should not be called C0, but expressed
> as (#define) POLL_IDLE wherever possible.
>
> In all documentations/specs/white papers about other OSes
> C0 is refered to as not being idle.
> Linux mis-uses it as a self-defined idle state which
> is really confusing.

sure naming is one thing
>>> and
>>> "C0 no longer idle"
>>>
>>> I'd propose using the number 0 for the first one (it makes the most
>>> logical sense, it's the least deep idle state etc etc)
> I would use a special number for the "Linux only" state.

that special number is 0 though..
it makes sense in ordering, 0 < 1, 1 < 2 etc



0 makes for a really bad special number for the exit marker; not just here,
but also for your suspend hook, that one definitely needs to change
(since current commercially available SOCs already reuse 0 for this for 
standby level states)

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 15:48               ` Thomas Renninger
@ 2010-10-25 16:00                 ` Arjan van de Ven
  2010-10-25 23:32                   ` Thomas Renninger
  2010-10-25 23:32                   ` Thomas Renninger
  2010-10-25 16:00                 ` Arjan van de Ven
  1 sibling, 2 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-25 16:00 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/25/2010 8:48 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 16:56:04 Ingo Molnar wrote:
>> * Arjan van de Ven<arjan@linux.intel.com>  wrote:
>>> On 10/25/2010 7:36 AM, Thomas Renninger wrote:
>>> ok so we have
>>>
>>> "C0 idle"
> Ideally this should not be called C0, but expressed
> as (#define) POLL_IDLE wherever possible.
>
> In all documentations/specs/white papers about other OSes
> C0 is refered to as not being idle.
> Linux mis-uses it as a self-defined idle state which
> is really confusing.

sure naming is one thing
>>> and
>>> "C0 no longer idle"
>>>
>>> I'd propose using the number 0 for the first one (it makes the most
>>> logical sense, it's the least deep idle state etc etc)
> I would use a special number for the "Linux only" state.

that special number is 0 though..
it makes sense in ordering, 0 < 1, 1 < 2 etc



0 makes for a really bad special number for the exit marker; not just here,
but also for your suspend hook, that one definitely needs to change
(since current commercially available SOCs already reuse 0 for this for 
standby level states)



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 12:58         ` Mathieu Desnoyers
  2010-10-25 20:29           ` Rafael J. Wysocki
@ 2010-10-25 20:29           ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-25 20:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Arjan van de Ven, linux-pm, Masami Hiramatsu,
	Tejun Heo, Thomas Gleixner, linux-omap, Linus Torvalds,
	Ingo Molnar

On Monday, October 25, 2010, Mathieu Desnoyers wrote:
> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > > 
> > > > * Thomas Renninger <trenn@suse.de> wrote:
> > > > 
> > > > > New power trace events:
> > > > > power:processor_idle
> > > > > power:processor_frequency
> > > > > power:machine_suspend
> > > > > 
> > > > > 
> > > > > C-state/idle accounting events:
> > > > >   power:power_start
> > > > >   power:power_end
> > > > > are replaced with:
> > > > >   power:processor_idle
> > > > 
> > > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > > model:
> > > > 
> > > >  enter power saving mode X
> > > >  exit power saving mode
> > > > 
> > > > Where X is some sort of 'power saving deepness' attribute, right?
> > >
> > > Sure.
> > 
> > Which is is the 'saner' model?
> > 
> > > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > > defines state 0 as the non-power saving mode.
> > 
> > But the actual code does not actually deal with any 'state 0', does it? It enters an 
> > idle function and then exits it, right?
> > 
> > 'power state' might be what is used for devices - but even there, we have:
> > 
> >   - enter power state X
> >   - exit power state
> > 
> > right?
> > 
> > > Same as done here with machine suspend state (S0 is back from suspend) and
> > > this model should get picked up when device sleep states get tracked at
> > > some time.
> > >
> > > It's consistent and applies to some well known specifications.
> > 
> > What we want it to be is for it to be the nicest, most understandable, most logical 
> > model - not one matching random hardware specifications.
> > 
> > ( Hardware specifications only matter in so far that it should be possible to 
> >   express all the known hardware state transitions via these events efficiently. )
> > 
> > > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > > there is no need to introduce: processor_idle_start/processor_idle_end 
> > > machine_suspend_start/machine_suspend_end 
> > > device_power_mode_start/device_power_mode_end events.
> > 
> > What do you mean by "makes no sense"?
> > 
> > Are they superfluous? Inefficient? Illogical?
> 
> I think it would require deep understanding of specific power modes of each
> architecture to split into this topology. On the bright side, it would bring
> clear understanding of which HW resource is being put to sleep, which would make
> automated analysis much easier to do. But maybe it's too much pain compared to
> the benefit. The related question is also: where is it best to put this logic ?
> In the kernel code ? In per-arch TRACE_EVENT() handlers or in external trace
> analysis plugins ?
> 
> > 
> > > Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> > > implementations/code and the user.
> > 
> > By that argument we should not have separate fork() and exit() syscalls either, but 
> > a set_process_state(1) and set_process_state(0) interface?
> 
> I'm by no mean expert on power saving hardware specs, but if it is possible for
> hardware to switch between two power saving states without passing through power
> state 0, then using a "set state" rather than an enter/exit would be more
> appropriate; even if we go for a scheme introducing
> 
> processor_idle_start/processor_idle_end,
> machine_suspend_start/machine_suspend_end,
> device_power_mode_start/device_power_mode_end.
> 
> I must defer to you guys to figure out if some hardware actually do that for
> either of CPU idle, suspend or device power modes.

Yes, you can go directly from PCI_D1 to PCI_D2, for one example.

Apart from this, attempting to put system suspend to the same bag as cpuidle
is not going to work in the long run.  They are _fundamentally_ different things
event though the power state we get into as a result of suspend is approximately
the same as we can get into via cpuidle (even in that case the energy savings
will generally be different in both cases due to wakeup events).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 12:58         ` Mathieu Desnoyers
@ 2010-10-25 20:29           ` Rafael J. Wysocki
  2010-10-25 20:29           ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-25 20:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Thomas Renninger, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Pierre Tardy, Frederic Weisbecker,
	Tejun Heo, Arjan van de Ven

On Monday, October 25, 2010, Mathieu Desnoyers wrote:
> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > > 
> > > > * Thomas Renninger <trenn@suse.de> wrote:
> > > > 
> > > > > New power trace events:
> > > > > power:processor_idle
> > > > > power:processor_frequency
> > > > > power:machine_suspend
> > > > > 
> > > > > 
> > > > > C-state/idle accounting events:
> > > > >   power:power_start
> > > > >   power:power_end
> > > > > are replaced with:
> > > > >   power:processor_idle
> > > > 
> > > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > > model:
> > > > 
> > > >  enter power saving mode X
> > > >  exit power saving mode
> > > > 
> > > > Where X is some sort of 'power saving deepness' attribute, right?
> > >
> > > Sure.
> > 
> > Which is is the 'saner' model?
> > 
> > > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > > defines state 0 as the non-power saving mode.
> > 
> > But the actual code does not actually deal with any 'state 0', does it? It enters an 
> > idle function and then exits it, right?
> > 
> > 'power state' might be what is used for devices - but even there, we have:
> > 
> >   - enter power state X
> >   - exit power state
> > 
> > right?
> > 
> > > Same as done here with machine suspend state (S0 is back from suspend) and
> > > this model should get picked up when device sleep states get tracked at
> > > some time.
> > >
> > > It's consistent and applies to some well known specifications.
> > 
> > What we want it to be is for it to be the nicest, most understandable, most logical 
> > model - not one matching random hardware specifications.
> > 
> > ( Hardware specifications only matter in so far that it should be possible to 
> >   express all the known hardware state transitions via these events efficiently. )
> > 
> > > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > > there is no need to introduce: processor_idle_start/processor_idle_end 
> > > machine_suspend_start/machine_suspend_end 
> > > device_power_mode_start/device_power_mode_end events.
> > 
> > What do you mean by "makes no sense"?
> > 
> > Are they superfluous? Inefficient? Illogical?
> 
> I think it would require deep understanding of specific power modes of each
> architecture to split into this topology. On the bright side, it would bring
> clear understanding of which HW resource is being put to sleep, which would make
> automated analysis much easier to do. But maybe it's too much pain compared to
> the benefit. The related question is also: where is it best to put this logic ?
> In the kernel code ? In per-arch TRACE_EVENT() handlers or in external trace
> analysis plugins ?
> 
> > 
> > > Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> > > implementations/code and the user.
> > 
> > By that argument we should not have separate fork() and exit() syscalls either, but 
> > a set_process_state(1) and set_process_state(0) interface?
> 
> I'm by no mean expert on power saving hardware specs, but if it is possible for
> hardware to switch between two power saving states without passing through power
> state 0, then using a "set state" rather than an enter/exit would be more
> appropriate; even if we go for a scheme introducing
> 
> processor_idle_start/processor_idle_end,
> machine_suspend_start/machine_suspend_end,
> device_power_mode_start/device_power_mode_end.
> 
> I must defer to you guys to figure out if some hardware actually do that for
> either of CPU idle, suspend or device power modes.

Yes, you can go directly from PCI_D1 to PCI_D2, for one example.

Apart from this, attempting to put system suspend to the same bag as cpuidle
is not going to work in the long run.  They are _fundamentally_ different things
event though the power state we get into as a result of suspend is approximately
the same as we can get into via cpuidle (even in that case the energy savings
will generally be different in both cases due to wakeup events).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 13:58       ` Arjan van de Ven
@ 2010-10-25 20:33         ` Rafael J. Wysocki
  2010-10-25 20:33         ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-25 20:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Thomas Gleixner, linux-omap, Linus Torvalds,
	Ingo Molnar

On Monday, October 25, 2010, Arjan van de Ven wrote:
> On 10/25/2010 4:03 AM, Thomas Renninger wrote:
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> >> * Thomas Renninger<trenn@suse.de>  wrote:
> >>
> >>> New power trace events:
> >>> power:processor_idle
> >>> power:processor_frequency
> >>> power:machine_suspend
> >>>
> >>>
> >>> C-state/idle accounting events:
> >>>    power:power_start
> >>>    power:power_end
> >>> are replaced with:
> >>>    power:processor_idle
> >> Well, most power saving hw models (and the code implementing them) have this kind of
> >> model:
> >>
> >>   enter power saving mode X
> >>   exit power saving mode
> >>
> >> Where X is some sort of 'power saving deepness' attribute, right?
> > Sure.
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs
> > as well, defines state 0 as the non-power saving mode.
> 
> correct ,... "C0" is not power efficient... but it's still a valid OS 
> idle state!
> Also tracking processor_idle_{start,end} as a separate event!
> 
> same for "S0"... S0 as standby state is still valid... sure it doesn't 
> save you much power... but that does not mean it's not valid.

If you mean ACPI S0, it is not a standby state.  It actually is the full-power
state.

> (as indication, the Intel Moorestown platform, which is currently in 
> production and available to OEMs, has such a S0 standby state)

Another naming confusion.  How smart.

> > makes no sense and there is no need to introduce:
> > processor_idle_start/processor_idle_end
> > machine_suspend_start/machine_suspend_end
> > device_power_mode_start/device_power_mode_end
> > events.
> > Using state 0 as "exit/end", is much nicer for kernel/
> > userspace implementations/code and the user.
> actually no; having written a few of these in userspace so far, having a 
> separate end event is easier to deal with;
> the actions you take on entry and exit are complete separate code paths.

That's correct, unless you go directly from one low-power state to another
(which is possible for example for PCI).  We don't do that at the moment,
but it's possible in principle and we may want to start doing that at one
point.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 13:58       ` Arjan van de Ven
  2010-10-25 20:33         ` Rafael J. Wysocki
@ 2010-10-25 20:33         ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-25 20:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Thomas Renninger, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Pierre Tardy, Frederic Weisbecker,
	Tejun Heo, Mathieu Desnoyers

On Monday, October 25, 2010, Arjan van de Ven wrote:
> On 10/25/2010 4:03 AM, Thomas Renninger wrote:
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> >> * Thomas Renninger<trenn@suse.de>  wrote:
> >>
> >>> New power trace events:
> >>> power:processor_idle
> >>> power:processor_frequency
> >>> power:machine_suspend
> >>>
> >>>
> >>> C-state/idle accounting events:
> >>>    power:power_start
> >>>    power:power_end
> >>> are replaced with:
> >>>    power:processor_idle
> >> Well, most power saving hw models (and the code implementing them) have this kind of
> >> model:
> >>
> >>   enter power saving mode X
> >>   exit power saving mode
> >>
> >> Where X is some sort of 'power saving deepness' attribute, right?
> > Sure.
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs
> > as well, defines state 0 as the non-power saving mode.
> 
> correct ,... "C0" is not power efficient... but it's still a valid OS 
> idle state!
> Also tracking processor_idle_{start,end} as a separate event!
> 
> same for "S0"... S0 as standby state is still valid... sure it doesn't 
> save you much power... but that does not mean it's not valid.

If you mean ACPI S0, it is not a standby state.  It actually is the full-power
state.

> (as indication, the Intel Moorestown platform, which is currently in 
> production and available to OEMs, has such a S0 standby state)

Another naming confusion.  How smart.

> > makes no sense and there is no need to introduce:
> > processor_idle_start/processor_idle_end
> > machine_suspend_start/machine_suspend_end
> > device_power_mode_start/device_power_mode_end
> > events.
> > Using state 0 as "exit/end", is much nicer for kernel/
> > userspace implementations/code and the user.
> actually no; having written a few of these in userspace so far, having a 
> separate end event is easier to deal with;
> the actions you take on entry and exit are complete separate code paths.

That's correct, unless you go directly from one low-power state to another
(which is possible for example for PCI).  We don't do that at the moment,
but it's possible in principle and we may want to start doing that at one
point.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 16:00                 ` Arjan van de Ven
  2010-10-25 23:32                   ` Thomas Renninger
@ 2010-10-25 23:32                   ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 23:32 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Andrew Morton, Mathieu Desnoyers

@Ingo: Can you queue up 1/3, it's an independent fix.

On Monday 25 October 2010 06:00:17 pm Arjan van de Ven wrote:
> On 10/25/2010 8:48 AM, Thomas Renninger wrote:
> 
> sure naming is one thing
Yes it should get renamed to not show:
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
C0
This is wrong and confusing

> >>> and
> >>> "C0 no longer idle"
> >>>
> >>> I'd propose using the number 0 for the first one (it makes the most
> >>> logical sense, it's the least deep idle state etc etc)
> > I would use a special number for the "Linux only" state.
> 
> that special number is 0 though..
> it makes sense in ordering, 0 < 1, 1 < 2 etc
As long as it stays a kernel and perf processor_idle internal number
it does not hurt.
But userspace tools catching the perf idle event of state 0 should never
refer to it as processor idle state 0 (or even worse C0).
Instead they should try to get the name/description of:
/sys/../state0/name
or directly refer to it as "poll idle" state.

Processor idle state C0 is not only defined as "not being idle" in the
specs, also turbostat and cpufreq-aperf use it correctly and refer to C0 when 
they show accounted "not idle" time.

Encouraged by your suggestions I send another version.
It's not a big deal to send 0xFFFFFFFF instead of 0 as "non power saving" 
state. If you can handle compatibility with it in powertop, it doesn't make 
things more complicated in kernel and perf timechart as I first thought it 
does.

      Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 16:00                 ` Arjan van de Ven
@ 2010-10-25 23:32                   ` Thomas Renninger
  2010-10-25 23:32                   ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 23:32 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

@Ingo: Can you queue up 1/3, it's an independent fix.

On Monday 25 October 2010 06:00:17 pm Arjan van de Ven wrote:
> On 10/25/2010 8:48 AM, Thomas Renninger wrote:
> 
> sure naming is one thing
Yes it should get renamed to not show:
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
C0
This is wrong and confusing

> >>> and
> >>> "C0 no longer idle"
> >>>
> >>> I'd propose using the number 0 for the first one (it makes the most
> >>> logical sense, it's the least deep idle state etc etc)
> > I would use a special number for the "Linux only" state.
> 
> that special number is 0 though..
> it makes sense in ordering, 0 < 1, 1 < 2 etc
As long as it stays a kernel and perf processor_idle internal number
it does not hurt.
But userspace tools catching the perf idle event of state 0 should never
refer to it as processor idle state 0 (or even worse C0).
Instead they should try to get the name/description of:
/sys/../state0/name
or directly refer to it as "poll idle" state.

Processor idle state C0 is not only defined as "not being idle" in the
specs, also turbostat and cpufreq-aperf use it correctly and refer to C0 when 
they show accounted "not idle" time.

Encouraged by your suggestions I send another version.
It's not a big deal to send 0xFFFFFFFF instead of 0 as "non power saving" 
state. If you can handle compatibility with it in powertop, it doesn't make 
things more complicated in kernel and perf timechart as I first thought it 
does.

      Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                     ` (6 preceding siblings ...)
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
@ 2010-10-25 23:33   ` Thomas Renninger
  7 siblings, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 23:33 UTC (permalink / raw)
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Thomas Gleixner, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

Changes in V2:
  - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
  - Use u32 instead of u64 for cpuid, state which is by far enough

New power trace events:
power:processor_idle
power:processor_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:processor_idle

and
  power:power_frequency
is replaced with:
  power:processor_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process.c    |    7 +++-
 arch/x86/kernel/process_64.c |    2 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   81 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   14 +++++++
 kernel/trace/power-traces.c  |    3 ++
 8 files changed, 108 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..6a98da3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_processor_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_processor_idle(1, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..5f2bb98 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_processor_idle(PWR_EVENT_EXIT,
+					     smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..33bdc41 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_processor_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..ec703e6 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..c78e496 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_processor_idle((eax >> 4) + 1, smp_processor_id());
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..4b13414 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,61 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(processor,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	     TP_ARGS(state, cpu_id)
+);
+
+#define PWR_EVENT_EXIT 0xFFFFFFFF
+
+DEFINE_EVENT(processor, processor_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +124,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..0b5c841 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,20 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..6b6da42 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(processor_idle);
 
-- 
1.6.3

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                     ` (5 preceding siblings ...)
  2010-10-25 10:04   ` Ingo Molnar
@ 2010-10-25 23:33   ` Thomas Renninger
  2010-10-26  1:09     ` Arjan van de Ven
                       ` (7 more replies)
  2010-10-25 23:33   ` Thomas Renninger
  7 siblings, 8 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-25 23:33 UTC (permalink / raw)
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven, Ingo Molnar

Changes in V2:
  - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
  - Use u32 instead of u64 for cpuid, state which is by far enough

New power trace events:
power:processor_idle
power:processor_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:processor_idle

and
  power:power_frequency
is replaced with:
  power:processor_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process.c    |    7 +++-
 arch/x86/kernel/process_64.c |    2 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   81 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   14 +++++++
 kernel/trace/power-traces.c  |    3 ++
 8 files changed, 108 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..6a98da3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_processor_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_processor_idle(1, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..5f2bb98 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_processor_idle(PWR_EVENT_EXIT,
+					     smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..33bdc41 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_processor_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..ec703e6 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..c78e496 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_processor_idle((eax >> 4) + 1, smp_processor_id());
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..4b13414 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,61 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(processor,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	     TP_ARGS(state, cpu_id)
+);
+
+#define PWR_EVENT_EXIT 0xFFFFFFFF
+
+DEFINE_EVENT(processor, processor_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +124,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..0b5c841 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,20 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..6b6da42 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(processor_idle);
 
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH] PERF(userspace): Adjust perf timechart to the new power events V2
  2010-10-19 11:36 ` Thomas Renninger
  2010-10-26  0:18   ` [PATCH] PERF(userspace): Adjust perf timechart to the new power events V2 Thomas Renninger
@ 2010-10-26  0:18   ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-26  0:18 UTC (permalink / raw)
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Thomas Gleixner, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

Changes in V2:
  - Hanlde PWR_EVENT_EXIT instead of 0 to recon non-power state

The transition was rather smooth, only part I had to fiddle
some time was the check whether a tracepoint/event is
supported by the running kernel.

builtin-timechart must only pass -e power:xy events which
are supported by the running kernel.
For this I added the tiny helper function:
int is_valid_tracepoint(const char *event_string)
to parse-events.[hc]
which could be more generic as an interface and support
hardware/software/... events, not only tracepoints, but someone
else could extend that if needed...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/builtin-timechart.c |   89 ++++++++++++++++++++++++++++++++-------
 tools/perf/util/parse-events.c |   43 +++++++++++++++++++-
 tools/perf/util/parse-events.h |    1 +
 3 files changed, 116 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 9bcc38f..7eaa5b5 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -32,6 +32,10 @@
 #include "util/session.h"
 #include "util/svghelper.h"
 
+#define SUPPORT_OLD_POWER_EVENTS 1
+#define PWR_EVENT_EXIT 0xFFFFFFFF
+
+
 static char		const *input_name = "perf.data";
 static char		const *output_name = "output.svg";
 
@@ -298,12 +302,25 @@ struct trace_entry {
 	int			lock_depth;
 };
 
-struct power_entry {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+struct power_entry_old {
 	struct trace_entry te;
 	u64	type;
 	u64	value;
 	u64	cpu_id;
 };
+#endif
+
+struct power_processor_entry {
+	struct trace_entry te;
+	u64	state;
+	u64	cpu_id;
+};
+
+struct power_suspend_entry {
+	struct trace_entry te;
+	u64	state;
+};
 
 #define TASK_COMM_LEN 16
 struct wakeup_entry {
@@ -489,29 +506,46 @@ static int process_sample_event(event_t *event, struct perf_session *session)
 	te = (void *)data.raw_data;
 	if (session->sample_type & PERF_SAMPLE_RAW && data.raw_size > 0) {
 		char *event_str;
-		struct power_entry *pe;
-
-		pe = (void *)te;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		struct power_entry_old *peo;
+		peo = (void *)te;
+#endif
 
 		event_str = perf_header__find_event(te->type);
 
 		if (!event_str)
 			return 0;
 
-		if (strcmp(event_str, "power:power_start") == 0)
-			c_state_start(pe->cpu_id, data.time, pe->value);
-
-		if (strcmp(event_str, "power:power_end") == 0)
-			c_state_end(pe->cpu_id, data.time);
+		if (strcmp(event_str, "power:processor_idle") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			if (ppe->state == PWR_EVENT_EXIT)
+				c_state_end(ppe->cpu_id, data.time);
+			else
+				c_state_start(ppe->cpu_id, data.time,
+					      ppe->state);
+		}
 
-		if (strcmp(event_str, "power:power_frequency") == 0)
-			p_state_change(pe->cpu_id, data.time, pe->value);
+		else if (strcmp(event_str, "power:processor_frequency") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			p_state_change(ppe->cpu_id, data.time, ppe->state);
+		}
 
-		if (strcmp(event_str, "sched:sched_wakeup") == 0)
+		else if (strcmp(event_str, "sched:sched_wakeup") == 0)
 			sched_wakeup(data.cpu, data.time, data.pid, te);
 
-		if (strcmp(event_str, "sched:sched_switch") == 0)
+		else if (strcmp(event_str, "sched:sched_switch") == 0)
 			sched_switch(data.cpu, data.time, te);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		else if (strcmp(event_str, "power:power_start") == 0)
+			c_state_start(peo->cpu_id, data.time, peo->value);
+
+		else if (strcmp(event_str, "power:power_end") == 0)
+			c_state_end(peo->cpu_id, data.time);
+
+		else if (strcmp(event_str, "power:power_frequency") == 0)
+			p_state_change(peo->cpu_id, data.time, peo->value);
+#endif
 	}
 	return 0;
 }
@@ -968,7 +1002,8 @@ static const char * const timechart_usage[] = {
 	NULL
 };
 
-static const char *record_args[] = {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static const char *record_old_args[] = {
 	"record",
 	"-a",
 	"-R",
@@ -980,16 +1015,38 @@ static const char *record_args[] = {
 	"-e", "sched:sched_wakeup",
 	"-e", "sched:sched_switch",
 };
+#endif
+
+static const char *record_new_args[] = {
+	"record",
+	"-a",
+	"-R",
+	"-f",
+	"-c", "1",
+	"-e", "power:processor_frequency",
+	"-e", "power:processor_idle",
+	"-e", "sched:sched_wakeup",
+	"-e", "sched:sched_switch",
+};
 
 static int __cmd_record(int argc, const char **argv)
 {
 	unsigned int rec_argc, i, j;
 	const char **rec_argv;
+	const char **record_args = record_new_args;
+	unsigned int record_elems = ARRAY_SIZE(record_new_args);
 
-	rec_argc = ARRAY_SIZE(record_args) + argc - 1;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+	if (is_valid_tracepoint("power:power_start")) {
+		record_args = record_old_args;
+		record_elems = ARRAY_SIZE(record_old_args);
+	}
+#endif
+	
+	rec_argc = record_elems + argc - 1;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
-	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+	for (i = 0; i < record_elems; i++)
 		rec_argv[i] = strdup(record_args[i]);
 
 	for (j = 1; j < (unsigned int)argc; j++, i++)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4af5bd5..d706dcb 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -824,7 +824,7 @@ int parse_events(const struct option *opt __used, const char *str, int unset __u
 		if (ret != EVT_HANDLED_ALL) {
 			attrs[nr_counters] = attr;
 			nr_counters++;
-		}
+	}
 
 		if (*str == 0)
 			break;
@@ -906,6 +906,47 @@ static void print_tracepoint_events(void)
 }
 
 /*
+ * Check whether event is in <debugfs_mount_point>/tracing/events
+ */
+
+int is_valid_tracepoint(const char *event_string)
+{
+	DIR *sys_dir, *evt_dir;
+	struct dirent *sys_next, *evt_next, sys_dirent, evt_dirent;
+	char evt_path[MAXPATHLEN];
+	char dir_path[MAXPATHLEN];
+
+	if (debugfs_valid_mountpoint(debugfs_path))
+		return 0;
+
+	sys_dir = opendir(debugfs_path);
+	if (!sys_dir)
+		return 0;
+
+	for_each_subsystem(sys_dir, sys_dirent, sys_next) {
+
+		snprintf(dir_path, MAXPATHLEN, "%s/%s", debugfs_path,
+			 sys_dirent.d_name);
+		evt_dir = opendir(dir_path);
+		if (!evt_dir)
+			continue;
+
+		for_each_event(sys_dirent, evt_dir, evt_dirent, evt_next) {
+			snprintf(evt_path, MAXPATHLEN, "%s:%s",
+				 sys_dirent.d_name, evt_dirent.d_name);
+			if (!strcmp(evt_path, event_string)) {
+				closedir(evt_dir);
+				closedir(sys_dir);
+				return 1;
+			}
+		}
+		closedir(evt_dir);
+	}
+	closedir(sys_dir);
+	return 0;
+}
+
+/*
  * Print the help text for the event symbols:
  */
 void print_events(void)
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index fc4ab3f..7ab4685 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -29,6 +29,7 @@ extern int parse_filter(const struct option *opt, const char *str, int unset);
 #define EVENTS_HELP_MAX (128*1024)
 
 extern void print_events(void);
+extern int is_valid_tracepoint(const char *event_string);
 
 extern char debugfs_path[];
 extern int valid_debugfs_mount(const char *debugfs);
-- 
1.6.3

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH] PERF(userspace): Adjust perf timechart to the new power events V2
  2010-10-19 11:36 ` Thomas Renninger
@ 2010-10-26  0:18   ` Thomas Renninger
  2010-10-26  0:18   ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-26  0:18 UTC (permalink / raw)
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven, Ingo Molnar

Changes in V2:
  - Hanlde PWR_EVENT_EXIT instead of 0 to recon non-power state

The transition was rather smooth, only part I had to fiddle
some time was the check whether a tracepoint/event is
supported by the running kernel.

builtin-timechart must only pass -e power:xy events which
are supported by the running kernel.
For this I added the tiny helper function:
int is_valid_tracepoint(const char *event_string)
to parse-events.[hc]
which could be more generic as an interface and support
hardware/software/... events, not only tracepoints, but someone
else could extend that if needed...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/builtin-timechart.c |   89 ++++++++++++++++++++++++++++++++-------
 tools/perf/util/parse-events.c |   43 +++++++++++++++++++-
 tools/perf/util/parse-events.h |    1 +
 3 files changed, 116 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 9bcc38f..7eaa5b5 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -32,6 +32,10 @@
 #include "util/session.h"
 #include "util/svghelper.h"
 
+#define SUPPORT_OLD_POWER_EVENTS 1
+#define PWR_EVENT_EXIT 0xFFFFFFFF
+
+
 static char		const *input_name = "perf.data";
 static char		const *output_name = "output.svg";
 
@@ -298,12 +302,25 @@ struct trace_entry {
 	int			lock_depth;
 };
 
-struct power_entry {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+struct power_entry_old {
 	struct trace_entry te;
 	u64	type;
 	u64	value;
 	u64	cpu_id;
 };
+#endif
+
+struct power_processor_entry {
+	struct trace_entry te;
+	u64	state;
+	u64	cpu_id;
+};
+
+struct power_suspend_entry {
+	struct trace_entry te;
+	u64	state;
+};
 
 #define TASK_COMM_LEN 16
 struct wakeup_entry {
@@ -489,29 +506,46 @@ static int process_sample_event(event_t *event, struct perf_session *session)
 	te = (void *)data.raw_data;
 	if (session->sample_type & PERF_SAMPLE_RAW && data.raw_size > 0) {
 		char *event_str;
-		struct power_entry *pe;
-
-		pe = (void *)te;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		struct power_entry_old *peo;
+		peo = (void *)te;
+#endif
 
 		event_str = perf_header__find_event(te->type);
 
 		if (!event_str)
 			return 0;
 
-		if (strcmp(event_str, "power:power_start") == 0)
-			c_state_start(pe->cpu_id, data.time, pe->value);
-
-		if (strcmp(event_str, "power:power_end") == 0)
-			c_state_end(pe->cpu_id, data.time);
+		if (strcmp(event_str, "power:processor_idle") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			if (ppe->state == PWR_EVENT_EXIT)
+				c_state_end(ppe->cpu_id, data.time);
+			else
+				c_state_start(ppe->cpu_id, data.time,
+					      ppe->state);
+		}
 
-		if (strcmp(event_str, "power:power_frequency") == 0)
-			p_state_change(pe->cpu_id, data.time, pe->value);
+		else if (strcmp(event_str, "power:processor_frequency") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			p_state_change(ppe->cpu_id, data.time, ppe->state);
+		}
 
-		if (strcmp(event_str, "sched:sched_wakeup") == 0)
+		else if (strcmp(event_str, "sched:sched_wakeup") == 0)
 			sched_wakeup(data.cpu, data.time, data.pid, te);
 
-		if (strcmp(event_str, "sched:sched_switch") == 0)
+		else if (strcmp(event_str, "sched:sched_switch") == 0)
 			sched_switch(data.cpu, data.time, te);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		else if (strcmp(event_str, "power:power_start") == 0)
+			c_state_start(peo->cpu_id, data.time, peo->value);
+
+		else if (strcmp(event_str, "power:power_end") == 0)
+			c_state_end(peo->cpu_id, data.time);
+
+		else if (strcmp(event_str, "power:power_frequency") == 0)
+			p_state_change(peo->cpu_id, data.time, peo->value);
+#endif
 	}
 	return 0;
 }
@@ -968,7 +1002,8 @@ static const char * const timechart_usage[] = {
 	NULL
 };
 
-static const char *record_args[] = {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static const char *record_old_args[] = {
 	"record",
 	"-a",
 	"-R",
@@ -980,16 +1015,38 @@ static const char *record_args[] = {
 	"-e", "sched:sched_wakeup",
 	"-e", "sched:sched_switch",
 };
+#endif
+
+static const char *record_new_args[] = {
+	"record",
+	"-a",
+	"-R",
+	"-f",
+	"-c", "1",
+	"-e", "power:processor_frequency",
+	"-e", "power:processor_idle",
+	"-e", "sched:sched_wakeup",
+	"-e", "sched:sched_switch",
+};
 
 static int __cmd_record(int argc, const char **argv)
 {
 	unsigned int rec_argc, i, j;
 	const char **rec_argv;
+	const char **record_args = record_new_args;
+	unsigned int record_elems = ARRAY_SIZE(record_new_args);
 
-	rec_argc = ARRAY_SIZE(record_args) + argc - 1;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+	if (is_valid_tracepoint("power:power_start")) {
+		record_args = record_old_args;
+		record_elems = ARRAY_SIZE(record_old_args);
+	}
+#endif
+	
+	rec_argc = record_elems + argc - 1;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
-	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+	for (i = 0; i < record_elems; i++)
 		rec_argv[i] = strdup(record_args[i]);
 
 	for (j = 1; j < (unsigned int)argc; j++, i++)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4af5bd5..d706dcb 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -824,7 +824,7 @@ int parse_events(const struct option *opt __used, const char *str, int unset __u
 		if (ret != EVT_HANDLED_ALL) {
 			attrs[nr_counters] = attr;
 			nr_counters++;
-		}
+	}
 
 		if (*str == 0)
 			break;
@@ -906,6 +906,47 @@ static void print_tracepoint_events(void)
 }
 
 /*
+ * Check whether event is in <debugfs_mount_point>/tracing/events
+ */
+
+int is_valid_tracepoint(const char *event_string)
+{
+	DIR *sys_dir, *evt_dir;
+	struct dirent *sys_next, *evt_next, sys_dirent, evt_dirent;
+	char evt_path[MAXPATHLEN];
+	char dir_path[MAXPATHLEN];
+
+	if (debugfs_valid_mountpoint(debugfs_path))
+		return 0;
+
+	sys_dir = opendir(debugfs_path);
+	if (!sys_dir)
+		return 0;
+
+	for_each_subsystem(sys_dir, sys_dirent, sys_next) {
+
+		snprintf(dir_path, MAXPATHLEN, "%s/%s", debugfs_path,
+			 sys_dirent.d_name);
+		evt_dir = opendir(dir_path);
+		if (!evt_dir)
+			continue;
+
+		for_each_event(sys_dirent, evt_dir, evt_dirent, evt_next) {
+			snprintf(evt_path, MAXPATHLEN, "%s:%s",
+				 sys_dirent.d_name, evt_dirent.d_name);
+			if (!strcmp(evt_path, event_string)) {
+				closedir(evt_dir);
+				closedir(sys_dir);
+				return 1;
+			}
+		}
+		closedir(evt_dir);
+	}
+	closedir(sys_dir);
+	return 0;
+}
+
+/*
  * Print the help text for the event symbols:
  */
 void print_events(void)
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index fc4ab3f..7ab4685 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -29,6 +29,7 @@ extern int parse_filter(const struct option *opt, const char *str, int unset);
 #define EVENTS_HELP_MAX (128*1024)
 
 extern void print_events(void);
+extern int is_valid_tracepoint(const char *event_string);
 
 extern char debugfs_path[];
 extern int valid_debugfs_mount(const char *debugfs);
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
@ 2010-10-26  1:09     ` Arjan van de Ven
  2010-10-26  1:09     ` Arjan van de Ven
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-26  1:09 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On 10/25/2010 4:33 PM, Thomas Renninger wrote:
> Changes in V2:
>    - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>    - Use u32 instead of u64 for cpuid, state which is by far enough
>
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>    power:power_start
>    power:power_end
> are replaced with:
>    power:processor_idle
>
> and
>    power:power_frequency
> is replaced with:
>    power:processor_frequency
>
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.
>
> the type= field got removed from both, it was never
> used and the type is differed by the event type itself.
>
> perf timechart
> userspace tool gets adjusted in a separate patch.
>

Acked-by: Arjan van de Ven <arjan@linux.intel.com>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
  2010-10-26  1:09     ` Arjan van de Ven
@ 2010-10-26  1:09     ` Arjan van de Ven
  2010-10-26  7:10     ` Ingo Molnar
                       ` (5 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-26  1:09 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/25/2010 4:33 PM, Thomas Renninger wrote:
> Changes in V2:
>    - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>    - Use u32 instead of u64 for cpuid, state which is by far enough
>
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>    power:power_start
>    power:power_end
> are replaced with:
>    power:processor_idle
>
> and
>    power:power_frequency
> is replaced with:
>    power:processor_frequency
>
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.
>
> the type= field got removed from both, it was never
> used and the type is differed by the event type itself.
>
> perf timechart
> userspace tool gets adjusted in a separate patch.
>

Acked-by: Arjan van de Ven <arjan@linux.intel.com>


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
  2010-10-26  1:09     ` Arjan van de Ven
  2010-10-26  1:09     ` Arjan van de Ven
@ 2010-10-26  7:10     ` Ingo Molnar
  2010-10-26  7:10     ` Ingo Molnar
                       ` (4 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-26  7:10 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner


* Thomas Renninger <trenn@suse.de> wrote:

> Changes in V2:
>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle
> 
> and
>   power:power_frequency
> is replaced with:
>   power:processor_frequency

Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
shortness? We generally use 'cpu' in the kernel and for events.

> power:machine_suspend

How will future PCI (or other device) power saving tracepoints be called?

Might be more consistent to use:

  power:cpu_idle
  power:machine_idle
  power:device_idle

Where machine_idle is the suspend event.

> the type= field got removed from both, it was never
> used and the type is differed by the event type itself.
>
> +#define PWR_EVENT_EXIT 0xFFFFFFFF

Shouldnt this be part of the POWER_ enum? (and you can write -1 there)

> +#ifndef _TRACE_POWER_ENUM_
> +#define _TRACE_POWER_ENUM_
> +enum {
> +	POWER_NONE = 0,
> +	POWER_CSTATE = 1,
> +	POWER_PSTATE = 2,
> +};
> +#endif

Since we are cleaning up all these events, those enum definitions dont really look 
logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?

Plus:

> +DECLARE_EVENT_CLASS(processor,
> +
> +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +	TP_ARGS(state, cpu_id),
> +
> +	TP_STRUCT__entry(
> +		__field(	u32,		state		)
> +		__field(	u32,		cpu_id		)

Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
ever be different from that CPU id?

> +	),
> +
> +	TP_fast_assign(
> +		__entry->state = state;
> +		__entry->cpu_id = cpu_id;
> +	),
> +
> +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> +		  (unsigned long)__entry->cpu_id)
> +);
> +
> +DEFINE_EVENT(processor, processor_idle,
> +
> +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +	     TP_ARGS(state, cpu_id)
> +);
> +
> +#define PWR_EVENT_EXIT 0xFFFFFFFF
> +
> +DEFINE_EVENT(processor, processor_frequency,
> +
> +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> +
> +	TP_ARGS(frequency, cpu_id)
> +);

So, we have a 'state' field in the class, which is used as 'state' by the 
power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?

Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
then it wont fit into u32.

Also, might there be a future need to express different types of frequencies? For 
example, should we decide to track turbo frequencies in Intel CPUs, how would that 
be expressed via these events? Are there any architectures and CPUs that somehow 
have some extra attribute to the frequency value?

> +TRACE_EVENT(machine_suspend,
> +
> +	TP_PROTO(unsigned int state),
> +
> +	TP_ARGS(state),
> +
> +	TP_STRUCT__entry(
> +		__field(	u32,		state		)
> +	),

Hm, this event is not used anywhere in the submitted patches. Where is the patch 
that adds usage, and what are the possible values for 'state'?

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
                       ` (2 preceding siblings ...)
  2010-10-26  7:10     ` Ingo Molnar
@ 2010-10-26  7:10     ` Ingo Molnar
  2010-10-26  8:08       ` Jean Pihet
                         ` (7 more replies)
  2010-10-26  7:59     ` Jean Pihet
                       ` (3 subsequent siblings)
  7 siblings, 8 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-26  7:10 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> Changes in V2:
>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle
> 
> and
>   power:power_frequency
> is replaced with:
>   power:processor_frequency

Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
shortness? We generally use 'cpu' in the kernel and for events.

> power:machine_suspend

How will future PCI (or other device) power saving tracepoints be called?

Might be more consistent to use:

  power:cpu_idle
  power:machine_idle
  power:device_idle

Where machine_idle is the suspend event.

> the type= field got removed from both, it was never
> used and the type is differed by the event type itself.
>
> +#define PWR_EVENT_EXIT 0xFFFFFFFF

Shouldnt this be part of the POWER_ enum? (and you can write -1 there)

> +#ifndef _TRACE_POWER_ENUM_
> +#define _TRACE_POWER_ENUM_
> +enum {
> +	POWER_NONE = 0,
> +	POWER_CSTATE = 1,
> +	POWER_PSTATE = 2,
> +};
> +#endif

Since we are cleaning up all these events, those enum definitions dont really look 
logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?

Plus:

> +DECLARE_EVENT_CLASS(processor,
> +
> +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +	TP_ARGS(state, cpu_id),
> +
> +	TP_STRUCT__entry(
> +		__field(	u32,		state		)
> +		__field(	u32,		cpu_id		)

Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
ever be different from that CPU id?

> +	),
> +
> +	TP_fast_assign(
> +		__entry->state = state;
> +		__entry->cpu_id = cpu_id;
> +	),
> +
> +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> +		  (unsigned long)__entry->cpu_id)
> +);
> +
> +DEFINE_EVENT(processor, processor_idle,
> +
> +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +	     TP_ARGS(state, cpu_id)
> +);
> +
> +#define PWR_EVENT_EXIT 0xFFFFFFFF
> +
> +DEFINE_EVENT(processor, processor_frequency,
> +
> +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> +
> +	TP_ARGS(frequency, cpu_id)
> +);

So, we have a 'state' field in the class, which is used as 'state' by the 
power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?

Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
then it wont fit into u32.

Also, might there be a future need to express different types of frequencies? For 
example, should we decide to track turbo frequencies in Intel CPUs, how would that 
be expressed via these events? Are there any architectures and CPUs that somehow 
have some extra attribute to the frequency value?

> +TRACE_EVENT(machine_suspend,
> +
> +	TP_PROTO(unsigned int state),
> +
> +	TP_ARGS(state),
> +
> +	TP_STRUCT__entry(
> +		__field(	u32,		state		)
> +	),

Hm, this event is not used anywhere in the submitted patches. Where is the patch 
that adds usage, and what are the possible values for 'state'?

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
                       ` (4 preceding siblings ...)
  2010-10-26  7:59     ` Jean Pihet
@ 2010-10-26  7:59     ` Jean Pihet
  2010-10-26 18:52     ` Rafael J. Wysocki
  2010-10-26 18:52     ` Rafael J. Wysocki
  7 siblings, 0 replies; 135+ messages in thread
From: Jean Pihet @ 2010-10-26  7:59 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Mathieu Desnoyers, Ingo Molnar, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Tue, Oct 26, 2010 at 1:33 AM, Thomas Renninger <trenn@suse.de> wrote:
> Changes in V2:
>  - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>  - Use u32 instead of u64 for cpuid, state which is by far enough
>
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>  power:power_start
>  power:power_end
> are replaced with:
>  power:processor_idle
>
> and
>  power:power_frequency
> is replaced with:
>  power:processor_frequency
>
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.
This will come as a separate patch, which fits all platforms. Cf.
http://marc.info/?l=linux-omap&m=128620575300682&w=2.

...

> @@ -481,10 +484,12 @@ static void mwait_idle(void)
>  static void poll_idle(void)
>  {
>        trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> +       trace_processor_idle(1, smp_processor_id());

Should that be:
+       trace_processor_idle(0, smp_processor_id());
instead?
Since state '0' is for the CPU active in polling mode and
PWR_EVENT_EXIT means 'exit from any idle state'.

>        local_irq_enable();
>        while (!need_resched())
>                cpu_relax();
> -       trace_power_end(0);
> +       trace_power_end(smp_processor_id());
> +       trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
>  }
>
>  /*

...

> diff --git a/include/trace/events/power.h b/include/trace/events/power.h
...

Regards,
Jean

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
                       ` (3 preceding siblings ...)
  2010-10-26  7:10     ` Ingo Molnar
@ 2010-10-26  7:59     ` Jean Pihet
  2010-10-26  7:59     ` Jean Pihet
                       ` (2 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Jean Pihet @ 2010-10-26  7:59 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Pierre Tardy,
	Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven, Ingo Molnar

On Tue, Oct 26, 2010 at 1:33 AM, Thomas Renninger <trenn@suse.de> wrote:
> Changes in V2:
>  - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>  - Use u32 instead of u64 for cpuid, state which is by far enough
>
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>  power:power_start
>  power:power_end
> are replaced with:
>  power:processor_idle
>
> and
>  power:power_frequency
> is replaced with:
>  power:processor_frequency
>
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.
This will come as a separate patch, which fits all platforms. Cf.
http://marc.info/?l=linux-omap&m=128620575300682&w=2.

...

> @@ -481,10 +484,12 @@ static void mwait_idle(void)
>  static void poll_idle(void)
>  {
>        trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> +       trace_processor_idle(1, smp_processor_id());

Should that be:
+       trace_processor_idle(0, smp_processor_id());
instead?
Since state '0' is for the CPU active in polling mode and
PWR_EVENT_EXIT means 'exit from any idle state'.

>        local_irq_enable();
>        while (!need_resched())
>                cpu_relax();
> -       trace_power_end(0);
> +       trace_power_end(smp_processor_id());
> +       trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
>  }
>
>  /*

...

> diff --git a/include/trace/events/power.h b/include/trace/events/power.h
...

Regards,
Jean
--
To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
  2010-10-26  8:08       ` Jean Pihet
@ 2010-10-26  8:08       ` Jean Pihet
  2010-10-26  9:58       ` Arjan van de Ven
                         ` (5 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Jean Pihet @ 2010-10-26  8:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Thomas Gleixner, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Mathieu Desnoyers

Ingo,

On Tue, Oct 26, 2010 at 9:10 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Thomas Renninger <trenn@suse.de> wrote:
>
>> Changes in V2:
>>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>>   - Use u32 instead of u64 for cpuid, state which is by far enough

...

>>
>> +#define PWR_EVENT_EXIT 0xFFFFFFFF
>
> Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
>
>> +#ifndef _TRACE_POWER_ENUM_
>> +#define _TRACE_POWER_ENUM_
>> +enum {
>> +     POWER_NONE = 0,
>> +     POWER_CSTATE = 1,
>> +     POWER_PSTATE = 2,
>> +};
>> +#endif
>
> Since we are cleaning up all these events, those enum definitions dont really look
> logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
The enum belongs to the deprecated API so I would rather not touch it.
Keeping the deprecated code isolated will make it easier to remove
later.

>
> Plus:
>
>> +DECLARE_EVENT_CLASS(processor,
>> +
>> +     TP_PROTO(unsigned int state, unsigned int cpu_id),
>> +
>> +     TP_ARGS(state, cpu_id),
>> +
>> +     TP_STRUCT__entry(
>> +             __field(        u32,            state           )
>> +             __field(        u32,            cpu_id          )
>
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> ever be different from that CPU id?
I have no evidence of that (powr mgt on SMP ARM is coming real
soon...) but one can imagine one of the CPUs being the master for PM
decisions.

>
>> +     ),
>> +
>> +     TP_fast_assign(
>> +             __entry->state = state;
>> +             __entry->cpu_id = cpu_id;
>> +     ),
>> +
>> +     TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
>> +               (unsigned long)__entry->cpu_id)
>> +);
>> +
>> +DEFINE_EVENT(processor, processor_idle,
>> +
>> +     TP_PROTO(unsigned int state, unsigned int cpu_id),
>> +
>> +          TP_ARGS(state, cpu_id)
>> +);
>> +
>> +#define PWR_EVENT_EXIT 0xFFFFFFFF
>> +
>> +DEFINE_EVENT(processor, processor_frequency,
>> +
>> +     TP_PROTO(unsigned int frequency, unsigned int cpu_id),
>> +
>> +     TP_ARGS(frequency, cpu_id)
>> +);
>
> So, we have a 'state' field in the class, which is used as 'state' by the
> power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
>
> Are there any architectures that track frequency in Hz, not in kHz? If yes, might
> there ever be a need for the frequency value to be larger than 4.29 GHz? If yes,
> then it wont fit into u32.
>
> Also, might there be a future need to express different types of frequencies? For
> example, should we decide to track turbo frequencies in Intel CPUs, how would that
> be expressed via these events? Are there any architectures and CPUs that somehow
> have some extra attribute to the frequency value?
>
>> +TRACE_EVENT(machine_suspend,
>> +
>> +     TP_PROTO(unsigned int state),
>> +
>> +     TP_ARGS(state),
>> +
>> +     TP_STRUCT__entry(
>> +             __field(        u32,            state           )
>> +     ),
>
> Hm, this event is not used anywhere in the submitted patches. Where is the patch
> that adds usage, and what are the possible values for 'state'?
This will come as a separate patch, which fits all platforms. Cf.
http://marc.info/?l=linux-omap&m=128620575300682&w=2.
The state field is of type suspend_state_t, cf. include/linux/suspend.h
>
>        Ingo
>

Jean

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
@ 2010-10-26  8:08       ` Jean Pihet
  2010-10-26 11:21         ` Ingo Molnar
  2010-10-26 11:21         ` Ingo Molnar
  2010-10-26  8:08       ` Jean Pihet
                         ` (6 subsequent siblings)
  7 siblings, 2 replies; 135+ messages in thread
From: Jean Pihet @ 2010-10-26  8:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

Ingo,

On Tue, Oct 26, 2010 at 9:10 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Thomas Renninger <trenn@suse.de> wrote:
>
>> Changes in V2:
>>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>>   - Use u32 instead of u64 for cpuid, state which is by far enough

...

>>
>> +#define PWR_EVENT_EXIT 0xFFFFFFFF
>
> Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
>
>> +#ifndef _TRACE_POWER_ENUM_
>> +#define _TRACE_POWER_ENUM_
>> +enum {
>> +     POWER_NONE = 0,
>> +     POWER_CSTATE = 1,
>> +     POWER_PSTATE = 2,
>> +};
>> +#endif
>
> Since we are cleaning up all these events, those enum definitions dont really look
> logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
The enum belongs to the deprecated API so I would rather not touch it.
Keeping the deprecated code isolated will make it easier to remove
later.

>
> Plus:
>
>> +DECLARE_EVENT_CLASS(processor,
>> +
>> +     TP_PROTO(unsigned int state, unsigned int cpu_id),
>> +
>> +     TP_ARGS(state, cpu_id),
>> +
>> +     TP_STRUCT__entry(
>> +             __field(        u32,            state           )
>> +             __field(        u32,            cpu_id          )
>
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> ever be different from that CPU id?
I have no evidence of that (powr mgt on SMP ARM is coming real
soon...) but one can imagine one of the CPUs being the master for PM
decisions.

>
>> +     ),
>> +
>> +     TP_fast_assign(
>> +             __entry->state = state;
>> +             __entry->cpu_id = cpu_id;
>> +     ),
>> +
>> +     TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
>> +               (unsigned long)__entry->cpu_id)
>> +);
>> +
>> +DEFINE_EVENT(processor, processor_idle,
>> +
>> +     TP_PROTO(unsigned int state, unsigned int cpu_id),
>> +
>> +          TP_ARGS(state, cpu_id)
>> +);
>> +
>> +#define PWR_EVENT_EXIT 0xFFFFFFFF
>> +
>> +DEFINE_EVENT(processor, processor_frequency,
>> +
>> +     TP_PROTO(unsigned int frequency, unsigned int cpu_id),
>> +
>> +     TP_ARGS(frequency, cpu_id)
>> +);
>
> So, we have a 'state' field in the class, which is used as 'state' by the
> power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
>
> Are there any architectures that track frequency in Hz, not in kHz? If yes, might
> there ever be a need for the frequency value to be larger than 4.29 GHz? If yes,
> then it wont fit into u32.
>
> Also, might there be a future need to express different types of frequencies? For
> example, should we decide to track turbo frequencies in Intel CPUs, how would that
> be expressed via these events? Are there any architectures and CPUs that somehow
> have some extra attribute to the frequency value?
>
>> +TRACE_EVENT(machine_suspend,
>> +
>> +     TP_PROTO(unsigned int state),
>> +
>> +     TP_ARGS(state),
>> +
>> +     TP_STRUCT__entry(
>> +             __field(        u32,            state           )
>> +     ),
>
> Hm, this event is not used anywhere in the submitted patches. Where is the patch
> that adds usage, and what are the possible values for 'state'?
This will come as a separate patch, which fits all platforms. Cf.
http://marc.info/?l=linux-omap&m=128620575300682&w=2.
The state field is of type suspend_state_t, cf. include/linux/suspend.h
>
>        Ingo
>

Jean
--
To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
                         ` (2 preceding siblings ...)
  2010-10-26  9:58       ` Arjan van de Ven
@ 2010-10-26  9:58       ` Arjan van de Ven
  2010-10-26 10:37       ` Thomas Renninger
                         ` (3 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-26  9:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

On 10/26/2010 12:10 AM, Ingo Molnar wrote:
> +
>> +	TP_STRUCT__entry(
>> +		__field(	u32,		state		)
>> +		__field(	u32,		cpu_id		)
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> ever be different from that CPU id?

yes

esp cpu frequency you can change cross cpu....

originally we did not have this in the API but Thomas added it for that 
reason some time ago.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
  2010-10-26  8:08       ` Jean Pihet
  2010-10-26  8:08       ` Jean Pihet
@ 2010-10-26  9:58       ` Arjan van de Ven
  2010-10-26 10:19         ` Ingo Molnar
  2010-10-26 10:19         ` Ingo Molnar
  2010-10-26  9:58       ` Arjan van de Ven
                         ` (4 subsequent siblings)
  7 siblings, 2 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-26  9:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/26/2010 12:10 AM, Ingo Molnar wrote:
> +
>> +	TP_STRUCT__entry(
>> +		__field(	u32,		state		)
>> +		__field(	u32,		cpu_id		)
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> ever be different from that CPU id?

yes

esp cpu frequency you can change cross cpu....

originally we did not have this in the API but Thomas added it for that 
reason some time ago.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  9:58       ` Arjan van de Ven
  2010-10-26 10:19         ` Ingo Molnar
@ 2010-10-26 10:19         ` Ingo Molnar
  1 sibling, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-26 10:19 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Mathieu Desnoyers


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On 10/26/2010 12:10 AM, Ingo Molnar wrote:
> >+
> >>+	TP_STRUCT__entry(
> >>+		__field(	u32,		state		)
> >>+		__field(	u32,		cpu_id		)
> >Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> >ever be different from that CPU id?
> 
> yes
> 
> esp cpu frequency you can change cross cpu....
> 
> originally we did not have this in the API but Thomas added it for that reason 
> some time ago.

Ok, good! Maybe add this as a comment?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  9:58       ` Arjan van de Ven
@ 2010-10-26 10:19         ` Ingo Molnar
  2010-10-26 10:19         ` Ingo Molnar
  1 sibling, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-26 10:19 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On 10/26/2010 12:10 AM, Ingo Molnar wrote:
> >+
> >>+	TP_STRUCT__entry(
> >>+		__field(	u32,		state		)
> >>+		__field(	u32,		cpu_id		)
> >Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> >ever be different from that CPU id?
> 
> yes
> 
> esp cpu frequency you can change cross cpu....
> 
> originally we did not have this in the API but Thomas added it for that reason 
> some time ago.

Ok, good! Maybe add this as a comment?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
                         ` (3 preceding siblings ...)
  2010-10-26  9:58       ` Arjan van de Ven
@ 2010-10-26 10:37       ` Thomas Renninger
  2010-10-26 10:37       ` Thomas Renninger
                         ` (2 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-26 10:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > Changes in V2:
> >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > 
> > New power trace events:
> > power:processor_idle
> > power:processor_frequency
> > power:machine_suspend
> > 
> > 
> > C-state/idle accounting events:
> >   power:power_start
> >   power:power_end
> > are replaced with:
> >   power:processor_idle
> > 
> > and
> >   power:power_frequency
> > is replaced with:
> >   power:processor_frequency
> 
> Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> shortness? We generally use 'cpu' in the kernel and for events.
Sure.
> 
> > power:machine_suspend
> 
> How will future PCI (or other device) power saving tracepoints be called?
> 
> Might be more consistent to use:
> 
>   power:cpu_idle
>   power:machine_idle
>   power:device_idle
device idle is not true. Those may be low power modes
like reduced network throughput, reduced wlan range, the device
needs not to be idle.
Device power states is probably the most complex area, if such
a thing gets introduced, it should makes sense to state
the interface experimental for some time until a wider range of
devices uses it (in contrast to these new ones
which should not change that soon anymore...).

Also machine_idle may be true, but machine_suspend sounds more
familiar and everyone immediately knows what the event is about.

-> *_idle convention is not really worth it.

> Where machine_idle is the suspend event.
Here you name it. You talk about machine_idle but you mean
the suspend event, better just name it what it is.

> > the type= field got removed from both, it was never
> > used and the type is differed by the event type itself.
> >
> > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> 
> Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
No, below enum will vanish, but -1 is nicer.

...

> Plus:
> 
> > +DECLARE_EVENT_CLASS(processor,
> > +
> > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > +
> > +	TP_ARGS(state, cpu_id),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	u32,		state		)
> > +		__field(	u32,		cpu_id		)
> 
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> ever be different from that CPU id?
Yes. A core's frequency can depend on another one which
will get switched as well (by one command/MSR).
Compare with commit 6f4f2723d08534fd4e407e1e.

This can theoretically also be the case for sleep states.
Afaik such HW does not exist yet, but ACPI spec already
provides interfaces to pass these dependency from BIOS to OS.
-> We want a stable ABI and should be prepared for such stuff.

> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->state = state;
> > +		__entry->cpu_id = cpu_id;
> > +	),
> > +
> > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > +		  (unsigned long)__entry->cpu_id)
> > +);
> > +
> > +DEFINE_EVENT(processor, processor_idle,
> > +
> > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > +
> > +	     TP_ARGS(state, cpu_id)
> > +);
> > +
> > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > +
> > +DEFINE_EVENT(processor, processor_frequency,
> > +
> > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > +
> > +	TP_ARGS(frequency, cpu_id)
> > +);
> 
> So, we have a 'state' field in the class, which is used as 'state' by the 
> power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
Yes, is this a problem?
Definitions are a bit shorter having one power processor class.
As "frequency" is stated in frequency event definition everything should
be obvious and this one looks like the more elegant way to me.
 
> Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> then it wont fit into u32.
drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
        unsigned int            min;    /* in kHz */
        unsigned int            max;    /* in kHz */
        unsigned int            cur;    /* in kHz,
        ...
that should be fine.

> Also, might there be a future need to express different types of frequencies? For 
> example, should we decide to track turbo frequencies in Intel CPUs, how would that 
> be expressed via these events? Are there any architectures and CPUs that somehow 
> have some extra attribute to the frequency value?
I wonder whether this ever can/will work in a sane way.
Userspace can compare with:
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
everything above is turbo. So I do not think it's ever needed.
But adding an additional value at the end does not violate
userspace compatibility. This has been done with the cpuid
as well.
 
> > +TRACE_EVENT(machine_suspend,
> > +
> > +	TP_PROTO(unsigned int state),
> > +
> > +	TP_ARGS(state),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	u32,		state		)
> > +	),
> 
> Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> that adds usage, and what are the possible values for 'state'?
Jean wants to make use of it on ARM.
I also had patch for x86, I can have another look at it, Rafael
already gave me a comment on it. But on X86 you typically realize
when you suspend the machine (could imagine this is more useful on
ARM driven mobile phones and similar), still I can add it..

Values probably should be (include/linux/suspend.h):
#define PM_SUSPEND_ON       0
#define PM_SUSPEND_STANDBY  1
#define PM_SUSPEND_MEM      3
#define PM_SUSPEND_MAX      4

How this strange state Arjan talked about is passed is up
to these guys. Instead of using 0 and above pre-defined such
arch specific special states better should get passed by:

#define X86_MOORESTOWN_STANDBY_S0   0x100
..                                  0x101
#define ARM_WHATEVER_STRANGE_THING  0x200
...

     Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
                         ` (4 preceding siblings ...)
  2010-10-26 10:37       ` Thomas Renninger
@ 2010-10-26 10:37       ` Thomas Renninger
  2010-10-26 11:19         ` Ingo Molnar
  2010-10-26 11:19         ` Ingo Molnar
  2010-10-26 15:32       ` Pierre Tardy
  2010-10-26 15:32       ` Pierre Tardy
  7 siblings, 2 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-26 10:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > Changes in V2:
> >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > 
> > New power trace events:
> > power:processor_idle
> > power:processor_frequency
> > power:machine_suspend
> > 
> > 
> > C-state/idle accounting events:
> >   power:power_start
> >   power:power_end
> > are replaced with:
> >   power:processor_idle
> > 
> > and
> >   power:power_frequency
> > is replaced with:
> >   power:processor_frequency
> 
> Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> shortness? We generally use 'cpu' in the kernel and for events.
Sure.
> 
> > power:machine_suspend
> 
> How will future PCI (or other device) power saving tracepoints be called?
> 
> Might be more consistent to use:
> 
>   power:cpu_idle
>   power:machine_idle
>   power:device_idle
device idle is not true. Those may be low power modes
like reduced network throughput, reduced wlan range, the device
needs not to be idle.
Device power states is probably the most complex area, if such
a thing gets introduced, it should makes sense to state
the interface experimental for some time until a wider range of
devices uses it (in contrast to these new ones
which should not change that soon anymore...).

Also machine_idle may be true, but machine_suspend sounds more
familiar and everyone immediately knows what the event is about.

-> *_idle convention is not really worth it.

> Where machine_idle is the suspend event.
Here you name it. You talk about machine_idle but you mean
the suspend event, better just name it what it is.

> > the type= field got removed from both, it was never
> > used and the type is differed by the event type itself.
> >
> > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> 
> Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
No, below enum will vanish, but -1 is nicer.

...

> Plus:
> 
> > +DECLARE_EVENT_CLASS(processor,
> > +
> > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > +
> > +	TP_ARGS(state, cpu_id),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	u32,		state		)
> > +		__field(	u32,		cpu_id		)
> 
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> ever be different from that CPU id?
Yes. A core's frequency can depend on another one which
will get switched as well (by one command/MSR).
Compare with commit 6f4f2723d08534fd4e407e1e.

This can theoretically also be the case for sleep states.
Afaik such HW does not exist yet, but ACPI spec already
provides interfaces to pass these dependency from BIOS to OS.
-> We want a stable ABI and should be prepared for such stuff.

> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->state = state;
> > +		__entry->cpu_id = cpu_id;
> > +	),
> > +
> > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > +		  (unsigned long)__entry->cpu_id)
> > +);
> > +
> > +DEFINE_EVENT(processor, processor_idle,
> > +
> > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > +
> > +	     TP_ARGS(state, cpu_id)
> > +);
> > +
> > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > +
> > +DEFINE_EVENT(processor, processor_frequency,
> > +
> > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > +
> > +	TP_ARGS(frequency, cpu_id)
> > +);
> 
> So, we have a 'state' field in the class, which is used as 'state' by the 
> power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
Yes, is this a problem?
Definitions are a bit shorter having one power processor class.
As "frequency" is stated in frequency event definition everything should
be obvious and this one looks like the more elegant way to me.
 
> Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> then it wont fit into u32.
drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
        unsigned int            min;    /* in kHz */
        unsigned int            max;    /* in kHz */
        unsigned int            cur;    /* in kHz,
        ...
that should be fine.

> Also, might there be a future need to express different types of frequencies? For 
> example, should we decide to track turbo frequencies in Intel CPUs, how would that 
> be expressed via these events? Are there any architectures and CPUs that somehow 
> have some extra attribute to the frequency value?
I wonder whether this ever can/will work in a sane way.
Userspace can compare with:
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
everything above is turbo. So I do not think it's ever needed.
But adding an additional value at the end does not violate
userspace compatibility. This has been done with the cpuid
as well.
 
> > +TRACE_EVENT(machine_suspend,
> > +
> > +	TP_PROTO(unsigned int state),
> > +
> > +	TP_ARGS(state),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	u32,		state		)
> > +	),
> 
> Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> that adds usage, and what are the possible values for 'state'?
Jean wants to make use of it on ARM.
I also had patch for x86, I can have another look at it, Rafael
already gave me a comment on it. But on X86 you typically realize
when you suspend the machine (could imagine this is more useful on
ARM driven mobile phones and similar), still I can add it..

Values probably should be (include/linux/suspend.h):
#define PM_SUSPEND_ON       0
#define PM_SUSPEND_STANDBY  1
#define PM_SUSPEND_MEM      3
#define PM_SUSPEND_MAX      4

How this strange state Arjan talked about is passed is up
to these guys. Instead of using 0 and above pre-defined such
arch specific special states better should get passed by:

#define X86_MOORESTOWN_STANDBY_S0   0x100
..                                  0x101
#define ARM_WHATEVER_STRANGE_THING  0x200
...

     Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 10:37       ` Thomas Renninger
@ 2010-10-26 11:19         ` Ingo Molnar
  2010-10-26 11:19         ` Ingo Molnar
  1 sibling, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:19 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner


* Thomas Renninger <trenn@suse.de> wrote:

> On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > Changes in V2:
> > >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> > >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > > 
> > > New power trace events:
> > > power:processor_idle
> > > power:processor_frequency
> > > power:machine_suspend
> > > 
> > > 
> > > C-state/idle accounting events:
> > >   power:power_start
> > >   power:power_end
> > > are replaced with:
> > >   power:processor_idle
> > > 
> > > and
> > >   power:power_frequency
> > > is replaced with:
> > >   power:processor_frequency
> > 
> > Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> > shortness? We generally use 'cpu' in the kernel and for events.
> Sure.
> > 
> > > power:machine_suspend
> > 
> > How will future PCI (or other device) power saving tracepoints be called?
> > 
> > Might be more consistent to use:
> > 
> >   power:cpu_idle
> >   power:machine_idle
> >   power:device_idle
>
> device idle is not true. Those may be low power modes
> like reduced network throughput, reduced wlan range, the device
> needs not to be idle.
> Device power states is probably the most complex area, if such
> a thing gets introduced, it should makes sense to state
> the interface experimental for some time until a wider range of
> devices uses it (in contrast to these new ones
> which should not change that soon anymore...).

Ok.

> Also machine_idle may be true, but machine_suspend sounds more
> familiar and everyone immediately knows what the event is about.

Ok - fair enough.

> > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > 
> > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
>
> No, below enum will vanish, but -1 is nicer.

When it vanishes what will replace it?

> ...
> 
> > Plus:
> > 
> > > +DECLARE_EVENT_CLASS(processor,
> > > +
> > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > +
> > > +	TP_ARGS(state, cpu_id),
> > > +
> > > +	TP_STRUCT__entry(
> > > +		__field(	u32,		state		)
> > > +		__field(	u32,		cpu_id		)
> > 
> > Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> > ever be different from that CPU id?
>
> Yes. A core's frequency can depend on another one which
> will get switched as well (by one command/MSR).
> Compare with commit 6f4f2723d08534fd4e407e1e.
> 
> This can theoretically also be the case for sleep states.
> Afaik such HW does not exist yet, but ACPI spec already
> provides interfaces to pass these dependency from BIOS to OS.
> -> We want a stable ABI and should be prepared for such stuff.
> 
> > > +	),
> > > +
> > > +	TP_fast_assign(
> > > +		__entry->state = state;
> > > +		__entry->cpu_id = cpu_id;
> > > +	),
> > > +
> > > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > > +		  (unsigned long)__entry->cpu_id)
> > > +);
> > > +
> > > +DEFINE_EVENT(processor, processor_idle,
> > > +
> > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > +
> > > +	     TP_ARGS(state, cpu_id)
> > > +);
> > > +
> > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > +
> > > +DEFINE_EVENT(processor, processor_frequency,
> > > +
> > > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > > +
> > > +	TP_ARGS(frequency, cpu_id)
> > > +);
> > 
> > So, we have a 'state' field in the class, which is used as 'state' by the 
> > power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
>
> Yes, is this a problem?
>
> Definitions are a bit shorter having one power processor class.
> As "frequency" is stated in frequency event definition everything should
> be obvious and this one looks like the more elegant way to me.
>  
> > Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> > there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> > then it wont fit into u32.
>
> drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
>         unsigned int            min;    /* in kHz */
>         unsigned int            max;    /* in kHz */
>         unsigned int            cur;    /* in kHz,
>         ...
> that should be fine.

ok, good - so we should be fine up to 4 THz CPUs.

> > Also, might there be a future need to express different types of frequencies? 
> > For example, should we decide to track turbo frequencies in Intel CPUs, how 
> > would that be expressed via these events? Are there any architectures and CPUs 
> > that somehow have some extra attribute to the frequency value?
>
> I wonder whether this ever can/will work in a sane way.
> Userspace can compare with:
> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
> everything above is turbo. So I do not think it's ever needed.
> But adding an additional value at the end does not violate
> userspace compatibility. This has been done with the cpuid
> as well.
>  
> > > +TRACE_EVENT(machine_suspend,
> > > +
> > > +	TP_PROTO(unsigned int state),
> > > +
> > > +	TP_ARGS(state),
> > > +
> > > +	TP_STRUCT__entry(
> > > +		__field(	u32,		state		)
> > > +	),
> > 
> > Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> > that adds usage, and what are the possible values for 'state'?
>
> Jean wants to make use of it on ARM.
> I also had patch for x86, I can have another look at it, Rafael
> already gave me a comment on it. But on X86 you typically realize
> when you suspend the machine (could imagine this is more useful on
> ARM driven mobile phones and similar), still I can add it..
> 
> Values probably should be (include/linux/suspend.h):
> #define PM_SUSPEND_ON       0
> #define PM_SUSPEND_STANDBY  1
> #define PM_SUSPEND_MEM      3
> #define PM_SUSPEND_MAX      4
> 
> How this strange state Arjan talked about is passed is up
> to these guys. Instead of using 0 and above pre-defined such
> arch specific special states better should get passed by:
> 
> #define X86_MOORESTOWN_STANDBY_S0   0x100
> ..                                  0x101
> #define ARM_WHATEVER_STRANGE_THING  0x200
> ...

I'd rather like to see a meaningful name to be given to these states and them being 
passed, instead of weird platform specific things. Tooling will try to be as generic 
as possible.

I dont know this stuff, but making a distinction between s2ram and s2disk events 
would seem meaningful.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 10:37       ` Thomas Renninger
  2010-10-26 11:19         ` Ingo Molnar
@ 2010-10-26 11:19         ` Ingo Molnar
  2010-10-26 19:01           ` Rafael J. Wysocki
  2010-10-26 19:01           ` Rafael J. Wysocki
  1 sibling, 2 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:19 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > Changes in V2:
> > >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> > >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > > 
> > > New power trace events:
> > > power:processor_idle
> > > power:processor_frequency
> > > power:machine_suspend
> > > 
> > > 
> > > C-state/idle accounting events:
> > >   power:power_start
> > >   power:power_end
> > > are replaced with:
> > >   power:processor_idle
> > > 
> > > and
> > >   power:power_frequency
> > > is replaced with:
> > >   power:processor_frequency
> > 
> > Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> > shortness? We generally use 'cpu' in the kernel and for events.
> Sure.
> > 
> > > power:machine_suspend
> > 
> > How will future PCI (or other device) power saving tracepoints be called?
> > 
> > Might be more consistent to use:
> > 
> >   power:cpu_idle
> >   power:machine_idle
> >   power:device_idle
>
> device idle is not true. Those may be low power modes
> like reduced network throughput, reduced wlan range, the device
> needs not to be idle.
> Device power states is probably the most complex area, if such
> a thing gets introduced, it should makes sense to state
> the interface experimental for some time until a wider range of
> devices uses it (in contrast to these new ones
> which should not change that soon anymore...).

Ok.

> Also machine_idle may be true, but machine_suspend sounds more
> familiar and everyone immediately knows what the event is about.

Ok - fair enough.

> > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > 
> > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
>
> No, below enum will vanish, but -1 is nicer.

When it vanishes what will replace it?

> ...
> 
> > Plus:
> > 
> > > +DECLARE_EVENT_CLASS(processor,
> > > +
> > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > +
> > > +	TP_ARGS(state, cpu_id),
> > > +
> > > +	TP_STRUCT__entry(
> > > +		__field(	u32,		state		)
> > > +		__field(	u32,		cpu_id		)
> > 
> > Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> > ever be different from that CPU id?
>
> Yes. A core's frequency can depend on another one which
> will get switched as well (by one command/MSR).
> Compare with commit 6f4f2723d08534fd4e407e1e.
> 
> This can theoretically also be the case for sleep states.
> Afaik such HW does not exist yet, but ACPI spec already
> provides interfaces to pass these dependency from BIOS to OS.
> -> We want a stable ABI and should be prepared for such stuff.
> 
> > > +	),
> > > +
> > > +	TP_fast_assign(
> > > +		__entry->state = state;
> > > +		__entry->cpu_id = cpu_id;
> > > +	),
> > > +
> > > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > > +		  (unsigned long)__entry->cpu_id)
> > > +);
> > > +
> > > +DEFINE_EVENT(processor, processor_idle,
> > > +
> > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > +
> > > +	     TP_ARGS(state, cpu_id)
> > > +);
> > > +
> > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > +
> > > +DEFINE_EVENT(processor, processor_frequency,
> > > +
> > > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > > +
> > > +	TP_ARGS(frequency, cpu_id)
> > > +);
> > 
> > So, we have a 'state' field in the class, which is used as 'state' by the 
> > power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
>
> Yes, is this a problem?
>
> Definitions are a bit shorter having one power processor class.
> As "frequency" is stated in frequency event definition everything should
> be obvious and this one looks like the more elegant way to me.
>  
> > Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> > there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> > then it wont fit into u32.
>
> drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
>         unsigned int            min;    /* in kHz */
>         unsigned int            max;    /* in kHz */
>         unsigned int            cur;    /* in kHz,
>         ...
> that should be fine.

ok, good - so we should be fine up to 4 THz CPUs.

> > Also, might there be a future need to express different types of frequencies? 
> > For example, should we decide to track turbo frequencies in Intel CPUs, how 
> > would that be expressed via these events? Are there any architectures and CPUs 
> > that somehow have some extra attribute to the frequency value?
>
> I wonder whether this ever can/will work in a sane way.
> Userspace can compare with:
> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
> everything above is turbo. So I do not think it's ever needed.
> But adding an additional value at the end does not violate
> userspace compatibility. This has been done with the cpuid
> as well.
>  
> > > +TRACE_EVENT(machine_suspend,
> > > +
> > > +	TP_PROTO(unsigned int state),
> > > +
> > > +	TP_ARGS(state),
> > > +
> > > +	TP_STRUCT__entry(
> > > +		__field(	u32,		state		)
> > > +	),
> > 
> > Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> > that adds usage, and what are the possible values for 'state'?
>
> Jean wants to make use of it on ARM.
> I also had patch for x86, I can have another look at it, Rafael
> already gave me a comment on it. But on X86 you typically realize
> when you suspend the machine (could imagine this is more useful on
> ARM driven mobile phones and similar), still I can add it..
> 
> Values probably should be (include/linux/suspend.h):
> #define PM_SUSPEND_ON       0
> #define PM_SUSPEND_STANDBY  1
> #define PM_SUSPEND_MEM      3
> #define PM_SUSPEND_MAX      4
> 
> How this strange state Arjan talked about is passed is up
> to these guys. Instead of using 0 and above pre-defined such
> arch specific special states better should get passed by:
> 
> #define X86_MOORESTOWN_STANDBY_S0   0x100
> ..                                  0x101
> #define ARM_WHATEVER_STRANGE_THING  0x200
> ...

I'd rather like to see a meaningful name to be given to these states and them being 
passed, instead of weird platform specific things. Tooling will try to be as generic 
as possible.

I dont know this stuff, but making a distinction between s2ram and s2disk events 
would seem meaningful.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  8:08       ` Jean Pihet
  2010-10-26 11:21         ` Ingo Molnar
@ 2010-10-26 11:21         ` Ingo Molnar
  1 sibling, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:21 UTC (permalink / raw)
  To: Jean Pihet
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Thomas Gleixner, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Mathieu Desnoyers


* Jean Pihet <jean.pihet@newoldbits.com> wrote:

> Ingo,
> 
> On Tue, Oct 26, 2010 at 9:10 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Thomas Renninger <trenn@suse.de> wrote:
> >
> >> Changes in V2:
> >>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> >>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> ...
> 
> >>
> >> +#define PWR_EVENT_EXIT 0xFFFFFFFF
> >
> > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
> >
> >> +#ifndef _TRACE_POWER_ENUM_
> >> +#define _TRACE_POWER_ENUM_
> >> +enum {
> >> +     POWER_NONE = 0,
> >> +     POWER_CSTATE = 1,
> >> +     POWER_PSTATE = 2,
> >> +};
> >> +#endif
> >
> > Since we are cleaning up all these events, those enum definitions dont really look
> > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
>
> The enum belongs to the deprecated API so I would rather not touch it.
> Keeping the deprecated code isolated will make it easier to remove
> later.

So what will replace it? We still have a state field.

Passing in platform specific codes is a step backwards.

> >> +TRACE_EVENT(machine_suspend,
> >> +
> >> +     TP_PROTO(unsigned int state),
> >> +
> >> +     TP_ARGS(state),
> >> +
> >> +     TP_STRUCT__entry(
> >> +             __field(        u32,            state           )
> >> +     ),
> >
> > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > that adds usage, and what are the possible values for 'state'?
>
> This will come as a separate patch, which fits all platforms. Cf.
> http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> The state field is of type suspend_state_t, cf. include/linux/suspend.h

Ok, that's at least generic. Needs the review of Rafael, to determine whether this 
state value is all we want to know when we enter suspend.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  8:08       ` Jean Pihet
@ 2010-10-26 11:21         ` Ingo Molnar
  2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:21         ` Ingo Molnar
  1 sibling, 2 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:21 UTC (permalink / raw)
  To: Jean Pihet
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Jean Pihet <jean.pihet@newoldbits.com> wrote:

> Ingo,
> 
> On Tue, Oct 26, 2010 at 9:10 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Thomas Renninger <trenn@suse.de> wrote:
> >
> >> Changes in V2:
> >>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> >>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> ...
> 
> >>
> >> +#define PWR_EVENT_EXIT 0xFFFFFFFF
> >
> > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
> >
> >> +#ifndef _TRACE_POWER_ENUM_
> >> +#define _TRACE_POWER_ENUM_
> >> +enum {
> >> +     POWER_NONE = 0,
> >> +     POWER_CSTATE = 1,
> >> +     POWER_PSTATE = 2,
> >> +};
> >> +#endif
> >
> > Since we are cleaning up all these events, those enum definitions dont really look
> > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
>
> The enum belongs to the deprecated API so I would rather not touch it.
> Keeping the deprecated code isolated will make it easier to remove
> later.

So what will replace it? We still have a state field.

Passing in platform specific codes is a step backwards.

> >> +TRACE_EVENT(machine_suspend,
> >> +
> >> +     TP_PROTO(unsigned int state),
> >> +
> >> +     TP_ARGS(state),
> >> +
> >> +     TP_STRUCT__entry(
> >> +             __field(        u32,            state           )
> >> +     ),
> >
> > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > that adds usage, and what are the possible values for 'state'?
>
> This will come as a separate patch, which fits all platforms. Cf.
> http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> The state field is of type suspend_state_t, cf. include/linux/suspend.h

Ok, that's at least generic. Needs the review of Rafael, to determine whether this 
state value is all we want to know when we enter suspend.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:21         ` Ingo Molnar
@ 2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:48           ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-26 11:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> 
> * Jean Pihet <jean.pihet@newoldbits.com> wrote:
..
> > >> +#ifndef _TRACE_POWER_ENUM_
> > >> +#define _TRACE_POWER_ENUM_
> > >> +enum {
> > >> +     POWER_NONE = 0,
> > >> +     POWER_CSTATE = 1,
> > >> +     POWER_PSTATE = 2,
> > >> +};
> > >> +#endif
> > >
> > > Since we are cleaning up all these events, those enum definitions dont really look
> > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> >
> > The enum belongs to the deprecated API so I would rather not touch it.
> > Keeping the deprecated code isolated will make it easier to remove
> > later.
> 
> So what will replace it? We still have a state field.
Nothing, this is part of the cleanup.
As you state above: POWER_NONE does not make sense at all.
The whole thing (type= attribute that vanishes now) is
passed to userspace, but never gets used there because the
same info is in the event name:
cpu_frequency <-> frequency_switch      <-> PSTATE
cpu_idle      <-> power_start/power_end <-> CSTATE 

I expect that there was an initial power_start/end which
was also used for frequency switching.
Then it got realized that _start/_end does not work out and
frequency_switch got introduced.
To stay compatible the whole power_start/end was not renamed
to cpu_idle and the type= field was kept.

This is a guess without even looking at the git history.
Therefore my partly harsh comments about the sanity of the
current power tracing events.

> Passing in platform specific codes is a step backwards.
> 
> > >> +TRACE_EVENT(machine_suspend,
> > >> +
> > >> +     TP_PROTO(unsigned int state),
> > >> +
> > >> +     TP_ARGS(state),
> > >> +
> > >> +     TP_STRUCT__entry(
> > >> +             __field(        u32,            state           )
> > >> +     ),
> > >
> > > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > > that adds usage, and what are the possible values for 'state'?
> >
> > This will come as a separate patch, which fits all platforms. Cf.
> > http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> > The state field is of type suspend_state_t, cf. include/linux/suspend.h
> 
> Ok, that's at least generic. Needs the review of Rafael, to determine
> whether this state value is all we want to know when we enter suspend.
He already gave an acked-by on this generic one here:
Re: [PATCH 3/4] perf: add calls to suspend trace point
Oh now, that was on the X86 specific part which depends on this one.
One should expect that he's fine with the generic part as well then,
but I agree that he should definitely have a look at this and sign it off.

So as they got signed-off already, I'll send the X86 suspend events
on top, once I find these in a tree...

    Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:21         ` Ingo Molnar
  2010-10-26 11:48           ` Thomas Renninger
@ 2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:54             ` Ingo Molnar
                               ` (3 more replies)
  1 sibling, 4 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-26 11:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jean Pihet, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> 
> * Jean Pihet <jean.pihet@newoldbits.com> wrote:
..
> > >> +#ifndef _TRACE_POWER_ENUM_
> > >> +#define _TRACE_POWER_ENUM_
> > >> +enum {
> > >> +     POWER_NONE = 0,
> > >> +     POWER_CSTATE = 1,
> > >> +     POWER_PSTATE = 2,
> > >> +};
> > >> +#endif
> > >
> > > Since we are cleaning up all these events, those enum definitions dont really look
> > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> >
> > The enum belongs to the deprecated API so I would rather not touch it.
> > Keeping the deprecated code isolated will make it easier to remove
> > later.
> 
> So what will replace it? We still have a state field.
Nothing, this is part of the cleanup.
As you state above: POWER_NONE does not make sense at all.
The whole thing (type= attribute that vanishes now) is
passed to userspace, but never gets used there because the
same info is in the event name:
cpu_frequency <-> frequency_switch      <-> PSTATE
cpu_idle      <-> power_start/power_end <-> CSTATE 

I expect that there was an initial power_start/end which
was also used for frequency switching.
Then it got realized that _start/_end does not work out and
frequency_switch got introduced.
To stay compatible the whole power_start/end was not renamed
to cpu_idle and the type= field was kept.

This is a guess without even looking at the git history.
Therefore my partly harsh comments about the sanity of the
current power tracing events.

> Passing in platform specific codes is a step backwards.
> 
> > >> +TRACE_EVENT(machine_suspend,
> > >> +
> > >> +     TP_PROTO(unsigned int state),
> > >> +
> > >> +     TP_ARGS(state),
> > >> +
> > >> +     TP_STRUCT__entry(
> > >> +             __field(        u32,            state           )
> > >> +     ),
> > >
> > > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > > that adds usage, and what are the possible values for 'state'?
> >
> > This will come as a separate patch, which fits all platforms. Cf.
> > http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> > The state field is of type suspend_state_t, cf. include/linux/suspend.h
> 
> Ok, that's at least generic. Needs the review of Rafael, to determine
> whether this state value is all we want to know when we enter suspend.
He already gave an acked-by on this generic one here:
Re: [PATCH 3/4] perf: add calls to suspend trace point
Oh now, that was on the X86 specific part which depends on this one.
One should expect that he's fine with the generic part as well then,
but I agree that he should definitely have a look at this and sign it off.

So as they got signed-off already, I'll send the X86 suspend events
on top, once I find these in a tree...

    Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:48           ` Thomas Renninger
@ 2010-10-26 11:54             ` Ingo Molnar
  2010-10-26 11:54             ` Ingo Molnar
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:54 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner


* Thomas Renninger <trenn@suse.de> wrote:

> On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> > 
> > * Jean Pihet <jean.pihet@newoldbits.com> wrote:
> ..
> > > >> +#ifndef _TRACE_POWER_ENUM_
> > > >> +#define _TRACE_POWER_ENUM_
> > > >> +enum {
> > > >> +     POWER_NONE = 0,
> > > >> +     POWER_CSTATE = 1,
> > > >> +     POWER_PSTATE = 2,
> > > >> +};
> > > >> +#endif
> > > >
> > > > Since we are cleaning up all these events, those enum definitions dont really look
> > > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> > >
> > > The enum belongs to the deprecated API so I would rather not touch it.
> > > Keeping the deprecated code isolated will make it easier to remove
> > > later.
> > 
> > So what will replace it? We still have a state field.
>
> Nothing, this is part of the cleanup.

I mean, what will go into the state field of the power:cpu_idle event?

> As you state above: POWER_NONE does not make sense at all.
> The whole thing (type= attribute that vanishes now) is
> passed to userspace, but never gets used there because the
> same info is in the event name:
> cpu_frequency <-> frequency_switch      <-> PSTATE
> cpu_idle      <-> power_start/power_end <-> CSTATE 

Ah, i see, so this 'state' enum went into the type field.

So my question is, and ignore this particular enum for now, what values go into the 
state field, which field is still kept in the new events as well.

[ I'd like to avoid us having to define a third set of power events a few years down 
  the road ;-) ]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:54             ` Ingo Molnar
@ 2010-10-26 11:54             ` Ingo Molnar
  2010-10-26 13:17               ` Thomas Renninger
  2010-10-26 13:17               ` Thomas Renninger
  2010-10-26 18:57             ` Rafael J. Wysocki
  2010-10-26 18:57             ` Rafael J. Wysocki
  3 siblings, 2 replies; 135+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:54 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Jean Pihet, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> > 
> > * Jean Pihet <jean.pihet@newoldbits.com> wrote:
> ..
> > > >> +#ifndef _TRACE_POWER_ENUM_
> > > >> +#define _TRACE_POWER_ENUM_
> > > >> +enum {
> > > >> +     POWER_NONE = 0,
> > > >> +     POWER_CSTATE = 1,
> > > >> +     POWER_PSTATE = 2,
> > > >> +};
> > > >> +#endif
> > > >
> > > > Since we are cleaning up all these events, those enum definitions dont really look
> > > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> > >
> > > The enum belongs to the deprecated API so I would rather not touch it.
> > > Keeping the deprecated code isolated will make it easier to remove
> > > later.
> > 
> > So what will replace it? We still have a state field.
>
> Nothing, this is part of the cleanup.

I mean, what will go into the state field of the power:cpu_idle event?

> As you state above: POWER_NONE does not make sense at all.
> The whole thing (type= attribute that vanishes now) is
> passed to userspace, but never gets used there because the
> same info is in the event name:
> cpu_frequency <-> frequency_switch      <-> PSTATE
> cpu_idle      <-> power_start/power_end <-> CSTATE 

Ah, i see, so this 'state' enum went into the type field.

So my question is, and ignore this particular enum for now, what values go into the 
state field, which field is still kept in the new events as well.

[ I'd like to avoid us having to define a third set of power events a few years down 
  the road ;-) ]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:54             ` Ingo Molnar
  2010-10-26 13:17               ` Thomas Renninger
@ 2010-10-26 13:17               ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-26 13:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Tuesday 26 October 2010 13:54:22 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
..
> As you state above: POWER_NONE does not make sense at all.
> > The whole thing (type= attribute that vanishes now) is
> > passed to userspace, but never gets used there because the
> > same info is in the event name:
> > cpu_frequency <-> frequency_switch      <-> PSTATE
> > cpu_idle      <-> power_start/power_end <-> CSTATE 
> 
> Ah, i see, so this 'state' enum went into the type field.
> 
> So my question is, and ignore this particular enum for now, what values go into the 
> state field, which field is still kept in the new events as well.
Same as before:
cpu_idle:
                trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+               trace_processor_idle(1, smp_processor_id());
(Ooops found a copy and paste bug in my patch where I reverted
the poll_idle event, but it should be zero...).

State in cpu_idle is identical with cpuidle registered
state. If cpuidle got registered, one should be able to calculate the
same C-state residency time and usage via state=X (cpu_idle event)
which you can grab via:
cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/{time,usage}
The cpu_idle event additionally gives you the timestamps of the state
changes.
This is rather nice as userspace can grab additional info from
cpuidle sysfs layer like:
/sys/devices/system/cpu/cpu0/cpuidle/stateX/{desc,power,name}

If cpuidle is not registered, the events you get are arch specific.
I mean they are arch specific anyway, but with cpuidle you can
build up an arch independent userspace framework nicely by looking
up name/desc/power/... of an cpu_idle event in cpuidle sysfs as
described above.

   Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:54             ` Ingo Molnar
@ 2010-10-26 13:17               ` Thomas Renninger
  2010-10-26 13:35                 ` Thomas Renninger
  2010-10-26 13:35                 ` Thomas Renninger
  2010-10-26 13:17               ` Thomas Renninger
  1 sibling, 2 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-26 13:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jean Pihet, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Tuesday 26 October 2010 13:54:22 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
..
> As you state above: POWER_NONE does not make sense at all.
> > The whole thing (type= attribute that vanishes now) is
> > passed to userspace, but never gets used there because the
> > same info is in the event name:
> > cpu_frequency <-> frequency_switch      <-> PSTATE
> > cpu_idle      <-> power_start/power_end <-> CSTATE 
> 
> Ah, i see, so this 'state' enum went into the type field.
> 
> So my question is, and ignore this particular enum for now, what values go into the 
> state field, which field is still kept in the new events as well.
Same as before:
cpu_idle:
                trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+               trace_processor_idle(1, smp_processor_id());
(Ooops found a copy and paste bug in my patch where I reverted
the poll_idle event, but it should be zero...).

State in cpu_idle is identical with cpuidle registered
state. If cpuidle got registered, one should be able to calculate the
same C-state residency time and usage via state=X (cpu_idle event)
which you can grab via:
cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/{time,usage}
The cpu_idle event additionally gives you the timestamps of the state
changes.
This is rather nice as userspace can grab additional info from
cpuidle sysfs layer like:
/sys/devices/system/cpu/cpu0/cpuidle/stateX/{desc,power,name}

If cpuidle is not registered, the events you get are arch specific.
I mean they are arch specific anyway, but with cpuidle you can
build up an arch independent userspace framework nicely by looking
up name/desc/power/... of an cpu_idle event in cpuidle sysfs as
described above.

   Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 13:17               ` Thomas Renninger
@ 2010-10-26 13:35                 ` Thomas Renninger
  2010-10-26 13:35                 ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-26 13:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Tuesday 26 October 2010 15:17:43 Thomas Renninger wrote:
> On Tuesday 26 October 2010 13:54:22 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
...
> If cpuidle is not registered, the events you get are arch specific.
> I mean they are arch specific anyway, but with cpuidle you can
> build up an arch independent userspace framework nicely by looking
> up name/desc/power/... of an cpu_idle event in cpuidle sysfs as
> described above.
About cpuidle and cpu_idle events:
There is some oddness that:
arch specific code which registers for cpuidle has to
throw the cpu_idle enter sleep state X event
and the generic cpuidle framework triggers the "exit" event.

So as there are only cpu_idle events in drivers/idle/intel_idle.c,
but not in drivers/acpi/processor_idle.c, I expect that processor.ko
idle driver is broken and only exit states are sent.
Ideally all cpuidle events should be thrown in cpuidle.c like this:

        trace_processor_idle(target_state, smp_processor_id());
        dev->last_residency = target_state->enter(dev, target_state);                                                                     
        trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());

My patches do not touch this behavior. If, it was broken before.
I'll look at it separately.

      Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 13:17               ` Thomas Renninger
  2010-10-26 13:35                 ` Thomas Renninger
@ 2010-10-26 13:35                 ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-26 13:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jean Pihet, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Tuesday 26 October 2010 15:17:43 Thomas Renninger wrote:
> On Tuesday 26 October 2010 13:54:22 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
...
> If cpuidle is not registered, the events you get are arch specific.
> I mean they are arch specific anyway, but with cpuidle you can
> build up an arch independent userspace framework nicely by looking
> up name/desc/power/... of an cpu_idle event in cpuidle sysfs as
> described above.
About cpuidle and cpu_idle events:
There is some oddness that:
arch specific code which registers for cpuidle has to
throw the cpu_idle enter sleep state X event
and the generic cpuidle framework triggers the "exit" event.

So as there are only cpu_idle events in drivers/idle/intel_idle.c,
but not in drivers/acpi/processor_idle.c, I expect that processor.ko
idle driver is broken and only exit states are sent.
Ideally all cpuidle events should be thrown in cpuidle.c like this:

        trace_processor_idle(target_state, smp_processor_id());
        dev->last_residency = target_state->enter(dev, target_state);                                                                     
        trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());

My patches do not touch this behavior. If, it was broken before.
I'll look at it separately.

      Thomas

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
                         ` (6 preceding siblings ...)
  2010-10-26 15:32       ` Pierre Tardy
@ 2010-10-26 15:32       ` Pierre Tardy
  7 siblings, 0 replies; 135+ messages in thread
From: Pierre Tardy @ 2010-10-26 15:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Jean Pihet, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Thomas Gleixner, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Mathieu Desnoyers

On Tue, Oct 26, 2010 at 2:10 AM, Ingo Molnar <mingo@elte.hu> wrote:

> How will future PCI (or other device) power saving tracepoints be called?
>
> Might be more consistent to use:
>
>  power:cpu_idle
>  power:machine_idle
>  power:device_idle
Agree with this.

FYI, I have a runtime_pm tracepoint currently cooking. Here is
preliminary patch.
Can this be a candidate for a "power:device_idle" tracepoint?

Regards,
Pierre
----
>From 3d5e03405f590d470ecfa59c8b9759915bf29307 Mon Sep 17 00:00:00 2001
From: Pierre Tardy <pierre.tardy@intel.com>
Date: Fri, 22 Oct 2010 03:07:07 -0500
Subject: [PATCH] trace/runtime_pm: add runtime_pm trace event

based on the recent hook from Arjan for powertop statistics
we add a tracepoint in order for pytimechart to display
the runtime_pm activity over time, and versus other events.

Signed-off-by: Pierre Tardy <pierre.tardy@intel.com>
---
 drivers/base/power/runtime.c |    3 ++-
 include/trace/events/power.h |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index b78c401..0f38447 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -9,6 +9,7 @@
 #include <linux/sched.h>
 #include <linux/pm_runtime.h>
 #include <linux/jiffies.h>
+#include <trace/events/power.h>

 static int __pm_runtime_resume(struct device *dev, bool from_wq);
 static int __pm_request_idle(struct device *dev);
@@ -159,9 +160,9 @@ void update_pm_runtime_accounting(struct device *dev)
 static void __update_runtime_status(struct device *dev, enum rpm_status status)
 {
 	update_pm_runtime_accounting(dev);
+	trace_runtime_pm_status(dev, status);
 	dev->power.runtime_status = status;
 }
-
 /**
  * __pm_runtime_suspend - Carry out run-time suspend of given device.
  * @dev: Device to suspend.
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 286784d..dd57c29 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -6,6 +6,7 @@

 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
+#include <linux/device.h>

 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
@@ -75,6 +76,40 @@ TRACE_EVENT(power_end,

 );

+#ifdef CONFIG_PM_RUNTIME
+#define rpm_status_name(status) { RPM_##status, #status }
+#define show_rpm_status_name(val)				\
+	__print_symbolic(val,					\
+		rpm_status_name(SUSPENDED),			\
+		rpm_status_name(SUSPENDING),			\
+		rpm_status_name(RESUMING),			\
+		rpm_status_name(ACTIVE)		                \
+		)
+TRACE_EVENT(runtime_pm_status,
+
+	TP_PROTO(struct device *dev, int new_status),
+
+	TP_ARGS(dev, new_status),
+
+	TP_STRUCT__entry(
+		__string(devname,dev_name(dev))
+		__string(drivername,dev_driver_string(dev))
+		__field(u32, prev_status)
+		__field(u32, status)
+	),
+
+	TP_fast_assign(
+		__assign_str(devname, dev_name(dev));
+		__assign_str(drivername, dev_driver_string(dev));
+		__entry->prev_status = (u32)dev->power.runtime_status;
+		__entry->status = (u32)new_status;
+	),
+
+	TP_printk("driver=%s dev=%s prev_status=%s status=%s",
__get_str(drivername),__get_str(devname),
+		  show_rpm_status_name(__entry->prev_status),
show_rpm_status_name(__entry->status))
+);
+#endif /* CONFIG_PM_RUNTIME */
+
 /*
  * The clock events are used for clock enable/disable and for
  *  clock rate change
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
                         ` (5 preceding siblings ...)
  2010-10-26 10:37       ` Thomas Renninger
@ 2010-10-26 15:32       ` Pierre Tardy
  2010-10-26 16:04         ` Arjan van de Ven
  2010-10-26 16:04         ` Arjan van de Ven
  2010-10-26 15:32       ` Pierre Tardy
  7 siblings, 2 replies; 135+ messages in thread
From: Pierre Tardy @ 2010-10-26 15:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Tue, Oct 26, 2010 at 2:10 AM, Ingo Molnar <mingo@elte.hu> wrote:

> How will future PCI (or other device) power saving tracepoints be called?
>
> Might be more consistent to use:
>
>  power:cpu_idle
>  power:machine_idle
>  power:device_idle
Agree with this.

FYI, I have a runtime_pm tracepoint currently cooking. Here is
preliminary patch.
Can this be a candidate for a "power:device_idle" tracepoint?

Regards,
Pierre
----
From 3d5e03405f590d470ecfa59c8b9759915bf29307 Mon Sep 17 00:00:00 2001
From: Pierre Tardy <pierre.tardy@intel.com>
Date: Fri, 22 Oct 2010 03:07:07 -0500
Subject: [PATCH] trace/runtime_pm: add runtime_pm trace event

based on the recent hook from Arjan for powertop statistics
we add a tracepoint in order for pytimechart to display
the runtime_pm activity over time, and versus other events.

Signed-off-by: Pierre Tardy <pierre.tardy@intel.com>
---
 drivers/base/power/runtime.c |    3 ++-
 include/trace/events/power.h |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index b78c401..0f38447 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -9,6 +9,7 @@
 #include <linux/sched.h>
 #include <linux/pm_runtime.h>
 #include <linux/jiffies.h>
+#include <trace/events/power.h>

 static int __pm_runtime_resume(struct device *dev, bool from_wq);
 static int __pm_request_idle(struct device *dev);
@@ -159,9 +160,9 @@ void update_pm_runtime_accounting(struct device *dev)
 static void __update_runtime_status(struct device *dev, enum rpm_status status)
 {
 	update_pm_runtime_accounting(dev);
+	trace_runtime_pm_status(dev, status);
 	dev->power.runtime_status = status;
 }
-
 /**
  * __pm_runtime_suspend - Carry out run-time suspend of given device.
  * @dev: Device to suspend.
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 286784d..dd57c29 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -6,6 +6,7 @@

 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
+#include <linux/device.h>

 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
@@ -75,6 +76,40 @@ TRACE_EVENT(power_end,

 );

+#ifdef CONFIG_PM_RUNTIME
+#define rpm_status_name(status) { RPM_##status, #status }
+#define show_rpm_status_name(val)				\
+	__print_symbolic(val,					\
+		rpm_status_name(SUSPENDED),			\
+		rpm_status_name(SUSPENDING),			\
+		rpm_status_name(RESUMING),			\
+		rpm_status_name(ACTIVE)		                \
+		)
+TRACE_EVENT(runtime_pm_status,
+
+	TP_PROTO(struct device *dev, int new_status),
+
+	TP_ARGS(dev, new_status),
+
+	TP_STRUCT__entry(
+		__string(devname,dev_name(dev))
+		__string(drivername,dev_driver_string(dev))
+		__field(u32, prev_status)
+		__field(u32, status)
+	),
+
+	TP_fast_assign(
+		__assign_str(devname, dev_name(dev));
+		__assign_str(drivername, dev_driver_string(dev));
+		__entry->prev_status = (u32)dev->power.runtime_status;
+		__entry->status = (u32)new_status;
+	),
+
+	TP_printk("driver=%s dev=%s prev_status=%s status=%s",
__get_str(drivername),__get_str(devname),
+		  show_rpm_status_name(__entry->prev_status),
show_rpm_status_name(__entry->status))
+);
+#endif /* CONFIG_PM_RUNTIME */
+
 /*
  * The clock events are used for clock enable/disable and for
  *  clock rate change
-- 
1.7.2.3
--
To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 15:32       ` Pierre Tardy
@ 2010-10-26 16:04         ` Arjan van de Ven
  2010-10-26 16:04         ` Arjan van de Ven
  1 sibling, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-26 16:04 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Andrew Morton, linux-trace-users, Frederic Weisbecker,
	Jean Pihet, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Mathieu Desnoyers, linux-pm, Masami Hiramatsu, Tejun Heo,
	Thomas Gleixner, linux-omap, Linus Torvalds, Ingo Molnar

On 10/26/2010 8:32 AM, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 2:10 AM, Ingo Molnar<mingo@elte.hu>  wrote:
>
>> How will future PCI (or other device) power saving tracepoints be called?
>>
>> Might be more consistent to use:
>>
>>   power:cpu_idle
>>   power:machine_idle
>>   power:device_idle
> Agree with this.
>
> FYI, I have a runtime_pm tracepoint currently cooking. Here is
> preliminary patch.
> Can this be a candidate for a "power:device_idle" tracepoint?


I would like to see a slightly more advanced tracepoint do the runtime 
pm framework;
specifically I'd like to see the "comm" of the process that's taking a 
refcount on a device
(that way, powertop can track which process keeps a device busy)

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 15:32       ` Pierre Tardy
  2010-10-26 16:04         ` Arjan van de Ven
@ 2010-10-26 16:04         ` Arjan van de Ven
  2010-10-26 16:56           ` Pierre Tardy
  2010-10-26 16:56           ` Pierre Tardy
  1 sibling, 2 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-26 16:04 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Ingo Molnar, Thomas Renninger, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, rjw, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/26/2010 8:32 AM, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 2:10 AM, Ingo Molnar<mingo@elte.hu>  wrote:
>
>> How will future PCI (or other device) power saving tracepoints be called?
>>
>> Might be more consistent to use:
>>
>>   power:cpu_idle
>>   power:machine_idle
>>   power:device_idle
> Agree with this.
>
> FYI, I have a runtime_pm tracepoint currently cooking. Here is
> preliminary patch.
> Can this be a candidate for a "power:device_idle" tracepoint?


I would like to see a slightly more advanced tracepoint do the runtime 
pm framework;
specifically I'd like to see the "comm" of the process that's taking a 
refcount on a device
(that way, powertop can track which process keeps a device busy)


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 16:04         ` Arjan van de Ven
  2010-10-26 16:56           ` Pierre Tardy
@ 2010-10-26 16:56           ` Pierre Tardy
  1 sibling, 0 replies; 135+ messages in thread
From: Pierre Tardy @ 2010-10-26 16:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, linux-trace-users, Frederic Weisbecker,
	Jean Pihet, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Mathieu Desnoyers, linux-pm, Masami Hiramatsu, Tejun Heo,
	Thomas Gleixner, linux-omap, Linus Torvalds, Ingo Molnar

> I would like to see a slightly more advanced tracepoint do the runtime pm
> framework;
> specifically I'd like to see the "comm" of the process that's taking a
> refcount on a device
> (that way, powertop can track which process keeps a device busy)
>
>
Yes, the "comm" for this tracepoint should be the runtime_pm workqueue.
To track responsabilities, I'm making another tracepoint, that traces
the rpm_get/put.

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 0f38447..54d9911 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -792,7 +792,7 @@ EXPORT_SYMBOL_GPL(pm_request_resume);
 int __pm_runtime_get(struct device *dev, bool sync)
 {
        int retval;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);

@@ -813,6 +813,7 @@ int __pm_runtime_put(struct device *dev, bool sync)
 {
        int retval = 0;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)-1);
        if (atomic_dec_and_test(&dev->power.usage_count))
                retval = sync ? pm_runtime_idle(dev) : pm_request_idle(dev);

@@ -1065,6 +1066,7 @@ void pm_runtime_forbid(struct device *dev)
                goto out;

        dev->power.runtime_auto = false;
+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        __pm_runtime_resume(dev, false);

@@ -1086,6 +1088,7 @@ void pm_runtime_allow(struct device *dev)
                goto out;

        dev->power.runtime_auto = true;
+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)-1);
        if (atomic_dec_and_test(&dev->power.usage_count))
                __pm_runtime_idle(dev);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index ea514eb..813229c 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -100,6 +100,29 @@ TRACE_EVENT(runtime_pm_status,
        TP_printk("driver=%s dev=%s prev_status=%s status=%s",
__get_str(drivername),__get_str(devname),
                  show_rpm_status_name(__entry->prev_status),
show_rpm_status_name(__entry->status))
 );
+TRACE_EVENT(runtime_pm_usage,
+
+       TP_PROTO(struct device *dev, int new_usage),
+
+       TP_ARGS(dev, new_usage),
+
+       TP_STRUCT__entry(
+               __string(devname,dev_name(dev))
+               __string(drivername,dev_driver_string(dev))
+               __field(u32, prev_usage)
+               __field(u32, usage)
+       ),
+
+       TP_fast_assign(
+               __assign_str(devname, dev_name(dev));
+               __assign_str(drivername, dev_driver_string(dev));
+               __entry->prev_usage = (u32)atomic_read(&dev->power.usage_count);
+               __entry->usage = (u32)new_usage;
+       ),
+
+       TP_printk("driver=%s dev=%s prev_usage=%d usage=%s",
__get_str(drivername),__get_str(devname),
+                 __entry->prev_usage, __entry->usage)
+);
-- 
Pierre

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 16:04         ` Arjan van de Ven
@ 2010-10-26 16:56           ` Pierre Tardy
  2010-10-26 17:58             ` Peter Zijlstra
  2010-10-26 17:58             ` Peter Zijlstra
  2010-10-26 16:56           ` Pierre Tardy
  1 sibling, 2 replies; 135+ messages in thread
From: Pierre Tardy @ 2010-10-26 16:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Thomas Renninger, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, rjw, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

> I would like to see a slightly more advanced tracepoint do the runtime pm
> framework;
> specifically I'd like to see the "comm" of the process that's taking a
> refcount on a device
> (that way, powertop can track which process keeps a device busy)
>
>
Yes, the "comm" for this tracepoint should be the runtime_pm workqueue.
To track responsabilities, I'm making another tracepoint, that traces
the rpm_get/put.

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 0f38447..54d9911 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -792,7 +792,7 @@ EXPORT_SYMBOL_GPL(pm_request_resume);
 int __pm_runtime_get(struct device *dev, bool sync)
 {
        int retval;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);

@@ -813,6 +813,7 @@ int __pm_runtime_put(struct device *dev, bool sync)
 {
        int retval = 0;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)-1);
        if (atomic_dec_and_test(&dev->power.usage_count))
                retval = sync ? pm_runtime_idle(dev) : pm_request_idle(dev);

@@ -1065,6 +1066,7 @@ void pm_runtime_forbid(struct device *dev)
                goto out;

        dev->power.runtime_auto = false;
+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        __pm_runtime_resume(dev, false);

@@ -1086,6 +1088,7 @@ void pm_runtime_allow(struct device *dev)
                goto out;

        dev->power.runtime_auto = true;
+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)-1);
        if (atomic_dec_and_test(&dev->power.usage_count))
                __pm_runtime_idle(dev);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index ea514eb..813229c 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -100,6 +100,29 @@ TRACE_EVENT(runtime_pm_status,
        TP_printk("driver=%s dev=%s prev_status=%s status=%s",
__get_str(drivername),__get_str(devname),
                  show_rpm_status_name(__entry->prev_status),
show_rpm_status_name(__entry->status))
 );
+TRACE_EVENT(runtime_pm_usage,
+
+       TP_PROTO(struct device *dev, int new_usage),
+
+       TP_ARGS(dev, new_usage),
+
+       TP_STRUCT__entry(
+               __string(devname,dev_name(dev))
+               __string(drivername,dev_driver_string(dev))
+               __field(u32, prev_usage)
+               __field(u32, usage)
+       ),
+
+       TP_fast_assign(
+               __assign_str(devname, dev_name(dev));
+               __assign_str(drivername, dev_driver_string(dev));
+               __entry->prev_usage = (u32)atomic_read(&dev->power.usage_count);
+               __entry->usage = (u32)new_usage;
+       ),
+
+       TP_printk("driver=%s dev=%s prev_usage=%d usage=%s",
__get_str(drivername),__get_str(devname),
+                 __entry->prev_usage, __entry->usage)
+);
-- 
Pierre

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 16:56           ` Pierre Tardy
  2010-10-26 17:58             ` Peter Zijlstra
@ 2010-10-26 17:58             ` Peter Zijlstra
  1 sibling, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-10-26 17:58 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Frederic Weisbecker, Andrew Morton, Arjan van de Ven, linux-pm,
	Jean Pihet, Rostedt, linux-trace-users, Frank Eigler,
	Mathieu Desnoyers, Steven, Masami Hiramatsu, Tejun Heo,
	Thomas Gleixner, linux-omap, Linus Torvalds, Ingo Molnar

On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count); 

That's terribly racy..

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 16:56           ` Pierre Tardy
@ 2010-10-26 17:58             ` Peter Zijlstra
  2010-10-26 18:14               ` Mathieu Desnoyers
                                 ` (3 more replies)
  2010-10-26 17:58             ` Peter Zijlstra
  1 sibling, 4 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-10-26 17:58 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Arjan van de Ven, Ingo Molnar, Thomas Renninger, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Masami Hiramatsu, Frank Eigler,
	Steven Rostedt, Kevin Hilman, linux-omap, rjw, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count); 

That's terribly racy..

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 17:58             ` Peter Zijlstra
@ 2010-10-26 18:14               ` Mathieu Desnoyers
  2010-10-26 18:14               ` Mathieu Desnoyers
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 18:14 UTC (permalink / raw)
  To: Peter Zijlstra, Greg Kroah-Hartman
  Cc: Andrew Morton, Pierre Tardy, Arjan van de Ven,
	Frederic Weisbecker, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Thomas Gleixner, linux-pm,
	Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count); 
> 
> That's terribly racy..

Looking at the original code, it looks racy even without considering the
tracepoint:

int __pm_runtime_get(struct device *dev, bool sync)
 {
        int retval;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);

There is no implied memory barrier after "atomic_inc". So either all these
inc/dec are protected with mutexes or spinlocks, in which case one might wonder
why atomic operations are used at all, or it's a racy mess. (I vote for the
second option)

kref should certainly be used there.

About the instrumentation, well... the only way to have something that's not
racy would be to instrument kref directly, and use atomic_add_return() in both
the get/put paths. But I fear that the performance impact on many architectures
might be significant (turning atomic_add + smp_mb() into a cmpxchg()). Maybe it
could be acceptable as a kernel debug option.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 17:58             ` Peter Zijlstra
  2010-10-26 18:14               ` Mathieu Desnoyers
@ 2010-10-26 18:14               ` Mathieu Desnoyers
  2010-10-26 18:50                 ` [linux-pm] " Alan Stern
                                   ` (3 more replies)
  2010-10-26 18:15               ` Pierre Tardy
  2010-10-26 18:15               ` Pierre Tardy
  3 siblings, 4 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 18:14 UTC (permalink / raw)
  To: Peter Zijlstra, Greg Kroah-Hartman
  Cc: Pierre Tardy, Arjan van de Ven, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, rjw,
	linux-pm, linux-trace-users, Jean Pihet, Frederic Weisbecker,
	Tejun Heo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count); 
> 
> That's terribly racy..

Looking at the original code, it looks racy even without considering the
tracepoint:

int __pm_runtime_get(struct device *dev, bool sync)
 {
        int retval;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);

There is no implied memory barrier after "atomic_inc". So either all these
inc/dec are protected with mutexes or spinlocks, in which case one might wonder
why atomic operations are used at all, or it's a racy mess. (I vote for the
second option)

kref should certainly be used there.

About the instrumentation, well... the only way to have something that's not
racy would be to instrument kref directly, and use atomic_add_return() in both
the get/put paths. But I fear that the performance impact on many architectures
might be significant (turning atomic_add + smp_mb() into a cmpxchg()). Maybe it
could be acceptable as a kernel debug option.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 17:58             ` Peter Zijlstra
                                 ` (2 preceding siblings ...)
  2010-10-26 18:15               ` Pierre Tardy
@ 2010-10-26 18:15               ` Pierre Tardy
  3 siblings, 0 replies; 135+ messages in thread
From: Pierre Tardy @ 2010-10-26 18:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Arjan van de Ven, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler,
	Mathieu Desnoyers, linux-pm, Masami Hiramatsu, Tejun Heo,
	Thomas Gleixner, linux-omap, Linus Torvalds, Ingo Molnar

On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>>
>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>>         atomic_inc(&dev->power.usage_count);
>
> That's terribly racy..
>
I know. I'm not proud of this.. As I said, this is preliminary patch.
We dont really need to have this prev_usage. This is just for debug.
It mayprobably endup with something like:

         atomic_inc(&dev->power.usage_count);
+       trace_power_device_usage(dev);

-- 
Pierre

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 17:58             ` Peter Zijlstra
  2010-10-26 18:14               ` Mathieu Desnoyers
  2010-10-26 18:14               ` Mathieu Desnoyers
@ 2010-10-26 18:15               ` Pierre Tardy
  2010-10-26 19:08                 ` Rafael J. Wysocki
  2010-10-26 19:08                 ` Rafael J. Wysocki
  2010-10-26 18:15               ` Pierre Tardy
  3 siblings, 2 replies; 135+ messages in thread
From: Pierre Tardy @ 2010-10-26 18:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Ingo Molnar, Thomas Renninger, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Masami Hiramatsu, Frank Eigler,
	Steven Rostedt, Kevin Hilman, linux-omap, rjw, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>>
>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>>         atomic_inc(&dev->power.usage_count);
>
> That's terribly racy..
>
I know. I'm not proud of this.. As I said, this is preliminary patch.
We dont really need to have this prev_usage. This is just for debug.
It mayprobably endup with something like:

         atomic_inc(&dev->power.usage_count);
+       trace_power_device_usage(dev);

-- 
Pierre
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:14               ` Mathieu Desnoyers
  2010-10-26 18:50                 ` [linux-pm] " Alan Stern
@ 2010-10-26 18:50                 ` Alan Stern
  2010-10-26 19:04                 ` Rafael J. Wysocki
  2010-10-26 19:04                 ` Rafael J. Wysocki
  3 siblings, 0 replies; 135+ messages in thread
From: Alan Stern @ 2010-10-26 18:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Pierre Tardy, Peter Zijlstra, linux-pm, Ingo Molnar, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:

> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count); 
> > 
> > That's terribly racy..
> 
> Looking at the original code, it looks racy even without considering the
> tracepoint:
> 
> int __pm_runtime_get(struct device *dev, bool sync)
>  {
>         int retval;
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count);
>         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> 
> There is no implied memory barrier after "atomic_inc". So either all these
> inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> why atomic operations are used at all, or it's a racy mess. (I vote for the
> second option)

I don't understand.  What's the problem?  The inc/dec are atomic 
because they are not protected by spinlocks, but everything else is 
(aside from the tracepoint, which is new).

> kref should certainly be used there.

What for?

Alan Stern

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:14               ` Mathieu Desnoyers
@ 2010-10-26 18:50                 ` Alan Stern
  2010-10-26 21:33                   ` Mathieu Desnoyers
  2010-10-26 21:33                   ` [linux-pm] " Mathieu Desnoyers
  2010-10-26 18:50                 ` Alan Stern
                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 135+ messages in thread
From: Alan Stern @ 2010-10-26 18:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Greg Kroah-Hartman, Andrew Morton, Pierre Tardy,
	Arjan van de Ven, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds

On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:

> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count); 
> > 
> > That's terribly racy..
> 
> Looking at the original code, it looks racy even without considering the
> tracepoint:
> 
> int __pm_runtime_get(struct device *dev, bool sync)
>  {
>         int retval;
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count);
>         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> 
> There is no implied memory barrier after "atomic_inc". So either all these
> inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> why atomic operations are used at all, or it's a racy mess. (I vote for the
> second option)

I don't understand.  What's the problem?  The inc/dec are atomic 
because they are not protected by spinlocks, but everything else is 
(aside from the tracepoint, which is new).

> kref should certainly be used there.

What for?

Alan Stern


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
                       ` (6 preceding siblings ...)
  2010-10-26 18:52     ` Rafael J. Wysocki
@ 2010-10-26 18:52     ` Rafael J. Wysocki
  7 siblings, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 18:52 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Thomas Gleixner

On Tuesday, October 26, 2010, Thomas Renninger wrote:
> Changes in V2:
>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle
> 
> and
>   power:power_frequency
> is replaced with:
>   power:processor_frequency
> 
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.

As I said already somewhere else, I think this one should be done at the
core level rather than in arch-specific code.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
                       ` (5 preceding siblings ...)
  2010-10-26  7:59     ` Jean Pihet
@ 2010-10-26 18:52     ` Rafael J. Wysocki
  2010-10-26 18:52     ` Rafael J. Wysocki
  7 siblings, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 18:52 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven, Ingo Molnar

On Tuesday, October 26, 2010, Thomas Renninger wrote:
> Changes in V2:
>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle
> 
> and
>   power:power_frequency
> is replaced with:
>   power:processor_frequency
> 
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.

As I said already somewhere else, I think this one should be done at the
core level rather than in arch-specific code.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:48           ` Thomas Renninger
                               ` (2 preceding siblings ...)
  2010-10-26 18:57             ` Rafael J. Wysocki
@ 2010-10-26 18:57             ` Rafael J. Wysocki
  3 siblings, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 18:57 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, Andrew Morton, linux-trace-users,
	Frederic Weisbecker, Pierre Tardy, Jean Pihet, Steven Rostedt,
	Peter Zijlstra, Frank Eigler, Mathieu Desnoyers, linux-pm,
	Masami Hiramatsu, Tejun Heo, Thomas Gleixner, linux-omap,
	Linus Torvalds, Ingo Molnar

On Tuesday, October 26, 2010, Thomas Renninger wrote:
> On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> > 
> > * Jean Pihet <jean.pihet@newoldbits.com> wrote:
> ..
> > > >> +#ifndef _TRACE_POWER_ENUM_
> > > >> +#define _TRACE_POWER_ENUM_
> > > >> +enum {
> > > >> +     POWER_NONE = 0,
> > > >> +     POWER_CSTATE = 1,
> > > >> +     POWER_PSTATE = 2,
> > > >> +};
> > > >> +#endif
> > > >
> > > > Since we are cleaning up all these events, those enum definitions dont really look
> > > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> > >
> > > The enum belongs to the deprecated API so I would rather not touch it.
> > > Keeping the deprecated code isolated will make it easier to remove
> > > later.
> > 
> > So what will replace it? We still have a state field.
> Nothing, this is part of the cleanup.
> As you state above: POWER_NONE does not make sense at all.
> The whole thing (type= attribute that vanishes now) is
> passed to userspace, but never gets used there because the
> same info is in the event name:
> cpu_frequency <-> frequency_switch      <-> PSTATE
> cpu_idle      <-> power_start/power_end <-> CSTATE 
> 
> I expect that there was an initial power_start/end which
> was also used for frequency switching.
> Then it got realized that _start/_end does not work out and
> frequency_switch got introduced.
> To stay compatible the whole power_start/end was not renamed
> to cpu_idle and the type= field was kept.
> 
> This is a guess without even looking at the git history.
> Therefore my partly harsh comments about the sanity of the
> current power tracing events.
> 
> > Passing in platform specific codes is a step backwards.
> > 
> > > >> +TRACE_EVENT(machine_suspend,
> > > >> +
> > > >> +     TP_PROTO(unsigned int state),
> > > >> +
> > > >> +     TP_ARGS(state),
> > > >> +
> > > >> +     TP_STRUCT__entry(
> > > >> +             __field(        u32,            state           )
> > > >> +     ),
> > > >
> > > > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > > > that adds usage, and what are the possible values for 'state'?
> > >
> > > This will come as a separate patch, which fits all platforms. Cf.
> > > http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> > > The state field is of type suspend_state_t, cf. include/linux/suspend.h
> > 
> > Ok, that's at least generic. Needs the review of Rafael, to determine
> > whether this state value is all we want to know when we enter suspend.
> He already gave an acked-by on this generic one here:
> Re: [PATCH 3/4] perf: add calls to suspend trace point
> Oh now, that was on the X86 specific part which depends on this one.
> One should expect that he's fine with the generic part as well then,
> but I agree that he should definitely have a look at this and sign it off.

What patch exactly do you mean?  I'm not quite sure from your comment above.

Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:54             ` Ingo Molnar
  2010-10-26 11:54             ` Ingo Molnar
@ 2010-10-26 18:57             ` Rafael J. Wysocki
  2010-10-27  0:00               ` Thomas Renninger
  2010-10-27  0:00               ` Thomas Renninger
  2010-10-26 18:57             ` Rafael J. Wysocki
  3 siblings, 2 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 18:57 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Jean Pihet, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven

On Tuesday, October 26, 2010, Thomas Renninger wrote:
> On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> > 
> > * Jean Pihet <jean.pihet@newoldbits.com> wrote:
> ..
> > > >> +#ifndef _TRACE_POWER_ENUM_
> > > >> +#define _TRACE_POWER_ENUM_
> > > >> +enum {
> > > >> +     POWER_NONE = 0,
> > > >> +     POWER_CSTATE = 1,
> > > >> +     POWER_PSTATE = 2,
> > > >> +};
> > > >> +#endif
> > > >
> > > > Since we are cleaning up all these events, those enum definitions dont really look
> > > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> > >
> > > The enum belongs to the deprecated API so I would rather not touch it.
> > > Keeping the deprecated code isolated will make it easier to remove
> > > later.
> > 
> > So what will replace it? We still have a state field.
> Nothing, this is part of the cleanup.
> As you state above: POWER_NONE does not make sense at all.
> The whole thing (type= attribute that vanishes now) is
> passed to userspace, but never gets used there because the
> same info is in the event name:
> cpu_frequency <-> frequency_switch      <-> PSTATE
> cpu_idle      <-> power_start/power_end <-> CSTATE 
> 
> I expect that there was an initial power_start/end which
> was also used for frequency switching.
> Then it got realized that _start/_end does not work out and
> frequency_switch got introduced.
> To stay compatible the whole power_start/end was not renamed
> to cpu_idle and the type= field was kept.
> 
> This is a guess without even looking at the git history.
> Therefore my partly harsh comments about the sanity of the
> current power tracing events.
> 
> > Passing in platform specific codes is a step backwards.
> > 
> > > >> +TRACE_EVENT(machine_suspend,
> > > >> +
> > > >> +     TP_PROTO(unsigned int state),
> > > >> +
> > > >> +     TP_ARGS(state),
> > > >> +
> > > >> +     TP_STRUCT__entry(
> > > >> +             __field(        u32,            state           )
> > > >> +     ),
> > > >
> > > > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > > > that adds usage, and what are the possible values for 'state'?
> > >
> > > This will come as a separate patch, which fits all platforms. Cf.
> > > http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> > > The state field is of type suspend_state_t, cf. include/linux/suspend.h
> > 
> > Ok, that's at least generic. Needs the review of Rafael, to determine
> > whether this state value is all we want to know when we enter suspend.
> He already gave an acked-by on this generic one here:
> Re: [PATCH 3/4] perf: add calls to suspend trace point
> Oh now, that was on the X86 specific part which depends on this one.
> One should expect that he's fine with the generic part as well then,
> but I agree that he should definitely have a look at this and sign it off.

What patch exactly do you mean?  I'm not quite sure from your comment above.

Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:19         ` Ingo Molnar
  2010-10-26 19:01           ` Rafael J. Wysocki
@ 2010-10-26 19:01           ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Thomas Gleixner, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Mathieu Desnoyers

On Tuesday, October 26, 2010, Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > Changes in V2:
> > > >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> > > >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > > 
> > > > and
> > > >   power:power_frequency
> > > > is replaced with:
> > > >   power:processor_frequency
> > > 
> > > Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> > > shortness? We generally use 'cpu' in the kernel and for events.
> > Sure.
> > > 
> > > > power:machine_suspend
> > > 
> > > How will future PCI (or other device) power saving tracepoints be called?
> > > 
> > > Might be more consistent to use:
> > > 
> > >   power:cpu_idle
> > >   power:machine_idle
> > >   power:device_idle
> >
> > device idle is not true. Those may be low power modes
> > like reduced network throughput, reduced wlan range, the device
> > needs not to be idle.
> > Device power states is probably the most complex area, if such
> > a thing gets introduced, it should makes sense to state
> > the interface experimental for some time until a wider range of
> > devices uses it (in contrast to these new ones
> > which should not change that soon anymore...).
> 
> Ok.
> 
> > Also machine_idle may be true, but machine_suspend sounds more
> > familiar and everyone immediately knows what the event is about.
> 
> Ok - fair enough.
> 
> > > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > 
> > > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
> >
> > No, below enum will vanish, but -1 is nicer.
> 
> When it vanishes what will replace it?
> 
> > ...
> > 
> > > Plus:
> > > 
> > > > +DECLARE_EVENT_CLASS(processor,
> > > > +
> > > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > > +
> > > > +	TP_ARGS(state, cpu_id),
> > > > +
> > > > +	TP_STRUCT__entry(
> > > > +		__field(	u32,		state		)
> > > > +		__field(	u32,		cpu_id		)
> > > 
> > > Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> > > ever be different from that CPU id?
> >
> > Yes. A core's frequency can depend on another one which
> > will get switched as well (by one command/MSR).
> > Compare with commit 6f4f2723d08534fd4e407e1e.
> > 
> > This can theoretically also be the case for sleep states.
> > Afaik such HW does not exist yet, but ACPI spec already
> > provides interfaces to pass these dependency from BIOS to OS.
> > -> We want a stable ABI and should be prepared for such stuff.
> > 
> > > > +	),
> > > > +
> > > > +	TP_fast_assign(
> > > > +		__entry->state = state;
> > > > +		__entry->cpu_id = cpu_id;
> > > > +	),
> > > > +
> > > > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > > > +		  (unsigned long)__entry->cpu_id)
> > > > +);
> > > > +
> > > > +DEFINE_EVENT(processor, processor_idle,
> > > > +
> > > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > > +
> > > > +	     TP_ARGS(state, cpu_id)
> > > > +);
> > > > +
> > > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > > +
> > > > +DEFINE_EVENT(processor, processor_frequency,
> > > > +
> > > > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > > > +
> > > > +	TP_ARGS(frequency, cpu_id)
> > > > +);
> > > 
> > > So, we have a 'state' field in the class, which is used as 'state' by the 
> > > power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
> >
> > Yes, is this a problem?
> >
> > Definitions are a bit shorter having one power processor class.
> > As "frequency" is stated in frequency event definition everything should
> > be obvious and this one looks like the more elegant way to me.
> >  
> > > Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> > > there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> > > then it wont fit into u32.
> >
> > drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
> >         unsigned int            min;    /* in kHz */
> >         unsigned int            max;    /* in kHz */
> >         unsigned int            cur;    /* in kHz,
> >         ...
> > that should be fine.
> 
> ok, good - so we should be fine up to 4 THz CPUs.
> 
> > > Also, might there be a future need to express different types of frequencies? 
> > > For example, should we decide to track turbo frequencies in Intel CPUs, how 
> > > would that be expressed via these events? Are there any architectures and CPUs 
> > > that somehow have some extra attribute to the frequency value?
> >
> > I wonder whether this ever can/will work in a sane way.
> > Userspace can compare with:
> > /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
> > everything above is turbo. So I do not think it's ever needed.
> > But adding an additional value at the end does not violate
> > userspace compatibility. This has been done with the cpuid
> > as well.
> >  
> > > > +TRACE_EVENT(machine_suspend,
> > > > +
> > > > +	TP_PROTO(unsigned int state),
> > > > +
> > > > +	TP_ARGS(state),
> > > > +
> > > > +	TP_STRUCT__entry(
> > > > +		__field(	u32,		state		)
> > > > +	),
> > > 
> > > Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> > > that adds usage, and what are the possible values for 'state'?
> >
> > Jean wants to make use of it on ARM.
> > I also had patch for x86, I can have another look at it, Rafael
> > already gave me a comment on it. But on X86 you typically realize
> > when you suspend the machine (could imagine this is more useful on
> > ARM driven mobile phones and similar), still I can add it..
> > 
> > Values probably should be (include/linux/suspend.h):
> > #define PM_SUSPEND_ON       0
> > #define PM_SUSPEND_STANDBY  1
> > #define PM_SUSPEND_MEM      3
> > #define PM_SUSPEND_MAX      4
> > 
> > How this strange state Arjan talked about is passed is up
> > to these guys. Instead of using 0 and above pre-defined such
> > arch specific special states better should get passed by:
> > 
> > #define X86_MOORESTOWN_STANDBY_S0   0x100
> > ..                                  0x101
> > #define ARM_WHATEVER_STRANGE_THING  0x200
> > ...
> 
> I'd rather like to see a meaningful name to be given to these states and them being 
> passed, instead of weird platform specific things. Tooling will try to be as generic 
> as possible.
> 
> I dont know this stuff, but making a distinction between s2ram and s2disk events 
> would seem meaningful.

Basically, we have 

standby (which is what it says)
mem (s2ram)
disk (s2disk)

These are the transitions the PM core really cares about (if supported).
The can be read from /sys/power/state and I think these names should be used
by the tracing interfaces too.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:19         ` Ingo Molnar
@ 2010-10-26 19:01           ` Rafael J. Wysocki
  2010-10-26 19:01           ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven

On Tuesday, October 26, 2010, Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > Changes in V2:
> > > >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> > > >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > > 
> > > > and
> > > >   power:power_frequency
> > > > is replaced with:
> > > >   power:processor_frequency
> > > 
> > > Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> > > shortness? We generally use 'cpu' in the kernel and for events.
> > Sure.
> > > 
> > > > power:machine_suspend
> > > 
> > > How will future PCI (or other device) power saving tracepoints be called?
> > > 
> > > Might be more consistent to use:
> > > 
> > >   power:cpu_idle
> > >   power:machine_idle
> > >   power:device_idle
> >
> > device idle is not true. Those may be low power modes
> > like reduced network throughput, reduced wlan range, the device
> > needs not to be idle.
> > Device power states is probably the most complex area, if such
> > a thing gets introduced, it should makes sense to state
> > the interface experimental for some time until a wider range of
> > devices uses it (in contrast to these new ones
> > which should not change that soon anymore...).
> 
> Ok.
> 
> > Also machine_idle may be true, but machine_suspend sounds more
> > familiar and everyone immediately knows what the event is about.
> 
> Ok - fair enough.
> 
> > > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > 
> > > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
> >
> > No, below enum will vanish, but -1 is nicer.
> 
> When it vanishes what will replace it?
> 
> > ...
> > 
> > > Plus:
> > > 
> > > > +DECLARE_EVENT_CLASS(processor,
> > > > +
> > > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > > +
> > > > +	TP_ARGS(state, cpu_id),
> > > > +
> > > > +	TP_STRUCT__entry(
> > > > +		__field(	u32,		state		)
> > > > +		__field(	u32,		cpu_id		)
> > > 
> > > Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> > > ever be different from that CPU id?
> >
> > Yes. A core's frequency can depend on another one which
> > will get switched as well (by one command/MSR).
> > Compare with commit 6f4f2723d08534fd4e407e1e.
> > 
> > This can theoretically also be the case for sleep states.
> > Afaik such HW does not exist yet, but ACPI spec already
> > provides interfaces to pass these dependency from BIOS to OS.
> > -> We want a stable ABI and should be prepared for such stuff.
> > 
> > > > +	),
> > > > +
> > > > +	TP_fast_assign(
> > > > +		__entry->state = state;
> > > > +		__entry->cpu_id = cpu_id;
> > > > +	),
> > > > +
> > > > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > > > +		  (unsigned long)__entry->cpu_id)
> > > > +);
> > > > +
> > > > +DEFINE_EVENT(processor, processor_idle,
> > > > +
> > > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > > +
> > > > +	     TP_ARGS(state, cpu_id)
> > > > +);
> > > > +
> > > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > > +
> > > > +DEFINE_EVENT(processor, processor_frequency,
> > > > +
> > > > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > > > +
> > > > +	TP_ARGS(frequency, cpu_id)
> > > > +);
> > > 
> > > So, we have a 'state' field in the class, which is used as 'state' by the 
> > > power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
> >
> > Yes, is this a problem?
> >
> > Definitions are a bit shorter having one power processor class.
> > As "frequency" is stated in frequency event definition everything should
> > be obvious and this one looks like the more elegant way to me.
> >  
> > > Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> > > there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> > > then it wont fit into u32.
> >
> > drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
> >         unsigned int            min;    /* in kHz */
> >         unsigned int            max;    /* in kHz */
> >         unsigned int            cur;    /* in kHz,
> >         ...
> > that should be fine.
> 
> ok, good - so we should be fine up to 4 THz CPUs.
> 
> > > Also, might there be a future need to express different types of frequencies? 
> > > For example, should we decide to track turbo frequencies in Intel CPUs, how 
> > > would that be expressed via these events? Are there any architectures and CPUs 
> > > that somehow have some extra attribute to the frequency value?
> >
> > I wonder whether this ever can/will work in a sane way.
> > Userspace can compare with:
> > /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
> > everything above is turbo. So I do not think it's ever needed.
> > But adding an additional value at the end does not violate
> > userspace compatibility. This has been done with the cpuid
> > as well.
> >  
> > > > +TRACE_EVENT(machine_suspend,
> > > > +
> > > > +	TP_PROTO(unsigned int state),
> > > > +
> > > > +	TP_ARGS(state),
> > > > +
> > > > +	TP_STRUCT__entry(
> > > > +		__field(	u32,		state		)
> > > > +	),
> > > 
> > > Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> > > that adds usage, and what are the possible values for 'state'?
> >
> > Jean wants to make use of it on ARM.
> > I also had patch for x86, I can have another look at it, Rafael
> > already gave me a comment on it. But on X86 you typically realize
> > when you suspend the machine (could imagine this is more useful on
> > ARM driven mobile phones and similar), still I can add it..
> > 
> > Values probably should be (include/linux/suspend.h):
> > #define PM_SUSPEND_ON       0
> > #define PM_SUSPEND_STANDBY  1
> > #define PM_SUSPEND_MEM      3
> > #define PM_SUSPEND_MAX      4
> > 
> > How this strange state Arjan talked about is passed is up
> > to these guys. Instead of using 0 and above pre-defined such
> > arch specific special states better should get passed by:
> > 
> > #define X86_MOORESTOWN_STANDBY_S0   0x100
> > ..                                  0x101
> > #define ARM_WHATEVER_STRANGE_THING  0x200
> > ...
> 
> I'd rather like to see a meaningful name to be given to these states and them being 
> passed, instead of weird platform specific things. Tooling will try to be as generic 
> as possible.
> 
> I dont know this stuff, but making a distinction between s2ram and s2disk events 
> would seem meaningful.

Basically, we have 

standby (which is what it says)
mem (s2ram)
disk (s2disk)

These are the transitions the PM core really cares about (if supported).
The can be read from /sys/power/state and I think these names should be used
by the tracing interfaces too.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:14               ` Mathieu Desnoyers
  2010-10-26 18:50                 ` [linux-pm] " Alan Stern
  2010-10-26 18:50                 ` Alan Stern
@ 2010-10-26 19:04                 ` Rafael J. Wysocki
  2010-10-26 19:04                 ` Rafael J. Wysocki
  3 siblings, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Pierre Tardy, Peter Zijlstra, linux-trace-users,
	Jean Pihet, Steven Rostedt, Frederic Weisbecker, Linus Torvalds,
	Frank Eigler, Thomas Gleixner, linux-pm, Masami Hiramatsu,
	Tejun Heo, Ingo Molnar, linux-omap, Arjan van de Ven

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count); 
> > 
> > That's terribly racy..
> 
> Looking at the original code, it looks racy even without considering the
> tracepoint:
> 
> int __pm_runtime_get(struct device *dev, bool sync)
>  {
>         int retval;
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count);
>         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> 
> There is no implied memory barrier after "atomic_inc". So either all these
> inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> why atomic operations are used at all, or it's a racy mess. (I vote for the
> second option)

No, it isn't.

> kref should certainly be used there.

No, it shouldn't.

Please try to understand the code you're commenting on first.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:14               ` Mathieu Desnoyers
                                   ` (2 preceding siblings ...)
  2010-10-26 19:04                 ` Rafael J. Wysocki
@ 2010-10-26 19:04                 ` Rafael J. Wysocki
  2010-10-26 21:38                   ` Mathieu Desnoyers
  2010-10-26 21:38                   ` Mathieu Desnoyers
  3 siblings, 2 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Greg Kroah-Hartman, Pierre Tardy,
	Arjan van de Ven, Ingo Molnar, Thomas Renninger, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Masami Hiramatsu, Frank Eigler,
	Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count); 
> > 
> > That's terribly racy..
> 
> Looking at the original code, it looks racy even without considering the
> tracepoint:
> 
> int __pm_runtime_get(struct device *dev, bool sync)
>  {
>         int retval;
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count);
>         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> 
> There is no implied memory barrier after "atomic_inc". So either all these
> inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> why atomic operations are used at all, or it's a racy mess. (I vote for the
> second option)

No, it isn't.

> kref should certainly be used there.

No, it shouldn't.

Please try to understand the code you're commenting on first.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:15               ` Pierre Tardy
  2010-10-26 19:08                 ` Rafael J. Wysocki
@ 2010-10-26 19:08                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:08 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Andrew Morton, linux-trace-users, Peter Zijlstra,
	Arjan van de Ven, Jean Pihet, Steven Rostedt,
	Frederic Weisbecker, Frank Eigler, Thomas Gleixner, linux-pm,
	Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

On Tuesday, October 26, 2010, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >>
> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >>         atomic_inc(&dev->power.usage_count);
> >
> > That's terribly racy..
> >
> I know. I'm not proud of this.. As I said, this is preliminary patch.
> We dont really need to have this prev_usage. This is just for debug.
> It mayprobably endup with something like:
> 
>          atomic_inc(&dev->power.usage_count);
> +       trace_power_device_usage(dev);

Well, please tell me what you're trying to achieve.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:15               ` Pierre Tardy
@ 2010-10-26 19:08                 ` Rafael J. Wysocki
  2010-10-26 20:23                   ` Pierre Tardy
  2010-10-26 20:23                   ` Pierre Tardy
  2010-10-26 19:08                 ` Rafael J. Wysocki
  1 sibling, 2 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:08 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Peter Zijlstra, Arjan van de Ven, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Tuesday, October 26, 2010, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >>
> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >>         atomic_inc(&dev->power.usage_count);
> >
> > That's terribly racy..
> >
> I know. I'm not proud of this.. As I said, this is preliminary patch.
> We dont really need to have this prev_usage. This is just for debug.
> It mayprobably endup with something like:
> 
>          atomic_inc(&dev->power.usage_count);
> +       trace_power_device_usage(dev);

Well, please tell me what you're trying to achieve.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 19:08                 ` Rafael J. Wysocki
@ 2010-10-26 20:23                   ` Pierre Tardy
  2010-10-26 20:23                   ` Pierre Tardy
  1 sibling, 0 replies; 135+ messages in thread
From: Pierre Tardy @ 2010-10-26 20:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, linux-trace-users, Peter Zijlstra,
	Arjan van de Ven, Jean Pihet, Steven Rostedt,
	Frederic Weisbecker, Frank Eigler, Thomas Gleixner, linux-pm,
	Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 1372 bytes --]

On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>> >>
>> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>> >>         atomic_inc(&dev->power.usage_count);
>> >
>> > That's terribly racy..
>> >
>> I know. I'm not proud of this.. As I said, this is preliminary patch.
>> We dont really need to have this prev_usage. This is just for debug.
>> It mayprobably endup with something like:
>>
>>          atomic_inc(&dev->power.usage_count);
>> +       trace_power_device_usage(dev);
>
> Well, please tell me what you're trying to achieve.

Please see attached the kind of pytimechart output I'm trying to
achieve (yes, this chart is not coherent, seems I'm still missing some
traces)

We basically want to have a trace point eachtime the usage_counter
changes, so that I can display nice timecharts, and Arjan can have the
comm of the process that eventually generated the rpm_get, in order to
pinpoint it in powertop.

What you dont see in the above two lines is that
trace_power_device_usage(dev); actually reads the usage_count, as well
as the driver and device name.

Regards,
-- 
Pierre

[-- Attachment #2: pytimechart_runtime_pm.png --]
[-- Type: image/png, Size: 16247 bytes --]

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 19:08                 ` Rafael J. Wysocki
  2010-10-26 20:23                   ` Pierre Tardy
@ 2010-10-26 20:23                   ` Pierre Tardy
  2010-10-26 20:38                     ` Rafael J. Wysocki
  2010-10-26 20:38                     ` Rafael J. Wysocki
  1 sibling, 2 replies; 135+ messages in thread
From: Pierre Tardy @ 2010-10-26 20:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Arjan van de Ven, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 1372 bytes --]

On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>> >>
>> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>> >>         atomic_inc(&dev->power.usage_count);
>> >
>> > That's terribly racy..
>> >
>> I know. I'm not proud of this.. As I said, this is preliminary patch.
>> We dont really need to have this prev_usage. This is just for debug.
>> It mayprobably endup with something like:
>>
>>          atomic_inc(&dev->power.usage_count);
>> +       trace_power_device_usage(dev);
>
> Well, please tell me what you're trying to achieve.

Please see attached the kind of pytimechart output I'm trying to
achieve (yes, this chart is not coherent, seems I'm still missing some
traces)

We basically want to have a trace point eachtime the usage_counter
changes, so that I can display nice timecharts, and Arjan can have the
comm of the process that eventually generated the rpm_get, in order to
pinpoint it in powertop.

What you dont see in the above two lines is that
trace_power_device_usage(dev); actually reads the usage_count, as well
as the driver and device name.

Regards,
-- 
Pierre

[-- Attachment #2: pytimechart_runtime_pm.png --]
[-- Type: image/png, Size: 16247 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:23                   ` Pierre Tardy
@ 2010-10-26 20:38                     ` Rafael J. Wysocki
  2010-10-26 20:38                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 20:38 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Andrew Morton, linux-trace-users, Peter Zijlstra,
	Arjan van de Ven, Jean Pihet, Steven Rostedt,
	Frederic Weisbecker, Frank Eigler, Thomas Gleixner, linux-pm,
	Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

On Tuesday, October 26, 2010, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >> >>
> >> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >> >>         atomic_inc(&dev->power.usage_count);
> >> >
> >> > That's terribly racy..
> >> >
> >> I know. I'm not proud of this.. As I said, this is preliminary patch.
> >> We dont really need to have this prev_usage. This is just for debug.
> >> It mayprobably endup with something like:
> >>
> >>          atomic_inc(&dev->power.usage_count);
> >> +       trace_power_device_usage(dev);
> >
> > Well, please tell me what you're trying to achieve.
> 
> Please see attached the kind of pytimechart output I'm trying to
> achieve (yes, this chart is not coherent, seems I'm still missing some
> traces)
> 
> We basically want to have a trace point eachtime the usage_counter
> changes, so that I can display nice timecharts, and Arjan can have the
> comm of the process that eventually generated the rpm_get, in order to
> pinpoint it in powertop.
> 
> What you dont see in the above two lines is that
> trace_power_device_usage(dev); actually reads the usage_count, as well
> as the driver and device name.

I'm afraid that for this to really work you'd need to put usage_count under a
spinlock along with your trace point, which I'm not really sure I like.

Besides, I'm not really sure the manipulations of usage_count are worth
tracing.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:23                   ` Pierre Tardy
  2010-10-26 20:38                     ` Rafael J. Wysocki
@ 2010-10-26 20:38                     ` Rafael J. Wysocki
  2010-10-26 20:52                       ` Arjan van de Ven
  2010-10-26 20:52                       ` Arjan van de Ven
  1 sibling, 2 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 20:38 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Peter Zijlstra, Arjan van de Ven, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Tuesday, October 26, 2010, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >> >>
> >> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >> >>         atomic_inc(&dev->power.usage_count);
> >> >
> >> > That's terribly racy..
> >> >
> >> I know. I'm not proud of this.. As I said, this is preliminary patch.
> >> We dont really need to have this prev_usage. This is just for debug.
> >> It mayprobably endup with something like:
> >>
> >>          atomic_inc(&dev->power.usage_count);
> >> +       trace_power_device_usage(dev);
> >
> > Well, please tell me what you're trying to achieve.
> 
> Please see attached the kind of pytimechart output I'm trying to
> achieve (yes, this chart is not coherent, seems I'm still missing some
> traces)
> 
> We basically want to have a trace point eachtime the usage_counter
> changes, so that I can display nice timecharts, and Arjan can have the
> comm of the process that eventually generated the rpm_get, in order to
> pinpoint it in powertop.
> 
> What you dont see in the above two lines is that
> trace_power_device_usage(dev); actually reads the usage_count, as well
> as the driver and device name.

I'm afraid that for this to really work you'd need to put usage_count under a
spinlock along with your trace point, which I'm not really sure I like.

Besides, I'm not really sure the manipulations of usage_count are worth
tracing.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:38                     ` Rafael J. Wysocki
@ 2010-10-26 20:52                       ` Arjan van de Ven
  2010-10-26 20:52                       ` Arjan van de Ven
  1 sibling, 0 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-26 20:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Pierre Tardy, Peter Zijlstra, Frederic Weisbecker,
	linux-trace-users, Jean Pihet, Steven Rostedt, Frank Eigler,
	Thomas Gleixner, linux-pm, Masami Hiramatsu, Tejun Heo,
	Ingo Molnar, linux-omap, Linus Torvalds, Mathieu Desnoyers

On 10/26/2010 1:38 PM, Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki<rjw@sisk.pl>  wrote:
>>> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>>>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra<peterz@infradead.org>  wrote:
>>>>> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>>>>>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>>>>>>          atomic_inc(&dev->power.usage_count);
>>>>> That's terribly racy..
>>>>>
>>>> I know. I'm not proud of this.. As I said, this is preliminary patch.
>>>> We dont really need to have this prev_usage. This is just for debug.
>>>> It mayprobably endup with something like:
>>>>
>>>>           atomic_inc(&dev->power.usage_count);
>>>> +       trace_power_device_usage(dev);
>>> Well, please tell me what you're trying to achieve.
>> Please see attached the kind of pytimechart output I'm trying to
>> achieve (yes, this chart is not coherent, seems I'm still missing some
>> traces)
>>
>> We basically want to have a trace point eachtime the usage_counter
>> changes, so that I can display nice timecharts, and Arjan can have the
>> comm of the process that eventually generated the rpm_get, in order to
>> pinpoint it in powertop.
>>
>> What you dont see in the above two lines is that
>> trace_power_device_usage(dev); actually reads the usage_count, as well
>> as the driver and device name.
> I'm afraid that for this to really work you'd need to put usage_count under a
> spinlock along with your trace point, which I'm not really sure I like.
>
> Besides, I'm not really sure the manipulations of usage_count are worth
> tracing.

what's most interesting is the 0->1  and 1->0 transitions.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:38                     ` Rafael J. Wysocki
  2010-10-26 20:52                       ` Arjan van de Ven
@ 2010-10-26 20:52                       ` Arjan van de Ven
  2010-10-26 21:17                         ` Rafael J. Wysocki
  2010-10-26 21:17                         ` Rafael J. Wysocki
  1 sibling, 2 replies; 135+ messages in thread
From: Arjan van de Ven @ 2010-10-26 20:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pierre Tardy, Peter Zijlstra, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/26/2010 1:38 PM, Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki<rjw@sisk.pl>  wrote:
>>> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>>>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra<peterz@infradead.org>  wrote:
>>>>> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>>>>>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>>>>>>          atomic_inc(&dev->power.usage_count);
>>>>> That's terribly racy..
>>>>>
>>>> I know. I'm not proud of this.. As I said, this is preliminary patch.
>>>> We dont really need to have this prev_usage. This is just for debug.
>>>> It mayprobably endup with something like:
>>>>
>>>>           atomic_inc(&dev->power.usage_count);
>>>> +       trace_power_device_usage(dev);
>>> Well, please tell me what you're trying to achieve.
>> Please see attached the kind of pytimechart output I'm trying to
>> achieve (yes, this chart is not coherent, seems I'm still missing some
>> traces)
>>
>> We basically want to have a trace point eachtime the usage_counter
>> changes, so that I can display nice timecharts, and Arjan can have the
>> comm of the process that eventually generated the rpm_get, in order to
>> pinpoint it in powertop.
>>
>> What you dont see in the above two lines is that
>> trace_power_device_usage(dev); actually reads the usage_count, as well
>> as the driver and device name.
> I'm afraid that for this to really work you'd need to put usage_count under a
> spinlock along with your trace point, which I'm not really sure I like.
>
> Besides, I'm not really sure the manipulations of usage_count are worth
> tracing.

what's most interesting is the 0->1  and 1->0 transitions.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:52                       ` Arjan van de Ven
  2010-10-26 21:17                         ` Rafael J. Wysocki
@ 2010-10-26 21:17                         ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 21:17 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Pierre Tardy, Peter Zijlstra, Frederic Weisbecker,
	linux-trace-users, Jean Pihet, Steven Rostedt, Frank Eigler,
	Thomas Gleixner, linux-pm, Masami Hiramatsu, Tejun Heo,
	Ingo Molnar, linux-omap, Linus Torvalds, Mathieu Desnoyers

On Tuesday, October 26, 2010, Arjan van de Ven wrote:
> On 10/26/2010 1:38 PM, Rafael J. Wysocki wrote:
> > On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki<rjw@sisk.pl>  wrote:
> >>> On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >>>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra<peterz@infradead.org>  wrote:
> >>>>> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >>>>>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >>>>>>          atomic_inc(&dev->power.usage_count);
> >>>>> That's terribly racy..
> >>>>>
> >>>> I know. I'm not proud of this.. As I said, this is preliminary patch.
> >>>> We dont really need to have this prev_usage. This is just for debug.
> >>>> It mayprobably endup with something like:
> >>>>
> >>>>           atomic_inc(&dev->power.usage_count);
> >>>> +       trace_power_device_usage(dev);
> >>> Well, please tell me what you're trying to achieve.
> >> Please see attached the kind of pytimechart output I'm trying to
> >> achieve (yes, this chart is not coherent, seems I'm still missing some
> >> traces)
> >>
> >> We basically want to have a trace point eachtime the usage_counter
> >> changes, so that I can display nice timecharts, and Arjan can have the
> >> comm of the process that eventually generated the rpm_get, in order to
> >> pinpoint it in powertop.
> >>
> >> What you dont see in the above two lines is that
> >> trace_power_device_usage(dev); actually reads the usage_count, as well
> >> as the driver and device name.
> > I'm afraid that for this to really work you'd need to put usage_count under a
> > spinlock along with your trace point, which I'm not really sure I like.
> >
> > Besides, I'm not really sure the manipulations of usage_count are worth
> > tracing.
> 
> what's most interesting is the 0->1  and 1->0 transitions.

But they are only meaningful in specific situations.  For example, if someone
does pm_runtime_get_noresume() when the device is active, there may be
a device suspend already under way at the same time.  So IMO what really
is interesting is when rpm_resume() is called with usage_count > 0 and then
perhaps when rpm_idle() or rpm_suspend() is called after usage_count drops
back to 0.

There are some other interesting cases, but they all need to be checked under
->power.lock and you need to do that cleverly, so that the _functionality_ is
not harmed.

Overall, I think that adding tracepoints to the runtime PM core code is really
premature at this point, given that we've just reworked it quite a bit recently.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:52                       ` Arjan van de Ven
@ 2010-10-26 21:17                         ` Rafael J. Wysocki
  2010-10-26 21:17                         ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 21:17 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Pierre Tardy, Peter Zijlstra, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Tuesday, October 26, 2010, Arjan van de Ven wrote:
> On 10/26/2010 1:38 PM, Rafael J. Wysocki wrote:
> > On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki<rjw@sisk.pl>  wrote:
> >>> On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >>>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra<peterz@infradead.org>  wrote:
> >>>>> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >>>>>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >>>>>>          atomic_inc(&dev->power.usage_count);
> >>>>> That's terribly racy..
> >>>>>
> >>>> I know. I'm not proud of this.. As I said, this is preliminary patch.
> >>>> We dont really need to have this prev_usage. This is just for debug.
> >>>> It mayprobably endup with something like:
> >>>>
> >>>>           atomic_inc(&dev->power.usage_count);
> >>>> +       trace_power_device_usage(dev);
> >>> Well, please tell me what you're trying to achieve.
> >> Please see attached the kind of pytimechart output I'm trying to
> >> achieve (yes, this chart is not coherent, seems I'm still missing some
> >> traces)
> >>
> >> We basically want to have a trace point eachtime the usage_counter
> >> changes, so that I can display nice timecharts, and Arjan can have the
> >> comm of the process that eventually generated the rpm_get, in order to
> >> pinpoint it in powertop.
> >>
> >> What you dont see in the above two lines is that
> >> trace_power_device_usage(dev); actually reads the usage_count, as well
> >> as the driver and device name.
> > I'm afraid that for this to really work you'd need to put usage_count under a
> > spinlock along with your trace point, which I'm not really sure I like.
> >
> > Besides, I'm not really sure the manipulations of usage_count are worth
> > tracing.
> 
> what's most interesting is the 0->1  and 1->0 transitions.

But they are only meaningful in specific situations.  For example, if someone
does pm_runtime_get_noresume() when the device is active, there may be
a device suspend already under way at the same time.  So IMO what really
is interesting is when rpm_resume() is called with usage_count > 0 and then
perhaps when rpm_idle() or rpm_suspend() is called after usage_count drops
back to 0.

There are some other interesting cases, but they all need to be checked under
->power.lock and you need to do that cleverly, so that the _functionality_ is
not harmed.

Overall, I think that adding tracepoints to the runtime PM core code is really
premature at this point, given that we've just reworked it quite a bit recently.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:50                 ` [linux-pm] " Alan Stern
@ 2010-10-26 21:33                   ` Mathieu Desnoyers
  2010-10-26 21:33                   ` [linux-pm] " Mathieu Desnoyers
  1 sibling, 0 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 21:33 UTC (permalink / raw)
  To: Alan Stern
  Cc: Paul E. McKenney, Pierre Tardy, Peter Zijlstra, linux-pm,
	Ingo Molnar, Jean Pihet, Steven Rostedt, linux-trace-users,
	Frank Eigler, Linus Torvalds, Frederic Weisbecker,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Arjan van de Ven, Thomas Gleixner

* Alan Stern (stern@rowland.harvard.edu) wrote:
> On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> 
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count); 
> > > 
> > > That's terribly racy..
> > 
> > Looking at the original code, it looks racy even without considering the
> > tracepoint:
> > 
> > int __pm_runtime_get(struct device *dev, bool sync)
> >  {
> >         int retval;
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count);
> >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > 
> > There is no implied memory barrier after "atomic_inc". So either all these
> > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > second option)
> 
> I don't understand.  What's the problem?  The inc/dec are atomic 
> because they are not protected by spinlocks, but everything else is 
> (aside from the tracepoint, which is new).
> 
> > kref should certainly be used there.
> 
> What for?

kref has the following "get":

        atomic_inc(&kref->refcount);
        smp_mb__after_atomic_inc();

What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
the memory barrier after the atomic increment. The atomic increment is free to
be reordered into the following spinlock (within pm_request_resume or pm_request
resume execution) because taking a spinlock only acts as a memory barrier with
acquire semantic, not a full memory barrier.

So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):

initial conditions: usage_count = 1

CPU A                                                       CPU B
1) __pm_runtime_get() (sync = true)
2)   atomic_inc(&usage_count) (not committed to memory yet)
3)   pm_runtime_resume()
4)     spin_lock_irqsave(&dev->power.lock, flags);
5)     retval = __pm_request_resume(dev);
6)     (execute the body of __pm_request_resume and return)
7)                                                          __pm_runtime_put() (sync = true) 
8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
                                                              (still see usage_count == 1 before decrement,
                                                               thus decrement to 0)
9)                                                             pm_runtime_idle()
10)  spin_unlock_irqrestore(&dev->power.lock, flags)
11)                                                            spin_lock_irq(&dev->power.lock);
12)                                                            retval = __pm_runtime_idle(dev);
13)                                                            spin_unlock_irq(&dev->power.lock);

So we end up in a situation where CPU A expects the device to be resumed, but
the last action performed has been to bring it to idle.

A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:50                 ` [linux-pm] " Alan Stern
  2010-10-26 21:33                   ` Mathieu Desnoyers
@ 2010-10-26 21:33                   ` Mathieu Desnoyers
  2010-10-26 22:20                     ` Rafael J. Wysocki
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
  1 sibling, 2 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 21:33 UTC (permalink / raw)
  To: Alan Stern
  Cc: Peter Zijlstra, Greg Kroah-Hartman, Andrew Morton, Pierre Tardy,
	Arjan van de Ven, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Paul E. McKenney

* Alan Stern (stern@rowland.harvard.edu) wrote:
> On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> 
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count); 
> > > 
> > > That's terribly racy..
> > 
> > Looking at the original code, it looks racy even without considering the
> > tracepoint:
> > 
> > int __pm_runtime_get(struct device *dev, bool sync)
> >  {
> >         int retval;
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count);
> >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > 
> > There is no implied memory barrier after "atomic_inc". So either all these
> > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > second option)
> 
> I don't understand.  What's the problem?  The inc/dec are atomic 
> because they are not protected by spinlocks, but everything else is 
> (aside from the tracepoint, which is new).
> 
> > kref should certainly be used there.
> 
> What for?

kref has the following "get":

        atomic_inc(&kref->refcount);
        smp_mb__after_atomic_inc();

What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
the memory barrier after the atomic increment. The atomic increment is free to
be reordered into the following spinlock (within pm_request_resume or pm_request
resume execution) because taking a spinlock only acts as a memory barrier with
acquire semantic, not a full memory barrier.

So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):

initial conditions: usage_count = 1

CPU A                                                       CPU B
1) __pm_runtime_get() (sync = true)
2)   atomic_inc(&usage_count) (not committed to memory yet)
3)   pm_runtime_resume()
4)     spin_lock_irqsave(&dev->power.lock, flags);
5)     retval = __pm_request_resume(dev);
6)     (execute the body of __pm_request_resume and return)
7)                                                          __pm_runtime_put() (sync = true) 
8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
                                                              (still see usage_count == 1 before decrement,
                                                               thus decrement to 0)
9)                                                             pm_runtime_idle()
10)  spin_unlock_irqrestore(&dev->power.lock, flags)
11)                                                            spin_lock_irq(&dev->power.lock);
12)                                                            retval = __pm_runtime_idle(dev);
13)                                                            spin_unlock_irq(&dev->power.lock);

So we end up in a situation where CPU A expects the device to be resumed, but
the last action performed has been to bring it to idle.

A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 19:04                 ` Rafael J. Wysocki
@ 2010-10-26 21:38                   ` Mathieu Desnoyers
  2010-10-26 21:38                   ` Mathieu Desnoyers
  1 sibling, 0 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 21:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Pierre Tardy, Peter Zijlstra, linux-trace-users,
	Jean Pihet, Steven Rostedt, Frederic Weisbecker, Linus Torvalds,
	Frank Eigler, Thomas Gleixner, linux-pm, Masami Hiramatsu,
	Tejun Heo, Ingo Molnar, linux-omap, Arjan van de Ven

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count); 
> > > 
> > > That's terribly racy..
> > 
> > Looking at the original code, it looks racy even without considering the
> > tracepoint:
> > 
> > int __pm_runtime_get(struct device *dev, bool sync)
> >  {
> >         int retval;
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count);
> >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > 
> > There is no implied memory barrier after "atomic_inc". So either all these
> > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > second option)
> 
> No, it isn't.
> 
> > kref should certainly be used there.
> 
> No, it shouldn't.
> 
> Please try to understand the code you're commenting on first.

Please see my reply to Alan Stern:

http://www.spinics.net/lists/linux-omap/msg39382.html

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 19:04                 ` Rafael J. Wysocki
  2010-10-26 21:38                   ` Mathieu Desnoyers
@ 2010-10-26 21:38                   ` Mathieu Desnoyers
  2010-10-26 22:22                     ` Rafael J. Wysocki
  2010-10-26 22:22                     ` Rafael J. Wysocki
  1 sibling, 2 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 21:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Greg Kroah-Hartman, Pierre Tardy,
	Arjan van de Ven, Ingo Molnar, Thomas Renninger, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Masami Hiramatsu, Frank Eigler,
	Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count); 
> > > 
> > > That's terribly racy..
> > 
> > Looking at the original code, it looks racy even without considering the
> > tracepoint:
> > 
> > int __pm_runtime_get(struct device *dev, bool sync)
> >  {
> >         int retval;
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count);
> >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > 
> > There is no implied memory barrier after "atomic_inc". So either all these
> > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > second option)
> 
> No, it isn't.
> 
> > kref should certainly be used there.
> 
> No, it shouldn't.
> 
> Please try to understand the code you're commenting on first.

Please see my reply to Alan Stern:

http://www.spinics.net/lists/linux-omap/msg39382.html

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 21:33                   ` [linux-pm] " Mathieu Desnoyers
@ 2010-10-26 22:20                     ` Rafael J. Wysocki
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:20 UTC (permalink / raw)
  To: linux-pm
  Cc: linux-omap, Arjan van de Ven, Thomas Gleixner, Pierre Tardy,
	Peter Zijlstra, Frederic Weisbecker, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Paul E. McKenney, Linus Torvalds,
	Andrew Morton, Tejun Heo

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Alan Stern (stern@rowland.harvard.edu) wrote:
> > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > 
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count); 
> > > > 
> > > > That's terribly racy..
> > > 
> > > Looking at the original code, it looks racy even without considering the
> > > tracepoint:
> > > 
> > > int __pm_runtime_get(struct device *dev, bool sync)
> > >  {
> > >         int retval;
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count);
> > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > 
> > > There is no implied memory barrier after "atomic_inc". So either all these
> > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > second option)
> > 
> > I don't understand.  What's the problem?  The inc/dec are atomic 
> > because they are not protected by spinlocks, but everything else is 
> > (aside from the tracepoint, which is new).
> > 
> > > kref should certainly be used there.
> > 
> > What for?
> 
> kref has the following "get":
> 
>         atomic_inc(&kref->refcount);
>         smp_mb__after_atomic_inc();
> 
> What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> the memory barrier after the atomic increment. The atomic increment is free to
> be reordered into the following spinlock (within pm_request_resume or pm_request
> resume execution) because taking a spinlock only acts as a memory barrier with
> acquire semantic, not a full memory barrier.
>
> So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> 
> initial conditions: usage_count = 1
> 
> CPU A                                                       CPU B
> 1) __pm_runtime_get() (sync = true)
> 2)   atomic_inc(&usage_count) (not committed to memory yet)
> 3)   pm_runtime_resume()
> 4)     spin_lock_irqsave(&dev->power.lock, flags);
> 5)     retval = __pm_request_resume(dev);

If sync = true this is
           retval = __pm_runtime_resume(dev);
which drops and reacquires the spinlock.  In the meantime it sets
->power.runtime_status so that __pm_runtime_idle() will fail if run at this
point.

> 6)     (execute the body of __pm_request_resume and return)
> 7)                                                          __pm_runtime_put() (sync = true) 
> 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
>                                                               (still see usage_count == 1 before decrement,
>                                                                thus decrement to 0)
> 9)                                                             pm_runtime_idle()
> 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> 11)                                                            spin_lock_irq(&dev->power.lock);
> 12)                                                            retval = __pm_runtime_idle(dev);

Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
so it will see it's been incremented in the meantime and it will back off.

> 13)                                                            spin_unlock_irq(&dev->power.lock);
> 
> So we end up in a situation where CPU A expects the device to be resumed, but
> the last action performed has been to bring it to idle.
>
> A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.

I don't think this particular race is possible.  However, there is another one
that seems to be possible (in a different function) that an explicit barrier
will prevent from happening.

It's related to pm_runtime_get_noresume(), but I think it's better to put the
barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 21:33                   ` [linux-pm] " Mathieu Desnoyers
  2010-10-26 22:20                     ` Rafael J. Wysocki
@ 2010-10-26 22:20                     ` Rafael J. Wysocki
  2010-10-26 22:39                       ` Rafael J. Wysocki
                                         ` (3 more replies)
  1 sibling, 4 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:20 UTC (permalink / raw)
  To: linux-pm
  Cc: Mathieu Desnoyers, Alan Stern, Paul E. McKenney, Pierre Tardy,
	Peter Zijlstra, Ingo Molnar, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Alan Stern (stern@rowland.harvard.edu) wrote:
> > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > 
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count); 
> > > > 
> > > > That's terribly racy..
> > > 
> > > Looking at the original code, it looks racy even without considering the
> > > tracepoint:
> > > 
> > > int __pm_runtime_get(struct device *dev, bool sync)
> > >  {
> > >         int retval;
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count);
> > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > 
> > > There is no implied memory barrier after "atomic_inc". So either all these
> > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > second option)
> > 
> > I don't understand.  What's the problem?  The inc/dec are atomic 
> > because they are not protected by spinlocks, but everything else is 
> > (aside from the tracepoint, which is new).
> > 
> > > kref should certainly be used there.
> > 
> > What for?
> 
> kref has the following "get":
> 
>         atomic_inc(&kref->refcount);
>         smp_mb__after_atomic_inc();
> 
> What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> the memory barrier after the atomic increment. The atomic increment is free to
> be reordered into the following spinlock (within pm_request_resume or pm_request
> resume execution) because taking a spinlock only acts as a memory barrier with
> acquire semantic, not a full memory barrier.
>
> So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> 
> initial conditions: usage_count = 1
> 
> CPU A                                                       CPU B
> 1) __pm_runtime_get() (sync = true)
> 2)   atomic_inc(&usage_count) (not committed to memory yet)
> 3)   pm_runtime_resume()
> 4)     spin_lock_irqsave(&dev->power.lock, flags);
> 5)     retval = __pm_request_resume(dev);

If sync = true this is
           retval = __pm_runtime_resume(dev);
which drops and reacquires the spinlock.  In the meantime it sets
->power.runtime_status so that __pm_runtime_idle() will fail if run at this
point.

> 6)     (execute the body of __pm_request_resume and return)
> 7)                                                          __pm_runtime_put() (sync = true) 
> 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
>                                                               (still see usage_count == 1 before decrement,
>                                                                thus decrement to 0)
> 9)                                                             pm_runtime_idle()
> 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> 11)                                                            spin_lock_irq(&dev->power.lock);
> 12)                                                            retval = __pm_runtime_idle(dev);

Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
so it will see it's been incremented in the meantime and it will back off.

> 13)                                                            spin_unlock_irq(&dev->power.lock);
> 
> So we end up in a situation where CPU A expects the device to be resumed, but
> the last action performed has been to bring it to idle.
>
> A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.

I don't think this particular race is possible.  However, there is another one
that seems to be possible (in a different function) that an explicit barrier
will prevent from happening.

It's related to pm_runtime_get_noresume(), but I think it's better to put the
barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 21:38                   ` Mathieu Desnoyers
  2010-10-26 22:22                     ` Rafael J. Wysocki
@ 2010-10-26 22:22                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Pierre Tardy, Peter Zijlstra, linux-trace-users,
	Jean Pihet, Steven Rostedt, Frederic Weisbecker, Linus Torvalds,
	Frank Eigler, Thomas Gleixner, linux-pm, Masami Hiramatsu,
	Tejun Heo, Ingo Molnar, linux-omap, Arjan van de Ven

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count); 
> > > > 
> > > > That's terribly racy..
> > > 
> > > Looking at the original code, it looks racy even without considering the
> > > tracepoint:
> > > 
> > > int __pm_runtime_get(struct device *dev, bool sync)
> > >  {
> > >         int retval;
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count);
> > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > 
> > > There is no implied memory barrier after "atomic_inc". So either all these
> > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > second option)
> > 
> > No, it isn't.
> > 
> > > kref should certainly be used there.
> > 
> > No, it shouldn't.
> > 
> > Please try to understand the code you're commenting on first.
> 
> Please see my reply to Alan Stern:
> 
> http://www.spinics.net/lists/linux-omap/msg39382.html

I have and I'm still unimpressed. :-)

Please see my reply to that message.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 21:38                   ` Mathieu Desnoyers
@ 2010-10-26 22:22                     ` Rafael J. Wysocki
  2010-10-26 22:22                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Greg Kroah-Hartman, Pierre Tardy,
	Arjan van de Ven, Ingo Molnar, Thomas Renninger, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Masami Hiramatsu, Frank Eigler,
	Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count); 
> > > > 
> > > > That's terribly racy..
> > > 
> > > Looking at the original code, it looks racy even without considering the
> > > tracepoint:
> > > 
> > > int __pm_runtime_get(struct device *dev, bool sync)
> > >  {
> > >         int retval;
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count);
> > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > 
> > > There is no implied memory barrier after "atomic_inc". So either all these
> > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > second option)
> > 
> > No, it isn't.
> > 
> > > kref should certainly be used there.
> > 
> > No, it shouldn't.
> > 
> > Please try to understand the code you're commenting on first.
> 
> Please see my reply to Alan Stern:
> 
> http://www.spinics.net/lists/linux-omap/msg39382.html

I have and I'm still unimpressed. :-)

Please see my reply to that message.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
@ 2010-10-26 22:39                       ` Rafael J. Wysocki
  2010-10-26 22:39                       ` [linux-pm] " Rafael J. Wysocki
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:39 UTC (permalink / raw)
  To: linux-pm
  Cc: Paul E. McKenney, Andrew Morton, Pierre Tardy, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Mathieu Desnoyers,
	Masami Hiramatsu, Tejun Heo, Thomas Gleixner, linux-omap,
	Arjan van de Ven, Ingo Molnar

On Wednesday, October 27, 2010, Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > 
> > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > 
> > > > > That's terribly racy..
> > > > 
> > > > Looking at the original code, it looks racy even without considering the
> > > > tracepoint:
> > > > 
> > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > >  {
> > > >         int retval;
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count);
> > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > 
> > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > second option)
> > > 
> > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > because they are not protected by spinlocks, but everything else is 
> > > (aside from the tracepoint, which is new).
> > > 
> > > > kref should certainly be used there.
> > > 
> > > What for?
> > 
> > kref has the following "get":
> > 
> >         atomic_inc(&kref->refcount);
> >         smp_mb__after_atomic_inc();
> > 
> > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > the memory barrier after the atomic increment. The atomic increment is free to
> > be reordered into the following spinlock (within pm_request_resume or pm_request
> > resume execution) because taking a spinlock only acts as a memory barrier with
> > acquire semantic, not a full memory barrier.
> >
> > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > 
> > initial conditions: usage_count = 1
> > 
> > CPU A                                                       CPU B
> > 1) __pm_runtime_get() (sync = true)
> > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > 3)   pm_runtime_resume()
> > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > 5)     retval = __pm_request_resume(dev);
> 
> If sync = true this is
>            retval = __pm_runtime_resume(dev);
> which drops and reacquires the spinlock.  In the meantime it sets
> ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> point.
> 
> > 6)     (execute the body of __pm_request_resume and return)
> > 7)                                                          __pm_runtime_put() (sync = true) 
> > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> >                                                               (still see usage_count == 1 before decrement,
> >                                                                thus decrement to 0)
> > 9)                                                             pm_runtime_idle()
> > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > 11)                                                            spin_lock_irq(&dev->power.lock);
> > 12)                                                            retval = __pm_runtime_idle(dev);
> 
> Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> so it will see it's been incremented in the meantime and it will back off.
> 
> > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > 
> > So we end up in a situation where CPU A expects the device to be resumed, but
> > the last action performed has been to bring it to idle.
> >
> > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> 
> I don't think this particular race is possible.  However, there is another one
> that seems to be possible (in a different function) that an explicit barrier
> will prevent from happening.
> 
> It's related to pm_runtime_get_noresume(), but I think it's better to put the
> barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
the spinlock, the race I was thinking about doesn't appear to be possible after
all.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
  2010-10-26 22:39                       ` Rafael J. Wysocki
@ 2010-10-26 22:39                       ` Rafael J. Wysocki
  2010-10-27  0:46                       ` Mathieu Desnoyers
  2010-10-27  0:46                       ` Mathieu Desnoyers
  3 siblings, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:39 UTC (permalink / raw)
  To: linux-pm
  Cc: linux-omap, Arjan van de Ven, Thomas Gleixner, Pierre Tardy,
	Peter Zijlstra, Frederic Weisbecker, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Paul E. McKenney, Linus Torvalds,
	Andrew Morton, Tejun Heo

On Wednesday, October 27, 2010, Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > 
> > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > 
> > > > > That's terribly racy..
> > > > 
> > > > Looking at the original code, it looks racy even without considering the
> > > > tracepoint:
> > > > 
> > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > >  {
> > > >         int retval;
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count);
> > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > 
> > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > second option)
> > > 
> > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > because they are not protected by spinlocks, but everything else is 
> > > (aside from the tracepoint, which is new).
> > > 
> > > > kref should certainly be used there.
> > > 
> > > What for?
> > 
> > kref has the following "get":
> > 
> >         atomic_inc(&kref->refcount);
> >         smp_mb__after_atomic_inc();
> > 
> > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > the memory barrier after the atomic increment. The atomic increment is free to
> > be reordered into the following spinlock (within pm_request_resume or pm_request
> > resume execution) because taking a spinlock only acts as a memory barrier with
> > acquire semantic, not a full memory barrier.
> >
> > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > 
> > initial conditions: usage_count = 1
> > 
> > CPU A                                                       CPU B
> > 1) __pm_runtime_get() (sync = true)
> > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > 3)   pm_runtime_resume()
> > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > 5)     retval = __pm_request_resume(dev);
> 
> If sync = true this is
>            retval = __pm_runtime_resume(dev);
> which drops and reacquires the spinlock.  In the meantime it sets
> ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> point.
> 
> > 6)     (execute the body of __pm_request_resume and return)
> > 7)                                                          __pm_runtime_put() (sync = true) 
> > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> >                                                               (still see usage_count == 1 before decrement,
> >                                                                thus decrement to 0)
> > 9)                                                             pm_runtime_idle()
> > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > 11)                                                            spin_lock_irq(&dev->power.lock);
> > 12)                                                            retval = __pm_runtime_idle(dev);
> 
> Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> so it will see it's been incremented in the meantime and it will back off.
> 
> > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > 
> > So we end up in a situation where CPU A expects the device to be resumed, but
> > the last action performed has been to bring it to idle.
> >
> > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> 
> I don't think this particular race is possible.  However, there is another one
> that seems to be possible (in a different function) that an explicit barrier
> will prevent from happening.
> 
> It's related to pm_runtime_get_noresume(), but I think it's better to put the
> barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
the spinlock, the race I was thinking about doesn't appear to be possible after
all.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:57             ` Rafael J. Wysocki
  2010-10-27  0:00               ` Thomas Renninger
@ 2010-10-27  0:00               ` Thomas Renninger
  1 sibling, 0 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-27  0:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Arjan van de Ven, Andrew Morton, linux-trace-users,
	Frederic Weisbecker, Pierre Tardy, Jean Pihet, Steven Rostedt,
	Peter Zijlstra, Frank Eigler, Mathieu Desnoyers, linux-pm,
	Masami Hiramatsu, Tejun Heo, Thomas Gleixner, linux-omap,
	Linus Torvalds, Ingo Molnar

On Tuesday 26 October 2010 08:57:01 pm Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Thomas Renninger wrote:
> > > 
> > > Ok, that's at least generic. Needs the review of Rafael, to determine
> > > whether this state value is all we want to know when we enter suspend.
> > He already gave an acked-by on this generic one here:
> > Re: [PATCH 3/4] perf: add calls to suspend trace point
> > Oh now, that was on the X86 specific part which depends on this one.
> > One should expect that he's fine with the generic part as well then,
> > but I agree that he should definitely have a look at this and sign it off.
> 
> What patch exactly do you mean?  I'm not quite sure from your comment above.

Eh, Jean's patch, sorry about that.
Needs a tiny change to use PWR_EVENT_EXIT instead of 0 with
my new patch series:

Signed-off-by: Jean Pihet <j-pihet@ti.com>
CC: Thomas Renninger <trenn@suse.de>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>

---
 kernel/power/suspend.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 7335952..10cad5c 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/suspend.h>
+#include <trace/events/power.h>
 
 #include "power.h"
 
@@ -164,7 +165,9 @@ static int suspend_enter(suspend_state_t state)
        error = sysdev_suspend(PMSG_SUSPEND);
        if (!error) {
                if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
+                       trace_machine_suspend(state);
                        error = suspend_ops->enter(state);
+                       trace_machine_suspend(0);
                        events_check_enabled = false;
                }
                sysdev_resume();

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:57             ` Rafael J. Wysocki
@ 2010-10-27  0:00               ` Thomas Renninger
  2010-10-27  9:16                 ` Rafael J. Wysocki
  2010-10-27  9:16                 ` Rafael J. Wysocki
  2010-10-27  0:00               ` Thomas Renninger
  1 sibling, 2 replies; 135+ messages in thread
From: Thomas Renninger @ 2010-10-27  0:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Jean Pihet, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven

On Tuesday 26 October 2010 08:57:01 pm Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Thomas Renninger wrote:
> > > 
> > > Ok, that's at least generic. Needs the review of Rafael, to determine
> > > whether this state value is all we want to know when we enter suspend.
> > He already gave an acked-by on this generic one here:
> > Re: [PATCH 3/4] perf: add calls to suspend trace point
> > Oh now, that was on the X86 specific part which depends on this one.
> > One should expect that he's fine with the generic part as well then,
> > but I agree that he should definitely have a look at this and sign it off.
> 
> What patch exactly do you mean?  I'm not quite sure from your comment above.

Eh, Jean's patch, sorry about that.
Needs a tiny change to use PWR_EVENT_EXIT instead of 0 with
my new patch series:

Signed-off-by: Jean Pihet <j-pihet@ti.com>
CC: Thomas Renninger <trenn@suse.de>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>

---
 kernel/power/suspend.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 7335952..10cad5c 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/suspend.h>
+#include <trace/events/power.h>
 
 #include "power.h"
 
@@ -164,7 +165,9 @@ static int suspend_enter(suspend_state_t state)
        error = sysdev_suspend(PMSG_SUSPEND);
        if (!error) {
                if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
+                       trace_machine_suspend(state);
                        error = suspend_ops->enter(state);
+                       trace_machine_suspend(0);
                        events_check_enabled = false;
                }
                sysdev_resume();

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
                                         ` (2 preceding siblings ...)
  2010-10-27  0:46                       ` Mathieu Desnoyers
@ 2010-10-27  0:46                       ` Mathieu Desnoyers
  3 siblings, 0 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-27  0:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-omap, Arjan van de Ven, Andrew Morton, Thomas Gleixner,
	Pierre Tardy, Peter Zijlstra, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler,
	Masami Hiramatsu, Tejun Heo, linux-pm, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > 
> > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > 
> > > > > That's terribly racy..
> > > > 
> > > > Looking at the original code, it looks racy even without considering the
> > > > tracepoint:
> > > > 
> > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > >  {
> > > >         int retval;
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count);
> > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > 
> > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > second option)
> > > 
> > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > because they are not protected by spinlocks, but everything else is 
> > > (aside from the tracepoint, which is new).
> > > 
> > > > kref should certainly be used there.
> > > 
> > > What for?
> > 
> > kref has the following "get":
> > 
> >         atomic_inc(&kref->refcount);
> >         smp_mb__after_atomic_inc();
> > 
> > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > the memory barrier after the atomic increment. The atomic increment is free to
> > be reordered into the following spinlock (within pm_request_resume or pm_request
> > resume execution) because taking a spinlock only acts as a memory barrier with
> > acquire semantic, not a full memory barrier.
> >
> > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > 
> > initial conditions: usage_count = 1
> > 
> > CPU A                                                       CPU B
> > 1) __pm_runtime_get() (sync = true)
> > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > 3)   pm_runtime_resume()
> > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > 5)     retval = __pm_request_resume(dev);
> 
> If sync = true this is
>            retval = __pm_runtime_resume(dev);
> which drops and reacquires the spinlock.

Let's see. Upon entry in __pm_runtime_resume, the following condition holds
(remember, the initial condition is that usage_count == 1):

  dev->power.runtime_status == RPM_ACTIVE

so retval is set to 1, which goto directly to "out", without setting "parent".
So there does not seem to be any spinlock reacquire on this path, or am I
misunderstanding how the "runtime_status" works ?

> In the meantime it sets
> ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> point.

runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
expected by __pm_runtime_idle.

> 
> > 6)     (execute the body of __pm_request_resume and return)
> > 7)                                                          __pm_runtime_put() (sync = true) 
> > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> >                                                               (still see usage_count == 1 before decrement,
> >                                                                thus decrement to 0)
> > 9)                                                             pm_runtime_idle()
> > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > 11)                                                            spin_lock_irq(&dev->power.lock);
> > 12)                                                            retval = __pm_runtime_idle(dev);
> 
> Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> so it will see it's been incremented in the meantime and it will back off.

This is a subtle but important point. Yes, my scenario seems to be dealt with by
the extra usage_count check while the spinlock is held.

How about adding a comment under this atomic_inc() stating that the memory
barriers are implicitely dealt with by the following spinlock release and the
extra check while spinlock is held ?

Commenting memory barriers is important, but commenting why memory barriers are
not needed due to a subtle corner-case looks even more important.

(hrm, but more below considering pm_runtime_get_noresume())

> 
> > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > 
> > So we end up in a situation where CPU A expects the device to be resumed, but
> > the last action performed has been to bring it to idle.
> >
> > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> 
> I don't think this particular race is possible.  However, there is another one
> that seems to be possible (in a different function) that an explicit barrier
> will prevent from happening.
> 
> It's related to pm_runtime_get_noresume(), but I think it's better to put the
> barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Quoting your following mail:

> Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> the spinlock, the race I was thinking about doesn't appear to be possible
> after all.

Hrm, for the extra-usage_count-under-spinlock check to work, all
pm_runtime_get_noresume() callers should grab and release the dev->power.lock
after incrementing the usage_count. This does not seem to be the case though. So
you might really have a race there.

So every code path that does:

1) pm_runtime_get_noresume(dev);

2) ...

3) pm_runtime_put_noidle(dev);

expecting that the device state cannot be changed between 1 and 3 might be
surprised by a concurrent call to __pm_runtime_idle() that would put a device to
idle (or similarly with suspend) due to lack of memory barrier after the atomic
increment.

Or am I missing something else ?

Thanks,

Mathieu

> 
> Thanks,
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
  2010-10-26 22:39                       ` Rafael J. Wysocki
  2010-10-26 22:39                       ` [linux-pm] " Rafael J. Wysocki
@ 2010-10-27  0:46                       ` Mathieu Desnoyers
  2010-10-27 10:22                         ` Rafael J. Wysocki
  2010-10-27 10:22                         ` Rafael J. Wysocki
  2010-10-27  0:46                       ` Mathieu Desnoyers
  3 siblings, 2 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-27  0:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, Alan Stern, Paul E. McKenney, Pierre Tardy,
	Peter Zijlstra, Ingo Molnar, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > 
> > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > 
> > > > > That's terribly racy..
> > > > 
> > > > Looking at the original code, it looks racy even without considering the
> > > > tracepoint:
> > > > 
> > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > >  {
> > > >         int retval;
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count);
> > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > 
> > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > second option)
> > > 
> > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > because they are not protected by spinlocks, but everything else is 
> > > (aside from the tracepoint, which is new).
> > > 
> > > > kref should certainly be used there.
> > > 
> > > What for?
> > 
> > kref has the following "get":
> > 
> >         atomic_inc(&kref->refcount);
> >         smp_mb__after_atomic_inc();
> > 
> > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > the memory barrier after the atomic increment. The atomic increment is free to
> > be reordered into the following spinlock (within pm_request_resume or pm_request
> > resume execution) because taking a spinlock only acts as a memory barrier with
> > acquire semantic, not a full memory barrier.
> >
> > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > 
> > initial conditions: usage_count = 1
> > 
> > CPU A                                                       CPU B
> > 1) __pm_runtime_get() (sync = true)
> > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > 3)   pm_runtime_resume()
> > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > 5)     retval = __pm_request_resume(dev);
> 
> If sync = true this is
>            retval = __pm_runtime_resume(dev);
> which drops and reacquires the spinlock.

Let's see. Upon entry in __pm_runtime_resume, the following condition holds
(remember, the initial condition is that usage_count == 1):

  dev->power.runtime_status == RPM_ACTIVE

so retval is set to 1, which goto directly to "out", without setting "parent".
So there does not seem to be any spinlock reacquire on this path, or am I
misunderstanding how the "runtime_status" works ?

> In the meantime it sets
> ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> point.

runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
expected by __pm_runtime_idle.

> 
> > 6)     (execute the body of __pm_request_resume and return)
> > 7)                                                          __pm_runtime_put() (sync = true) 
> > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> >                                                               (still see usage_count == 1 before decrement,
> >                                                                thus decrement to 0)
> > 9)                                                             pm_runtime_idle()
> > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > 11)                                                            spin_lock_irq(&dev->power.lock);
> > 12)                                                            retval = __pm_runtime_idle(dev);
> 
> Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> so it will see it's been incremented in the meantime and it will back off.

This is a subtle but important point. Yes, my scenario seems to be dealt with by
the extra usage_count check while the spinlock is held.

How about adding a comment under this atomic_inc() stating that the memory
barriers are implicitely dealt with by the following spinlock release and the
extra check while spinlock is held ?

Commenting memory barriers is important, but commenting why memory barriers are
not needed due to a subtle corner-case looks even more important.

(hrm, but more below considering pm_runtime_get_noresume())

> 
> > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > 
> > So we end up in a situation where CPU A expects the device to be resumed, but
> > the last action performed has been to bring it to idle.
> >
> > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> 
> I don't think this particular race is possible.  However, there is another one
> that seems to be possible (in a different function) that an explicit barrier
> will prevent from happening.
> 
> It's related to pm_runtime_get_noresume(), but I think it's better to put the
> barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Quoting your following mail:

> Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> the spinlock, the race I was thinking about doesn't appear to be possible
> after all.

Hrm, for the extra-usage_count-under-spinlock check to work, all
pm_runtime_get_noresume() callers should grab and release the dev->power.lock
after incrementing the usage_count. This does not seem to be the case though. So
you might really have a race there.

So every code path that does:

1) pm_runtime_get_noresume(dev);

2) ...

3) pm_runtime_put_noidle(dev);

expecting that the device state cannot be changed between 1 and 3 might be
surprised by a concurrent call to __pm_runtime_idle() that would put a device to
idle (or similarly with suspend) due to lack of memory barrier after the atomic
increment.

Or am I missing something else ?

Thanks,

Mathieu

> 
> Thanks,
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27  0:00               ` Thomas Renninger
  2010-10-27  9:16                 ` Rafael J. Wysocki
@ 2010-10-27  9:16                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27  9:16 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, Andrew Morton, linux-trace-users,
	Frederic Weisbecker, Pierre Tardy, Jean Pihet, Steven Rostedt,
	Peter Zijlstra, Frank Eigler, Mathieu Desnoyers, linux-pm,
	Masami Hiramatsu, Tejun Heo, Thomas Gleixner, linux-omap,
	Linus Torvalds, Ingo Molnar

On Wednesday, October 27, 2010, Thomas Renninger wrote:
> On Tuesday 26 October 2010 08:57:01 pm Rafael J. Wysocki wrote:
> > On Tuesday, October 26, 2010, Thomas Renninger wrote:
> > > > 
> > > > Ok, that's at least generic. Needs the review of Rafael, to determine
> > > > whether this state value is all we want to know when we enter suspend.
> > > He already gave an acked-by on this generic one here:
> > > Re: [PATCH 3/4] perf: add calls to suspend trace point
> > > Oh now, that was on the X86 specific part which depends on this one.
> > > One should expect that he's fine with the generic part as well then,
> > > but I agree that he should definitely have a look at this and sign it off.
> > 
> > What patch exactly do you mean?  I'm not quite sure from your comment above.
> 
> Eh, Jean's patch, sorry about that.
> Needs a tiny change to use PWR_EVENT_EXIT instead of 0 with
> my new patch series:

No problem with that as far as I'm concerned.

Thanks,
Rafael


> Signed-off-by: Jean Pihet <j-pihet@ti.com>
> CC: Thomas Renninger <trenn@suse.de>
> Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
> 
> ---
>  kernel/power/suspend.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 7335952..10cad5c 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -22,6 +22,7 @@
>  #include <linux/mm.h>
>  #include <linux/slab.h>
>  #include <linux/suspend.h>
> +#include <trace/events/power.h>
>  
>  #include "power.h"
>  
> @@ -164,7 +165,9 @@ static int suspend_enter(suspend_state_t state)
>         error = sysdev_suspend(PMSG_SUSPEND);
>         if (!error) {
>                 if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
> +                       trace_machine_suspend(state);
>                         error = suspend_ops->enter(state);
> +                       trace_machine_suspend(0);
>                         events_check_enabled = false;
>                 }
>                 sysdev_resume();
> 
> 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27  0:00               ` Thomas Renninger
@ 2010-10-27  9:16                 ` Rafael J. Wysocki
  2010-10-27  9:16                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27  9:16 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Jean Pihet, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven

On Wednesday, October 27, 2010, Thomas Renninger wrote:
> On Tuesday 26 October 2010 08:57:01 pm Rafael J. Wysocki wrote:
> > On Tuesday, October 26, 2010, Thomas Renninger wrote:
> > > > 
> > > > Ok, that's at least generic. Needs the review of Rafael, to determine
> > > > whether this state value is all we want to know when we enter suspend.
> > > He already gave an acked-by on this generic one here:
> > > Re: [PATCH 3/4] perf: add calls to suspend trace point
> > > Oh now, that was on the X86 specific part which depends on this one.
> > > One should expect that he's fine with the generic part as well then,
> > > but I agree that he should definitely have a look at this and sign it off.
> > 
> > What patch exactly do you mean?  I'm not quite sure from your comment above.
> 
> Eh, Jean's patch, sorry about that.
> Needs a tiny change to use PWR_EVENT_EXIT instead of 0 with
> my new patch series:

No problem with that as far as I'm concerned.

Thanks,
Rafael


> Signed-off-by: Jean Pihet <j-pihet@ti.com>
> CC: Thomas Renninger <trenn@suse.de>
> Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
> 
> ---
>  kernel/power/suspend.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 7335952..10cad5c 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -22,6 +22,7 @@
>  #include <linux/mm.h>
>  #include <linux/slab.h>
>  #include <linux/suspend.h>
> +#include <trace/events/power.h>
>  
>  #include "power.h"
>  
> @@ -164,7 +165,9 @@ static int suspend_enter(suspend_state_t state)
>         error = sysdev_suspend(PMSG_SUSPEND);
>         if (!error) {
>                 if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
> +                       trace_machine_suspend(state);
>                         error = suspend_ops->enter(state);
> +                       trace_machine_suspend(0);
>                         events_check_enabled = false;
>                 }
>                 sysdev_resume();
> 
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27  0:46                       ` Mathieu Desnoyers
  2010-10-27 10:22                         ` Rafael J. Wysocki
@ 2010-10-27 10:22                         ` Rafael J. Wysocki
  1 sibling, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27 10:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-omap, Arjan van de Ven, Andrew Morton, Thomas Gleixner,
	Pierre Tardy, Peter Zijlstra, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler,
	Masami Hiramatsu, Tejun Heo, linux-pm, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > > 
> > > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > > 
> > > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > > 
> > > > > > That's terribly racy..
> > > > > 
> > > > > Looking at the original code, it looks racy even without considering the
> > > > > tracepoint:
> > > > > 
> > > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > > >  {
> > > > >         int retval;
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count);
> > > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > > 
> > > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > > second option)
> > > > 
> > > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > > because they are not protected by spinlocks, but everything else is 
> > > > (aside from the tracepoint, which is new).
> > > > 
> > > > > kref should certainly be used there.
> > > > 
> > > > What for?
> > > 
> > > kref has the following "get":
> > > 
> > >         atomic_inc(&kref->refcount);
> > >         smp_mb__after_atomic_inc();
> > > 
> > > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > > the memory barrier after the atomic increment. The atomic increment is free to
> > > be reordered into the following spinlock (within pm_request_resume or pm_request
> > > resume execution) because taking a spinlock only acts as a memory barrier with
> > > acquire semantic, not a full memory barrier.
> > >
> > > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > > 
> > > initial conditions: usage_count = 1
> > > 
> > > CPU A                                                       CPU B
> > > 1) __pm_runtime_get() (sync = true)
> > > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > > 3)   pm_runtime_resume()
> > > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > > 5)     retval = __pm_request_resume(dev);
> > 
> > If sync = true this is
> >            retval = __pm_runtime_resume(dev);
> > which drops and reacquires the spinlock.
> 
> Let's see. Upon entry in __pm_runtime_resume, the following condition holds
> (remember, the initial condition is that usage_count == 1):
> 
>   dev->power.runtime_status == RPM_ACTIVE
> 
> so retval is set to 1, which goto directly to "out", without setting "parent".
> So there does not seem to be any spinlock reacquire on this path, or am I
> misunderstanding how the "runtime_status" works ?

No, you're not I think, the above is correct.  I was referring to the scenario
in which the device was RPM_SUSPENDED initially.

> > In the meantime it sets
> > ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> > point.
> 
> runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
> expected by __pm_runtime_idle.
> 
> > 
> > > 6)     (execute the body of __pm_request_resume and return)
> > > 7)                                                          __pm_runtime_put() (sync = true) 
> > > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> > >                                                               (still see usage_count == 1 before decrement,
> > >                                                                thus decrement to 0)
> > > 9)                                                             pm_runtime_idle()
> > > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > > 11)                                                            spin_lock_irq(&dev->power.lock);
> > > 12)                                                            retval = __pm_runtime_idle(dev);
> > 
> > Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> > so it will see it's been incremented in the meantime and it will back off.
> 
> This is a subtle but important point. Yes, my scenario seems to be dealt with by
> the extra usage_count check while the spinlock is held.
> 
> How about adding a comment under this atomic_inc() stating that the memory
> barriers are implicitely dealt with by the following spinlock release and the
> extra check while spinlock is held ?
> 
> Commenting memory barriers is important, but commenting why memory barriers are
> not needed due to a subtle corner-case looks even more important.

Well, given that this discussion is taking place at all, I admit that it would
be good to document this somehow. :-)

I'll take care of that.

> (hrm, but more below considering pm_runtime_get_noresume())
> 
> > 
> > > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > > 
> > > So we end up in a situation where CPU A expects the device to be resumed, but
> > > the last action performed has been to bring it to idle.
> > >
> > > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> > 
> > I don't think this particular race is possible.  However, there is another one
> > that seems to be possible (in a different function) that an explicit barrier
> > will prevent from happening.
> > 
> > It's related to pm_runtime_get_noresume(), but I think it's better to put the
> > barrier where it's necessary rather than into pm_runtime_get_noresume() itself.
> 
> Quoting your following mail:
> 
> > Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> > the spinlock, the race I was thinking about doesn't appear to be possible
> > after all.
> 
> Hrm, for the extra-usage_count-under-spinlock check to work, all
> pm_runtime_get_noresume() callers should grab and release the dev->power.lock
> after incrementing the usage_count. This does not seem to be the case though. So
> you might really have a race there.
> 
> So every code path that does:
> 
> 1) pm_runtime_get_noresume(dev);
> 
> 2) ...
> 
> 3) pm_runtime_put_noidle(dev);
> 
> expecting that the device state cannot be changed between 1 and 3 might be
> surprised by a concurrent call to __pm_runtime_idle() that would put a device to
> idle (or similarly with suspend) due to lack of memory barrier after the atomic
> increment.
> 
> Or am I missing something else ?

First of all, the device can always be resumed regardless of the usage_count
value.  usage_count is only used to block attempts to suspend the device and
execute its driver's ->runtime_idle() callback after it has been resumed.
That's why the "normal" pm_runtime_get() queues up a resume request.

IOW, the _get() only becomes meaningful after attempting to resume the device
(which is what I tried to tell Arjan in one of the previous messages).

Second, there's no synchronization between pm_runtime_get_noresume() and
pm_runtime_suspend/idle() etc., so calling pm_runtime_get_noresume() is
certainly insufficient to block pm_runtime_suspend/idle() regardless of memory
barriers (there may be one already in progress when _get_noresume() is called).
To limit possible status changes from happening one should (at least) run
pm_runtime_barrier() (surprise, no? ;-)) after pm_runtime_get_noresume().

So if you don't want to resume the device immediately after increasing its
usage_count (in which case it's better to use pm_runtime_get_sync()), you
should do something like this:

1) pm_runtime_get_noresume(dev);
1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.

2) ...
 
3) pm_runtime_put_noidle(dev);

[The meaning of pm_runtime_barrier() is that all of the runtime PM activity
started before the barrier has been completed when it returns.]

There's one place in the PM core where that really is necessary, but I wouldn't
recommend anyone doing anything like it in a driver.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27  0:46                       ` Mathieu Desnoyers
@ 2010-10-27 10:22                         ` Rafael J. Wysocki
  2010-10-27 12:21                           ` Mathieu Desnoyers
  2010-10-27 12:21                           ` Mathieu Desnoyers
  2010-10-27 10:22                         ` Rafael J. Wysocki
  1 sibling, 2 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27 10:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-pm, Alan Stern, Paul E. McKenney, Pierre Tardy,
	Peter Zijlstra, Ingo Molnar, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > > 
> > > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > > 
> > > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > > 
> > > > > > That's terribly racy..
> > > > > 
> > > > > Looking at the original code, it looks racy even without considering the
> > > > > tracepoint:
> > > > > 
> > > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > > >  {
> > > > >         int retval;
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count);
> > > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > > 
> > > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > > second option)
> > > > 
> > > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > > because they are not protected by spinlocks, but everything else is 
> > > > (aside from the tracepoint, which is new).
> > > > 
> > > > > kref should certainly be used there.
> > > > 
> > > > What for?
> > > 
> > > kref has the following "get":
> > > 
> > >         atomic_inc(&kref->refcount);
> > >         smp_mb__after_atomic_inc();
> > > 
> > > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > > the memory barrier after the atomic increment. The atomic increment is free to
> > > be reordered into the following spinlock (within pm_request_resume or pm_request
> > > resume execution) because taking a spinlock only acts as a memory barrier with
> > > acquire semantic, not a full memory barrier.
> > >
> > > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > > 
> > > initial conditions: usage_count = 1
> > > 
> > > CPU A                                                       CPU B
> > > 1) __pm_runtime_get() (sync = true)
> > > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > > 3)   pm_runtime_resume()
> > > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > > 5)     retval = __pm_request_resume(dev);
> > 
> > If sync = true this is
> >            retval = __pm_runtime_resume(dev);
> > which drops and reacquires the spinlock.
> 
> Let's see. Upon entry in __pm_runtime_resume, the following condition holds
> (remember, the initial condition is that usage_count == 1):
> 
>   dev->power.runtime_status == RPM_ACTIVE
> 
> so retval is set to 1, which goto directly to "out", without setting "parent".
> So there does not seem to be any spinlock reacquire on this path, or am I
> misunderstanding how the "runtime_status" works ?

No, you're not I think, the above is correct.  I was referring to the scenario
in which the device was RPM_SUSPENDED initially.

> > In the meantime it sets
> > ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> > point.
> 
> runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
> expected by __pm_runtime_idle.
> 
> > 
> > > 6)     (execute the body of __pm_request_resume and return)
> > > 7)                                                          __pm_runtime_put() (sync = true) 
> > > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> > >                                                               (still see usage_count == 1 before decrement,
> > >                                                                thus decrement to 0)
> > > 9)                                                             pm_runtime_idle()
> > > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > > 11)                                                            spin_lock_irq(&dev->power.lock);
> > > 12)                                                            retval = __pm_runtime_idle(dev);
> > 
> > Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> > so it will see it's been incremented in the meantime and it will back off.
> 
> This is a subtle but important point. Yes, my scenario seems to be dealt with by
> the extra usage_count check while the spinlock is held.
> 
> How about adding a comment under this atomic_inc() stating that the memory
> barriers are implicitely dealt with by the following spinlock release and the
> extra check while spinlock is held ?
> 
> Commenting memory barriers is important, but commenting why memory barriers are
> not needed due to a subtle corner-case looks even more important.

Well, given that this discussion is taking place at all, I admit that it would
be good to document this somehow. :-)

I'll take care of that.

> (hrm, but more below considering pm_runtime_get_noresume())
> 
> > 
> > > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > > 
> > > So we end up in a situation where CPU A expects the device to be resumed, but
> > > the last action performed has been to bring it to idle.
> > >
> > > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> > 
> > I don't think this particular race is possible.  However, there is another one
> > that seems to be possible (in a different function) that an explicit barrier
> > will prevent from happening.
> > 
> > It's related to pm_runtime_get_noresume(), but I think it's better to put the
> > barrier where it's necessary rather than into pm_runtime_get_noresume() itself.
> 
> Quoting your following mail:
> 
> > Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> > the spinlock, the race I was thinking about doesn't appear to be possible
> > after all.
> 
> Hrm, for the extra-usage_count-under-spinlock check to work, all
> pm_runtime_get_noresume() callers should grab and release the dev->power.lock
> after incrementing the usage_count. This does not seem to be the case though. So
> you might really have a race there.
> 
> So every code path that does:
> 
> 1) pm_runtime_get_noresume(dev);
> 
> 2) ...
> 
> 3) pm_runtime_put_noidle(dev);
> 
> expecting that the device state cannot be changed between 1 and 3 might be
> surprised by a concurrent call to __pm_runtime_idle() that would put a device to
> idle (or similarly with suspend) due to lack of memory barrier after the atomic
> increment.
> 
> Or am I missing something else ?

First of all, the device can always be resumed regardless of the usage_count
value.  usage_count is only used to block attempts to suspend the device and
execute its driver's ->runtime_idle() callback after it has been resumed.
That's why the "normal" pm_runtime_get() queues up a resume request.

IOW, the _get() only becomes meaningful after attempting to resume the device
(which is what I tried to tell Arjan in one of the previous messages).

Second, there's no synchronization between pm_runtime_get_noresume() and
pm_runtime_suspend/idle() etc., so calling pm_runtime_get_noresume() is
certainly insufficient to block pm_runtime_suspend/idle() regardless of memory
barriers (there may be one already in progress when _get_noresume() is called).
To limit possible status changes from happening one should (at least) run
pm_runtime_barrier() (surprise, no? ;-)) after pm_runtime_get_noresume().

So if you don't want to resume the device immediately after increasing its
usage_count (in which case it's better to use pm_runtime_get_sync()), you
should do something like this:

1) pm_runtime_get_noresume(dev);
1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.

2) ...
 
3) pm_runtime_put_noidle(dev);

[The meaning of pm_runtime_barrier() is that all of the runtime PM activity
started before the barrier has been completed when it returns.]

There's one place in the PM core where that really is necessary, but I wouldn't
recommend anyone doing anything like it in a driver.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 10:22                         ` Rafael J. Wysocki
  2010-10-27 12:21                           ` Mathieu Desnoyers
@ 2010-10-27 12:21                           ` Mathieu Desnoyers
  1 sibling, 0 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-27 12:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-omap, Arjan van de Ven, Andrew Morton, Thomas Gleixner,
	Pierre Tardy, Peter Zijlstra, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler,
	Masami Hiramatsu, Tejun Heo, linux-pm, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> > * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > > > 
> > > > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > > > 
> > > > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > > > 
> > > > > > > That's terribly racy..
> > > > > > 
> > > > > > Looking at the original code, it looks racy even without considering the
> > > > > > tracepoint:
> > > > > > 
> > > > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > > > >  {
> > > > > >         int retval;
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count);
> > > > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > > > 
> > > > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > > > second option)
> > > > > 
> > > > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > > > because they are not protected by spinlocks, but everything else is 
> > > > > (aside from the tracepoint, which is new).
> > > > > 
> > > > > > kref should certainly be used there.
> > > > > 
> > > > > What for?
> > > > 
> > > > kref has the following "get":
> > > > 
> > > >         atomic_inc(&kref->refcount);
> > > >         smp_mb__after_atomic_inc();
> > > > 
> > > > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > > > the memory barrier after the atomic increment. The atomic increment is free to
> > > > be reordered into the following spinlock (within pm_request_resume or pm_request
> > > > resume execution) because taking a spinlock only acts as a memory barrier with
> > > > acquire semantic, not a full memory barrier.
> > > >
> > > > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > > > 
> > > > initial conditions: usage_count = 1
> > > > 
> > > > CPU A                                                       CPU B
> > > > 1) __pm_runtime_get() (sync = true)
> > > > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > > > 3)   pm_runtime_resume()
> > > > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > > > 5)     retval = __pm_request_resume(dev);
> > > 
> > > If sync = true this is
> > >            retval = __pm_runtime_resume(dev);
> > > which drops and reacquires the spinlock.
> > 
> > Let's see. Upon entry in __pm_runtime_resume, the following condition holds
> > (remember, the initial condition is that usage_count == 1):
> > 
> >   dev->power.runtime_status == RPM_ACTIVE
> > 
> > so retval is set to 1, which goto directly to "out", without setting "parent".
> > So there does not seem to be any spinlock reacquire on this path, or am I
> > misunderstanding how the "runtime_status" works ?
> 
> No, you're not I think, the above is correct.  I was referring to the scenario
> in which the device was RPM_SUSPENDED initially.

Good to know I'm not losing it. ;-)

> 
> > > In the meantime it sets
> > > ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> > > point.
> > 
> > runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
> > expected by __pm_runtime_idle.
> > 
> > > 
> > > > 6)     (execute the body of __pm_request_resume and return)
> > > > 7)                                                          __pm_runtime_put() (sync = true) 
> > > > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> > > >                                                               (still see usage_count == 1 before decrement,
> > > >                                                                thus decrement to 0)
> > > > 9)                                                             pm_runtime_idle()
> > > > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > > > 11)                                                            spin_lock_irq(&dev->power.lock);
> > > > 12)                                                            retval = __pm_runtime_idle(dev);
> > > 
> > > Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> > > so it will see it's been incremented in the meantime and it will back off.
> > 
> > This is a subtle but important point. Yes, my scenario seems to be dealt with by
> > the extra usage_count check while the spinlock is held.
> > 
> > How about adding a comment under this atomic_inc() stating that the memory
> > barriers are implicitely dealt with by the following spinlock release and the
> > extra check while spinlock is held ?
> > 
> > Commenting memory barriers is important, but commenting why memory barriers are
> > not needed due to a subtle corner-case looks even more important.
> 
> Well, given that this discussion is taking place at all, I admit that it would
> be good to document this somehow. :-)

Yep, it's astonishing how a few comments can end up saving lots of emails from
confused reviewers. ;-)

> 
> I'll take care of that.
> 
> > (hrm, but more below considering pm_runtime_get_noresume())
> > 
> > > 
> > > > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > > > 
> > > > So we end up in a situation where CPU A expects the device to be resumed, but
> > > > the last action performed has been to bring it to idle.
> > > >
> > > > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> > > 
> > > I don't think this particular race is possible.  However, there is another one
> > > that seems to be possible (in a different function) that an explicit barrier
> > > will prevent from happening.
> > > 
> > > It's related to pm_runtime_get_noresume(), but I think it's better to put the
> > > barrier where it's necessary rather than into pm_runtime_get_noresume() itself.
> > 
> > Quoting your following mail:
> > 
> > > Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> > > the spinlock, the race I was thinking about doesn't appear to be possible
> > > after all.
> > 
> > Hrm, for the extra-usage_count-under-spinlock check to work, all
> > pm_runtime_get_noresume() callers should grab and release the dev->power.lock
> > after incrementing the usage_count. This does not seem to be the case though. So
> > you might really have a race there.
> > 
> > So every code path that does:
> > 
> > 1) pm_runtime_get_noresume(dev);
> > 
> > 2) ...
> > 
> > 3) pm_runtime_put_noidle(dev);
> > 
> > expecting that the device state cannot be changed between 1 and 3 might be
> > surprised by a concurrent call to __pm_runtime_idle() that would put a device to
> > idle (or similarly with suspend) due to lack of memory barrier after the atomic
> > increment.
> > 
> > Or am I missing something else ?
> 
> First of all, the device can always be resumed regardless of the usage_count
> value.  usage_count is only used to block attempts to suspend the device and
> execute its driver's ->runtime_idle() callback after it has been resumed.
> That's why the "normal" pm_runtime_get() queues up a resume request.
> 
> IOW, the _get() only becomes meaningful after attempting to resume the device
> (which is what I tried to tell Arjan in one of the previous messages).

OK

> 
> Second, there's no synchronization between pm_runtime_get_noresume() and
> pm_runtime_suspend/idle() etc., so calling pm_runtime_get_noresume() is
> certainly insufficient to block pm_runtime_suspend/idle() regardless of memory
> barriers (there may be one already in progress when _get_noresume() is called).

Agreed, I was wondering how this was expected to work.

> To limit possible status changes from happening one should (at least) run
> pm_runtime_barrier() (surprise, no? ;-)) after pm_runtime_get_noresume().

Hrm, then why export pm_runtime_get_noresume() at all ? I don't feel comfortable
with some of the pm_runtime_get_noresume() callers.

> 
> So if you don't want to resume the device immediately after increasing its
> usage_count (in which case it's better to use pm_runtime_get_sync()), you
> should do something like this:
> 
> 1) pm_runtime_get_noresume(dev);
> 1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.
> 
> 2) ...
>  
> 3) pm_runtime_put_noidle(dev);
> 
> [The meaning of pm_runtime_barrier() is that all of the runtime PM activity
> started before the barrier has been completed when it returns.]
> 
> There's one place in the PM core where that really is necessary, but I wouldn't
> recommend anyone doing anything like it in a driver.

grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.

e.g.:

drivers/usb/core/drivers.c: usb_autopm_get_interface_async()

        pm_runtime_get_noresume(&intf->dev);
        s = ACCESS_ONCE(intf->dev.power.runtime_status);
        if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
                status = pm_request_resume(&intf->dev);

How is this supposed to work ?

If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
device can be suspended even after the check.

My point is that a get/put semantic should imply memory barriers, especially if
these are exported APIs.

Thanks,

Mathieu


> 
> Thanks,
> Rafael

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 10:22                         ` Rafael J. Wysocki
@ 2010-10-27 12:21                           ` Mathieu Desnoyers
  2010-10-27 14:32                             ` Alan Stern
                                               ` (4 more replies)
  2010-10-27 12:21                           ` Mathieu Desnoyers
  1 sibling, 5 replies; 135+ messages in thread
From: Mathieu Desnoyers @ 2010-10-27 12:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, Alan Stern, Paul E. McKenney, Pierre Tardy,
	Peter Zijlstra, Ingo Molnar, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> > * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > > > 
> > > > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > > > 
> > > > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > > > 
> > > > > > > That's terribly racy..
> > > > > > 
> > > > > > Looking at the original code, it looks racy even without considering the
> > > > > > tracepoint:
> > > > > > 
> > > > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > > > >  {
> > > > > >         int retval;
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count);
> > > > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > > > 
> > > > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > > > second option)
> > > > > 
> > > > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > > > because they are not protected by spinlocks, but everything else is 
> > > > > (aside from the tracepoint, which is new).
> > > > > 
> > > > > > kref should certainly be used there.
> > > > > 
> > > > > What for?
> > > > 
> > > > kref has the following "get":
> > > > 
> > > >         atomic_inc(&kref->refcount);
> > > >         smp_mb__after_atomic_inc();
> > > > 
> > > > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > > > the memory barrier after the atomic increment. The atomic increment is free to
> > > > be reordered into the following spinlock (within pm_request_resume or pm_request
> > > > resume execution) because taking a spinlock only acts as a memory barrier with
> > > > acquire semantic, not a full memory barrier.
> > > >
> > > > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > > > 
> > > > initial conditions: usage_count = 1
> > > > 
> > > > CPU A                                                       CPU B
> > > > 1) __pm_runtime_get() (sync = true)
> > > > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > > > 3)   pm_runtime_resume()
> > > > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > > > 5)     retval = __pm_request_resume(dev);
> > > 
> > > If sync = true this is
> > >            retval = __pm_runtime_resume(dev);
> > > which drops and reacquires the spinlock.
> > 
> > Let's see. Upon entry in __pm_runtime_resume, the following condition holds
> > (remember, the initial condition is that usage_count == 1):
> > 
> >   dev->power.runtime_status == RPM_ACTIVE
> > 
> > so retval is set to 1, which goto directly to "out", without setting "parent".
> > So there does not seem to be any spinlock reacquire on this path, or am I
> > misunderstanding how the "runtime_status" works ?
> 
> No, you're not I think, the above is correct.  I was referring to the scenario
> in which the device was RPM_SUSPENDED initially.

Good to know I'm not losing it. ;-)

> 
> > > In the meantime it sets
> > > ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> > > point.
> > 
> > runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
> > expected by __pm_runtime_idle.
> > 
> > > 
> > > > 6)     (execute the body of __pm_request_resume and return)
> > > > 7)                                                          __pm_runtime_put() (sync = true) 
> > > > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> > > >                                                               (still see usage_count == 1 before decrement,
> > > >                                                                thus decrement to 0)
> > > > 9)                                                             pm_runtime_idle()
> > > > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > > > 11)                                                            spin_lock_irq(&dev->power.lock);
> > > > 12)                                                            retval = __pm_runtime_idle(dev);
> > > 
> > > Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> > > so it will see it's been incremented in the meantime and it will back off.
> > 
> > This is a subtle but important point. Yes, my scenario seems to be dealt with by
> > the extra usage_count check while the spinlock is held.
> > 
> > How about adding a comment under this atomic_inc() stating that the memory
> > barriers are implicitely dealt with by the following spinlock release and the
> > extra check while spinlock is held ?
> > 
> > Commenting memory barriers is important, but commenting why memory barriers are
> > not needed due to a subtle corner-case looks even more important.
> 
> Well, given that this discussion is taking place at all, I admit that it would
> be good to document this somehow. :-)

Yep, it's astonishing how a few comments can end up saving lots of emails from
confused reviewers. ;-)

> 
> I'll take care of that.
> 
> > (hrm, but more below considering pm_runtime_get_noresume())
> > 
> > > 
> > > > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > > > 
> > > > So we end up in a situation where CPU A expects the device to be resumed, but
> > > > the last action performed has been to bring it to idle.
> > > >
> > > > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> > > 
> > > I don't think this particular race is possible.  However, there is another one
> > > that seems to be possible (in a different function) that an explicit barrier
> > > will prevent from happening.
> > > 
> > > It's related to pm_runtime_get_noresume(), but I think it's better to put the
> > > barrier where it's necessary rather than into pm_runtime_get_noresume() itself.
> > 
> > Quoting your following mail:
> > 
> > > Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> > > the spinlock, the race I was thinking about doesn't appear to be possible
> > > after all.
> > 
> > Hrm, for the extra-usage_count-under-spinlock check to work, all
> > pm_runtime_get_noresume() callers should grab and release the dev->power.lock
> > after incrementing the usage_count. This does not seem to be the case though. So
> > you might really have a race there.
> > 
> > So every code path that does:
> > 
> > 1) pm_runtime_get_noresume(dev);
> > 
> > 2) ...
> > 
> > 3) pm_runtime_put_noidle(dev);
> > 
> > expecting that the device state cannot be changed between 1 and 3 might be
> > surprised by a concurrent call to __pm_runtime_idle() that would put a device to
> > idle (or similarly with suspend) due to lack of memory barrier after the atomic
> > increment.
> > 
> > Or am I missing something else ?
> 
> First of all, the device can always be resumed regardless of the usage_count
> value.  usage_count is only used to block attempts to suspend the device and
> execute its driver's ->runtime_idle() callback after it has been resumed.
> That's why the "normal" pm_runtime_get() queues up a resume request.
> 
> IOW, the _get() only becomes meaningful after attempting to resume the device
> (which is what I tried to tell Arjan in one of the previous messages).

OK

> 
> Second, there's no synchronization between pm_runtime_get_noresume() and
> pm_runtime_suspend/idle() etc., so calling pm_runtime_get_noresume() is
> certainly insufficient to block pm_runtime_suspend/idle() regardless of memory
> barriers (there may be one already in progress when _get_noresume() is called).

Agreed, I was wondering how this was expected to work.

> To limit possible status changes from happening one should (at least) run
> pm_runtime_barrier() (surprise, no? ;-)) after pm_runtime_get_noresume().

Hrm, then why export pm_runtime_get_noresume() at all ? I don't feel comfortable
with some of the pm_runtime_get_noresume() callers.

> 
> So if you don't want to resume the device immediately after increasing its
> usage_count (in which case it's better to use pm_runtime_get_sync()), you
> should do something like this:
> 
> 1) pm_runtime_get_noresume(dev);
> 1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.
> 
> 2) ...
>  
> 3) pm_runtime_put_noidle(dev);
> 
> [The meaning of pm_runtime_barrier() is that all of the runtime PM activity
> started before the barrier has been completed when it returns.]
> 
> There's one place in the PM core where that really is necessary, but I wouldn't
> recommend anyone doing anything like it in a driver.

grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.

e.g.:

drivers/usb/core/drivers.c: usb_autopm_get_interface_async()

        pm_runtime_get_noresume(&intf->dev);
        s = ACCESS_ONCE(intf->dev.power.runtime_status);
        if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
                status = pm_request_resume(&intf->dev);

How is this supposed to work ?

If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
device can be suspended even after the check.

My point is that a get/put semantic should imply memory barriers, especially if
these are exported APIs.

Thanks,

Mathieu


> 
> Thanks,
> Rafael

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 12:21                           ` Mathieu Desnoyers
  2010-10-27 14:32                             ` Alan Stern
  2010-10-27 14:32                             ` Alan Stern
@ 2010-10-27 14:32                             ` Alan Stern
  2010-10-28 15:22                               ` Alan Stern
  2010-10-28 15:22                               ` [linux-pm] " Alan Stern
  2010-10-27 21:43                             ` Rafael J. Wysocki
  2010-10-27 21:43                             ` Rafael J. Wysocki
  4 siblings, 2 replies; 135+ messages in thread
From: Alan Stern @ 2010-10-27 14:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Rafael J. Wysocki, Linux-pm mailing list, Paul E. McKenney,
	Kernel development list

[CC: list trimmed drastically, on the assumption that most of the 
people on it aren't very interested in the details of the PM runtime 
memory barriers.]

On Wed, 27 Oct 2010, Mathieu Desnoyers wrote:

> grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> 
> e.g.:
> 
> drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> 
>         pm_runtime_get_noresume(&intf->dev);
>         s = ACCESS_ONCE(intf->dev.power.runtime_status);
>         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
>                 status = pm_request_resume(&intf->dev);
> 
> How is this supposed to work ?

It's worth pointing out that this code is going to be removed during
the 2.6.38 development cycle, due to ongoing changes in the runtime PM
core.  It would have been removed already if not for the difficulty of
coordinating cross-subsystem changes.

But it's legitimate to ask how the code _was_ supposed to work...

> If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> device can be suspended even after the check.

You are correct; the code as written may sometimes fail.  It was a
hack from the beginning; the kind of test it performs should not be
done outside the PM core.  However at the time it was the easiest way 
to do what I wanted.

> My point is that a get/put semantic should imply memory barriers, especially if
> these are exported APIs.

As far as I am aware, apart from the hack above,
pm_runtime_get_noresume is called only in places where either:

	it is purely advisory (e.g., we know that we will use the
	device in the near future so we would prefer to prevent it from
	being suspended, but we don't really care because we're going
	to call pm_runtime_resume_sync before using it anyway);

	or we already know that the usage_count is > 0.

No memory barrier is required for either of these cases.

Alan Stern


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 12:21                           ` Mathieu Desnoyers
@ 2010-10-27 14:32                             ` Alan Stern
  2010-10-27 14:32                             ` Alan Stern
                                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 135+ messages in thread
From: Alan Stern @ 2010-10-27 14:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linux-pm mailing list, Paul E. McKenney, Kernel development list

[CC: list trimmed drastically, on the assumption that most of the 
people on it aren't very interested in the details of the PM runtime 
memory barriers.]

On Wed, 27 Oct 2010, Mathieu Desnoyers wrote:

> grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> 
> e.g.:
> 
> drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> 
>         pm_runtime_get_noresume(&intf->dev);
>         s = ACCESS_ONCE(intf->dev.power.runtime_status);
>         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
>                 status = pm_request_resume(&intf->dev);
> 
> How is this supposed to work ?

It's worth pointing out that this code is going to be removed during
the 2.6.38 development cycle, due to ongoing changes in the runtime PM
core.  It would have been removed already if not for the difficulty of
coordinating cross-subsystem changes.

But it's legitimate to ask how the code _was_ supposed to work...

> If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> device can be suspended even after the check.

You are correct; the code as written may sometimes fail.  It was a
hack from the beginning; the kind of test it performs should not be
done outside the PM core.  However at the time it was the easiest way 
to do what I wanted.

> My point is that a get/put semantic should imply memory barriers, especially if
> these are exported APIs.

As far as I am aware, apart from the hack above,
pm_runtime_get_noresume is called only in places where either:

	it is purely advisory (e.g., we know that we will use the
	device in the near future so we would prefer to prevent it from
	being suspended, but we don't really care because we're going
	to call pm_runtime_resume_sync before using it anyway);

	or we already know that the usage_count is > 0.

No memory barrier is required for either of these cases.

Alan Stern

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 12:21                           ` Mathieu Desnoyers
  2010-10-27 14:32                             ` Alan Stern
@ 2010-10-27 14:32                             ` Alan Stern
  2010-10-27 14:32                             ` [linux-pm] " Alan Stern
                                               ` (2 subsequent siblings)
  4 siblings, 0 replies; 135+ messages in thread
From: Alan Stern @ 2010-10-27 14:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linux-pm mailing list, Paul E. McKenney, Kernel development list

[CC: list trimmed drastically, on the assumption that most of the 
people on it aren't very interested in the details of the PM runtime 
memory barriers.]

On Wed, 27 Oct 2010, Mathieu Desnoyers wrote:

> grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> 
> e.g.:
> 
> drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> 
>         pm_runtime_get_noresume(&intf->dev);
>         s = ACCESS_ONCE(intf->dev.power.runtime_status);
>         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
>                 status = pm_request_resume(&intf->dev);
> 
> How is this supposed to work ?

It's worth pointing out that this code is going to be removed during
the 2.6.38 development cycle, due to ongoing changes in the runtime PM
core.  It would have been removed already if not for the difficulty of
coordinating cross-subsystem changes.

But it's legitimate to ask how the code _was_ supposed to work...

> If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> device can be suspended even after the check.

You are correct; the code as written may sometimes fail.  It was a
hack from the beginning; the kind of test it performs should not be
done outside the PM core.  However at the time it was the easiest way 
to do what I wanted.

> My point is that a get/put semantic should imply memory barriers, especially if
> these are exported APIs.

As far as I am aware, apart from the hack above,
pm_runtime_get_noresume is called only in places where either:

	it is purely advisory (e.g., we know that we will use the
	device in the near future so we would prefer to prevent it from
	being suspended, but we don't really care because we're going
	to call pm_runtime_resume_sync before using it anyway);

	or we already know that the usage_count is > 0.

No memory barrier is required for either of these cases.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Geschäftsführender Gesellschafter: Tjark Auerbach
Sitz der Gesellschaft: Tettnang
Handelsregister: Amtsgericht Ulm, HRB 630992
ALLGEMEINE GESCHÄFTSBEDINGUNGEN
Es gelten unsere Allgemeinen Geschäftsbedingungen
(AGB). Sie finden sie in der jeweils gültigen Fassung
im Internet unter http://www.avira.com/de/standard-terms-conditions-business-de
***************************************************
_______________________________________________
linux-pm mailing list
linux-pm@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/linux-pm

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 12:21                           ` Mathieu Desnoyers
                                               ` (3 preceding siblings ...)
  2010-10-27 21:43                             ` Rafael J. Wysocki
@ 2010-10-27 21:43                             ` Rafael J. Wysocki
  4 siblings, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27 21:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-omap, Arjan van de Ven, Andrew Morton, Thomas Gleixner,
	Pierre Tardy, Peter Zijlstra, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler,
	Masami Hiramatsu, Tejun Heo, linux-pm, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
...
> 
> Hrm, then why export pm_runtime_get_noresume() at all ?

Basically, the PM core needs it for some obscure stuff.  Beyond that people
really should use it with care (preferably avoid using it at all).

> I don't feel comfortable with some of the pm_runtime_get_noresume() callers.
> 
> > 
> > So if you don't want to resume the device immediately after increasing its
> > usage_count (in which case it's better to use pm_runtime_get_sync()), you
> > should do something like this:
> > 
> > 1) pm_runtime_get_noresume(dev);
> > 1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.
> > 
> > 2) ...
> >  
> > 3) pm_runtime_put_noidle(dev);
> > 
> > [The meaning of pm_runtime_barrier() is that all of the runtime PM activity
> > started before the barrier has been completed when it returns.]
> > 
> > There's one place in the PM core where that really is necessary, but I wouldn't
> > recommend anyone doing anything like it in a driver.
> 
> grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> 
> e.g.:
> 
> drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> 
>         pm_runtime_get_noresume(&intf->dev);
>         s = ACCESS_ONCE(intf->dev.power.runtime_status);
>         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
>                 status = pm_request_resume(&intf->dev);
> 
> How is this supposed to work ?
> 
> If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> device can be suspended even after the check.
> 
> My point is that a get/put semantic should imply memory barriers, especially if
> these are exported APIs.

Well, IMO adding a memory barrier to pm_runtime_get_noresume() wouldn't really
change a lot (it still would be racy with respect to some other runtime PM helper
funtions).  That said I guess we should put a "handle with care" sticker on it.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 12:21                           ` Mathieu Desnoyers
                                               ` (2 preceding siblings ...)
  2010-10-27 14:32                             ` [linux-pm] " Alan Stern
@ 2010-10-27 21:43                             ` Rafael J. Wysocki
  2010-10-27 21:43                             ` Rafael J. Wysocki
  4 siblings, 0 replies; 135+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27 21:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-pm, Alan Stern, Paul E. McKenney, Pierre Tardy,
	Peter Zijlstra, Ingo Molnar, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
...
> 
> Hrm, then why export pm_runtime_get_noresume() at all ?

Basically, the PM core needs it for some obscure stuff.  Beyond that people
really should use it with care (preferably avoid using it at all).

> I don't feel comfortable with some of the pm_runtime_get_noresume() callers.
> 
> > 
> > So if you don't want to resume the device immediately after increasing its
> > usage_count (in which case it's better to use pm_runtime_get_sync()), you
> > should do something like this:
> > 
> > 1) pm_runtime_get_noresume(dev);
> > 1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.
> > 
> > 2) ...
> >  
> > 3) pm_runtime_put_noidle(dev);
> > 
> > [The meaning of pm_runtime_barrier() is that all of the runtime PM activity
> > started before the barrier has been completed when it returns.]
> > 
> > There's one place in the PM core where that really is necessary, but I wouldn't
> > recommend anyone doing anything like it in a driver.
> 
> grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> 
> e.g.:
> 
> drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> 
>         pm_runtime_get_noresume(&intf->dev);
>         s = ACCESS_ONCE(intf->dev.power.runtime_status);
>         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
>                 status = pm_request_resume(&intf->dev);
> 
> How is this supposed to work ?
> 
> If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> device can be suspended even after the check.
> 
> My point is that a get/put semantic should imply memory barriers, especially if
> these are exported APIs.

Well, IMO adding a memory barrier to pm_runtime_get_noresume() wouldn't really
change a lot (it still would be racy with respect to some other runtime PM helper
funtions).  That said I guess we should put a "handle with care" sticker on it.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 14:32                             ` [linux-pm] " Alan Stern
  2010-10-28 15:22                               ` Alan Stern
@ 2010-10-28 15:22                               ` Alan Stern
  1 sibling, 0 replies; 135+ messages in thread
From: Alan Stern @ 2010-10-28 15:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linux-pm mailing list, Paul E. McKenney, Kernel development list

On Wed, 27 Oct 2010, Alan Stern wrote:

> On Wed, 27 Oct 2010, Mathieu Desnoyers wrote:
> 
> > grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> > 
> > e.g.:
> > 
> > drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> > 
> >         pm_runtime_get_noresume(&intf->dev);
> >         s = ACCESS_ONCE(intf->dev.power.runtime_status);
> >         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
> >                 status = pm_request_resume(&intf->dev);
> > 
> > How is this supposed to work ?

> > If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> > device can be suspended even after the check.
> 
> You are correct; the code as written may sometimes fail.  It was a
> hack from the beginning; the kind of test it performs should not be
> done outside the PM core.  However at the time it was the easiest way 
> to do what I wanted.

I forgot to mention one other thing...  The fact that this code will
sometimes behave unexpectedly isn't a bug.  That function is documented
as requiring additional locking when a driver uses it.  The need for
extra locking is unavoidable because I/O requests can arrive at any
time, even while a runtime suspend is in progress.

Therefore the fact that usb_autopm_get_interface_async() can race with 
a runtime suspend doesn't matter.  The driver making the call should 
have sufficient locking to know that the runtime suspend should fail 
because the driver is busy.

Alan Stern


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 14:32                             ` [linux-pm] " Alan Stern
@ 2010-10-28 15:22                               ` Alan Stern
  2010-10-28 15:22                               ` [linux-pm] " Alan Stern
  1 sibling, 0 replies; 135+ messages in thread
From: Alan Stern @ 2010-10-28 15:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linux-pm mailing list, Paul E. McKenney, Kernel development list

On Wed, 27 Oct 2010, Alan Stern wrote:

> On Wed, 27 Oct 2010, Mathieu Desnoyers wrote:
> 
> > grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> > 
> > e.g.:
> > 
> > drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> > 
> >         pm_runtime_get_noresume(&intf->dev);
> >         s = ACCESS_ONCE(intf->dev.power.runtime_status);
> >         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
> >                 status = pm_request_resume(&intf->dev);
> > 
> > How is this supposed to work ?

> > If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> > device can be suspended even after the check.
> 
> You are correct; the code as written may sometimes fail.  It was a
> hack from the beginning; the kind of test it performs should not be
> done outside the PM core.  However at the time it was the easiest way 
> to do what I wanted.

I forgot to mention one other thing...  The fact that this code will
sometimes behave unexpectedly isn't a bug.  That function is documented
as requiring additional locking when a driver uses it.  The need for
extra locking is unavoidable because I/O requests can arrive at any
time, even while a runtime suspend is in progress.

Therefore the fact that usb_autopm_get_interface_async() can race with 
a runtime suspend doesn't matter.  The driver making the call should 
have sufficient locking to know that the runtime suspend should fail 
because the driver is busy.

Alan Stern

^ permalink raw reply	[flat|nested] 135+ messages in thread

end of thread, other threads:[~2010-10-28 15:23 UTC | newest]

Thread overview: 135+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
2010-10-19 11:36 ` [PATCH 1/3] PERF: Do not export power_frequency, but power_start event Thomas Renninger
2010-10-19 11:36 ` Thomas Renninger
2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
2010-10-25  6:54   ` Arjan van de Ven
2010-10-25  6:54   ` Arjan van de Ven
2010-10-25  9:41     ` Thomas Renninger
2010-10-25 13:55       ` Arjan van de Ven
2010-10-25 13:55       ` Arjan van de Ven
2010-10-25 14:36         ` Thomas Renninger
2010-10-25 14:45           ` Arjan van de Ven
2010-10-25 14:45           ` Arjan van de Ven
2010-10-25 14:56             ` Ingo Molnar
2010-10-25 14:56             ` Ingo Molnar
2010-10-25 15:48               ` Thomas Renninger
2010-10-25 16:00                 ` Arjan van de Ven
2010-10-25 23:32                   ` Thomas Renninger
2010-10-25 23:32                   ` Thomas Renninger
2010-10-25 16:00                 ` Arjan van de Ven
2010-10-25 15:48               ` Thomas Renninger
2010-10-25 14:36         ` Thomas Renninger
2010-10-25  9:41     ` Thomas Renninger
2010-10-25  6:58   ` Arjan van de Ven
2010-10-25  6:58   ` Arjan van de Ven
2010-10-25 10:04   ` Ingo Molnar
2010-10-25 10:04   ` Ingo Molnar
2010-10-25 11:03     ` Thomas Renninger
2010-10-25 11:03     ` Thomas Renninger
2010-10-25 11:55       ` Ingo Molnar
2010-10-25 12:55         ` Thomas Renninger
2010-10-25 14:11           ` Arjan van de Ven
2010-10-25 14:11           ` Arjan van de Ven
2010-10-25 14:51             ` Thomas Renninger
2010-10-25 14:51             ` Thomas Renninger
2010-10-25 12:55         ` Thomas Renninger
2010-10-25 12:58         ` Mathieu Desnoyers
2010-10-25 12:58         ` Mathieu Desnoyers
2010-10-25 20:29           ` Rafael J. Wysocki
2010-10-25 20:29           ` Rafael J. Wysocki
2010-10-25 11:55       ` Ingo Molnar
2010-10-25 13:58       ` Arjan van de Ven
2010-10-25 13:58       ` Arjan van de Ven
2010-10-25 20:33         ` Rafael J. Wysocki
2010-10-25 20:33         ` Rafael J. Wysocki
2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
2010-10-26  1:09     ` Arjan van de Ven
2010-10-26  1:09     ` Arjan van de Ven
2010-10-26  7:10     ` Ingo Molnar
2010-10-26  7:10     ` Ingo Molnar
2010-10-26  8:08       ` Jean Pihet
2010-10-26 11:21         ` Ingo Molnar
2010-10-26 11:48           ` Thomas Renninger
2010-10-26 11:48           ` Thomas Renninger
2010-10-26 11:54             ` Ingo Molnar
2010-10-26 11:54             ` Ingo Molnar
2010-10-26 13:17               ` Thomas Renninger
2010-10-26 13:35                 ` Thomas Renninger
2010-10-26 13:35                 ` Thomas Renninger
2010-10-26 13:17               ` Thomas Renninger
2010-10-26 18:57             ` Rafael J. Wysocki
2010-10-27  0:00               ` Thomas Renninger
2010-10-27  9:16                 ` Rafael J. Wysocki
2010-10-27  9:16                 ` Rafael J. Wysocki
2010-10-27  0:00               ` Thomas Renninger
2010-10-26 18:57             ` Rafael J. Wysocki
2010-10-26 11:21         ` Ingo Molnar
2010-10-26  8:08       ` Jean Pihet
2010-10-26  9:58       ` Arjan van de Ven
2010-10-26 10:19         ` Ingo Molnar
2010-10-26 10:19         ` Ingo Molnar
2010-10-26  9:58       ` Arjan van de Ven
2010-10-26 10:37       ` Thomas Renninger
2010-10-26 10:37       ` Thomas Renninger
2010-10-26 11:19         ` Ingo Molnar
2010-10-26 11:19         ` Ingo Molnar
2010-10-26 19:01           ` Rafael J. Wysocki
2010-10-26 19:01           ` Rafael J. Wysocki
2010-10-26 15:32       ` Pierre Tardy
2010-10-26 16:04         ` Arjan van de Ven
2010-10-26 16:04         ` Arjan van de Ven
2010-10-26 16:56           ` Pierre Tardy
2010-10-26 17:58             ` Peter Zijlstra
2010-10-26 18:14               ` Mathieu Desnoyers
2010-10-26 18:14               ` Mathieu Desnoyers
2010-10-26 18:50                 ` [linux-pm] " Alan Stern
2010-10-26 21:33                   ` Mathieu Desnoyers
2010-10-26 21:33                   ` [linux-pm] " Mathieu Desnoyers
2010-10-26 22:20                     ` Rafael J. Wysocki
2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
2010-10-26 22:39                       ` Rafael J. Wysocki
2010-10-26 22:39                       ` [linux-pm] " Rafael J. Wysocki
2010-10-27  0:46                       ` Mathieu Desnoyers
2010-10-27 10:22                         ` Rafael J. Wysocki
2010-10-27 12:21                           ` Mathieu Desnoyers
2010-10-27 14:32                             ` Alan Stern
2010-10-27 14:32                             ` Alan Stern
2010-10-27 14:32                             ` [linux-pm] " Alan Stern
2010-10-28 15:22                               ` Alan Stern
2010-10-28 15:22                               ` [linux-pm] " Alan Stern
2010-10-27 21:43                             ` Rafael J. Wysocki
2010-10-27 21:43                             ` Rafael J. Wysocki
2010-10-27 12:21                           ` Mathieu Desnoyers
2010-10-27 10:22                         ` Rafael J. Wysocki
2010-10-27  0:46                       ` Mathieu Desnoyers
2010-10-26 18:50                 ` Alan Stern
2010-10-26 19:04                 ` Rafael J. Wysocki
2010-10-26 19:04                 ` Rafael J. Wysocki
2010-10-26 21:38                   ` Mathieu Desnoyers
2010-10-26 21:38                   ` Mathieu Desnoyers
2010-10-26 22:22                     ` Rafael J. Wysocki
2010-10-26 22:22                     ` Rafael J. Wysocki
2010-10-26 18:15               ` Pierre Tardy
2010-10-26 19:08                 ` Rafael J. Wysocki
2010-10-26 20:23                   ` Pierre Tardy
2010-10-26 20:23                   ` Pierre Tardy
2010-10-26 20:38                     ` Rafael J. Wysocki
2010-10-26 20:38                     ` Rafael J. Wysocki
2010-10-26 20:52                       ` Arjan van de Ven
2010-10-26 20:52                       ` Arjan van de Ven
2010-10-26 21:17                         ` Rafael J. Wysocki
2010-10-26 21:17                         ` Rafael J. Wysocki
2010-10-26 19:08                 ` Rafael J. Wysocki
2010-10-26 18:15               ` Pierre Tardy
2010-10-26 17:58             ` Peter Zijlstra
2010-10-26 16:56           ` Pierre Tardy
2010-10-26 15:32       ` Pierre Tardy
2010-10-26  7:59     ` Jean Pihet
2010-10-26  7:59     ` Jean Pihet
2010-10-26 18:52     ` Rafael J. Wysocki
2010-10-26 18:52     ` Rafael J. Wysocki
2010-10-25 23:33   ` Thomas Renninger
2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
2010-10-19 11:36 ` [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new " Thomas Renninger
2010-10-19 11:36 ` Thomas Renninger
2010-10-26  0:18   ` [PATCH] PERF(userspace): Adjust perf timechart to the new power events V2 Thomas Renninger
2010-10-26  0:18   ` Thomas Renninger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.