All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] PERF: Do not export power_frequency, but power_start event
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-19 11:36 ` Thomas Renninger
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Thomas Gleixner

power_frequency moved to drivers/cpufreq/cpufreq.c which has
to be compiled in, no need to export it.

intel_idle can a be module though...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 drivers/idle/intel_idle.c   |    2 --
 kernel/trace/power-traces.c |    2 +-
 2 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index c37ef64..21ac077 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -201,9 +201,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 	kt_before = ktime_get_real();
 
 	stop_critical_timings();
-#ifndef MODULE
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
-#endif
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index a22582a..0e0497d 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,5 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
-EXPORT_TRACEPOINT_SYMBOL_GPL(power_frequency);
+EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
 
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH 1/3] PERF: Do not export power_frequency, but power_start event
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
  2010-10-19 11:36 ` [PATCH 1/3] PERF: Do not export power_frequency, but power_start event Thomas Renninger
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven, Ingo Molnar

power_frequency moved to drivers/cpufreq/cpufreq.c which has
to be compiled in, no need to export it.

intel_idle can a be module though...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 drivers/idle/intel_idle.c   |    2 --
 kernel/trace/power-traces.c |    2 +-
 2 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index c37ef64..21ac077 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -201,9 +201,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 	kt_before = ktime_get_real();
 
 	stop_critical_timings();
-#ifndef MODULE
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
-#endif
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index a22582a..0e0497d 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,5 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
-EXPORT_TRACEPOINT_SYMBOL_GPL(power_frequency);
+EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
                   ` (2 preceding siblings ...)
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-19 11:36 ` [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new " Thomas Renninger
  2010-10-19 11:36 ` Thomas Renninger
  5 siblings, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Thomas Gleixner

New power trace events:
power:processor_idle
power:processor_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:processor_idle

and
  power:power_frequency
is replaced with:
  power:processor_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process.c    |    5 ++-
 arch/x86/kernel/process_64.c |    1 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   80 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   14 +++++++
 kernel/trace/power-traces.c  |    3 ++
 8 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..b6b1578 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_processor_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -480,11 +483,9 @@ static void mwait_idle(void)
  */
 static void poll_idle(void)
 {
-	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
 }
 
 /*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..2c3254c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,7 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_processor_idle(0, smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..33bdc41 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_processor_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..f79de04 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_processor_idle(0, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..c78e496 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_processor_idle((eax >> 4) + 1, smp_processor_id());
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..d5cecd9 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,60 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(processor,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u64,		state		)
+		__field(	u64,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	     TP_ARGS(state, cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u64,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +123,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..0b5c841 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,20 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..6b6da42 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(processor_idle);
 
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
  2010-10-19 11:36 ` [PATCH 1/3] PERF: Do not export power_frequency, but power_start event Thomas Renninger
  2010-10-19 11:36 ` Thomas Renninger
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-25  6:54   ` Arjan van de Ven
                     ` (7 more replies)
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                   ` (2 subsequent siblings)
  5 siblings, 8 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven, Ingo Molnar

New power trace events:
power:processor_idle
power:processor_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:processor_idle

and
  power:power_frequency
is replaced with:
  power:processor_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process.c    |    5 ++-
 arch/x86/kernel/process_64.c |    1 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   80 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   14 +++++++
 kernel/trace/power-traces.c  |    3 ++
 8 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..b6b1578 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_processor_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -480,11 +483,9 @@ static void mwait_idle(void)
  */
 static void poll_idle(void)
 {
-	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
 }
 
 /*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..2c3254c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,7 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_processor_idle(0, smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..33bdc41 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_processor_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..f79de04 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_processor_idle(0, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..c78e496 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_processor_idle((eax >> 4) + 1, smp_processor_id());
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..d5cecd9 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,60 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(processor,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u64,		state		)
+		__field(	u64,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	     TP_ARGS(state, cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u64,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +123,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..0b5c841 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,20 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..6b6da42 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(processor_idle);
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new power events
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
                   ` (3 preceding siblings ...)
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-19 11:36 ` Thomas Renninger
  5 siblings, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Thomas Gleixner

The transition was rather smooth, only part I had to fiddle
some time was the check whether a tracepoint/event is
supported by the running kernel.

builtin-timechart must only pass -e power:xy events which
are supported by the running kernel.
For this I added the tiny helper function:
int is_valid_tracepoint(const char *event_string)
to parse-events.[hc]
which could be more generic as an interface and support
hardware/software/... events, not only tracepoints, but someone
else could extend that if needed...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/builtin-timechart.c |   87 ++++++++++++++++++++++++++++++++-------
 tools/perf/util/parse-events.c |   43 +++++++++++++++++++-
 tools/perf/util/parse-events.h |    1 +
 3 files changed, 114 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 9bcc38f..1304c27 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -32,6 +32,8 @@
 #include "util/session.h"
 #include "util/svghelper.h"
 
+#define SUPPORT_OLD_POWER_EVENTS 1
+
 static char		const *input_name = "perf.data";
 static char		const *output_name = "output.svg";
 
@@ -298,12 +300,25 @@ struct trace_entry {
 	int			lock_depth;
 };
 
-struct power_entry {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+struct power_entry_old {
 	struct trace_entry te;
 	u64	type;
 	u64	value;
 	u64	cpu_id;
 };
+#endif
+
+struct power_processor_entry {
+	struct trace_entry te;
+	u64	state;
+	u64	cpu_id;
+};
+
+struct power_suspend_entry {
+	struct trace_entry te;
+	u64	state;
+};
 
 #define TASK_COMM_LEN 16
 struct wakeup_entry {
@@ -489,29 +504,46 @@ static int process_sample_event(event_t *event, struct perf_session *session)
 	te = (void *)data.raw_data;
 	if (session->sample_type & PERF_SAMPLE_RAW && data.raw_size > 0) {
 		char *event_str;
-		struct power_entry *pe;
-
-		pe = (void *)te;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		struct power_entry_old *peo;
+		peo = (void *)te;
+#endif
 
 		event_str = perf_header__find_event(te->type);
 
 		if (!event_str)
 			return 0;
 
-		if (strcmp(event_str, "power:power_start") == 0)
-			c_state_start(pe->cpu_id, data.time, pe->value);
-
-		if (strcmp(event_str, "power:power_end") == 0)
-			c_state_end(pe->cpu_id, data.time);
+		if (strcmp(event_str, "power:processor_idle") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			if (ppe->state == 0)
+				c_state_end(ppe->cpu_id, data.time);
+			else
+				c_state_start(ppe->cpu_id, data.time,
+					      ppe->state);
+		}
 
-		if (strcmp(event_str, "power:power_frequency") == 0)
-			p_state_change(pe->cpu_id, data.time, pe->value);
+		else if (strcmp(event_str, "power:processor_frequency") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			p_state_change(ppe->cpu_id, data.time, ppe->state);
+		}
 
-		if (strcmp(event_str, "sched:sched_wakeup") == 0)
+		else if (strcmp(event_str, "sched:sched_wakeup") == 0)
 			sched_wakeup(data.cpu, data.time, data.pid, te);
 
-		if (strcmp(event_str, "sched:sched_switch") == 0)
+		else if (strcmp(event_str, "sched:sched_switch") == 0)
 			sched_switch(data.cpu, data.time, te);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		else if (strcmp(event_str, "power:power_start") == 0)
+			c_state_start(peo->cpu_id, data.time, peo->value);
+
+		else if (strcmp(event_str, "power:power_end") == 0)
+			c_state_end(peo->cpu_id, data.time);
+
+		else if (strcmp(event_str, "power:power_frequency") == 0)
+			p_state_change(peo->cpu_id, data.time, peo->value);
+#endif
 	}
 	return 0;
 }
@@ -968,7 +1000,8 @@ static const char * const timechart_usage[] = {
 	NULL
 };
 
-static const char *record_args[] = {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static const char *record_old_args[] = {
 	"record",
 	"-a",
 	"-R",
@@ -980,16 +1013,38 @@ static const char *record_args[] = {
 	"-e", "sched:sched_wakeup",
 	"-e", "sched:sched_switch",
 };
+#endif
+
+static const char *record_new_args[] = {
+	"record",
+	"-a",
+	"-R",
+	"-f",
+	"-c", "1",
+	"-e", "power:processor_frequency",
+	"-e", "power:processor_idle",
+	"-e", "sched:sched_wakeup",
+	"-e", "sched:sched_switch",
+};
 
 static int __cmd_record(int argc, const char **argv)
 {
 	unsigned int rec_argc, i, j;
 	const char **rec_argv;
+	const char **record_args = record_new_args;
+	unsigned int record_elems = ARRAY_SIZE(record_new_args);
 
-	rec_argc = ARRAY_SIZE(record_args) + argc - 1;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+	if (is_valid_tracepoint("power:power_start")) {
+		record_args = record_old_args;
+		record_elems = ARRAY_SIZE(record_old_args);
+	}
+#endif
+	
+	rec_argc = record_elems + argc - 1;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
-	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+	for (i = 0; i < record_elems; i++)
 		rec_argv[i] = strdup(record_args[i]);
 
 	for (j = 1; j < (unsigned int)argc; j++, i++)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4af5bd5..d706dcb 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -824,7 +824,7 @@ int parse_events(const struct option *opt __used, const char *str, int unset __u
 		if (ret != EVT_HANDLED_ALL) {
 			attrs[nr_counters] = attr;
 			nr_counters++;
-		}
+	}
 
 		if (*str == 0)
 			break;
@@ -906,6 +906,47 @@ static void print_tracepoint_events(void)
 }
 
 /*
+ * Check whether event is in <debugfs_mount_point>/tracing/events
+ */
+
+int is_valid_tracepoint(const char *event_string)
+{
+	DIR *sys_dir, *evt_dir;
+	struct dirent *sys_next, *evt_next, sys_dirent, evt_dirent;
+	char evt_path[MAXPATHLEN];
+	char dir_path[MAXPATHLEN];
+
+	if (debugfs_valid_mountpoint(debugfs_path))
+		return 0;
+
+	sys_dir = opendir(debugfs_path);
+	if (!sys_dir)
+		return 0;
+
+	for_each_subsystem(sys_dir, sys_dirent, sys_next) {
+
+		snprintf(dir_path, MAXPATHLEN, "%s/%s", debugfs_path,
+			 sys_dirent.d_name);
+		evt_dir = opendir(dir_path);
+		if (!evt_dir)
+			continue;
+
+		for_each_event(sys_dirent, evt_dir, evt_dirent, evt_next) {
+			snprintf(evt_path, MAXPATHLEN, "%s:%s",
+				 sys_dirent.d_name, evt_dirent.d_name);
+			if (!strcmp(evt_path, event_string)) {
+				closedir(evt_dir);
+				closedir(sys_dir);
+				return 1;
+			}
+		}
+		closedir(evt_dir);
+	}
+	closedir(sys_dir);
+	return 0;
+}
+
+/*
  * Print the help text for the event symbols:
  */
 void print_events(void)
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index fc4ab3f..7ab4685 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -29,6 +29,7 @@ extern int parse_filter(const struct option *opt, const char *str, int unset);
 #define EVENTS_HELP_MAX (128*1024)
 
 extern void print_events(void);
+extern int is_valid_tracepoint(const char *event_string);
 
 extern char debugfs_path[];
 extern int valid_debugfs_mount(const char *debugfs);
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new power events
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
                   ` (4 preceding siblings ...)
  2010-10-19 11:36 ` [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new " Thomas Renninger
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-26  0:18   ` [PATCH] PERF(userspace): Adjust perf timechart to the new power events V2 Thomas Renninger
  2010-10-26  0:18   ` Thomas Renninger
  5 siblings, 2 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven, Ingo Molnar

The transition was rather smooth, only part I had to fiddle
some time was the check whether a tracepoint/event is
supported by the running kernel.

builtin-timechart must only pass -e power:xy events which
are supported by the running kernel.
For this I added the tiny helper function:
int is_valid_tracepoint(const char *event_string)
to parse-events.[hc]
which could be more generic as an interface and support
hardware/software/... events, not only tracepoints, but someone
else could extend that if needed...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/builtin-timechart.c |   87 ++++++++++++++++++++++++++++++++-------
 tools/perf/util/parse-events.c |   43 +++++++++++++++++++-
 tools/perf/util/parse-events.h |    1 +
 3 files changed, 114 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 9bcc38f..1304c27 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -32,6 +32,8 @@
 #include "util/session.h"
 #include "util/svghelper.h"
 
+#define SUPPORT_OLD_POWER_EVENTS 1
+
 static char		const *input_name = "perf.data";
 static char		const *output_name = "output.svg";
 
@@ -298,12 +300,25 @@ struct trace_entry {
 	int			lock_depth;
 };
 
-struct power_entry {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+struct power_entry_old {
 	struct trace_entry te;
 	u64	type;
 	u64	value;
 	u64	cpu_id;
 };
+#endif
+
+struct power_processor_entry {
+	struct trace_entry te;
+	u64	state;
+	u64	cpu_id;
+};
+
+struct power_suspend_entry {
+	struct trace_entry te;
+	u64	state;
+};
 
 #define TASK_COMM_LEN 16
 struct wakeup_entry {
@@ -489,29 +504,46 @@ static int process_sample_event(event_t *event, struct perf_session *session)
 	te = (void *)data.raw_data;
 	if (session->sample_type & PERF_SAMPLE_RAW && data.raw_size > 0) {
 		char *event_str;
-		struct power_entry *pe;
-
-		pe = (void *)te;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		struct power_entry_old *peo;
+		peo = (void *)te;
+#endif
 
 		event_str = perf_header__find_event(te->type);
 
 		if (!event_str)
 			return 0;
 
-		if (strcmp(event_str, "power:power_start") == 0)
-			c_state_start(pe->cpu_id, data.time, pe->value);
-
-		if (strcmp(event_str, "power:power_end") == 0)
-			c_state_end(pe->cpu_id, data.time);
+		if (strcmp(event_str, "power:processor_idle") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			if (ppe->state == 0)
+				c_state_end(ppe->cpu_id, data.time);
+			else
+				c_state_start(ppe->cpu_id, data.time,
+					      ppe->state);
+		}
 
-		if (strcmp(event_str, "power:power_frequency") == 0)
-			p_state_change(pe->cpu_id, data.time, pe->value);
+		else if (strcmp(event_str, "power:processor_frequency") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			p_state_change(ppe->cpu_id, data.time, ppe->state);
+		}
 
-		if (strcmp(event_str, "sched:sched_wakeup") == 0)
+		else if (strcmp(event_str, "sched:sched_wakeup") == 0)
 			sched_wakeup(data.cpu, data.time, data.pid, te);
 
-		if (strcmp(event_str, "sched:sched_switch") == 0)
+		else if (strcmp(event_str, "sched:sched_switch") == 0)
 			sched_switch(data.cpu, data.time, te);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		else if (strcmp(event_str, "power:power_start") == 0)
+			c_state_start(peo->cpu_id, data.time, peo->value);
+
+		else if (strcmp(event_str, "power:power_end") == 0)
+			c_state_end(peo->cpu_id, data.time);
+
+		else if (strcmp(event_str, "power:power_frequency") == 0)
+			p_state_change(peo->cpu_id, data.time, peo->value);
+#endif
 	}
 	return 0;
 }
@@ -968,7 +1000,8 @@ static const char * const timechart_usage[] = {
 	NULL
 };
 
-static const char *record_args[] = {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static const char *record_old_args[] = {
 	"record",
 	"-a",
 	"-R",
@@ -980,16 +1013,38 @@ static const char *record_args[] = {
 	"-e", "sched:sched_wakeup",
 	"-e", "sched:sched_switch",
 };
+#endif
+
+static const char *record_new_args[] = {
+	"record",
+	"-a",
+	"-R",
+	"-f",
+	"-c", "1",
+	"-e", "power:processor_frequency",
+	"-e", "power:processor_idle",
+	"-e", "sched:sched_wakeup",
+	"-e", "sched:sched_switch",
+};
 
 static int __cmd_record(int argc, const char **argv)
 {
 	unsigned int rec_argc, i, j;
 	const char **rec_argv;
+	const char **record_args = record_new_args;
+	unsigned int record_elems = ARRAY_SIZE(record_new_args);
 
-	rec_argc = ARRAY_SIZE(record_args) + argc - 1;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+	if (is_valid_tracepoint("power:power_start")) {
+		record_args = record_old_args;
+		record_elems = ARRAY_SIZE(record_old_args);
+	}
+#endif
+	
+	rec_argc = record_elems + argc - 1;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
-	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+	for (i = 0; i < record_elems; i++)
 		rec_argv[i] = strdup(record_args[i]);
 
 	for (j = 1; j < (unsigned int)argc; j++, i++)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4af5bd5..d706dcb 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -824,7 +824,7 @@ int parse_events(const struct option *opt __used, const char *str, int unset __u
 		if (ret != EVT_HANDLED_ALL) {
 			attrs[nr_counters] = attr;
 			nr_counters++;
-		}
+	}
 
 		if (*str == 0)
 			break;
@@ -906,6 +906,47 @@ static void print_tracepoint_events(void)
 }
 
 /*
+ * Check whether event is in <debugfs_mount_point>/tracing/events
+ */
+
+int is_valid_tracepoint(const char *event_string)
+{
+	DIR *sys_dir, *evt_dir;
+	struct dirent *sys_next, *evt_next, sys_dirent, evt_dirent;
+	char evt_path[MAXPATHLEN];
+	char dir_path[MAXPATHLEN];
+
+	if (debugfs_valid_mountpoint(debugfs_path))
+		return 0;
+
+	sys_dir = opendir(debugfs_path);
+	if (!sys_dir)
+		return 0;
+
+	for_each_subsystem(sys_dir, sys_dirent, sys_next) {
+
+		snprintf(dir_path, MAXPATHLEN, "%s/%s", debugfs_path,
+			 sys_dirent.d_name);
+		evt_dir = opendir(dir_path);
+		if (!evt_dir)
+			continue;
+
+		for_each_event(sys_dirent, evt_dir, evt_dirent, evt_next) {
+			snprintf(evt_path, MAXPATHLEN, "%s:%s",
+				 sys_dirent.d_name, evt_dirent.d_name);
+			if (!strcmp(evt_path, event_string)) {
+				closedir(evt_dir);
+				closedir(sys_dir);
+				return 1;
+			}
+		}
+		closedir(evt_dir);
+	}
+	closedir(sys_dir);
+	return 0;
+}
+
+/*
  * Print the help text for the event symbols:
  */
 void print_events(void)
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index fc4ab3f..7ab4685 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -29,6 +29,7 @@ extern int parse_filter(const struct option *opt, const char *str, int unset);
 #define EVENTS_HELP_MAX (128*1024)
 
 extern void print_events(void);
+extern int is_valid_tracepoint(const char *event_string);
 
 extern char debugfs_path[];
 extern int valid_debugfs_mount(const char *debugfs);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
@ 2010-10-25  6:54   ` Arjan van de Ven
  2010-10-25  6:54   ` Arjan van de Ven
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25  6:54 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On 10/19/2010 4:36 AM, Thomas Renninger wrote:
>   static void poll_idle(void)
>   {
> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
>   	local_irq_enable();
>   	while (!need_resched())
>   		cpu_relax();
> -	trace_power_end(0);
>   }

why did you remove the idle tracepoints from this one ???

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
  2010-10-25  6:54   ` Arjan van de Ven
@ 2010-10-25  6:54   ` Arjan van de Ven
  2010-10-25  9:41     ` Thomas Renninger
  2010-10-25  9:41     ` Thomas Renninger
  2010-10-25  6:58   ` Arjan van de Ven
                     ` (5 subsequent siblings)
  7 siblings, 2 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25  6:54 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/19/2010 4:36 AM, Thomas Renninger wrote:
>   static void poll_idle(void)
>   {
> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
>   	local_irq_enable();
>   	while (!need_resched())
>   		cpu_relax();
> -	trace_power_end(0);
>   }

why did you remove the idle tracepoints from this one ???


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
  2010-10-25  6:54   ` Arjan van de Ven
  2010-10-25  6:54   ` Arjan van de Ven
@ 2010-10-25  6:58   ` Arjan van de Ven
  2010-10-25  6:58   ` Arjan van de Ven
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25  6:58 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>    power:power_start
>    power:power_end
> are replaced with:
>    power:processor_idle
>
I think you need two trace points for this
one to enter idle
one to exit

because using magic encoding games to encode "exit"is a mistake; as can 
be seen in this patch.
You're currently trying to use "0" to signal "end of idle", but "0" is 
also a valid idle state (namely that of polling)

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                     ` (2 preceding siblings ...)
  2010-10-25  6:58   ` Arjan van de Ven
@ 2010-10-25  6:58   ` Arjan van de Ven
  2010-10-25 10:04   ` Ingo Molnar
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25  6:58 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>    power:power_start
>    power:power_end
> are replaced with:
>    power:processor_idle
>
I think you need two trace points for this
one to enter idle
one to exit

because using magic encoding games to encode "exit"is a mistake; as can 
be seen in this patch.
You're currently trying to use "0" to signal "end of idle", but "0" is 
also a valid idle state (namely that of polling)


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25  6:54   ` Arjan van de Ven
  2010-10-25  9:41     ` Thomas Renninger
@ 2010-10-25  9:41     ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25  9:41 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> >   static void poll_idle(void)
> >   {
> > -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> >   	local_irq_enable();
> >   	while (!need_resched())
> >   		cpu_relax();
> > -	trace_power_end(0);
> >   }
> 
> why did you remove the idle tracepoints from this one ???
Because no idle/sleep state is entered here.
State 0 does not exist or say, it means the machine is not idle.
The new event uses idle state 0 spec conform as "exit sleep state".

If this should still be trackable some kind of dummy sleep state:
#define IDLE_BUSY_LOOP 0xFE
(or similar) must get defined and passed like this:
trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
    cpu_relax()
trace_processor_idle(0, smp_processor_id());

I could imagine this is somewhat worth it to compare idle results
to "no idle state at all" is used.
But nobody should ever use idle=poll, comparing deep sleep states
with C1 with (idle=halt) should be sufficient?

    Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25  6:54   ` Arjan van de Ven
@ 2010-10-25  9:41     ` Thomas Renninger
  2010-10-25 13:55       ` Arjan van de Ven
  2010-10-25 13:55       ` Arjan van de Ven
  2010-10-25  9:41     ` Thomas Renninger
  1 sibling, 2 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25  9:41 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> >   static void poll_idle(void)
> >   {
> > -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> >   	local_irq_enable();
> >   	while (!need_resched())
> >   		cpu_relax();
> > -	trace_power_end(0);
> >   }
> 
> why did you remove the idle tracepoints from this one ???
Because no idle/sleep state is entered here.
State 0 does not exist or say, it means the machine is not idle.
The new event uses idle state 0 spec conform as "exit sleep state".

If this should still be trackable some kind of dummy sleep state:
#define IDLE_BUSY_LOOP 0xFE
(or similar) must get defined and passed like this:
trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
    cpu_relax()
trace_processor_idle(0, smp_processor_id());

I could imagine this is somewhat worth it to compare idle results
to "no idle state at all" is used.
But nobody should ever use idle=poll, comparing deep sleep states
with C1 with (idle=halt) should be sufficient?

    Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                     ` (3 preceding siblings ...)
  2010-10-25  6:58   ` Arjan van de Ven
@ 2010-10-25 10:04   ` Ingo Molnar
  2010-10-25 10:04   ` Ingo Molnar
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-25 10:04 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner


* Thomas Renninger <trenn@suse.de> wrote:

> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle

Well, most power saving hw models (and the code implementing them) have this kind of 
model:

 enter power saving mode X
 exit power saving mode

Where X is some sort of 'power saving deepness' attribute, right?

 	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                     ` (4 preceding siblings ...)
  2010-10-25 10:04   ` Ingo Molnar
@ 2010-10-25 10:04   ` Ingo Molnar
  2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
  2010-10-25 23:33   ` Thomas Renninger
  7 siblings, 2 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-25 10:04 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle

Well, most power saving hw models (and the code implementing them) have this kind of 
model:

 enter power saving mode X
 exit power saving mode

Where X is some sort of 'power saving deepness' attribute, right?

 	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 10:04   ` Ingo Molnar
@ 2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:03     ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 11:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > New power trace events:
> > power:processor_idle
> > power:processor_frequency
> > power:machine_suspend
> > 
> > 
> > C-state/idle accounting events:
> >   power:power_start
> >   power:power_end
> > are replaced with:
> >   power:processor_idle
> 
> Well, most power saving hw models (and the code implementing them) have this kind of 
> model:
> 
>  enter power saving mode X
>  exit power saving mode
> 
> Where X is some sort of 'power saving deepness' attribute, right?
Sure.
But ACPI and afaik this model got picked up for PCI and other (sub-)archs
as well, defines state 0 as the non-power saving mode.
Same as done here with machine suspend state (S0 is back from suspend) and
this model should get picked up when device sleep states get tracked at
some time.
It's consistent and applies to some well known specifications.

Also tracking processor_idle_{start,end} as a separate event
makes no sense and there is no need to introduce:
processor_idle_start/processor_idle_end
machine_suspend_start/machine_suspend_end
device_power_mode_start/device_power_mode_end
events.
Using state 0 as "exit/end", is much nicer for kernel/
userspace implementations/code and the user.

     Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 10:04   ` Ingo Molnar
  2010-10-25 11:03     ` Thomas Renninger
@ 2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:55       ` Ingo Molnar
                         ` (3 more replies)
  1 sibling, 4 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 11:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > New power trace events:
> > power:processor_idle
> > power:processor_frequency
> > power:machine_suspend
> > 
> > 
> > C-state/idle accounting events:
> >   power:power_start
> >   power:power_end
> > are replaced with:
> >   power:processor_idle
> 
> Well, most power saving hw models (and the code implementing them) have this kind of 
> model:
> 
>  enter power saving mode X
>  exit power saving mode
> 
> Where X is some sort of 'power saving deepness' attribute, right?
Sure.
But ACPI and afaik this model got picked up for PCI and other (sub-)archs
as well, defines state 0 as the non-power saving mode.
Same as done here with machine suspend state (S0 is back from suspend) and
this model should get picked up when device sleep states get tracked at
some time.
It's consistent and applies to some well known specifications.

Also tracking processor_idle_{start,end} as a separate event
makes no sense and there is no need to introduce:
processor_idle_start/processor_idle_end
machine_suspend_start/machine_suspend_end
device_power_mode_start/device_power_mode_end
events.
Using state 0 as "exit/end", is much nicer for kernel/
userspace implementations/code and the user.

     Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:55       ` Ingo Molnar
@ 2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 13:58       ` Arjan van de Ven
  2010-10-25 13:58       ` Arjan van de Ven
  3 siblings, 0 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-25 11:55 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner


* Thomas Renninger <trenn@suse.de> wrote:

> On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > New power trace events:
> > > power:processor_idle
> > > power:processor_frequency
> > > power:machine_suspend
> > > 
> > > 
> > > C-state/idle accounting events:
> > >   power:power_start
> > >   power:power_end
> > > are replaced with:
> > >   power:processor_idle
> > 
> > Well, most power saving hw models (and the code implementing them) have this kind of 
> > model:
> > 
> >  enter power saving mode X
> >  exit power saving mode
> > 
> > Where X is some sort of 'power saving deepness' attribute, right?
>
> Sure.

Which is is the 'saner' model?

> But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> defines state 0 as the non-power saving mode.

But the actual code does not actually deal with any 'state 0', does it? It enters an 
idle function and then exits it, right?

'power state' might be what is used for devices - but even there, we have:

  - enter power state X
  - exit power state

right?

> Same as done here with machine suspend state (S0 is back from suspend) and
> this model should get picked up when device sleep states get tracked at
> some time.
>
> It's consistent and applies to some well known specifications.

What we want it to be is for it to be the nicest, most understandable, most logical 
model - not one matching random hardware specifications.

( Hardware specifications only matter in so far that it should be possible to 
  express all the known hardware state transitions via these events efficiently. )

> Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> there is no need to introduce: processor_idle_start/processor_idle_end 
> machine_suspend_start/machine_suspend_end 
> device_power_mode_start/device_power_mode_end events.

What do you mean by "makes no sense"?

Are they superfluous? Inefficient? Illogical?

> Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> implementations/code and the user.

By that argument we should not have separate fork() and exit() syscalls either, but 
a set_process_state(1) and set_process_state(0) interface?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:03     ` Thomas Renninger
@ 2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 12:55         ` Thomas Renninger
                           ` (3 more replies)
  2010-10-25 11:55       ` Ingo Molnar
                         ` (2 subsequent siblings)
  3 siblings, 4 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-25 11:55 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > New power trace events:
> > > power:processor_idle
> > > power:processor_frequency
> > > power:machine_suspend
> > > 
> > > 
> > > C-state/idle accounting events:
> > >   power:power_start
> > >   power:power_end
> > > are replaced with:
> > >   power:processor_idle
> > 
> > Well, most power saving hw models (and the code implementing them) have this kind of 
> > model:
> > 
> >  enter power saving mode X
> >  exit power saving mode
> > 
> > Where X is some sort of 'power saving deepness' attribute, right?
>
> Sure.

Which is is the 'saner' model?

> But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> defines state 0 as the non-power saving mode.

But the actual code does not actually deal with any 'state 0', does it? It enters an 
idle function and then exits it, right?

'power state' might be what is used for devices - but even there, we have:

  - enter power state X
  - exit power state

right?

> Same as done here with machine suspend state (S0 is back from suspend) and
> this model should get picked up when device sleep states get tracked at
> some time.
>
> It's consistent and applies to some well known specifications.

What we want it to be is for it to be the nicest, most understandable, most logical 
model - not one matching random hardware specifications.

( Hardware specifications only matter in so far that it should be possible to 
  express all the known hardware state transitions via these events efficiently. )

> Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> there is no need to introduce: processor_idle_start/processor_idle_end 
> machine_suspend_start/machine_suspend_end 
> device_power_mode_start/device_power_mode_end events.

What do you mean by "makes no sense"?

Are they superfluous? Inefficient? Illogical?

> Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> implementations/code and the user.

By that argument we should not have separate fork() and exit() syscalls either, but 
a set_process_state(1) and set_process_state(0) interface?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 12:55         ` Thomas Renninger
@ 2010-10-25 12:55         ` Thomas Renninger
  2010-10-25 12:58         ` Mathieu Desnoyers
  2010-10-25 12:58         ` Mathieu Desnoyers
  3 siblings, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 12:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Monday 25 October 2010 13:55:25 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > 
> > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > model:
> > > 
> > >  enter power saving mode X
> > >  exit power saving mode
> > > 
> > > Where X is some sort of 'power saving deepness' attribute, right?
> >
> > Sure.
> 
> Which is is the 'saner' model?
> 
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > defines state 0 as the non-power saving mode.
> 
> But the actual code does not actually deal with any 'state 0', does it?
It does. Not being idle is tracked by cpuidle driver as state 0
(arch independent):
/sys/devices/system/cpu/cpu0/cpuidle/state0/
halt/C1 on X86 is:
/sys/devices/system/cpu/cpu0/cpuidle/state1/
...

> It enters an idle function and then exits it, right?
> 'power state' might be what is used for devices - but even there, we have:
> 
>   - enter power state X
>   - exit power state
> 
> right?
That is not true for PCI, probably others as well.
There you have D0 (being the maximum powered state) up to D3.
Same for PCI Bus Power States (B0, B1, B2, and B3).

Look at drivers/pci/pci.c:pci_raw_set_power_state()
To "exit" a power state you call:
pci_raw_set_power_state(dev, PCI_D0);

Same for suspend. "Exit" suspend is:
#define PM_SUSPEND_ON           ((__force suspend_state_t) 0)
so on resume we enter suspend_state_t 0.

> > Same as done here with machine suspend state (S0 is back from suspend) and
> > this model should get picked up when device sleep states get tracked at
> > some time.
> >
> > It's consistent and applies to some well known specifications.
> 
> What we want it to be is for it to be the nicest, most understandable, most logical 
> model - not one matching random hardware specifications.
> 
> ( Hardware specifications only matter in so far that it should be possible to 
>   express all the known hardware state transitions via these events efficiently. )
> 
> > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > there is no need to introduce: processor_idle_start/processor_idle_end 
> > machine_suspend_start/machine_suspend_end 
> > device_power_mode_start/device_power_mode_end events.
> 
> What do you mean by "makes no sense"?
> 
> Are they superfluous?
Yes, you do not need two different events to track one thing.

> Illogical?
Yes, A user who wants to enable processor idle tracking does
want to enable it via:
echo power:processor_idle >/sys/kernel/debug/tracing/events/enable
what do you intend to track with a:
power:power_start
event?

    Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:55       ` Ingo Molnar
@ 2010-10-25 12:55         ` Thomas Renninger
  2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 12:55         ` Thomas Renninger
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 12:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Monday 25 October 2010 13:55:25 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > 
> > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > model:
> > > 
> > >  enter power saving mode X
> > >  exit power saving mode
> > > 
> > > Where X is some sort of 'power saving deepness' attribute, right?
> >
> > Sure.
> 
> Which is is the 'saner' model?
> 
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > defines state 0 as the non-power saving mode.
> 
> But the actual code does not actually deal with any 'state 0', does it?
It does. Not being idle is tracked by cpuidle driver as state 0
(arch independent):
/sys/devices/system/cpu/cpu0/cpuidle/state0/
halt/C1 on X86 is:
/sys/devices/system/cpu/cpu0/cpuidle/state1/
...

> It enters an idle function and then exits it, right?
> 'power state' might be what is used for devices - but even there, we have:
> 
>   - enter power state X
>   - exit power state
> 
> right?
That is not true for PCI, probably others as well.
There you have D0 (being the maximum powered state) up to D3.
Same for PCI Bus Power States (B0, B1, B2, and B3).

Look at drivers/pci/pci.c:pci_raw_set_power_state()
To "exit" a power state you call:
pci_raw_set_power_state(dev, PCI_D0);

Same for suspend. "Exit" suspend is:
#define PM_SUSPEND_ON           ((__force suspend_state_t) 0)
so on resume we enter suspend_state_t 0.

> > Same as done here with machine suspend state (S0 is back from suspend) and
> > this model should get picked up when device sleep states get tracked at
> > some time.
> >
> > It's consistent and applies to some well known specifications.
> 
> What we want it to be is for it to be the nicest, most understandable, most logical 
> model - not one matching random hardware specifications.
> 
> ( Hardware specifications only matter in so far that it should be possible to 
>   express all the known hardware state transitions via these events efficiently. )
> 
> > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > there is no need to introduce: processor_idle_start/processor_idle_end 
> > machine_suspend_start/machine_suspend_end 
> > device_power_mode_start/device_power_mode_end events.
> 
> What do you mean by "makes no sense"?
> 
> Are they superfluous?
Yes, you do not need two different events to track one thing.

> Illogical?
Yes, A user who wants to enable processor idle tracking does
want to enable it via:
echo power:processor_idle >/sys/kernel/debug/tracing/events/enable
what do you intend to track with a:
power:power_start
event?

    Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 12:55         ` Thomas Renninger
  2010-10-25 12:55         ` Thomas Renninger
@ 2010-10-25 12:58         ` Mathieu Desnoyers
  2010-10-25 12:58         ` Mathieu Desnoyers
  3 siblings, 0 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-25 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Thomas Gleixner, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > 
> > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > model:
> > > 
> > >  enter power saving mode X
> > >  exit power saving mode
> > > 
> > > Where X is some sort of 'power saving deepness' attribute, right?
> >
> > Sure.
> 
> Which is is the 'saner' model?
> 
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > defines state 0 as the non-power saving mode.
> 
> But the actual code does not actually deal with any 'state 0', does it? It enters an 
> idle function and then exits it, right?
> 
> 'power state' might be what is used for devices - but even there, we have:
> 
>   - enter power state X
>   - exit power state
> 
> right?
> 
> > Same as done here with machine suspend state (S0 is back from suspend) and
> > this model should get picked up when device sleep states get tracked at
> > some time.
> >
> > It's consistent and applies to some well known specifications.
> 
> What we want it to be is for it to be the nicest, most understandable, most logical 
> model - not one matching random hardware specifications.
> 
> ( Hardware specifications only matter in so far that it should be possible to 
>   express all the known hardware state transitions via these events efficiently. )
> 
> > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > there is no need to introduce: processor_idle_start/processor_idle_end 
> > machine_suspend_start/machine_suspend_end 
> > device_power_mode_start/device_power_mode_end events.
> 
> What do you mean by "makes no sense"?
> 
> Are they superfluous? Inefficient? Illogical?

I think it would require deep understanding of specific power modes of each
architecture to split into this topology. On the bright side, it would bring
clear understanding of which HW resource is being put to sleep, which would make
automated analysis much easier to do. But maybe it's too much pain compared to
the benefit. The related question is also: where is it best to put this logic ?
In the kernel code ? In per-arch TRACE_EVENT() handlers or in external trace
analysis plugins ?

> 
> > Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> > implementations/code and the user.
> 
> By that argument we should not have separate fork() and exit() syscalls either, but 
> a set_process_state(1) and set_process_state(0) interface?

I'm by no mean expert on power saving hardware specs, but if it is possible for
hardware to switch between two power saving states without passing through power
state 0, then using a "set state" rather than an enter/exit would be more
appropriate; even if we go for a scheme introducing

processor_idle_start/processor_idle_end,
machine_suspend_start/machine_suspend_end,
device_power_mode_start/device_power_mode_end.

I must defer to you guys to figure out if some hardware actually do that for
either of CPU idle, suspend or device power modes.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:55       ` Ingo Molnar
                           ` (2 preceding siblings ...)
  2010-10-25 12:58         ` Mathieu Desnoyers
@ 2010-10-25 12:58         ` Mathieu Desnoyers
  2010-10-25 20:29           ` Rafael J. Wysocki
  2010-10-25 20:29           ` Rafael J. Wysocki
  3 siblings, 2 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-25 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Arjan van de Ven

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > 
> > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > model:
> > > 
> > >  enter power saving mode X
> > >  exit power saving mode
> > > 
> > > Where X is some sort of 'power saving deepness' attribute, right?
> >
> > Sure.
> 
> Which is is the 'saner' model?
> 
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > defines state 0 as the non-power saving mode.
> 
> But the actual code does not actually deal with any 'state 0', does it? It enters an 
> idle function and then exits it, right?
> 
> 'power state' might be what is used for devices - but even there, we have:
> 
>   - enter power state X
>   - exit power state
> 
> right?
> 
> > Same as done here with machine suspend state (S0 is back from suspend) and
> > this model should get picked up when device sleep states get tracked at
> > some time.
> >
> > It's consistent and applies to some well known specifications.
> 
> What we want it to be is for it to be the nicest, most understandable, most logical 
> model - not one matching random hardware specifications.
> 
> ( Hardware specifications only matter in so far that it should be possible to 
>   express all the known hardware state transitions via these events efficiently. )
> 
> > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > there is no need to introduce: processor_idle_start/processor_idle_end 
> > machine_suspend_start/machine_suspend_end 
> > device_power_mode_start/device_power_mode_end events.
> 
> What do you mean by "makes no sense"?
> 
> Are they superfluous? Inefficient? Illogical?

I think it would require deep understanding of specific power modes of each
architecture to split into this topology. On the bright side, it would bring
clear understanding of which HW resource is being put to sleep, which would make
automated analysis much easier to do. But maybe it's too much pain compared to
the benefit. The related question is also: where is it best to put this logic ?
In the kernel code ? In per-arch TRACE_EVENT() handlers or in external trace
analysis plugins ?

> 
> > Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> > implementations/code and the user.
> 
> By that argument we should not have separate fork() and exit() syscalls either, but 
> a set_process_state(1) and set_process_state(0) interface?

I'm by no mean expert on power saving hardware specs, but if it is possible for
hardware to switch between two power saving states without passing through power
state 0, then using a "set state" rather than an enter/exit would be more
appropriate; even if we go for a scheme introducing

processor_idle_start/processor_idle_end,
machine_suspend_start/machine_suspend_end,
device_power_mode_start/device_power_mode_end.

I must defer to you guys to figure out if some hardware actually do that for
either of CPU idle, suspend or device power modes.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25  9:41     ` Thomas Renninger
@ 2010-10-25 13:55       ` Arjan van de Ven
  2010-10-25 13:55       ` Arjan van de Ven
  1 sibling, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25 13:55 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On 10/25/2010 2:41 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
>> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
>>>    static void poll_idle(void)
>>>    {
>>> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
>>>    	local_irq_enable();
>>>    	while (!need_resched())
>>>    		cpu_relax();
>>> -	trace_power_end(0);
>>>    }
>> why did you remove the idle tracepoints from this one ???
> Because no idle/sleep state is entered here.
> State 0 does not exist or say, it means the machine is not idle.
> The new event uses idle state 0 spec conform as "exit sleep state".
>
> If this should still be trackable some kind of dummy sleep state:
> #define IDLE_BUSY_LOOP 0xFE
> (or similar) must get defined and passed like this:
> trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
>      cpu_relax()
> trace_processor_idle(0, smp_processor_id());
>
> I could imagine this is somewhat worth it to compare idle results
> to "no idle state at all" is used.
> But nobody should ever use idle=poll, comparing deep sleep states
> with C1 with (idle=halt) should be sufficient?

this is not idle=poll on the command line only.
this also gets used normally, in two cases
1) during real time operations, for some short periods of time
     (think wallstreet trading)
2) by the menu governor when the next event is less than a few 
microseconds away, so short that even C1 is too much

I know that your new API tries to use "0" as exit, but 0 is already 
taken (in all power terminology at least on x86 it is) for this.

why isn't your "exit" a special define?


also, if you look at many other similar perf events, they ever separate 
entry/exit points:

process/do_process.cpp:         
perf_events->add_event("irq:irq_handler_entry");
process/do_process.cpp:         
perf_events->add_event("irq:irq_handler_exit");
process/do_process.cpp:         perf_events->add_event("irq:softirq_entry");
process/do_process.cpp:         perf_events->add_event("irq:softirq_exit");
process/do_process.cpp:         
perf_events->add_event("timer:timer_expire_entry");
process/do_process.cpp:         
perf_events->add_event("timer:timer_expire_exit");
process/do_process.cpp:         
perf_events->add_event("timer:hrtimer_expire_entry");
process/do_process.cpp:         
perf_events->add_event("timer:hrtimer_expire_exit");
process/do_process.cpp:         perf_events->add_event("power:power_start");
process/do_process.cpp:         perf_events->add_event("power:power_end");
process/do_process.cpp:         
perf_events->add_event("workqueue:workqueue_execute_start");
process/do_process.cpp:         
perf_events->add_event("workqueue:workqueue_execute_end");

so there is already an API consistency precedent
(and frankly, trying to multiplex in "exit" via a magic value is asking 
for trouble API wise)

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25  9:41     ` Thomas Renninger
  2010-10-25 13:55       ` Arjan van de Ven
@ 2010-10-25 13:55       ` Arjan van de Ven
  2010-10-25 14:36         ` Thomas Renninger
  2010-10-25 14:36         ` Thomas Renninger
  1 sibling, 2 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25 13:55 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/25/2010 2:41 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
>> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
>>>    static void poll_idle(void)
>>>    {
>>> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
>>>    	local_irq_enable();
>>>    	while (!need_resched())
>>>    		cpu_relax();
>>> -	trace_power_end(0);
>>>    }
>> why did you remove the idle tracepoints from this one ???
> Because no idle/sleep state is entered here.
> State 0 does not exist or say, it means the machine is not idle.
> The new event uses idle state 0 spec conform as "exit sleep state".
>
> If this should still be trackable some kind of dummy sleep state:
> #define IDLE_BUSY_LOOP 0xFE
> (or similar) must get defined and passed like this:
> trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
>      cpu_relax()
> trace_processor_idle(0, smp_processor_id());
>
> I could imagine this is somewhat worth it to compare idle results
> to "no idle state at all" is used.
> But nobody should ever use idle=poll, comparing deep sleep states
> with C1 with (idle=halt) should be sufficient?

this is not idle=poll on the command line only.
this also gets used normally, in two cases
1) during real time operations, for some short periods of time
     (think wallstreet trading)
2) by the menu governor when the next event is less than a few 
microseconds away, so short that even C1 is too much

I know that your new API tries to use "0" as exit, but 0 is already 
taken (in all power terminology at least on x86 it is) for this.

why isn't your "exit" a special define?


also, if you look at many other similar perf events, they ever separate 
entry/exit points:

process/do_process.cpp:         
perf_events->add_event("irq:irq_handler_entry");
process/do_process.cpp:         
perf_events->add_event("irq:irq_handler_exit");
process/do_process.cpp:         perf_events->add_event("irq:softirq_entry");
process/do_process.cpp:         perf_events->add_event("irq:softirq_exit");
process/do_process.cpp:         
perf_events->add_event("timer:timer_expire_entry");
process/do_process.cpp:         
perf_events->add_event("timer:timer_expire_exit");
process/do_process.cpp:         
perf_events->add_event("timer:hrtimer_expire_entry");
process/do_process.cpp:         
perf_events->add_event("timer:hrtimer_expire_exit");
process/do_process.cpp:         perf_events->add_event("power:power_start");
process/do_process.cpp:         perf_events->add_event("power:power_end");
process/do_process.cpp:         
perf_events->add_event("workqueue:workqueue_execute_start");
process/do_process.cpp:         
perf_events->add_event("workqueue:workqueue_execute_end");

so there is already an API consistency precedent
(and frankly, trying to multiplex in "exit" via a magic value is asking 
for trouble API wise)


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 11:55       ` Ingo Molnar
@ 2010-10-25 13:58       ` Arjan van de Ven
  2010-10-25 13:58       ` Arjan van de Ven
  3 siblings, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25 13:58 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Andrew Morton, Mathieu Desnoyers

On 10/25/2010 4:03 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
>> * Thomas Renninger<trenn@suse.de>  wrote:
>>
>>> New power trace events:
>>> power:processor_idle
>>> power:processor_frequency
>>> power:machine_suspend
>>>
>>>
>>> C-state/idle accounting events:
>>>    power:power_start
>>>    power:power_end
>>> are replaced with:
>>>    power:processor_idle
>> Well, most power saving hw models (and the code implementing them) have this kind of
>> model:
>>
>>   enter power saving mode X
>>   exit power saving mode
>>
>> Where X is some sort of 'power saving deepness' attribute, right?
> Sure.
> But ACPI and afaik this model got picked up for PCI and other (sub-)archs
> as well, defines state 0 as the non-power saving mode.

correct ,... "C0" is not power efficient... but it's still a valid OS 
idle state!
Also tracking processor_idle_{start,end} as a separate event!

same for "S0"... S0 as standby state is still valid... sure it doesn't 
save you much power... but that does not mean it's not valid.
(as indication, the Intel Moorestown platform, which is currently in 
production and available to OEMs, has such a S0 standby state)


> makes no sense and there is no need to introduce:
> processor_idle_start/processor_idle_end
> machine_suspend_start/machine_suspend_end
> device_power_mode_start/device_power_mode_end
> events.
> Using state 0 as "exit/end", is much nicer for kernel/
> userspace implementations/code and the user.
actually no; having written a few of these in userspace so far, having a 
separate end event is easier to deal with;
the actions you take on entry and exit are complete separate code paths.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:03     ` Thomas Renninger
                         ` (2 preceding siblings ...)
  2010-10-25 13:58       ` Arjan van de Ven
@ 2010-10-25 13:58       ` Arjan van de Ven
  2010-10-25 20:33         ` Rafael J. Wysocki
  2010-10-25 20:33         ` Rafael J. Wysocki
  3 siblings, 2 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25 13:58 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/25/2010 4:03 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
>> * Thomas Renninger<trenn@suse.de>  wrote:
>>
>>> New power trace events:
>>> power:processor_idle
>>> power:processor_frequency
>>> power:machine_suspend
>>>
>>>
>>> C-state/idle accounting events:
>>>    power:power_start
>>>    power:power_end
>>> are replaced with:
>>>    power:processor_idle
>> Well, most power saving hw models (and the code implementing them) have this kind of
>> model:
>>
>>   enter power saving mode X
>>   exit power saving mode
>>
>> Where X is some sort of 'power saving deepness' attribute, right?
> Sure.
> But ACPI and afaik this model got picked up for PCI and other (sub-)archs
> as well, defines state 0 as the non-power saving mode.

correct ,... "C0" is not power efficient... but it's still a valid OS 
idle state!
Also tracking processor_idle_{start,end} as a separate event!

same for "S0"... S0 as standby state is still valid... sure it doesn't 
save you much power... but that does not mean it's not valid.
(as indication, the Intel Moorestown platform, which is currently in 
production and available to OEMs, has such a S0 standby state)


> makes no sense and there is no need to introduce:
> processor_idle_start/processor_idle_end
> machine_suspend_start/machine_suspend_end
> device_power_mode_start/device_power_mode_end
> events.
> Using state 0 as "exit/end", is much nicer for kernel/
> userspace implementations/code and the user.
actually no; having written a few of these in userspace so far, having a 
separate end event is easier to deal with;
the actions you take on entry and exit are complete separate code paths.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 12:55         ` Thomas Renninger
@ 2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 14:11           ` Arjan van de Ven
  1 sibling, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25 14:11 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Andrew Morton, Mathieu Desnoyers

On 10/25/2010 5:55 AM, Thomas Renninger wrote:


>> But the actual code does not actually deal with any 'state 0', does it?
> It does. Not being idle is tracked by cpuidle driver as state 0
> (arch independent):
> /sys/devices/system/cpu/cpu0/cpuidle/state0/
> halt/C1 on X86 is:
> /sys/devices/system/cpu/cpu0/cpuidle/state1/
> ...
state0 is still OS idle!


the API is just weird for this, from a userspace perspective

if the kernel picks this state 0 for the idle handler, the userspace app 
gets
two events

one for going to state 0 to enter the idle state
one for going to state 0 to exit idle

but they're the exact same event in your API.

rather unpleasant from a userspace program perspective....
now I need to start tracking even more state on top in powertop to be 
able to make a guess at which of the two meanings a state 0 entry has.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 12:55         ` Thomas Renninger
  2010-10-25 14:11           ` Arjan van de Ven
@ 2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 14:51             ` Thomas Renninger
  2010-10-25 14:51             ` Thomas Renninger
  1 sibling, 2 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25 14:11 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/25/2010 5:55 AM, Thomas Renninger wrote:


>> But the actual code does not actually deal with any 'state 0', does it?
> It does. Not being idle is tracked by cpuidle driver as state 0
> (arch independent):
> /sys/devices/system/cpu/cpu0/cpuidle/state0/
> halt/C1 on X86 is:
> /sys/devices/system/cpu/cpu0/cpuidle/state1/
> ...
state0 is still OS idle!


the API is just weird for this, from a userspace perspective

if the kernel picks this state 0 for the idle handler, the userspace app 
gets
two events

one for going to state 0 to enter the idle state
one for going to state 0 to exit idle

but they're the exact same event in your API.

rather unpleasant from a userspace program perspective....
now I need to start tracking even more state on top in powertop to be 
able to make a guess at which of the two meanings a state 0 entry has.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 13:55       ` Arjan van de Ven
  2010-10-25 14:36         ` Thomas Renninger
@ 2010-10-25 14:36         ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 14:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On Monday 25 October 2010 15:55:08 Arjan van de Ven wrote:
> On 10/25/2010 2:41 AM, Thomas Renninger wrote:
> > On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
> >> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> >>>    static void poll_idle(void)
> >>>    {
> >>> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> >>>    	local_irq_enable();
> >>>    	while (!need_resched())
> >>>    		cpu_relax();
> >>> -	trace_power_end(0);
> >>>    }
> >> why did you remove the idle tracepoints from this one ???
> > Because no idle/sleep state is entered here.
> > State 0 does not exist or say, it means the machine is not idle.
> > The new event uses idle state 0 spec conform as "exit sleep state".
> >
> > If this should still be trackable some kind of dummy sleep state:
> > #define IDLE_BUSY_LOOP 0xFE
> > (or similar) must get defined and passed like this:
> > trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
> >      cpu_relax()
> > trace_processor_idle(0, smp_processor_id());
> >
> > I could imagine this is somewhat worth it to compare idle results
> > to "no idle state at all" is used.
> > But nobody should ever use idle=poll, comparing deep sleep states
> > with C1 with (idle=halt) should be sufficient?
> 
> this is not idle=poll on the command line only.
> this also gets used normally, in two cases
> 1) during real time operations, for some short periods of time
>      (think wallstreet trading)
> 2) by the menu governor when the next event is less than a few 
> microseconds away, so short that even C1 is too much
> 
> I know that your new API tries to use "0" as exit, but 0 is already 
> taken (in all power terminology at least on x86 it is) for this.
cpuidle indeed misuses C0 as "poll idle" state.
That's really bad/misleading, but nothing that can be changed easily.

I agree shifting C0 (cpuidle) <-> POLL_IDLE event
and              "not idle"   <-> real C0 (executing instructions)
or however this gets mapped makes things even worse.

Damn, it could be that easy and straight forward, but I agree that
this kills the approach to trigger state 0 event if C0 is entered
(C0 as defined as operational mode executing instructions).

     Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 13:55       ` Arjan van de Ven
@ 2010-10-25 14:36         ` Thomas Renninger
  2010-10-25 14:45           ` Arjan van de Ven
  2010-10-25 14:45           ` Arjan van de Ven
  2010-10-25 14:36         ` Thomas Renninger
  1 sibling, 2 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 14:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On Monday 25 October 2010 15:55:08 Arjan van de Ven wrote:
> On 10/25/2010 2:41 AM, Thomas Renninger wrote:
> > On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
> >> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> >>>    static void poll_idle(void)
> >>>    {
> >>> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> >>>    	local_irq_enable();
> >>>    	while (!need_resched())
> >>>    		cpu_relax();
> >>> -	trace_power_end(0);
> >>>    }
> >> why did you remove the idle tracepoints from this one ???
> > Because no idle/sleep state is entered here.
> > State 0 does not exist or say, it means the machine is not idle.
> > The new event uses idle state 0 spec conform as "exit sleep state".
> >
> > If this should still be trackable some kind of dummy sleep state:
> > #define IDLE_BUSY_LOOP 0xFE
> > (or similar) must get defined and passed like this:
> > trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
> >      cpu_relax()
> > trace_processor_idle(0, smp_processor_id());
> >
> > I could imagine this is somewhat worth it to compare idle results
> > to "no idle state at all" is used.
> > But nobody should ever use idle=poll, comparing deep sleep states
> > with C1 with (idle=halt) should be sufficient?
> 
> this is not idle=poll on the command line only.
> this also gets used normally, in two cases
> 1) during real time operations, for some short periods of time
>      (think wallstreet trading)
> 2) by the menu governor when the next event is less than a few 
> microseconds away, so short that even C1 is too much
> 
> I know that your new API tries to use "0" as exit, but 0 is already 
> taken (in all power terminology at least on x86 it is) for this.
cpuidle indeed misuses C0 as "poll idle" state.
That's really bad/misleading, but nothing that can be changed easily.

I agree shifting C0 (cpuidle) <-> POLL_IDLE event
and              "not idle"   <-> real C0 (executing instructions)
or however this gets mapped makes things even worse.

Damn, it could be that easy and straight forward, but I agree that
this kills the approach to trigger state 0 event if C0 is entered
(C0 as defined as operational mode executing instructions).

     Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:36         ` Thomas Renninger
@ 2010-10-25 14:45           ` Arjan van de Ven
  2010-10-25 14:45           ` Arjan van de Ven
  1 sibling, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25 14:45 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On 10/25/2010 7:36 AM, Thomas Renninger wrote:
>> I know that your new API tries to use "0" as exit, but 0 is already
>> taken (in all power terminology at least on x86 it is) for this.
> cpuidle indeed misuses C0 as "poll idle" state.
> That's really bad/misleading, but nothing that can be changed easily.
>
> I agree shifting C0 (cpuidle)<->  POLL_IDLE event
> and              "not idle"<->  real C0 (executing instructions)
> or however this gets mapped makes things even worse.
>
> Damn, it could be that easy and straight forward, but I agree that
> this kills the approach to trigger state 0 event if C0 is entered
> (C0 as defined as operational mode executing instructions).

ok so we have

"C0 idle"
and
"C0 no longer idle"

I'd propose using the number 0 for the first one (it makes the most 
logical sense, it's the least deep idle state etc etc)

we could use "-1" or "INT_MAX" for the later

but as a user of the API I rather like a separate "we're no longer idle" 
event... but if not, as long as things aren't ambigious I'll find a way 
to code around it.
basically with a separate event, I demultiplex based on event number 
between entry and exit.... with a special exit value I would just need a 
double demultiplex,
one on "idle" and then a second one on the state number to split between 
entry/exit.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:36         ` Thomas Renninger
  2010-10-25 14:45           ` Arjan van de Ven
@ 2010-10-25 14:45           ` Arjan van de Ven
  2010-10-25 14:56             ` Ingo Molnar
  2010-10-25 14:56             ` Ingo Molnar
  1 sibling, 2 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25 14:45 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/25/2010 7:36 AM, Thomas Renninger wrote:
>> I know that your new API tries to use "0" as exit, but 0 is already
>> taken (in all power terminology at least on x86 it is) for this.
> cpuidle indeed misuses C0 as "poll idle" state.
> That's really bad/misleading, but nothing that can be changed easily.
>
> I agree shifting C0 (cpuidle)<->  POLL_IDLE event
> and              "not idle"<->  real C0 (executing instructions)
> or however this gets mapped makes things even worse.
>
> Damn, it could be that easy and straight forward, but I agree that
> this kills the approach to trigger state 0 event if C0 is entered
> (C0 as defined as operational mode executing instructions).

ok so we have

"C0 idle"
and
"C0 no longer idle"

I'd propose using the number 0 for the first one (it makes the most 
logical sense, it's the least deep idle state etc etc)

we could use "-1" or "INT_MAX" for the later

but as a user of the API I rather like a separate "we're no longer idle" 
event... but if not, as long as things aren't ambigious I'll find a way 
to code around it.
basically with a separate event, I demultiplex based on event number 
between entry and exit.... with a special exit value I would just need a 
double demultiplex,
one on "idle" and then a second one on the state number to split between 
entry/exit.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 14:51             ` Thomas Renninger
@ 2010-10-25 14:51             ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 14:51 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Andrew Morton, Mathieu Desnoyers

On Monday 25 October 2010 16:11:10 Arjan van de Ven wrote:
> On 10/25/2010 5:55 AM, Thomas Renninger wrote:
> 
> 
> >> But the actual code does not actually deal with any 'state 0', does it?
> > It does. Not being idle is tracked by cpuidle driver as state 0
> > (arch independent):
> > /sys/devices/system/cpu/cpu0/cpuidle/state0/
> > halt/C1 on X86 is:
> > /sys/devices/system/cpu/cpu0/cpuidle/state1/
> > ...
> state0 is still OS idle!
Yes, I just realized that.
Which is very unfortunate.
The whole cpuidle stuff is based on ACPI C-states and
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
C0
is plain wrong if it's used as "poll idle" time.
C0 is defined as (in the ACPI spec):
----------
2.5 Processor Power State Definitions
C0 Processor Power State
While the processor is in this state, it executes instructions.
----------

> the API is just weird for this, from a userspace perspective
> 
> if the kernel picks this state 0 for the idle handler, the userspace app 
> gets
> two events
> 
> one for going to state 0 to enter the idle state
> one for going to state 0 to exit idle
> 
> but they're the exact same event in your API.
> 
> rather unpleasant from a userspace program perspective....
Yeah. But the re-definition of C0 being "Linux poll idle"
will confuse users as well. Not sure whether this should get
touched, though.

Thanks for clarification, I wasn't aware of that...

    Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:11           ` Arjan van de Ven
@ 2010-10-25 14:51             ` Thomas Renninger
  2010-10-25 14:51             ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 14:51 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Monday 25 October 2010 16:11:10 Arjan van de Ven wrote:
> On 10/25/2010 5:55 AM, Thomas Renninger wrote:
> 
> 
> >> But the actual code does not actually deal with any 'state 0', does it?
> > It does. Not being idle is tracked by cpuidle driver as state 0
> > (arch independent):
> > /sys/devices/system/cpu/cpu0/cpuidle/state0/
> > halt/C1 on X86 is:
> > /sys/devices/system/cpu/cpu0/cpuidle/state1/
> > ...
> state0 is still OS idle!
Yes, I just realized that.
Which is very unfortunate.
The whole cpuidle stuff is based on ACPI C-states and
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
C0
is plain wrong if it's used as "poll idle" time.
C0 is defined as (in the ACPI spec):
----------
2.5 Processor Power State Definitions
C0 Processor Power State
While the processor is in this state, it executes instructions.
----------

> the API is just weird for this, from a userspace perspective
> 
> if the kernel picks this state 0 for the idle handler, the userspace app 
> gets
> two events
> 
> one for going to state 0 to enter the idle state
> one for going to state 0 to exit idle
> 
> but they're the exact same event in your API.
> 
> rather unpleasant from a userspace program perspective....
Yeah. But the re-definition of C0 being "Linux poll idle"
will confuse users as well. Not sure whether this should get
touched, though.

Thanks for clarification, I wasn't aware of that...

    Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:45           ` Arjan van de Ven
@ 2010-10-25 14:56             ` Ingo Molnar
  2010-10-25 14:56             ` Ingo Molnar
  1 sibling, 0 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-25 14:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Mathieu Desnoyers


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On 10/25/2010 7:36 AM, Thomas Renninger wrote:
> >>I know that your new API tries to use "0" as exit, but 0 is already
> >>taken (in all power terminology at least on x86 it is) for this.
> >cpuidle indeed misuses C0 as "poll idle" state.
> >That's really bad/misleading, but nothing that can be changed easily.
> >
> >I agree shifting C0 (cpuidle)<->  POLL_IDLE event
> >and              "not idle"<->  real C0 (executing instructions)
> >or however this gets mapped makes things even worse.
> >
> >Damn, it could be that easy and straight forward, but I agree that
> >this kills the approach to trigger state 0 event if C0 is entered
> >(C0 as defined as operational mode executing instructions).
> 
> ok so we have
> 
> "C0 idle"
> and
> "C0 no longer idle"
> 
> I'd propose using the number 0 for the first one (it makes the most
> logical sense, it's the least deep idle state etc etc)
> 
> we could use "-1" or "INT_MAX" for the later
> 
> but as a user of the API I rather like a separate "we're no longer idle" event... 
> but if not, as long as things aren't ambigious I'll find a way to code around it.
>
> basically with a separate event, I demultiplex based on event number between entry 
> and exit.... with a special exit value I would just need a double demultiplex,

Hm, does not sound particularly smart.

> one on "idle" and then a second one on the state number to split between 
> entry/exit.

The thing is, in terms of CPU idle state, if the old tracepoints give us all the 
information that the new tracepoints, why dont we simply add the tracepoints to ARM 
and be done with it? No app needs to be changed in that case, etc.

Plus, lets express the suspend/resume tracepoints as suspend_enter(X)/suspend_exit() 
events as well, to keep it symmetric and consistent with the other enter/exit 
events.

The rename alone isnt a strong enough reason really. 'entering idle state X' and 
'exiting idle' is pretty much synonymous to 'enter idle state X'.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:45           ` Arjan van de Ven
  2010-10-25 14:56             ` Ingo Molnar
@ 2010-10-25 14:56             ` Ingo Molnar
  2010-10-25 15:48               ` Thomas Renninger
  2010-10-25 15:48               ` Thomas Renninger
  1 sibling, 2 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-25 14:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On 10/25/2010 7:36 AM, Thomas Renninger wrote:
> >>I know that your new API tries to use "0" as exit, but 0 is already
> >>taken (in all power terminology at least on x86 it is) for this.
> >cpuidle indeed misuses C0 as "poll idle" state.
> >That's really bad/misleading, but nothing that can be changed easily.
> >
> >I agree shifting C0 (cpuidle)<->  POLL_IDLE event
> >and              "not idle"<->  real C0 (executing instructions)
> >or however this gets mapped makes things even worse.
> >
> >Damn, it could be that easy and straight forward, but I agree that
> >this kills the approach to trigger state 0 event if C0 is entered
> >(C0 as defined as operational mode executing instructions).
> 
> ok so we have
> 
> "C0 idle"
> and
> "C0 no longer idle"
> 
> I'd propose using the number 0 for the first one (it makes the most
> logical sense, it's the least deep idle state etc etc)
> 
> we could use "-1" or "INT_MAX" for the later
> 
> but as a user of the API I rather like a separate "we're no longer idle" event... 
> but if not, as long as things aren't ambigious I'll find a way to code around it.
>
> basically with a separate event, I demultiplex based on event number between entry 
> and exit.... with a special exit value I would just need a double demultiplex,

Hm, does not sound particularly smart.

> one on "idle" and then a second one on the state number to split between 
> entry/exit.

The thing is, in terms of CPU idle state, if the old tracepoints give us all the 
information that the new tracepoints, why dont we simply add the tracepoints to ARM 
and be done with it? No app needs to be changed in that case, etc.

Plus, lets express the suspend/resume tracepoints as suspend_enter(X)/suspend_exit() 
events as well, to keep it symmetric and consistent with the other enter/exit 
events.

The rename alone isnt a strong enough reason really. 'entering idle state X' and 
'exiting idle' is pretty much synonymous to 'enter idle state X'.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:56             ` Ingo Molnar
  2010-10-25 15:48               ` Thomas Renninger
@ 2010-10-25 15:48               ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 15:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, linux-trace-users, Arjan van de Ven,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Monday 25 October 2010 16:56:04 Ingo Molnar wrote:
> 
> * Arjan van de Ven <arjan@linux.intel.com> wrote:
> > On 10/25/2010 7:36 AM, Thomas Renninger wrote:
> > ok so we have
> > 
> > "C0 idle"
Ideally this should not be called C0, but expressed
as (#define) POLL_IDLE wherever possible.

In all documentations/specs/white papers about other OSes
C0 is refered to as not being idle.
Linux mis-uses it as a self-defined idle state which
is really confusing.

> > and
> > "C0 no longer idle"
> > 
> > I'd propose using the number 0 for the first one (it makes the most
> > logical sense, it's the least deep idle state etc etc)
I would use a special number for the "Linux only" state.

> > we could use "-1" or "INT_MAX" for the later
> > but as a user of the API I rather like a separate "we're no longer idle" event... 
> > but if not, as long as things aren't ambigious I'll find a way to code around it.
> >
> > basically with a separate event, I demultiplex based on event number between entry 
> > and exit.... with a special exit value I would just need a double demultiplex,
> 
> Hm, does not sound particularly smart.
> 
> > one on "idle" and then a second one on the state number to split between 
> > entry/exit.
> 
> The thing is, in terms of CPU idle state, if the old tracepoints give us all the 
> information that the new tracepoints, why dont we simply add the tracepoints to ARM 
> and be done with it? No app needs to be changed in that case, etc.
> 
> Plus, lets express the suspend/resume tracepoints as suspend_enter(X)/suspend_exit() 
> events as well, to keep it symmetric and consistent with the other enter/exit 
> events.
> 
> The rename alone isnt a strong enough reason really. 'entering idle state X' and 
> 'exiting idle' is pretty much synonymous to 'enter idle state X'.
It's not only that, my patch also:
  - eleminates the never ever used type= field
  - uses a better name, currently it's power:power_{start,end}
    How would you name another power event...

Altogether, it should justify the proposed cleanup(s).
But with this C0 clash, I am not sure whether:
  1) as Ingo said any clean up
  2) a minimal cleanup:
       - rename power:power_{start,end} to power:processor_idle{start,end}
       - get rid of type= field
  3) or a maximum cleanup:
       - plus not use start/end events, but use one state transition
         event.
should be done.
I think best is Jean goes with current definitions.
2. is far less intrusive and if you like to have it, I can
still send another patch.

     Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:56             ` Ingo Molnar
@ 2010-10-25 15:48               ` Thomas Renninger
  2010-10-25 16:00                 ` Arjan van de Ven
  2010-10-25 16:00                 ` Arjan van de Ven
  2010-10-25 15:48               ` Thomas Renninger
  1 sibling, 2 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 15:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Monday 25 October 2010 16:56:04 Ingo Molnar wrote:
> 
> * Arjan van de Ven <arjan@linux.intel.com> wrote:
> > On 10/25/2010 7:36 AM, Thomas Renninger wrote:
> > ok so we have
> > 
> > "C0 idle"
Ideally this should not be called C0, but expressed
as (#define) POLL_IDLE wherever possible.

In all documentations/specs/white papers about other OSes
C0 is refered to as not being idle.
Linux mis-uses it as a self-defined idle state which
is really confusing.

> > and
> > "C0 no longer idle"
> > 
> > I'd propose using the number 0 for the first one (it makes the most
> > logical sense, it's the least deep idle state etc etc)
I would use a special number for the "Linux only" state.

> > we could use "-1" or "INT_MAX" for the later
> > but as a user of the API I rather like a separate "we're no longer idle" event... 
> > but if not, as long as things aren't ambigious I'll find a way to code around it.
> >
> > basically with a separate event, I demultiplex based on event number between entry 
> > and exit.... with a special exit value I would just need a double demultiplex,
> 
> Hm, does not sound particularly smart.
> 
> > one on "idle" and then a second one on the state number to split between 
> > entry/exit.
> 
> The thing is, in terms of CPU idle state, if the old tracepoints give us all the 
> information that the new tracepoints, why dont we simply add the tracepoints to ARM 
> and be done with it? No app needs to be changed in that case, etc.
> 
> Plus, lets express the suspend/resume tracepoints as suspend_enter(X)/suspend_exit() 
> events as well, to keep it symmetric and consistent with the other enter/exit 
> events.
> 
> The rename alone isnt a strong enough reason really. 'entering idle state X' and 
> 'exiting idle' is pretty much synonymous to 'enter idle state X'.
It's not only that, my patch also:
  - eleminates the never ever used type= field
  - uses a better name, currently it's power:power_{start,end}
    How would you name another power event...

Altogether, it should justify the proposed cleanup(s).
But with this C0 clash, I am not sure whether:
  1) as Ingo said any clean up
  2) a minimal cleanup:
       - rename power:power_{start,end} to power:processor_idle{start,end}
       - get rid of type= field
  3) or a maximum cleanup:
       - plus not use start/end events, but use one state transition
         event.
should be done.
I think best is Jean goes with current definitions.
2. is far less intrusive and if you like to have it, I can
still send another patch.

     Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 15:48               ` Thomas Renninger
  2010-10-25 16:00                 ` Arjan van de Ven
@ 2010-10-25 16:00                 ` Arjan van de Ven
  1 sibling, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25 16:00 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Andrew Morton, Mathieu Desnoyers

On 10/25/2010 8:48 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 16:56:04 Ingo Molnar wrote:
>> * Arjan van de Ven<arjan@linux.intel.com>  wrote:
>>> On 10/25/2010 7:36 AM, Thomas Renninger wrote:
>>> ok so we have
>>>
>>> "C0 idle"
> Ideally this should not be called C0, but expressed
> as (#define) POLL_IDLE wherever possible.
>
> In all documentations/specs/white papers about other OSes
> C0 is refered to as not being idle.
> Linux mis-uses it as a self-defined idle state which
> is really confusing.

sure naming is one thing
>>> and
>>> "C0 no longer idle"
>>>
>>> I'd propose using the number 0 for the first one (it makes the most
>>> logical sense, it's the least deep idle state etc etc)
> I would use a special number for the "Linux only" state.

that special number is 0 though..
it makes sense in ordering, 0 < 1, 1 < 2 etc



0 makes for a really bad special number for the exit marker; not just here,
but also for your suspend hook, that one definitely needs to change
(since current commercially available SOCs already reuse 0 for this for 
standby level states)

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 15:48               ` Thomas Renninger
@ 2010-10-25 16:00                 ` Arjan van de Ven
  2010-10-25 23:32                   ` Thomas Renninger
  2010-10-25 23:32                   ` Thomas Renninger
  2010-10-25 16:00                 ` Arjan van de Ven
  1 sibling, 2 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-25 16:00 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/25/2010 8:48 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 16:56:04 Ingo Molnar wrote:
>> * Arjan van de Ven<arjan@linux.intel.com>  wrote:
>>> On 10/25/2010 7:36 AM, Thomas Renninger wrote:
>>> ok so we have
>>>
>>> "C0 idle"
> Ideally this should not be called C0, but expressed
> as (#define) POLL_IDLE wherever possible.
>
> In all documentations/specs/white papers about other OSes
> C0 is refered to as not being idle.
> Linux mis-uses it as a self-defined idle state which
> is really confusing.

sure naming is one thing
>>> and
>>> "C0 no longer idle"
>>>
>>> I'd propose using the number 0 for the first one (it makes the most
>>> logical sense, it's the least deep idle state etc etc)
> I would use a special number for the "Linux only" state.

that special number is 0 though..
it makes sense in ordering, 0 < 1, 1 < 2 etc



0 makes for a really bad special number for the exit marker; not just here,
but also for your suspend hook, that one definitely needs to change
(since current commercially available SOCs already reuse 0 for this for 
standby level states)



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 12:58         ` Mathieu Desnoyers
  2010-10-25 20:29           ` Rafael J. Wysocki
@ 2010-10-25 20:29           ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-25 20:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Arjan van de Ven, linux-pm, Masami Hiramatsu,
	Tejun Heo, Thomas Gleixner, linux-omap, Linus Torvalds,
	Ingo Molnar

On Monday, October 25, 2010, Mathieu Desnoyers wrote:
> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > > 
> > > > * Thomas Renninger <trenn@suse.de> wrote:
> > > > 
> > > > > New power trace events:
> > > > > power:processor_idle
> > > > > power:processor_frequency
> > > > > power:machine_suspend
> > > > > 
> > > > > 
> > > > > C-state/idle accounting events:
> > > > >   power:power_start
> > > > >   power:power_end
> > > > > are replaced with:
> > > > >   power:processor_idle
> > > > 
> > > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > > model:
> > > > 
> > > >  enter power saving mode X
> > > >  exit power saving mode
> > > > 
> > > > Where X is some sort of 'power saving deepness' attribute, right?
> > >
> > > Sure.
> > 
> > Which is is the 'saner' model?
> > 
> > > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > > defines state 0 as the non-power saving mode.
> > 
> > But the actual code does not actually deal with any 'state 0', does it? It enters an 
> > idle function and then exits it, right?
> > 
> > 'power state' might be what is used for devices - but even there, we have:
> > 
> >   - enter power state X
> >   - exit power state
> > 
> > right?
> > 
> > > Same as done here with machine suspend state (S0 is back from suspend) and
> > > this model should get picked up when device sleep states get tracked at
> > > some time.
> > >
> > > It's consistent and applies to some well known specifications.
> > 
> > What we want it to be is for it to be the nicest, most understandable, most logical 
> > model - not one matching random hardware specifications.
> > 
> > ( Hardware specifications only matter in so far that it should be possible to 
> >   express all the known hardware state transitions via these events efficiently. )
> > 
> > > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > > there is no need to introduce: processor_idle_start/processor_idle_end 
> > > machine_suspend_start/machine_suspend_end 
> > > device_power_mode_start/device_power_mode_end events.
> > 
> > What do you mean by "makes no sense"?
> > 
> > Are they superfluous? Inefficient? Illogical?
> 
> I think it would require deep understanding of specific power modes of each
> architecture to split into this topology. On the bright side, it would bring
> clear understanding of which HW resource is being put to sleep, which would make
> automated analysis much easier to do. But maybe it's too much pain compared to
> the benefit. The related question is also: where is it best to put this logic ?
> In the kernel code ? In per-arch TRACE_EVENT() handlers or in external trace
> analysis plugins ?
> 
> > 
> > > Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> > > implementations/code and the user.
> > 
> > By that argument we should not have separate fork() and exit() syscalls either, but 
> > a set_process_state(1) and set_process_state(0) interface?
> 
> I'm by no mean expert on power saving hardware specs, but if it is possible for
> hardware to switch between two power saving states without passing through power
> state 0, then using a "set state" rather than an enter/exit would be more
> appropriate; even if we go for a scheme introducing
> 
> processor_idle_start/processor_idle_end,
> machine_suspend_start/machine_suspend_end,
> device_power_mode_start/device_power_mode_end.
> 
> I must defer to you guys to figure out if some hardware actually do that for
> either of CPU idle, suspend or device power modes.

Yes, you can go directly from PCI_D1 to PCI_D2, for one example.

Apart from this, attempting to put system suspend to the same bag as cpuidle
is not going to work in the long run.  They are _fundamentally_ different things
event though the power state we get into as a result of suspend is approximately
the same as we can get into via cpuidle (even in that case the energy savings
will generally be different in both cases due to wakeup events).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 12:58         ` Mathieu Desnoyers
@ 2010-10-25 20:29           ` Rafael J. Wysocki
  2010-10-25 20:29           ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-25 20:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Thomas Renninger, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Pierre Tardy, Frederic Weisbecker,
	Tejun Heo, Arjan van de Ven

On Monday, October 25, 2010, Mathieu Desnoyers wrote:
> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > > 
> > > > * Thomas Renninger <trenn@suse.de> wrote:
> > > > 
> > > > > New power trace events:
> > > > > power:processor_idle
> > > > > power:processor_frequency
> > > > > power:machine_suspend
> > > > > 
> > > > > 
> > > > > C-state/idle accounting events:
> > > > >   power:power_start
> > > > >   power:power_end
> > > > > are replaced with:
> > > > >   power:processor_idle
> > > > 
> > > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > > model:
> > > > 
> > > >  enter power saving mode X
> > > >  exit power saving mode
> > > > 
> > > > Where X is some sort of 'power saving deepness' attribute, right?
> > >
> > > Sure.
> > 
> > Which is is the 'saner' model?
> > 
> > > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > > defines state 0 as the non-power saving mode.
> > 
> > But the actual code does not actually deal with any 'state 0', does it? It enters an 
> > idle function and then exits it, right?
> > 
> > 'power state' might be what is used for devices - but even there, we have:
> > 
> >   - enter power state X
> >   - exit power state
> > 
> > right?
> > 
> > > Same as done here with machine suspend state (S0 is back from suspend) and
> > > this model should get picked up when device sleep states get tracked at
> > > some time.
> > >
> > > It's consistent and applies to some well known specifications.
> > 
> > What we want it to be is for it to be the nicest, most understandable, most logical 
> > model - not one matching random hardware specifications.
> > 
> > ( Hardware specifications only matter in so far that it should be possible to 
> >   express all the known hardware state transitions via these events efficiently. )
> > 
> > > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > > there is no need to introduce: processor_idle_start/processor_idle_end 
> > > machine_suspend_start/machine_suspend_end 
> > > device_power_mode_start/device_power_mode_end events.
> > 
> > What do you mean by "makes no sense"?
> > 
> > Are they superfluous? Inefficient? Illogical?
> 
> I think it would require deep understanding of specific power modes of each
> architecture to split into this topology. On the bright side, it would bring
> clear understanding of which HW resource is being put to sleep, which would make
> automated analysis much easier to do. But maybe it's too much pain compared to
> the benefit. The related question is also: where is it best to put this logic ?
> In the kernel code ? In per-arch TRACE_EVENT() handlers or in external trace
> analysis plugins ?
> 
> > 
> > > Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> > > implementations/code and the user.
> > 
> > By that argument we should not have separate fork() and exit() syscalls either, but 
> > a set_process_state(1) and set_process_state(0) interface?
> 
> I'm by no mean expert on power saving hardware specs, but if it is possible for
> hardware to switch between two power saving states without passing through power
> state 0, then using a "set state" rather than an enter/exit would be more
> appropriate; even if we go for a scheme introducing
> 
> processor_idle_start/processor_idle_end,
> machine_suspend_start/machine_suspend_end,
> device_power_mode_start/device_power_mode_end.
> 
> I must defer to you guys to figure out if some hardware actually do that for
> either of CPU idle, suspend or device power modes.

Yes, you can go directly from PCI_D1 to PCI_D2, for one example.

Apart from this, attempting to put system suspend to the same bag as cpuidle
is not going to work in the long run.  They are _fundamentally_ different things
event though the power state we get into as a result of suspend is approximately
the same as we can get into via cpuidle (even in that case the energy savings
will generally be different in both cases due to wakeup events).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 13:58       ` Arjan van de Ven
@ 2010-10-25 20:33         ` Rafael J. Wysocki
  2010-10-25 20:33         ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-25 20:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Thomas Gleixner, linux-omap, Linus Torvalds,
	Ingo Molnar

On Monday, October 25, 2010, Arjan van de Ven wrote:
> On 10/25/2010 4:03 AM, Thomas Renninger wrote:
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> >> * Thomas Renninger<trenn@suse.de>  wrote:
> >>
> >>> New power trace events:
> >>> power:processor_idle
> >>> power:processor_frequency
> >>> power:machine_suspend
> >>>
> >>>
> >>> C-state/idle accounting events:
> >>>    power:power_start
> >>>    power:power_end
> >>> are replaced with:
> >>>    power:processor_idle
> >> Well, most power saving hw models (and the code implementing them) have this kind of
> >> model:
> >>
> >>   enter power saving mode X
> >>   exit power saving mode
> >>
> >> Where X is some sort of 'power saving deepness' attribute, right?
> > Sure.
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs
> > as well, defines state 0 as the non-power saving mode.
> 
> correct ,... "C0" is not power efficient... but it's still a valid OS 
> idle state!
> Also tracking processor_idle_{start,end} as a separate event!
> 
> same for "S0"... S0 as standby state is still valid... sure it doesn't 
> save you much power... but that does not mean it's not valid.

If you mean ACPI S0, it is not a standby state.  It actually is the full-power
state.

> (as indication, the Intel Moorestown platform, which is currently in 
> production and available to OEMs, has such a S0 standby state)

Another naming confusion.  How smart.

> > makes no sense and there is no need to introduce:
> > processor_idle_start/processor_idle_end
> > machine_suspend_start/machine_suspend_end
> > device_power_mode_start/device_power_mode_end
> > events.
> > Using state 0 as "exit/end", is much nicer for kernel/
> > userspace implementations/code and the user.
> actually no; having written a few of these in userspace so far, having a 
> separate end event is easier to deal with;
> the actions you take on entry and exit are complete separate code paths.

That's correct, unless you go directly from one low-power state to another
(which is possible for example for PCI).  We don't do that at the moment,
but it's possible in principle and we may want to start doing that at one
point.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 13:58       ` Arjan van de Ven
  2010-10-25 20:33         ` Rafael J. Wysocki
@ 2010-10-25 20:33         ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-25 20:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Thomas Renninger, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Pierre Tardy, Frederic Weisbecker,
	Tejun Heo, Mathieu Desnoyers

On Monday, October 25, 2010, Arjan van de Ven wrote:
> On 10/25/2010 4:03 AM, Thomas Renninger wrote:
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> >> * Thomas Renninger<trenn@suse.de>  wrote:
> >>
> >>> New power trace events:
> >>> power:processor_idle
> >>> power:processor_frequency
> >>> power:machine_suspend
> >>>
> >>>
> >>> C-state/idle accounting events:
> >>>    power:power_start
> >>>    power:power_end
> >>> are replaced with:
> >>>    power:processor_idle
> >> Well, most power saving hw models (and the code implementing them) have this kind of
> >> model:
> >>
> >>   enter power saving mode X
> >>   exit power saving mode
> >>
> >> Where X is some sort of 'power saving deepness' attribute, right?
> > Sure.
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs
> > as well, defines state 0 as the non-power saving mode.
> 
> correct ,... "C0" is not power efficient... but it's still a valid OS 
> idle state!
> Also tracking processor_idle_{start,end} as a separate event!
> 
> same for "S0"... S0 as standby state is still valid... sure it doesn't 
> save you much power... but that does not mean it's not valid.

If you mean ACPI S0, it is not a standby state.  It actually is the full-power
state.

> (as indication, the Intel Moorestown platform, which is currently in 
> production and available to OEMs, has such a S0 standby state)

Another naming confusion.  How smart.

> > makes no sense and there is no need to introduce:
> > processor_idle_start/processor_idle_end
> > machine_suspend_start/machine_suspend_end
> > device_power_mode_start/device_power_mode_end
> > events.
> > Using state 0 as "exit/end", is much nicer for kernel/
> > userspace implementations/code and the user.
> actually no; having written a few of these in userspace so far, having a 
> separate end event is easier to deal with;
> the actions you take on entry and exit are complete separate code paths.

That's correct, unless you go directly from one low-power state to another
(which is possible for example for PCI).  We don't do that at the moment,
but it's possible in principle and we may want to start doing that at one
point.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 16:00                 ` Arjan van de Ven
  2010-10-25 23:32                   ` Thomas Renninger
@ 2010-10-25 23:32                   ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 23:32 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Andrew Morton, Mathieu Desnoyers

@Ingo: Can you queue up 1/3, it's an independent fix.

On Monday 25 October 2010 06:00:17 pm Arjan van de Ven wrote:
> On 10/25/2010 8:48 AM, Thomas Renninger wrote:
> 
> sure naming is one thing
Yes it should get renamed to not show:
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
C0
This is wrong and confusing

> >>> and
> >>> "C0 no longer idle"
> >>>
> >>> I'd propose using the number 0 for the first one (it makes the most
> >>> logical sense, it's the least deep idle state etc etc)
> > I would use a special number for the "Linux only" state.
> 
> that special number is 0 though..
> it makes sense in ordering, 0 < 1, 1 < 2 etc
As long as it stays a kernel and perf processor_idle internal number
it does not hurt.
But userspace tools catching the perf idle event of state 0 should never
refer to it as processor idle state 0 (or even worse C0).
Instead they should try to get the name/description of:
/sys/../state0/name
or directly refer to it as "poll idle" state.

Processor idle state C0 is not only defined as "not being idle" in the
specs, also turbostat and cpufreq-aperf use it correctly and refer to C0 when 
they show accounted "not idle" time.

Encouraged by your suggestions I send another version.
It's not a big deal to send 0xFFFFFFFF instead of 0 as "non power saving" 
state. If you can handle compatibility with it in powertop, it doesn't make 
things more complicated in kernel and perf timechart as I first thought it 
does.

      Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 16:00                 ` Arjan van de Ven
@ 2010-10-25 23:32                   ` Thomas Renninger
  2010-10-25 23:32                   ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 23:32 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

@Ingo: Can you queue up 1/3, it's an independent fix.

On Monday 25 October 2010 06:00:17 pm Arjan van de Ven wrote:
> On 10/25/2010 8:48 AM, Thomas Renninger wrote:
> 
> sure naming is one thing
Yes it should get renamed to not show:
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
C0
This is wrong and confusing

> >>> and
> >>> "C0 no longer idle"
> >>>
> >>> I'd propose using the number 0 for the first one (it makes the most
> >>> logical sense, it's the least deep idle state etc etc)
> > I would use a special number for the "Linux only" state.
> 
> that special number is 0 though..
> it makes sense in ordering, 0 < 1, 1 < 2 etc
As long as it stays a kernel and perf processor_idle internal number
it does not hurt.
But userspace tools catching the perf idle event of state 0 should never
refer to it as processor idle state 0 (or even worse C0).
Instead they should try to get the name/description of:
/sys/../state0/name
or directly refer to it as "poll idle" state.

Processor idle state C0 is not only defined as "not being idle" in the
specs, also turbostat and cpufreq-aperf use it correctly and refer to C0 when 
they show accounted "not idle" time.

Encouraged by your suggestions I send another version.
It's not a big deal to send 0xFFFFFFFF instead of 0 as "non power saving" 
state. If you can handle compatibility with it in powertop, it doesn't make 
things more complicated in kernel and perf timechart as I first thought it 
does.

      Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                     ` (6 preceding siblings ...)
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
@ 2010-10-25 23:33   ` Thomas Renninger
  7 siblings, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 23:33 UTC (permalink / raw)
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Thomas Gleixner, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

Changes in V2:
  - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
  - Use u32 instead of u64 for cpuid, state which is by far enough

New power trace events:
power:processor_idle
power:processor_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:processor_idle

and
  power:power_frequency
is replaced with:
  power:processor_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process.c    |    7 +++-
 arch/x86/kernel/process_64.c |    2 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   81 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   14 +++++++
 kernel/trace/power-traces.c  |    3 ++
 8 files changed, 108 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..6a98da3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_processor_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_processor_idle(1, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..5f2bb98 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_processor_idle(PWR_EVENT_EXIT,
+					     smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..33bdc41 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_processor_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..ec703e6 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..c78e496 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_processor_idle((eax >> 4) + 1, smp_processor_id());
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..4b13414 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,61 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(processor,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	     TP_ARGS(state, cpu_id)
+);
+
+#define PWR_EVENT_EXIT 0xFFFFFFFF
+
+DEFINE_EVENT(processor, processor_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +124,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..0b5c841 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,20 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..6b6da42 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(processor_idle);
 
-- 
1.6.3

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
                     ` (5 preceding siblings ...)
  2010-10-25 10:04   ` Ingo Molnar
@ 2010-10-25 23:33   ` Thomas Renninger
  2010-10-26  1:09     ` Arjan van de Ven
                       ` (7 more replies)
  2010-10-25 23:33   ` Thomas Renninger
  7 siblings, 8 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-25 23:33 UTC (permalink / raw)
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven, Ingo Molnar

Changes in V2:
  - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
  - Use u32 instead of u64 for cpuid, state which is by far enough

New power trace events:
power:processor_idle
power:processor_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:processor_idle

and
  power:power_frequency
is replaced with:
  power:processor_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process.c    |    7 +++-
 arch/x86/kernel/process_64.c |    2 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   81 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   14 +++++++
 kernel/trace/power-traces.c  |    3 ++
 8 files changed, 108 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..6a98da3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_processor_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_processor_idle(1, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..5f2bb98 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_processor_idle(PWR_EVENT_EXIT,
+					     smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..33bdc41 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_processor_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..ec703e6 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..c78e496 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_processor_idle((eax >> 4) + 1, smp_processor_id());
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..4b13414 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,61 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(processor,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	     TP_ARGS(state, cpu_id)
+);
+
+#define PWR_EVENT_EXIT 0xFFFFFFFF
+
+DEFINE_EVENT(processor, processor_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +124,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..0b5c841 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,20 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..6b6da42 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(processor_idle);
 
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH] PERF(userspace): Adjust perf timechart to the new power events V2
  2010-10-19 11:36 ` Thomas Renninger
  2010-10-26  0:18   ` [PATCH] PERF(userspace): Adjust perf timechart to the new power events V2 Thomas Renninger
@ 2010-10-26  0:18   ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-26  0:18 UTC (permalink / raw)
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Thomas Gleixner, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

Changes in V2:
  - Hanlde PWR_EVENT_EXIT instead of 0 to recon non-power state

The transition was rather smooth, only part I had to fiddle
some time was the check whether a tracepoint/event is
supported by the running kernel.

builtin-timechart must only pass -e power:xy events which
are supported by the running kernel.
For this I added the tiny helper function:
int is_valid_tracepoint(const char *event_string)
to parse-events.[hc]
which could be more generic as an interface and support
hardware/software/... events, not only tracepoints, but someone
else could extend that if needed...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/builtin-timechart.c |   89 ++++++++++++++++++++++++++++++++-------
 tools/perf/util/parse-events.c |   43 +++++++++++++++++++-
 tools/perf/util/parse-events.h |    1 +
 3 files changed, 116 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 9bcc38f..7eaa5b5 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -32,6 +32,10 @@
 #include "util/session.h"
 #include "util/svghelper.h"
 
+#define SUPPORT_OLD_POWER_EVENTS 1
+#define PWR_EVENT_EXIT 0xFFFFFFFF
+
+
 static char		const *input_name = "perf.data";
 static char		const *output_name = "output.svg";
 
@@ -298,12 +302,25 @@ struct trace_entry {
 	int			lock_depth;
 };
 
-struct power_entry {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+struct power_entry_old {
 	struct trace_entry te;
 	u64	type;
 	u64	value;
 	u64	cpu_id;
 };
+#endif
+
+struct power_processor_entry {
+	struct trace_entry te;
+	u64	state;
+	u64	cpu_id;
+};
+
+struct power_suspend_entry {
+	struct trace_entry te;
+	u64	state;
+};
 
 #define TASK_COMM_LEN 16
 struct wakeup_entry {
@@ -489,29 +506,46 @@ static int process_sample_event(event_t *event, struct perf_session *session)
 	te = (void *)data.raw_data;
 	if (session->sample_type & PERF_SAMPLE_RAW && data.raw_size > 0) {
 		char *event_str;
-		struct power_entry *pe;
-
-		pe = (void *)te;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		struct power_entry_old *peo;
+		peo = (void *)te;
+#endif
 
 		event_str = perf_header__find_event(te->type);
 
 		if (!event_str)
 			return 0;
 
-		if (strcmp(event_str, "power:power_start") == 0)
-			c_state_start(pe->cpu_id, data.time, pe->value);
-
-		if (strcmp(event_str, "power:power_end") == 0)
-			c_state_end(pe->cpu_id, data.time);
+		if (strcmp(event_str, "power:processor_idle") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			if (ppe->state == PWR_EVENT_EXIT)
+				c_state_end(ppe->cpu_id, data.time);
+			else
+				c_state_start(ppe->cpu_id, data.time,
+					      ppe->state);
+		}
 
-		if (strcmp(event_str, "power:power_frequency") == 0)
-			p_state_change(pe->cpu_id, data.time, pe->value);
+		else if (strcmp(event_str, "power:processor_frequency") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			p_state_change(ppe->cpu_id, data.time, ppe->state);
+		}
 
-		if (strcmp(event_str, "sched:sched_wakeup") == 0)
+		else if (strcmp(event_str, "sched:sched_wakeup") == 0)
 			sched_wakeup(data.cpu, data.time, data.pid, te);
 
-		if (strcmp(event_str, "sched:sched_switch") == 0)
+		else if (strcmp(event_str, "sched:sched_switch") == 0)
 			sched_switch(data.cpu, data.time, te);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		else if (strcmp(event_str, "power:power_start") == 0)
+			c_state_start(peo->cpu_id, data.time, peo->value);
+
+		else if (strcmp(event_str, "power:power_end") == 0)
+			c_state_end(peo->cpu_id, data.time);
+
+		else if (strcmp(event_str, "power:power_frequency") == 0)
+			p_state_change(peo->cpu_id, data.time, peo->value);
+#endif
 	}
 	return 0;
 }
@@ -968,7 +1002,8 @@ static const char * const timechart_usage[] = {
 	NULL
 };
 
-static const char *record_args[] = {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static const char *record_old_args[] = {
 	"record",
 	"-a",
 	"-R",
@@ -980,16 +1015,38 @@ static const char *record_args[] = {
 	"-e", "sched:sched_wakeup",
 	"-e", "sched:sched_switch",
 };
+#endif
+
+static const char *record_new_args[] = {
+	"record",
+	"-a",
+	"-R",
+	"-f",
+	"-c", "1",
+	"-e", "power:processor_frequency",
+	"-e", "power:processor_idle",
+	"-e", "sched:sched_wakeup",
+	"-e", "sched:sched_switch",
+};
 
 static int __cmd_record(int argc, const char **argv)
 {
 	unsigned int rec_argc, i, j;
 	const char **rec_argv;
+	const char **record_args = record_new_args;
+	unsigned int record_elems = ARRAY_SIZE(record_new_args);
 
-	rec_argc = ARRAY_SIZE(record_args) + argc - 1;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+	if (is_valid_tracepoint("power:power_start")) {
+		record_args = record_old_args;
+		record_elems = ARRAY_SIZE(record_old_args);
+	}
+#endif
+	
+	rec_argc = record_elems + argc - 1;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
-	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+	for (i = 0; i < record_elems; i++)
 		rec_argv[i] = strdup(record_args[i]);
 
 	for (j = 1; j < (unsigned int)argc; j++, i++)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4af5bd5..d706dcb 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -824,7 +824,7 @@ int parse_events(const struct option *opt __used, const char *str, int unset __u
 		if (ret != EVT_HANDLED_ALL) {
 			attrs[nr_counters] = attr;
 			nr_counters++;
-		}
+	}
 
 		if (*str == 0)
 			break;
@@ -906,6 +906,47 @@ static void print_tracepoint_events(void)
 }
 
 /*
+ * Check whether event is in <debugfs_mount_point>/tracing/events
+ */
+
+int is_valid_tracepoint(const char *event_string)
+{
+	DIR *sys_dir, *evt_dir;
+	struct dirent *sys_next, *evt_next, sys_dirent, evt_dirent;
+	char evt_path[MAXPATHLEN];
+	char dir_path[MAXPATHLEN];
+
+	if (debugfs_valid_mountpoint(debugfs_path))
+		return 0;
+
+	sys_dir = opendir(debugfs_path);
+	if (!sys_dir)
+		return 0;
+
+	for_each_subsystem(sys_dir, sys_dirent, sys_next) {
+
+		snprintf(dir_path, MAXPATHLEN, "%s/%s", debugfs_path,
+			 sys_dirent.d_name);
+		evt_dir = opendir(dir_path);
+		if (!evt_dir)
+			continue;
+
+		for_each_event(sys_dirent, evt_dir, evt_dirent, evt_next) {
+			snprintf(evt_path, MAXPATHLEN, "%s:%s",
+				 sys_dirent.d_name, evt_dirent.d_name);
+			if (!strcmp(evt_path, event_string)) {
+				closedir(evt_dir);
+				closedir(sys_dir);
+				return 1;
+			}
+		}
+		closedir(evt_dir);
+	}
+	closedir(sys_dir);
+	return 0;
+}
+
+/*
  * Print the help text for the event symbols:
  */
 void print_events(void)
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index fc4ab3f..7ab4685 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -29,6 +29,7 @@ extern int parse_filter(const struct option *opt, const char *str, int unset);
 #define EVENTS_HELP_MAX (128*1024)
 
 extern void print_events(void);
+extern int is_valid_tracepoint(const char *event_string);
 
 extern char debugfs_path[];
 extern int valid_debugfs_mount(const char *debugfs);
-- 
1.6.3

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH] PERF(userspace): Adjust perf timechart to the new power events V2
  2010-10-19 11:36 ` Thomas Renninger
@ 2010-10-26  0:18   ` Thomas Renninger
  2010-10-26  0:18   ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-26  0:18 UTC (permalink / raw)
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven, Ingo Molnar

Changes in V2:
  - Hanlde PWR_EVENT_EXIT instead of 0 to recon non-power state

The transition was rather smooth, only part I had to fiddle
some time was the check whether a tracepoint/event is
supported by the running kernel.

builtin-timechart must only pass -e power:xy events which
are supported by the running kernel.
For this I added the tiny helper function:
int is_valid_tracepoint(const char *event_string)
to parse-events.[hc]
which could be more generic as an interface and support
hardware/software/... events, not only tracepoints, but someone
else could extend that if needed...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/builtin-timechart.c |   89 ++++++++++++++++++++++++++++++++-------
 tools/perf/util/parse-events.c |   43 +++++++++++++++++++-
 tools/perf/util/parse-events.h |    1 +
 3 files changed, 116 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 9bcc38f..7eaa5b5 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -32,6 +32,10 @@
 #include "util/session.h"
 #include "util/svghelper.h"
 
+#define SUPPORT_OLD_POWER_EVENTS 1
+#define PWR_EVENT_EXIT 0xFFFFFFFF
+
+
 static char		const *input_name = "perf.data";
 static char		const *output_name = "output.svg";
 
@@ -298,12 +302,25 @@ struct trace_entry {
 	int			lock_depth;
 };
 
-struct power_entry {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+struct power_entry_old {
 	struct trace_entry te;
 	u64	type;
 	u64	value;
 	u64	cpu_id;
 };
+#endif
+
+struct power_processor_entry {
+	struct trace_entry te;
+	u64	state;
+	u64	cpu_id;
+};
+
+struct power_suspend_entry {
+	struct trace_entry te;
+	u64	state;
+};
 
 #define TASK_COMM_LEN 16
 struct wakeup_entry {
@@ -489,29 +506,46 @@ static int process_sample_event(event_t *event, struct perf_session *session)
 	te = (void *)data.raw_data;
 	if (session->sample_type & PERF_SAMPLE_RAW && data.raw_size > 0) {
 		char *event_str;
-		struct power_entry *pe;
-
-		pe = (void *)te;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		struct power_entry_old *peo;
+		peo = (void *)te;
+#endif
 
 		event_str = perf_header__find_event(te->type);
 
 		if (!event_str)
 			return 0;
 
-		if (strcmp(event_str, "power:power_start") == 0)
-			c_state_start(pe->cpu_id, data.time, pe->value);
-
-		if (strcmp(event_str, "power:power_end") == 0)
-			c_state_end(pe->cpu_id, data.time);
+		if (strcmp(event_str, "power:processor_idle") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			if (ppe->state == PWR_EVENT_EXIT)
+				c_state_end(ppe->cpu_id, data.time);
+			else
+				c_state_start(ppe->cpu_id, data.time,
+					      ppe->state);
+		}
 
-		if (strcmp(event_str, "power:power_frequency") == 0)
-			p_state_change(pe->cpu_id, data.time, pe->value);
+		else if (strcmp(event_str, "power:processor_frequency") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			p_state_change(ppe->cpu_id, data.time, ppe->state);
+		}
 
-		if (strcmp(event_str, "sched:sched_wakeup") == 0)
+		else if (strcmp(event_str, "sched:sched_wakeup") == 0)
 			sched_wakeup(data.cpu, data.time, data.pid, te);
 
-		if (strcmp(event_str, "sched:sched_switch") == 0)
+		else if (strcmp(event_str, "sched:sched_switch") == 0)
 			sched_switch(data.cpu, data.time, te);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		else if (strcmp(event_str, "power:power_start") == 0)
+			c_state_start(peo->cpu_id, data.time, peo->value);
+
+		else if (strcmp(event_str, "power:power_end") == 0)
+			c_state_end(peo->cpu_id, data.time);
+
+		else if (strcmp(event_str, "power:power_frequency") == 0)
+			p_state_change(peo->cpu_id, data.time, peo->value);
+#endif
 	}
 	return 0;
 }
@@ -968,7 +1002,8 @@ static const char * const timechart_usage[] = {
 	NULL
 };
 
-static const char *record_args[] = {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static const char *record_old_args[] = {
 	"record",
 	"-a",
 	"-R",
@@ -980,16 +1015,38 @@ static const char *record_args[] = {
 	"-e", "sched:sched_wakeup",
 	"-e", "sched:sched_switch",
 };
+#endif
+
+static const char *record_new_args[] = {
+	"record",
+	"-a",
+	"-R",
+	"-f",
+	"-c", "1",
+	"-e", "power:processor_frequency",
+	"-e", "power:processor_idle",
+	"-e", "sched:sched_wakeup",
+	"-e", "sched:sched_switch",
+};
 
 static int __cmd_record(int argc, const char **argv)
 {
 	unsigned int rec_argc, i, j;
 	const char **rec_argv;
+	const char **record_args = record_new_args;
+	unsigned int record_elems = ARRAY_SIZE(record_new_args);
 
-	rec_argc = ARRAY_SIZE(record_args) + argc - 1;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+	if (is_valid_tracepoint("power:power_start")) {
+		record_args = record_old_args;
+		record_elems = ARRAY_SIZE(record_old_args);
+	}
+#endif
+	
+	rec_argc = record_elems + argc - 1;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
-	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+	for (i = 0; i < record_elems; i++)
 		rec_argv[i] = strdup(record_args[i]);
 
 	for (j = 1; j < (unsigned int)argc; j++, i++)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4af5bd5..d706dcb 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -824,7 +824,7 @@ int parse_events(const struct option *opt __used, const char *str, int unset __u
 		if (ret != EVT_HANDLED_ALL) {
 			attrs[nr_counters] = attr;
 			nr_counters++;
-		}
+	}
 
 		if (*str == 0)
 			break;
@@ -906,6 +906,47 @@ static void print_tracepoint_events(void)
 }
 
 /*
+ * Check whether event is in <debugfs_mount_point>/tracing/events
+ */
+
+int is_valid_tracepoint(const char *event_string)
+{
+	DIR *sys_dir, *evt_dir;
+	struct dirent *sys_next, *evt_next, sys_dirent, evt_dirent;
+	char evt_path[MAXPATHLEN];
+	char dir_path[MAXPATHLEN];
+
+	if (debugfs_valid_mountpoint(debugfs_path))
+		return 0;
+
+	sys_dir = opendir(debugfs_path);
+	if (!sys_dir)
+		return 0;
+
+	for_each_subsystem(sys_dir, sys_dirent, sys_next) {
+
+		snprintf(dir_path, MAXPATHLEN, "%s/%s", debugfs_path,
+			 sys_dirent.d_name);
+		evt_dir = opendir(dir_path);
+		if (!evt_dir)
+			continue;
+
+		for_each_event(sys_dirent, evt_dir, evt_dirent, evt_next) {
+			snprintf(evt_path, MAXPATHLEN, "%s:%s",
+				 sys_dirent.d_name, evt_dirent.d_name);
+			if (!strcmp(evt_path, event_string)) {
+				closedir(evt_dir);
+				closedir(sys_dir);
+				return 1;
+			}
+		}
+		closedir(evt_dir);
+	}
+	closedir(sys_dir);
+	return 0;
+}
+
+/*
  * Print the help text for the event symbols:
  */
 void print_events(void)
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index fc4ab3f..7ab4685 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -29,6 +29,7 @@ extern int parse_filter(const struct option *opt, const char *str, int unset);
 #define EVENTS_HELP_MAX (128*1024)
 
 extern void print_events(void);
+extern int is_valid_tracepoint(const char *event_string);
 
 extern char debugfs_path[];
 extern int valid_debugfs_mount(const char *debugfs);
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
@ 2010-10-26  1:09     ` Arjan van de Ven
  2010-10-26  1:09     ` Arjan van de Ven
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-26  1:09 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Mathieu Desnoyers,
	Ingo Molnar, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Thomas Gleixner

On 10/25/2010 4:33 PM, Thomas Renninger wrote:
> Changes in V2:
>    - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>    - Use u32 instead of u64 for cpuid, state which is by far enough
>
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>    power:power_start
>    power:power_end
> are replaced with:
>    power:processor_idle
>
> and
>    power:power_frequency
> is replaced with:
>    power:processor_frequency
>
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.
>
> the type= field got removed from both, it was never
> used and the type is differed by the event type itself.
>
> perf timechart
> userspace tool gets adjusted in a separate patch.
>

Acked-by: Arjan van de Ven <arjan@linux.intel.com>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
  2010-10-26  1:09     ` Arjan van de Ven
@ 2010-10-26  1:09     ` Arjan van de Ven
  2010-10-26  7:10     ` Ingo Molnar
                       ` (5 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-26  1:09 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/25/2010 4:33 PM, Thomas Renninger wrote:
> Changes in V2:
>    - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>    - Use u32 instead of u64 for cpuid, state which is by far enough
>
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>    power:power_start
>    power:power_end
> are replaced with:
>    power:processor_idle
>
> and
>    power:power_frequency
> is replaced with:
>    power:processor_frequency
>
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.
>
> the type= field got removed from both, it was never
> used and the type is differed by the event type itself.
>
> perf timechart
> userspace tool gets adjusted in a separate patch.
>

Acked-by: Arjan van de Ven <arjan@linux.intel.com>


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
  2010-10-26  1:09     ` Arjan van de Ven
  2010-10-26  1:09     ` Arjan van de Ven
@ 2010-10-26  7:10     ` Ingo Molnar
  2010-10-26  7:10     ` Ingo Molnar
                       ` (4 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-26  7:10 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner


* Thomas Renninger <trenn@suse.de> wrote:

> Changes in V2:
>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle
> 
> and
>   power:power_frequency
> is replaced with:
>   power:processor_frequency

Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
shortness? We generally use 'cpu' in the kernel and for events.

> power:machine_suspend

How will future PCI (or other device) power saving tracepoints be called?

Might be more consistent to use:

  power:cpu_idle
  power:machine_idle
  power:device_idle

Where machine_idle is the suspend event.

> the type= field got removed from both, it was never
> used and the type is differed by the event type itself.
>
> +#define PWR_EVENT_EXIT 0xFFFFFFFF

Shouldnt this be part of the POWER_ enum? (and you can write -1 there)

> +#ifndef _TRACE_POWER_ENUM_
> +#define _TRACE_POWER_ENUM_
> +enum {
> +	POWER_NONE = 0,
> +	POWER_CSTATE = 1,
> +	POWER_PSTATE = 2,
> +};
> +#endif

Since we are cleaning up all these events, those enum definitions dont really look 
logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?

Plus:

> +DECLARE_EVENT_CLASS(processor,
> +
> +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +	TP_ARGS(state, cpu_id),
> +
> +	TP_STRUCT__entry(
> +		__field(	u32,		state		)
> +		__field(	u32,		cpu_id		)

Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
ever be different from that CPU id?

> +	),
> +
> +	TP_fast_assign(
> +		__entry->state = state;
> +		__entry->cpu_id = cpu_id;
> +	),
> +
> +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> +		  (unsigned long)__entry->cpu_id)
> +);
> +
> +DEFINE_EVENT(processor, processor_idle,
> +
> +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +	     TP_ARGS(state, cpu_id)
> +);
> +
> +#define PWR_EVENT_EXIT 0xFFFFFFFF
> +
> +DEFINE_EVENT(processor, processor_frequency,
> +
> +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> +
> +	TP_ARGS(frequency, cpu_id)
> +);

So, we have a 'state' field in the class, which is used as 'state' by the 
power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?

Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
then it wont fit into u32.

Also, might there be a future need to express different types of frequencies? For 
example, should we decide to track turbo frequencies in Intel CPUs, how would that 
be expressed via these events? Are there any architectures and CPUs that somehow 
have some extra attribute to the frequency value?

> +TRACE_EVENT(machine_suspend,
> +
> +	TP_PROTO(unsigned int state),
> +
> +	TP_ARGS(state),
> +
> +	TP_STRUCT__entry(
> +		__field(	u32,		state		)
> +	),

Hm, this event is not used anywhere in the submitted patches. Where is the patch 
that adds usage, and what are the possible values for 'state'?

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
                       ` (2 preceding siblings ...)
  2010-10-26  7:10     ` Ingo Molnar
@ 2010-10-26  7:10     ` Ingo Molnar
  2010-10-26  8:08       ` Jean Pihet
                         ` (7 more replies)
  2010-10-26  7:59     ` Jean Pihet
                       ` (3 subsequent siblings)
  7 siblings, 8 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-26  7:10 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> Changes in V2:
>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle
> 
> and
>   power:power_frequency
> is replaced with:
>   power:processor_frequency

Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
shortness? We generally use 'cpu' in the kernel and for events.

> power:machine_suspend

How will future PCI (or other device) power saving tracepoints be called?

Might be more consistent to use:

  power:cpu_idle
  power:machine_idle
  power:device_idle

Where machine_idle is the suspend event.

> the type= field got removed from both, it was never
> used and the type is differed by the event type itself.
>
> +#define PWR_EVENT_EXIT 0xFFFFFFFF

Shouldnt this be part of the POWER_ enum? (and you can write -1 there)

> +#ifndef _TRACE_POWER_ENUM_
> +#define _TRACE_POWER_ENUM_
> +enum {
> +	POWER_NONE = 0,
> +	POWER_CSTATE = 1,
> +	POWER_PSTATE = 2,
> +};
> +#endif

Since we are cleaning up all these events, those enum definitions dont really look 
logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?

Plus:

> +DECLARE_EVENT_CLASS(processor,
> +
> +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +	TP_ARGS(state, cpu_id),
> +
> +	TP_STRUCT__entry(
> +		__field(	u32,		state		)
> +		__field(	u32,		cpu_id		)

Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
ever be different from that CPU id?

> +	),
> +
> +	TP_fast_assign(
> +		__entry->state = state;
> +		__entry->cpu_id = cpu_id;
> +	),
> +
> +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> +		  (unsigned long)__entry->cpu_id)
> +);
> +
> +DEFINE_EVENT(processor, processor_idle,
> +
> +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +	     TP_ARGS(state, cpu_id)
> +);
> +
> +#define PWR_EVENT_EXIT 0xFFFFFFFF
> +
> +DEFINE_EVENT(processor, processor_frequency,
> +
> +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> +
> +	TP_ARGS(frequency, cpu_id)
> +);

So, we have a 'state' field in the class, which is used as 'state' by the 
power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?

Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
then it wont fit into u32.

Also, might there be a future need to express different types of frequencies? For 
example, should we decide to track turbo frequencies in Intel CPUs, how would that 
be expressed via these events? Are there any architectures and CPUs that somehow 
have some extra attribute to the frequency value?

> +TRACE_EVENT(machine_suspend,
> +
> +	TP_PROTO(unsigned int state),
> +
> +	TP_ARGS(state),
> +
> +	TP_STRUCT__entry(
> +		__field(	u32,		state		)
> +	),

Hm, this event is not used anywhere in the submitted patches. Where is the patch 
that adds usage, and what are the possible values for 'state'?

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
                       ` (4 preceding siblings ...)
  2010-10-26  7:59     ` Jean Pihet
@ 2010-10-26  7:59     ` Jean Pihet
  2010-10-26 18:52     ` Rafael J. Wysocki
  2010-10-26 18:52     ` Rafael J. Wysocki
  7 siblings, 0 replies; 157+ messages in thread
From: Jean Pihet @ 2010-10-26  7:59 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Mathieu Desnoyers, Ingo Molnar, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Tue, Oct 26, 2010 at 1:33 AM, Thomas Renninger <trenn@suse.de> wrote:
> Changes in V2:
>  - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>  - Use u32 instead of u64 for cpuid, state which is by far enough
>
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>  power:power_start
>  power:power_end
> are replaced with:
>  power:processor_idle
>
> and
>  power:power_frequency
> is replaced with:
>  power:processor_frequency
>
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.
This will come as a separate patch, which fits all platforms. Cf.
http://marc.info/?l=linux-omap&m=128620575300682&w=2.

...

> @@ -481,10 +484,12 @@ static void mwait_idle(void)
>  static void poll_idle(void)
>  {
>        trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> +       trace_processor_idle(1, smp_processor_id());

Should that be:
+       trace_processor_idle(0, smp_processor_id());
instead?
Since state '0' is for the CPU active in polling mode and
PWR_EVENT_EXIT means 'exit from any idle state'.

>        local_irq_enable();
>        while (!need_resched())
>                cpu_relax();
> -       trace_power_end(0);
> +       trace_power_end(smp_processor_id());
> +       trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
>  }
>
>  /*

...

> diff --git a/include/trace/events/power.h b/include/trace/events/power.h
...

Regards,
Jean

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
                       ` (3 preceding siblings ...)
  2010-10-26  7:10     ` Ingo Molnar
@ 2010-10-26  7:59     ` Jean Pihet
  2010-10-26  7:59     ` Jean Pihet
                       ` (2 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Jean Pihet @ 2010-10-26  7:59 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Pierre Tardy,
	Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven, Ingo Molnar

On Tue, Oct 26, 2010 at 1:33 AM, Thomas Renninger <trenn@suse.de> wrote:
> Changes in V2:
>  - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>  - Use u32 instead of u64 for cpuid, state which is by far enough
>
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>  power:power_start
>  power:power_end
> are replaced with:
>  power:processor_idle
>
> and
>  power:power_frequency
> is replaced with:
>  power:processor_frequency
>
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.
This will come as a separate patch, which fits all platforms. Cf.
http://marc.info/?l=linux-omap&m=128620575300682&w=2.

...

> @@ -481,10 +484,12 @@ static void mwait_idle(void)
>  static void poll_idle(void)
>  {
>        trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> +       trace_processor_idle(1, smp_processor_id());

Should that be:
+       trace_processor_idle(0, smp_processor_id());
instead?
Since state '0' is for the CPU active in polling mode and
PWR_EVENT_EXIT means 'exit from any idle state'.

>        local_irq_enable();
>        while (!need_resched())
>                cpu_relax();
> -       trace_power_end(0);
> +       trace_power_end(smp_processor_id());
> +       trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());
>  }
>
>  /*

...

> diff --git a/include/trace/events/power.h b/include/trace/events/power.h
...

Regards,
Jean
--
To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
  2010-10-26  8:08       ` Jean Pihet
@ 2010-10-26  8:08       ` Jean Pihet
  2010-10-26  9:58       ` Arjan van de Ven
                         ` (5 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Jean Pihet @ 2010-10-26  8:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Thomas Gleixner, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Mathieu Desnoyers

Ingo,

On Tue, Oct 26, 2010 at 9:10 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Thomas Renninger <trenn@suse.de> wrote:
>
>> Changes in V2:
>>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>>   - Use u32 instead of u64 for cpuid, state which is by far enough

...

>>
>> +#define PWR_EVENT_EXIT 0xFFFFFFFF
>
> Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
>
>> +#ifndef _TRACE_POWER_ENUM_
>> +#define _TRACE_POWER_ENUM_
>> +enum {
>> +     POWER_NONE = 0,
>> +     POWER_CSTATE = 1,
>> +     POWER_PSTATE = 2,
>> +};
>> +#endif
>
> Since we are cleaning up all these events, those enum definitions dont really look
> logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
The enum belongs to the deprecated API so I would rather not touch it.
Keeping the deprecated code isolated will make it easier to remove
later.

>
> Plus:
>
>> +DECLARE_EVENT_CLASS(processor,
>> +
>> +     TP_PROTO(unsigned int state, unsigned int cpu_id),
>> +
>> +     TP_ARGS(state, cpu_id),
>> +
>> +     TP_STRUCT__entry(
>> +             __field(        u32,            state           )
>> +             __field(        u32,            cpu_id          )
>
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> ever be different from that CPU id?
I have no evidence of that (powr mgt on SMP ARM is coming real
soon...) but one can imagine one of the CPUs being the master for PM
decisions.

>
>> +     ),
>> +
>> +     TP_fast_assign(
>> +             __entry->state = state;
>> +             __entry->cpu_id = cpu_id;
>> +     ),
>> +
>> +     TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
>> +               (unsigned long)__entry->cpu_id)
>> +);
>> +
>> +DEFINE_EVENT(processor, processor_idle,
>> +
>> +     TP_PROTO(unsigned int state, unsigned int cpu_id),
>> +
>> +          TP_ARGS(state, cpu_id)
>> +);
>> +
>> +#define PWR_EVENT_EXIT 0xFFFFFFFF
>> +
>> +DEFINE_EVENT(processor, processor_frequency,
>> +
>> +     TP_PROTO(unsigned int frequency, unsigned int cpu_id),
>> +
>> +     TP_ARGS(frequency, cpu_id)
>> +);
>
> So, we have a 'state' field in the class, which is used as 'state' by the
> power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
>
> Are there any architectures that track frequency in Hz, not in kHz? If yes, might
> there ever be a need for the frequency value to be larger than 4.29 GHz? If yes,
> then it wont fit into u32.
>
> Also, might there be a future need to express different types of frequencies? For
> example, should we decide to track turbo frequencies in Intel CPUs, how would that
> be expressed via these events? Are there any architectures and CPUs that somehow
> have some extra attribute to the frequency value?
>
>> +TRACE_EVENT(machine_suspend,
>> +
>> +     TP_PROTO(unsigned int state),
>> +
>> +     TP_ARGS(state),
>> +
>> +     TP_STRUCT__entry(
>> +             __field(        u32,            state           )
>> +     ),
>
> Hm, this event is not used anywhere in the submitted patches. Where is the patch
> that adds usage, and what are the possible values for 'state'?
This will come as a separate patch, which fits all platforms. Cf.
http://marc.info/?l=linux-omap&m=128620575300682&w=2.
The state field is of type suspend_state_t, cf. include/linux/suspend.h
>
>        Ingo
>

Jean

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
@ 2010-10-26  8:08       ` Jean Pihet
  2010-10-26 11:21         ` Ingo Molnar
  2010-10-26 11:21         ` Ingo Molnar
  2010-10-26  8:08       ` Jean Pihet
                         ` (6 subsequent siblings)
  7 siblings, 2 replies; 157+ messages in thread
From: Jean Pihet @ 2010-10-26  8:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

Ingo,

On Tue, Oct 26, 2010 at 9:10 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Thomas Renninger <trenn@suse.de> wrote:
>
>> Changes in V2:
>>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>>   - Use u32 instead of u64 for cpuid, state which is by far enough

...

>>
>> +#define PWR_EVENT_EXIT 0xFFFFFFFF
>
> Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
>
>> +#ifndef _TRACE_POWER_ENUM_
>> +#define _TRACE_POWER_ENUM_
>> +enum {
>> +     POWER_NONE = 0,
>> +     POWER_CSTATE = 1,
>> +     POWER_PSTATE = 2,
>> +};
>> +#endif
>
> Since we are cleaning up all these events, those enum definitions dont really look
> logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
The enum belongs to the deprecated API so I would rather not touch it.
Keeping the deprecated code isolated will make it easier to remove
later.

>
> Plus:
>
>> +DECLARE_EVENT_CLASS(processor,
>> +
>> +     TP_PROTO(unsigned int state, unsigned int cpu_id),
>> +
>> +     TP_ARGS(state, cpu_id),
>> +
>> +     TP_STRUCT__entry(
>> +             __field(        u32,            state           )
>> +             __field(        u32,            cpu_id          )
>
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> ever be different from that CPU id?
I have no evidence of that (powr mgt on SMP ARM is coming real
soon...) but one can imagine one of the CPUs being the master for PM
decisions.

>
>> +     ),
>> +
>> +     TP_fast_assign(
>> +             __entry->state = state;
>> +             __entry->cpu_id = cpu_id;
>> +     ),
>> +
>> +     TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
>> +               (unsigned long)__entry->cpu_id)
>> +);
>> +
>> +DEFINE_EVENT(processor, processor_idle,
>> +
>> +     TP_PROTO(unsigned int state, unsigned int cpu_id),
>> +
>> +          TP_ARGS(state, cpu_id)
>> +);
>> +
>> +#define PWR_EVENT_EXIT 0xFFFFFFFF
>> +
>> +DEFINE_EVENT(processor, processor_frequency,
>> +
>> +     TP_PROTO(unsigned int frequency, unsigned int cpu_id),
>> +
>> +     TP_ARGS(frequency, cpu_id)
>> +);
>
> So, we have a 'state' field in the class, which is used as 'state' by the
> power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
>
> Are there any architectures that track frequency in Hz, not in kHz? If yes, might
> there ever be a need for the frequency value to be larger than 4.29 GHz? If yes,
> then it wont fit into u32.
>
> Also, might there be a future need to express different types of frequencies? For
> example, should we decide to track turbo frequencies in Intel CPUs, how would that
> be expressed via these events? Are there any architectures and CPUs that somehow
> have some extra attribute to the frequency value?
>
>> +TRACE_EVENT(machine_suspend,
>> +
>> +     TP_PROTO(unsigned int state),
>> +
>> +     TP_ARGS(state),
>> +
>> +     TP_STRUCT__entry(
>> +             __field(        u32,            state           )
>> +     ),
>
> Hm, this event is not used anywhere in the submitted patches. Where is the patch
> that adds usage, and what are the possible values for 'state'?
This will come as a separate patch, which fits all platforms. Cf.
http://marc.info/?l=linux-omap&m=128620575300682&w=2.
The state field is of type suspend_state_t, cf. include/linux/suspend.h
>
>        Ingo
>

Jean
--
To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
                         ` (2 preceding siblings ...)
  2010-10-26  9:58       ` Arjan van de Ven
@ 2010-10-26  9:58       ` Arjan van de Ven
  2010-10-26 10:37       ` Thomas Renninger
                         ` (3 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-26  9:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

On 10/26/2010 12:10 AM, Ingo Molnar wrote:
> +
>> +	TP_STRUCT__entry(
>> +		__field(	u32,		state		)
>> +		__field(	u32,		cpu_id		)
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> ever be different from that CPU id?

yes

esp cpu frequency you can change cross cpu....

originally we did not have this in the API but Thomas added it for that 
reason some time ago.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
  2010-10-26  8:08       ` Jean Pihet
  2010-10-26  8:08       ` Jean Pihet
@ 2010-10-26  9:58       ` Arjan van de Ven
  2010-10-26 10:19         ` Ingo Molnar
  2010-10-26 10:19         ` Ingo Molnar
  2010-10-26  9:58       ` Arjan van de Ven
                         ` (4 subsequent siblings)
  7 siblings, 2 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-26  9:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/26/2010 12:10 AM, Ingo Molnar wrote:
> +
>> +	TP_STRUCT__entry(
>> +		__field(	u32,		state		)
>> +		__field(	u32,		cpu_id		)
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> ever be different from that CPU id?

yes

esp cpu frequency you can change cross cpu....

originally we did not have this in the API but Thomas added it for that 
reason some time ago.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  9:58       ` Arjan van de Ven
  2010-10-26 10:19         ` Ingo Molnar
@ 2010-10-26 10:19         ` Ingo Molnar
  1 sibling, 0 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-26 10:19 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-trace-users, Frederic Weisbecker, Pierre Tardy, Jean Pihet,
	Steven Rostedt, Peter Zijlstra, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Mathieu Desnoyers


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On 10/26/2010 12:10 AM, Ingo Molnar wrote:
> >+
> >>+	TP_STRUCT__entry(
> >>+		__field(	u32,		state		)
> >>+		__field(	u32,		cpu_id		)
> >Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> >ever be different from that CPU id?
> 
> yes
> 
> esp cpu frequency you can change cross cpu....
> 
> originally we did not have this in the API but Thomas added it for that reason 
> some time ago.

Ok, good! Maybe add this as a comment?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  9:58       ` Arjan van de Ven
@ 2010-10-26 10:19         ` Ingo Molnar
  2010-10-26 10:19         ` Ingo Molnar
  1 sibling, 0 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-26 10:19 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On 10/26/2010 12:10 AM, Ingo Molnar wrote:
> >+
> >>+	TP_STRUCT__entry(
> >>+		__field(	u32,		state		)
> >>+		__field(	u32,		cpu_id		)
> >Trace entries can carry a cpu_id of the current processor already. Can this cpu_id
> >ever be different from that CPU id?
> 
> yes
> 
> esp cpu frequency you can change cross cpu....
> 
> originally we did not have this in the API but Thomas added it for that reason 
> some time ago.

Ok, good! Maybe add this as a comment?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
                         ` (3 preceding siblings ...)
  2010-10-26  9:58       ` Arjan van de Ven
@ 2010-10-26 10:37       ` Thomas Renninger
  2010-10-26 10:37       ` Thomas Renninger
                         ` (2 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-26 10:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > Changes in V2:
> >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > 
> > New power trace events:
> > power:processor_idle
> > power:processor_frequency
> > power:machine_suspend
> > 
> > 
> > C-state/idle accounting events:
> >   power:power_start
> >   power:power_end
> > are replaced with:
> >   power:processor_idle
> > 
> > and
> >   power:power_frequency
> > is replaced with:
> >   power:processor_frequency
> 
> Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> shortness? We generally use 'cpu' in the kernel and for events.
Sure.
> 
> > power:machine_suspend
> 
> How will future PCI (or other device) power saving tracepoints be called?
> 
> Might be more consistent to use:
> 
>   power:cpu_idle
>   power:machine_idle
>   power:device_idle
device idle is not true. Those may be low power modes
like reduced network throughput, reduced wlan range, the device
needs not to be idle.
Device power states is probably the most complex area, if such
a thing gets introduced, it should makes sense to state
the interface experimental for some time until a wider range of
devices uses it (in contrast to these new ones
which should not change that soon anymore...).

Also machine_idle may be true, but machine_suspend sounds more
familiar and everyone immediately knows what the event is about.

-> *_idle convention is not really worth it.

> Where machine_idle is the suspend event.
Here you name it. You talk about machine_idle but you mean
the suspend event, better just name it what it is.

> > the type= field got removed from both, it was never
> > used and the type is differed by the event type itself.
> >
> > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> 
> Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
No, below enum will vanish, but -1 is nicer.

...

> Plus:
> 
> > +DECLARE_EVENT_CLASS(processor,
> > +
> > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > +
> > +	TP_ARGS(state, cpu_id),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	u32,		state		)
> > +		__field(	u32,		cpu_id		)
> 
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> ever be different from that CPU id?
Yes. A core's frequency can depend on another one which
will get switched as well (by one command/MSR).
Compare with commit 6f4f2723d08534fd4e407e1e.

This can theoretically also be the case for sleep states.
Afaik such HW does not exist yet, but ACPI spec already
provides interfaces to pass these dependency from BIOS to OS.
-> We want a stable ABI and should be prepared for such stuff.

> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->state = state;
> > +		__entry->cpu_id = cpu_id;
> > +	),
> > +
> > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > +		  (unsigned long)__entry->cpu_id)
> > +);
> > +
> > +DEFINE_EVENT(processor, processor_idle,
> > +
> > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > +
> > +	     TP_ARGS(state, cpu_id)
> > +);
> > +
> > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > +
> > +DEFINE_EVENT(processor, processor_frequency,
> > +
> > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > +
> > +	TP_ARGS(frequency, cpu_id)
> > +);
> 
> So, we have a 'state' field in the class, which is used as 'state' by the 
> power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
Yes, is this a problem?
Definitions are a bit shorter having one power processor class.
As "frequency" is stated in frequency event definition everything should
be obvious and this one looks like the more elegant way to me.
 
> Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> then it wont fit into u32.
drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
        unsigned int            min;    /* in kHz */
        unsigned int            max;    /* in kHz */
        unsigned int            cur;    /* in kHz,
        ...
that should be fine.

> Also, might there be a future need to express different types of frequencies? For 
> example, should we decide to track turbo frequencies in Intel CPUs, how would that 
> be expressed via these events? Are there any architectures and CPUs that somehow 
> have some extra attribute to the frequency value?
I wonder whether this ever can/will work in a sane way.
Userspace can compare with:
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
everything above is turbo. So I do not think it's ever needed.
But adding an additional value at the end does not violate
userspace compatibility. This has been done with the cpuid
as well.
 
> > +TRACE_EVENT(machine_suspend,
> > +
> > +	TP_PROTO(unsigned int state),
> > +
> > +	TP_ARGS(state),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	u32,		state		)
> > +	),
> 
> Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> that adds usage, and what are the possible values for 'state'?
Jean wants to make use of it on ARM.
I also had patch for x86, I can have another look at it, Rafael
already gave me a comment on it. But on X86 you typically realize
when you suspend the machine (could imagine this is more useful on
ARM driven mobile phones and similar), still I can add it..

Values probably should be (include/linux/suspend.h):
#define PM_SUSPEND_ON       0
#define PM_SUSPEND_STANDBY  1
#define PM_SUSPEND_MEM      3
#define PM_SUSPEND_MAX      4

How this strange state Arjan talked about is passed is up
to these guys. Instead of using 0 and above pre-defined such
arch specific special states better should get passed by:

#define X86_MOORESTOWN_STANDBY_S0   0x100
..                                  0x101
#define ARM_WHATEVER_STRANGE_THING  0x200
...

     Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
                         ` (4 preceding siblings ...)
  2010-10-26 10:37       ` Thomas Renninger
@ 2010-10-26 10:37       ` Thomas Renninger
  2010-10-26 11:19         ` Ingo Molnar
  2010-10-26 11:19         ` Ingo Molnar
  2010-10-26 15:32       ` Pierre Tardy
  2010-10-26 15:32       ` Pierre Tardy
  7 siblings, 2 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-26 10:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > Changes in V2:
> >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > 
> > New power trace events:
> > power:processor_idle
> > power:processor_frequency
> > power:machine_suspend
> > 
> > 
> > C-state/idle accounting events:
> >   power:power_start
> >   power:power_end
> > are replaced with:
> >   power:processor_idle
> > 
> > and
> >   power:power_frequency
> > is replaced with:
> >   power:processor_frequency
> 
> Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> shortness? We generally use 'cpu' in the kernel and for events.
Sure.
> 
> > power:machine_suspend
> 
> How will future PCI (or other device) power saving tracepoints be called?
> 
> Might be more consistent to use:
> 
>   power:cpu_idle
>   power:machine_idle
>   power:device_idle
device idle is not true. Those may be low power modes
like reduced network throughput, reduced wlan range, the device
needs not to be idle.
Device power states is probably the most complex area, if such
a thing gets introduced, it should makes sense to state
the interface experimental for some time until a wider range of
devices uses it (in contrast to these new ones
which should not change that soon anymore...).

Also machine_idle may be true, but machine_suspend sounds more
familiar and everyone immediately knows what the event is about.

-> *_idle convention is not really worth it.

> Where machine_idle is the suspend event.
Here you name it. You talk about machine_idle but you mean
the suspend event, better just name it what it is.

> > the type= field got removed from both, it was never
> > used and the type is differed by the event type itself.
> >
> > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> 
> Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
No, below enum will vanish, but -1 is nicer.

...

> Plus:
> 
> > +DECLARE_EVENT_CLASS(processor,
> > +
> > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > +
> > +	TP_ARGS(state, cpu_id),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	u32,		state		)
> > +		__field(	u32,		cpu_id		)
> 
> Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> ever be different from that CPU id?
Yes. A core's frequency can depend on another one which
will get switched as well (by one command/MSR).
Compare with commit 6f4f2723d08534fd4e407e1e.

This can theoretically also be the case for sleep states.
Afaik such HW does not exist yet, but ACPI spec already
provides interfaces to pass these dependency from BIOS to OS.
-> We want a stable ABI and should be prepared for such stuff.

> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->state = state;
> > +		__entry->cpu_id = cpu_id;
> > +	),
> > +
> > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > +		  (unsigned long)__entry->cpu_id)
> > +);
> > +
> > +DEFINE_EVENT(processor, processor_idle,
> > +
> > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > +
> > +	     TP_ARGS(state, cpu_id)
> > +);
> > +
> > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > +
> > +DEFINE_EVENT(processor, processor_frequency,
> > +
> > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > +
> > +	TP_ARGS(frequency, cpu_id)
> > +);
> 
> So, we have a 'state' field in the class, which is used as 'state' by the 
> power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
Yes, is this a problem?
Definitions are a bit shorter having one power processor class.
As "frequency" is stated in frequency event definition everything should
be obvious and this one looks like the more elegant way to me.
 
> Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> then it wont fit into u32.
drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
        unsigned int            min;    /* in kHz */
        unsigned int            max;    /* in kHz */
        unsigned int            cur;    /* in kHz,
        ...
that should be fine.

> Also, might there be a future need to express different types of frequencies? For 
> example, should we decide to track turbo frequencies in Intel CPUs, how would that 
> be expressed via these events? Are there any architectures and CPUs that somehow 
> have some extra attribute to the frequency value?
I wonder whether this ever can/will work in a sane way.
Userspace can compare with:
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
everything above is turbo. So I do not think it's ever needed.
But adding an additional value at the end does not violate
userspace compatibility. This has been done with the cpuid
as well.
 
> > +TRACE_EVENT(machine_suspend,
> > +
> > +	TP_PROTO(unsigned int state),
> > +
> > +	TP_ARGS(state),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	u32,		state		)
> > +	),
> 
> Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> that adds usage, and what are the possible values for 'state'?
Jean wants to make use of it on ARM.
I also had patch for x86, I can have another look at it, Rafael
already gave me a comment on it. But on X86 you typically realize
when you suspend the machine (could imagine this is more useful on
ARM driven mobile phones and similar), still I can add it..

Values probably should be (include/linux/suspend.h):
#define PM_SUSPEND_ON       0
#define PM_SUSPEND_STANDBY  1
#define PM_SUSPEND_MEM      3
#define PM_SUSPEND_MAX      4

How this strange state Arjan talked about is passed is up
to these guys. Instead of using 0 and above pre-defined such
arch specific special states better should get passed by:

#define X86_MOORESTOWN_STANDBY_S0   0x100
..                                  0x101
#define ARM_WHATEVER_STRANGE_THING  0x200
...

     Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 10:37       ` Thomas Renninger
@ 2010-10-26 11:19         ` Ingo Molnar
  2010-10-26 11:19         ` Ingo Molnar
  1 sibling, 0 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:19 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner


* Thomas Renninger <trenn@suse.de> wrote:

> On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > Changes in V2:
> > >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> > >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > > 
> > > New power trace events:
> > > power:processor_idle
> > > power:processor_frequency
> > > power:machine_suspend
> > > 
> > > 
> > > C-state/idle accounting events:
> > >   power:power_start
> > >   power:power_end
> > > are replaced with:
> > >   power:processor_idle
> > > 
> > > and
> > >   power:power_frequency
> > > is replaced with:
> > >   power:processor_frequency
> > 
> > Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> > shortness? We generally use 'cpu' in the kernel and for events.
> Sure.
> > 
> > > power:machine_suspend
> > 
> > How will future PCI (or other device) power saving tracepoints be called?
> > 
> > Might be more consistent to use:
> > 
> >   power:cpu_idle
> >   power:machine_idle
> >   power:device_idle
>
> device idle is not true. Those may be low power modes
> like reduced network throughput, reduced wlan range, the device
> needs not to be idle.
> Device power states is probably the most complex area, if such
> a thing gets introduced, it should makes sense to state
> the interface experimental for some time until a wider range of
> devices uses it (in contrast to these new ones
> which should not change that soon anymore...).

Ok.

> Also machine_idle may be true, but machine_suspend sounds more
> familiar and everyone immediately knows what the event is about.

Ok - fair enough.

> > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > 
> > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
>
> No, below enum will vanish, but -1 is nicer.

When it vanishes what will replace it?

> ...
> 
> > Plus:
> > 
> > > +DECLARE_EVENT_CLASS(processor,
> > > +
> > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > +
> > > +	TP_ARGS(state, cpu_id),
> > > +
> > > +	TP_STRUCT__entry(
> > > +		__field(	u32,		state		)
> > > +		__field(	u32,		cpu_id		)
> > 
> > Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> > ever be different from that CPU id?
>
> Yes. A core's frequency can depend on another one which
> will get switched as well (by one command/MSR).
> Compare with commit 6f4f2723d08534fd4e407e1e.
> 
> This can theoretically also be the case for sleep states.
> Afaik such HW does not exist yet, but ACPI spec already
> provides interfaces to pass these dependency from BIOS to OS.
> -> We want a stable ABI and should be prepared for such stuff.
> 
> > > +	),
> > > +
> > > +	TP_fast_assign(
> > > +		__entry->state = state;
> > > +		__entry->cpu_id = cpu_id;
> > > +	),
> > > +
> > > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > > +		  (unsigned long)__entry->cpu_id)
> > > +);
> > > +
> > > +DEFINE_EVENT(processor, processor_idle,
> > > +
> > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > +
> > > +	     TP_ARGS(state, cpu_id)
> > > +);
> > > +
> > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > +
> > > +DEFINE_EVENT(processor, processor_frequency,
> > > +
> > > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > > +
> > > +	TP_ARGS(frequency, cpu_id)
> > > +);
> > 
> > So, we have a 'state' field in the class, which is used as 'state' by the 
> > power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
>
> Yes, is this a problem?
>
> Definitions are a bit shorter having one power processor class.
> As "frequency" is stated in frequency event definition everything should
> be obvious and this one looks like the more elegant way to me.
>  
> > Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> > there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> > then it wont fit into u32.
>
> drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
>         unsigned int            min;    /* in kHz */
>         unsigned int            max;    /* in kHz */
>         unsigned int            cur;    /* in kHz,
>         ...
> that should be fine.

ok, good - so we should be fine up to 4 THz CPUs.

> > Also, might there be a future need to express different types of frequencies? 
> > For example, should we decide to track turbo frequencies in Intel CPUs, how 
> > would that be expressed via these events? Are there any architectures and CPUs 
> > that somehow have some extra attribute to the frequency value?
>
> I wonder whether this ever can/will work in a sane way.
> Userspace can compare with:
> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
> everything above is turbo. So I do not think it's ever needed.
> But adding an additional value at the end does not violate
> userspace compatibility. This has been done with the cpuid
> as well.
>  
> > > +TRACE_EVENT(machine_suspend,
> > > +
> > > +	TP_PROTO(unsigned int state),
> > > +
> > > +	TP_ARGS(state),
> > > +
> > > +	TP_STRUCT__entry(
> > > +		__field(	u32,		state		)
> > > +	),
> > 
> > Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> > that adds usage, and what are the possible values for 'state'?
>
> Jean wants to make use of it on ARM.
> I also had patch for x86, I can have another look at it, Rafael
> already gave me a comment on it. But on X86 you typically realize
> when you suspend the machine (could imagine this is more useful on
> ARM driven mobile phones and similar), still I can add it..
> 
> Values probably should be (include/linux/suspend.h):
> #define PM_SUSPEND_ON       0
> #define PM_SUSPEND_STANDBY  1
> #define PM_SUSPEND_MEM      3
> #define PM_SUSPEND_MAX      4
> 
> How this strange state Arjan talked about is passed is up
> to these guys. Instead of using 0 and above pre-defined such
> arch specific special states better should get passed by:
> 
> #define X86_MOORESTOWN_STANDBY_S0   0x100
> ..                                  0x101
> #define ARM_WHATEVER_STRANGE_THING  0x200
> ...

I'd rather like to see a meaningful name to be given to these states and them being 
passed, instead of weird platform specific things. Tooling will try to be as generic 
as possible.

I dont know this stuff, but making a distinction between s2ram and s2disk events 
would seem meaningful.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 10:37       ` Thomas Renninger
  2010-10-26 11:19         ` Ingo Molnar
@ 2010-10-26 11:19         ` Ingo Molnar
  2010-10-26 19:01           ` Rafael J. Wysocki
  2010-10-26 19:01           ` Rafael J. Wysocki
  1 sibling, 2 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:19 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > Changes in V2:
> > >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> > >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > > 
> > > New power trace events:
> > > power:processor_idle
> > > power:processor_frequency
> > > power:machine_suspend
> > > 
> > > 
> > > C-state/idle accounting events:
> > >   power:power_start
> > >   power:power_end
> > > are replaced with:
> > >   power:processor_idle
> > > 
> > > and
> > >   power:power_frequency
> > > is replaced with:
> > >   power:processor_frequency
> > 
> > Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> > shortness? We generally use 'cpu' in the kernel and for events.
> Sure.
> > 
> > > power:machine_suspend
> > 
> > How will future PCI (or other device) power saving tracepoints be called?
> > 
> > Might be more consistent to use:
> > 
> >   power:cpu_idle
> >   power:machine_idle
> >   power:device_idle
>
> device idle is not true. Those may be low power modes
> like reduced network throughput, reduced wlan range, the device
> needs not to be idle.
> Device power states is probably the most complex area, if such
> a thing gets introduced, it should makes sense to state
> the interface experimental for some time until a wider range of
> devices uses it (in contrast to these new ones
> which should not change that soon anymore...).

Ok.

> Also machine_idle may be true, but machine_suspend sounds more
> familiar and everyone immediately knows what the event is about.

Ok - fair enough.

> > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > 
> > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
>
> No, below enum will vanish, but -1 is nicer.

When it vanishes what will replace it?

> ...
> 
> > Plus:
> > 
> > > +DECLARE_EVENT_CLASS(processor,
> > > +
> > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > +
> > > +	TP_ARGS(state, cpu_id),
> > > +
> > > +	TP_STRUCT__entry(
> > > +		__field(	u32,		state		)
> > > +		__field(	u32,		cpu_id		)
> > 
> > Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> > ever be different from that CPU id?
>
> Yes. A core's frequency can depend on another one which
> will get switched as well (by one command/MSR).
> Compare with commit 6f4f2723d08534fd4e407e1e.
> 
> This can theoretically also be the case for sleep states.
> Afaik such HW does not exist yet, but ACPI spec already
> provides interfaces to pass these dependency from BIOS to OS.
> -> We want a stable ABI and should be prepared for such stuff.
> 
> > > +	),
> > > +
> > > +	TP_fast_assign(
> > > +		__entry->state = state;
> > > +		__entry->cpu_id = cpu_id;
> > > +	),
> > > +
> > > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > > +		  (unsigned long)__entry->cpu_id)
> > > +);
> > > +
> > > +DEFINE_EVENT(processor, processor_idle,
> > > +
> > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > +
> > > +	     TP_ARGS(state, cpu_id)
> > > +);
> > > +
> > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > +
> > > +DEFINE_EVENT(processor, processor_frequency,
> > > +
> > > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > > +
> > > +	TP_ARGS(frequency, cpu_id)
> > > +);
> > 
> > So, we have a 'state' field in the class, which is used as 'state' by the 
> > power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
>
> Yes, is this a problem?
>
> Definitions are a bit shorter having one power processor class.
> As "frequency" is stated in frequency event definition everything should
> be obvious and this one looks like the more elegant way to me.
>  
> > Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> > there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> > then it wont fit into u32.
>
> drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
>         unsigned int            min;    /* in kHz */
>         unsigned int            max;    /* in kHz */
>         unsigned int            cur;    /* in kHz,
>         ...
> that should be fine.

ok, good - so we should be fine up to 4 THz CPUs.

> > Also, might there be a future need to express different types of frequencies? 
> > For example, should we decide to track turbo frequencies in Intel CPUs, how 
> > would that be expressed via these events? Are there any architectures and CPUs 
> > that somehow have some extra attribute to the frequency value?
>
> I wonder whether this ever can/will work in a sane way.
> Userspace can compare with:
> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
> everything above is turbo. So I do not think it's ever needed.
> But adding an additional value at the end does not violate
> userspace compatibility. This has been done with the cpuid
> as well.
>  
> > > +TRACE_EVENT(machine_suspend,
> > > +
> > > +	TP_PROTO(unsigned int state),
> > > +
> > > +	TP_ARGS(state),
> > > +
> > > +	TP_STRUCT__entry(
> > > +		__field(	u32,		state		)
> > > +	),
> > 
> > Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> > that adds usage, and what are the possible values for 'state'?
>
> Jean wants to make use of it on ARM.
> I also had patch for x86, I can have another look at it, Rafael
> already gave me a comment on it. But on X86 you typically realize
> when you suspend the machine (could imagine this is more useful on
> ARM driven mobile phones and similar), still I can add it..
> 
> Values probably should be (include/linux/suspend.h):
> #define PM_SUSPEND_ON       0
> #define PM_SUSPEND_STANDBY  1
> #define PM_SUSPEND_MEM      3
> #define PM_SUSPEND_MAX      4
> 
> How this strange state Arjan talked about is passed is up
> to these guys. Instead of using 0 and above pre-defined such
> arch specific special states better should get passed by:
> 
> #define X86_MOORESTOWN_STANDBY_S0   0x100
> ..                                  0x101
> #define ARM_WHATEVER_STRANGE_THING  0x200
> ...

I'd rather like to see a meaningful name to be given to these states and them being 
passed, instead of weird platform specific things. Tooling will try to be as generic 
as possible.

I dont know this stuff, but making a distinction between s2ram and s2disk events 
would seem meaningful.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  8:08       ` Jean Pihet
  2010-10-26 11:21         ` Ingo Molnar
@ 2010-10-26 11:21         ` Ingo Molnar
  1 sibling, 0 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:21 UTC (permalink / raw)
  To: Jean Pihet
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Thomas Gleixner, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Mathieu Desnoyers


* Jean Pihet <jean.pihet@newoldbits.com> wrote:

> Ingo,
> 
> On Tue, Oct 26, 2010 at 9:10 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Thomas Renninger <trenn@suse.de> wrote:
> >
> >> Changes in V2:
> >>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> >>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> ...
> 
> >>
> >> +#define PWR_EVENT_EXIT 0xFFFFFFFF
> >
> > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
> >
> >> +#ifndef _TRACE_POWER_ENUM_
> >> +#define _TRACE_POWER_ENUM_
> >> +enum {
> >> +     POWER_NONE = 0,
> >> +     POWER_CSTATE = 1,
> >> +     POWER_PSTATE = 2,
> >> +};
> >> +#endif
> >
> > Since we are cleaning up all these events, those enum definitions dont really look
> > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
>
> The enum belongs to the deprecated API so I would rather not touch it.
> Keeping the deprecated code isolated will make it easier to remove
> later.

So what will replace it? We still have a state field.

Passing in platform specific codes is a step backwards.

> >> +TRACE_EVENT(machine_suspend,
> >> +
> >> +     TP_PROTO(unsigned int state),
> >> +
> >> +     TP_ARGS(state),
> >> +
> >> +     TP_STRUCT__entry(
> >> +             __field(        u32,            state           )
> >> +     ),
> >
> > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > that adds usage, and what are the possible values for 'state'?
>
> This will come as a separate patch, which fits all platforms. Cf.
> http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> The state field is of type suspend_state_t, cf. include/linux/suspend.h

Ok, that's at least generic. Needs the review of Rafael, to determine whether this 
state value is all we want to know when we enter suspend.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  8:08       ` Jean Pihet
@ 2010-10-26 11:21         ` Ingo Molnar
  2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:21         ` Ingo Molnar
  1 sibling, 2 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:21 UTC (permalink / raw)
  To: Jean Pihet
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Jean Pihet <jean.pihet@newoldbits.com> wrote:

> Ingo,
> 
> On Tue, Oct 26, 2010 at 9:10 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Thomas Renninger <trenn@suse.de> wrote:
> >
> >> Changes in V2:
> >>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> >>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> ...
> 
> >>
> >> +#define PWR_EVENT_EXIT 0xFFFFFFFF
> >
> > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
> >
> >> +#ifndef _TRACE_POWER_ENUM_
> >> +#define _TRACE_POWER_ENUM_
> >> +enum {
> >> +     POWER_NONE = 0,
> >> +     POWER_CSTATE = 1,
> >> +     POWER_PSTATE = 2,
> >> +};
> >> +#endif
> >
> > Since we are cleaning up all these events, those enum definitions dont really look
> > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
>
> The enum belongs to the deprecated API so I would rather not touch it.
> Keeping the deprecated code isolated will make it easier to remove
> later.

So what will replace it? We still have a state field.

Passing in platform specific codes is a step backwards.

> >> +TRACE_EVENT(machine_suspend,
> >> +
> >> +     TP_PROTO(unsigned int state),
> >> +
> >> +     TP_ARGS(state),
> >> +
> >> +     TP_STRUCT__entry(
> >> +             __field(        u32,            state           )
> >> +     ),
> >
> > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > that adds usage, and what are the possible values for 'state'?
>
> This will come as a separate patch, which fits all platforms. Cf.
> http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> The state field is of type suspend_state_t, cf. include/linux/suspend.h

Ok, that's at least generic. Needs the review of Rafael, to determine whether this 
state value is all we want to know when we enter suspend.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:21         ` Ingo Molnar
@ 2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:48           ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-26 11:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> 
> * Jean Pihet <jean.pihet@newoldbits.com> wrote:
..
> > >> +#ifndef _TRACE_POWER_ENUM_
> > >> +#define _TRACE_POWER_ENUM_
> > >> +enum {
> > >> +     POWER_NONE = 0,
> > >> +     POWER_CSTATE = 1,
> > >> +     POWER_PSTATE = 2,
> > >> +};
> > >> +#endif
> > >
> > > Since we are cleaning up all these events, those enum definitions dont really look
> > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> >
> > The enum belongs to the deprecated API so I would rather not touch it.
> > Keeping the deprecated code isolated will make it easier to remove
> > later.
> 
> So what will replace it? We still have a state field.
Nothing, this is part of the cleanup.
As you state above: POWER_NONE does not make sense at all.
The whole thing (type= attribute that vanishes now) is
passed to userspace, but never gets used there because the
same info is in the event name:
cpu_frequency <-> frequency_switch      <-> PSTATE
cpu_idle      <-> power_start/power_end <-> CSTATE 

I expect that there was an initial power_start/end which
was also used for frequency switching.
Then it got realized that _start/_end does not work out and
frequency_switch got introduced.
To stay compatible the whole power_start/end was not renamed
to cpu_idle and the type= field was kept.

This is a guess without even looking at the git history.
Therefore my partly harsh comments about the sanity of the
current power tracing events.

> Passing in platform specific codes is a step backwards.
> 
> > >> +TRACE_EVENT(machine_suspend,
> > >> +
> > >> +     TP_PROTO(unsigned int state),
> > >> +
> > >> +     TP_ARGS(state),
> > >> +
> > >> +     TP_STRUCT__entry(
> > >> +             __field(        u32,            state           )
> > >> +     ),
> > >
> > > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > > that adds usage, and what are the possible values for 'state'?
> >
> > This will come as a separate patch, which fits all platforms. Cf.
> > http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> > The state field is of type suspend_state_t, cf. include/linux/suspend.h
> 
> Ok, that's at least generic. Needs the review of Rafael, to determine
> whether this state value is all we want to know when we enter suspend.
He already gave an acked-by on this generic one here:
Re: [PATCH 3/4] perf: add calls to suspend trace point
Oh now, that was on the X86 specific part which depends on this one.
One should expect that he's fine with the generic part as well then,
but I agree that he should definitely have a look at this and sign it off.

So as they got signed-off already, I'll send the X86 suspend events
on top, once I find these in a tree...

    Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:21         ` Ingo Molnar
  2010-10-26 11:48           ` Thomas Renninger
@ 2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:54             ` Ingo Molnar
                               ` (3 more replies)
  1 sibling, 4 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-26 11:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jean Pihet, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> 
> * Jean Pihet <jean.pihet@newoldbits.com> wrote:
..
> > >> +#ifndef _TRACE_POWER_ENUM_
> > >> +#define _TRACE_POWER_ENUM_
> > >> +enum {
> > >> +     POWER_NONE = 0,
> > >> +     POWER_CSTATE = 1,
> > >> +     POWER_PSTATE = 2,
> > >> +};
> > >> +#endif
> > >
> > > Since we are cleaning up all these events, those enum definitions dont really look
> > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> >
> > The enum belongs to the deprecated API so I would rather not touch it.
> > Keeping the deprecated code isolated will make it easier to remove
> > later.
> 
> So what will replace it? We still have a state field.
Nothing, this is part of the cleanup.
As you state above: POWER_NONE does not make sense at all.
The whole thing (type= attribute that vanishes now) is
passed to userspace, but never gets used there because the
same info is in the event name:
cpu_frequency <-> frequency_switch      <-> PSTATE
cpu_idle      <-> power_start/power_end <-> CSTATE 

I expect that there was an initial power_start/end which
was also used for frequency switching.
Then it got realized that _start/_end does not work out and
frequency_switch got introduced.
To stay compatible the whole power_start/end was not renamed
to cpu_idle and the type= field was kept.

This is a guess without even looking at the git history.
Therefore my partly harsh comments about the sanity of the
current power tracing events.

> Passing in platform specific codes is a step backwards.
> 
> > >> +TRACE_EVENT(machine_suspend,
> > >> +
> > >> +     TP_PROTO(unsigned int state),
> > >> +
> > >> +     TP_ARGS(state),
> > >> +
> > >> +     TP_STRUCT__entry(
> > >> +             __field(        u32,            state           )
> > >> +     ),
> > >
> > > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > > that adds usage, and what are the possible values for 'state'?
> >
> > This will come as a separate patch, which fits all platforms. Cf.
> > http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> > The state field is of type suspend_state_t, cf. include/linux/suspend.h
> 
> Ok, that's at least generic. Needs the review of Rafael, to determine
> whether this state value is all we want to know when we enter suspend.
He already gave an acked-by on this generic one here:
Re: [PATCH 3/4] perf: add calls to suspend trace point
Oh now, that was on the X86 specific part which depends on this one.
One should expect that he's fine with the generic part as well then,
but I agree that he should definitely have a look at this and sign it off.

So as they got signed-off already, I'll send the X86 suspend events
on top, once I find these in a tree...

    Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:48           ` Thomas Renninger
@ 2010-10-26 11:54             ` Ingo Molnar
  2010-10-26 11:54             ` Ingo Molnar
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:54 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner


* Thomas Renninger <trenn@suse.de> wrote:

> On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> > 
> > * Jean Pihet <jean.pihet@newoldbits.com> wrote:
> ..
> > > >> +#ifndef _TRACE_POWER_ENUM_
> > > >> +#define _TRACE_POWER_ENUM_
> > > >> +enum {
> > > >> +     POWER_NONE = 0,
> > > >> +     POWER_CSTATE = 1,
> > > >> +     POWER_PSTATE = 2,
> > > >> +};
> > > >> +#endif
> > > >
> > > > Since we are cleaning up all these events, those enum definitions dont really look
> > > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> > >
> > > The enum belongs to the deprecated API so I would rather not touch it.
> > > Keeping the deprecated code isolated will make it easier to remove
> > > later.
> > 
> > So what will replace it? We still have a state field.
>
> Nothing, this is part of the cleanup.

I mean, what will go into the state field of the power:cpu_idle event?

> As you state above: POWER_NONE does not make sense at all.
> The whole thing (type= attribute that vanishes now) is
> passed to userspace, but never gets used there because the
> same info is in the event name:
> cpu_frequency <-> frequency_switch      <-> PSTATE
> cpu_idle      <-> power_start/power_end <-> CSTATE 

Ah, i see, so this 'state' enum went into the type field.

So my question is, and ignore this particular enum for now, what values go into the 
state field, which field is still kept in the new events as well.

[ I'd like to avoid us having to define a third set of power events a few years down 
  the road ;-) ]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:54             ` Ingo Molnar
@ 2010-10-26 11:54             ` Ingo Molnar
  2010-10-26 13:17               ` Thomas Renninger
  2010-10-26 13:17               ` Thomas Renninger
  2010-10-26 18:57             ` Rafael J. Wysocki
  2010-10-26 18:57             ` Rafael J. Wysocki
  3 siblings, 2 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-10-26 11:54 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Jean Pihet, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> > 
> > * Jean Pihet <jean.pihet@newoldbits.com> wrote:
> ..
> > > >> +#ifndef _TRACE_POWER_ENUM_
> > > >> +#define _TRACE_POWER_ENUM_
> > > >> +enum {
> > > >> +     POWER_NONE = 0,
> > > >> +     POWER_CSTATE = 1,
> > > >> +     POWER_PSTATE = 2,
> > > >> +};
> > > >> +#endif
> > > >
> > > > Since we are cleaning up all these events, those enum definitions dont really look
> > > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> > >
> > > The enum belongs to the deprecated API so I would rather not touch it.
> > > Keeping the deprecated code isolated will make it easier to remove
> > > later.
> > 
> > So what will replace it? We still have a state field.
>
> Nothing, this is part of the cleanup.

I mean, what will go into the state field of the power:cpu_idle event?

> As you state above: POWER_NONE does not make sense at all.
> The whole thing (type= attribute that vanishes now) is
> passed to userspace, but never gets used there because the
> same info is in the event name:
> cpu_frequency <-> frequency_switch      <-> PSTATE
> cpu_idle      <-> power_start/power_end <-> CSTATE 

Ah, i see, so this 'state' enum went into the type field.

So my question is, and ignore this particular enum for now, what values go into the 
state field, which field is still kept in the new events as well.

[ I'd like to avoid us having to define a third set of power events a few years down 
  the road ;-) ]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:54             ` Ingo Molnar
  2010-10-26 13:17               ` Thomas Renninger
@ 2010-10-26 13:17               ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-26 13:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Tuesday 26 October 2010 13:54:22 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
..
> As you state above: POWER_NONE does not make sense at all.
> > The whole thing (type= attribute that vanishes now) is
> > passed to userspace, but never gets used there because the
> > same info is in the event name:
> > cpu_frequency <-> frequency_switch      <-> PSTATE
> > cpu_idle      <-> power_start/power_end <-> CSTATE 
> 
> Ah, i see, so this 'state' enum went into the type field.
> 
> So my question is, and ignore this particular enum for now, what values go into the 
> state field, which field is still kept in the new events as well.
Same as before:
cpu_idle:
                trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+               trace_processor_idle(1, smp_processor_id());
(Ooops found a copy and paste bug in my patch where I reverted
the poll_idle event, but it should be zero...).

State in cpu_idle is identical with cpuidle registered
state. If cpuidle got registered, one should be able to calculate the
same C-state residency time and usage via state=X (cpu_idle event)
which you can grab via:
cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/{time,usage}
The cpu_idle event additionally gives you the timestamps of the state
changes.
This is rather nice as userspace can grab additional info from
cpuidle sysfs layer like:
/sys/devices/system/cpu/cpu0/cpuidle/stateX/{desc,power,name}

If cpuidle is not registered, the events you get are arch specific.
I mean they are arch specific anyway, but with cpuidle you can
build up an arch independent userspace framework nicely by looking
up name/desc/power/... of an cpu_idle event in cpuidle sysfs as
described above.

   Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:54             ` Ingo Molnar
@ 2010-10-26 13:17               ` Thomas Renninger
  2010-10-26 13:35                 ` Thomas Renninger
  2010-10-26 13:35                 ` Thomas Renninger
  2010-10-26 13:17               ` Thomas Renninger
  1 sibling, 2 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-26 13:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jean Pihet, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Tuesday 26 October 2010 13:54:22 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
..
> As you state above: POWER_NONE does not make sense at all.
> > The whole thing (type= attribute that vanishes now) is
> > passed to userspace, but never gets used there because the
> > same info is in the event name:
> > cpu_frequency <-> frequency_switch      <-> PSTATE
> > cpu_idle      <-> power_start/power_end <-> CSTATE 
> 
> Ah, i see, so this 'state' enum went into the type field.
> 
> So my question is, and ignore this particular enum for now, what values go into the 
> state field, which field is still kept in the new events as well.
Same as before:
cpu_idle:
                trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+               trace_processor_idle(1, smp_processor_id());
(Ooops found a copy and paste bug in my patch where I reverted
the poll_idle event, but it should be zero...).

State in cpu_idle is identical with cpuidle registered
state. If cpuidle got registered, one should be able to calculate the
same C-state residency time and usage via state=X (cpu_idle event)
which you can grab via:
cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/{time,usage}
The cpu_idle event additionally gives you the timestamps of the state
changes.
This is rather nice as userspace can grab additional info from
cpuidle sysfs layer like:
/sys/devices/system/cpu/cpu0/cpuidle/stateX/{desc,power,name}

If cpuidle is not registered, the events you get are arch specific.
I mean they are arch specific anyway, but with cpuidle you can
build up an arch independent userspace framework nicely by looking
up name/desc/power/... of an cpu_idle event in cpuidle sysfs as
described above.

   Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 13:17               ` Thomas Renninger
@ 2010-10-26 13:35                 ` Thomas Renninger
  2010-10-26 13:35                 ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-26 13:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Thomas Gleixner

On Tuesday 26 October 2010 15:17:43 Thomas Renninger wrote:
> On Tuesday 26 October 2010 13:54:22 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
...
> If cpuidle is not registered, the events you get are arch specific.
> I mean they are arch specific anyway, but with cpuidle you can
> build up an arch independent userspace framework nicely by looking
> up name/desc/power/... of an cpu_idle event in cpuidle sysfs as
> described above.
About cpuidle and cpu_idle events:
There is some oddness that:
arch specific code which registers for cpuidle has to
throw the cpu_idle enter sleep state X event
and the generic cpuidle framework triggers the "exit" event.

So as there are only cpu_idle events in drivers/idle/intel_idle.c,
but not in drivers/acpi/processor_idle.c, I expect that processor.ko
idle driver is broken and only exit states are sent.
Ideally all cpuidle events should be thrown in cpuidle.c like this:

        trace_processor_idle(target_state, smp_processor_id());
        dev->last_residency = target_state->enter(dev, target_state);                                                                     
        trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());

My patches do not touch this behavior. If, it was broken before.
I'll look at it separately.

      Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 13:17               ` Thomas Renninger
  2010-10-26 13:35                 ` Thomas Renninger
@ 2010-10-26 13:35                 ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-26 13:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jean Pihet, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Tuesday 26 October 2010 15:17:43 Thomas Renninger wrote:
> On Tuesday 26 October 2010 13:54:22 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
...
> If cpuidle is not registered, the events you get are arch specific.
> I mean they are arch specific anyway, but with cpuidle you can
> build up an arch independent userspace framework nicely by looking
> up name/desc/power/... of an cpu_idle event in cpuidle sysfs as
> described above.
About cpuidle and cpu_idle events:
There is some oddness that:
arch specific code which registers for cpuidle has to
throw the cpu_idle enter sleep state X event
and the generic cpuidle framework triggers the "exit" event.

So as there are only cpu_idle events in drivers/idle/intel_idle.c,
but not in drivers/acpi/processor_idle.c, I expect that processor.ko
idle driver is broken and only exit states are sent.
Ideally all cpuidle events should be thrown in cpuidle.c like this:

        trace_processor_idle(target_state, smp_processor_id());
        dev->last_residency = target_state->enter(dev, target_state);                                                                     
        trace_processor_idle(PWR_EVENT_EXIT, smp_processor_id());

My patches do not touch this behavior. If, it was broken before.
I'll look at it separately.

      Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
                         ` (6 preceding siblings ...)
  2010-10-26 15:32       ` Pierre Tardy
@ 2010-10-26 15:32       ` Pierre Tardy
  7 siblings, 0 replies; 157+ messages in thread
From: Pierre Tardy @ 2010-10-26 15:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Jean Pihet, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Thomas Gleixner, linux-pm, Masami Hiramatsu, Tejun Heo,
	Andrew Morton, linux-omap, Linus Torvalds, Mathieu Desnoyers

On Tue, Oct 26, 2010 at 2:10 AM, Ingo Molnar <mingo@elte.hu> wrote:

> How will future PCI (or other device) power saving tracepoints be called?
>
> Might be more consistent to use:
>
>  power:cpu_idle
>  power:machine_idle
>  power:device_idle
Agree with this.

FYI, I have a runtime_pm tracepoint currently cooking. Here is
preliminary patch.
Can this be a candidate for a "power:device_idle" tracepoint?

Regards,
Pierre
----
>From 3d5e03405f590d470ecfa59c8b9759915bf29307 Mon Sep 17 00:00:00 2001
From: Pierre Tardy <pierre.tardy@intel.com>
Date: Fri, 22 Oct 2010 03:07:07 -0500
Subject: [PATCH] trace/runtime_pm: add runtime_pm trace event

based on the recent hook from Arjan for powertop statistics
we add a tracepoint in order for pytimechart to display
the runtime_pm activity over time, and versus other events.

Signed-off-by: Pierre Tardy <pierre.tardy@intel.com>
---
 drivers/base/power/runtime.c |    3 ++-
 include/trace/events/power.h |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index b78c401..0f38447 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -9,6 +9,7 @@
 #include <linux/sched.h>
 #include <linux/pm_runtime.h>
 #include <linux/jiffies.h>
+#include <trace/events/power.h>

 static int __pm_runtime_resume(struct device *dev, bool from_wq);
 static int __pm_request_idle(struct device *dev);
@@ -159,9 +160,9 @@ void update_pm_runtime_accounting(struct device *dev)
 static void __update_runtime_status(struct device *dev, enum rpm_status status)
 {
 	update_pm_runtime_accounting(dev);
+	trace_runtime_pm_status(dev, status);
 	dev->power.runtime_status = status;
 }
-
 /**
  * __pm_runtime_suspend - Carry out run-time suspend of given device.
  * @dev: Device to suspend.
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 286784d..dd57c29 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -6,6 +6,7 @@

 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
+#include <linux/device.h>

 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
@@ -75,6 +76,40 @@ TRACE_EVENT(power_end,

 );

+#ifdef CONFIG_PM_RUNTIME
+#define rpm_status_name(status) { RPM_##status, #status }
+#define show_rpm_status_name(val)				\
+	__print_symbolic(val,					\
+		rpm_status_name(SUSPENDED),			\
+		rpm_status_name(SUSPENDING),			\
+		rpm_status_name(RESUMING),			\
+		rpm_status_name(ACTIVE)		                \
+		)
+TRACE_EVENT(runtime_pm_status,
+
+	TP_PROTO(struct device *dev, int new_status),
+
+	TP_ARGS(dev, new_status),
+
+	TP_STRUCT__entry(
+		__string(devname,dev_name(dev))
+		__string(drivername,dev_driver_string(dev))
+		__field(u32, prev_status)
+		__field(u32, status)
+	),
+
+	TP_fast_assign(
+		__assign_str(devname, dev_name(dev));
+		__assign_str(drivername, dev_driver_string(dev));
+		__entry->prev_status = (u32)dev->power.runtime_status;
+		__entry->status = (u32)new_status;
+	),
+
+	TP_printk("driver=%s dev=%s prev_status=%s status=%s",
__get_str(drivername),__get_str(devname),
+		  show_rpm_status_name(__entry->prev_status),
show_rpm_status_name(__entry->status))
+);
+#endif /* CONFIG_PM_RUNTIME */
+
 /*
  * The clock events are used for clock enable/disable and for
  *  clock rate change
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26  7:10     ` Ingo Molnar
                         ` (5 preceding siblings ...)
  2010-10-26 10:37       ` Thomas Renninger
@ 2010-10-26 15:32       ` Pierre Tardy
  2010-10-26 16:04         ` Arjan van de Ven
  2010-10-26 16:04         ` Arjan van de Ven
  2010-10-26 15:32       ` Pierre Tardy
  7 siblings, 2 replies; 157+ messages in thread
From: Pierre Tardy @ 2010-10-26 15:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Tue, Oct 26, 2010 at 2:10 AM, Ingo Molnar <mingo@elte.hu> wrote:

> How will future PCI (or other device) power saving tracepoints be called?
>
> Might be more consistent to use:
>
>  power:cpu_idle
>  power:machine_idle
>  power:device_idle
Agree with this.

FYI, I have a runtime_pm tracepoint currently cooking. Here is
preliminary patch.
Can this be a candidate for a "power:device_idle" tracepoint?

Regards,
Pierre
----
From 3d5e03405f590d470ecfa59c8b9759915bf29307 Mon Sep 17 00:00:00 2001
From: Pierre Tardy <pierre.tardy@intel.com>
Date: Fri, 22 Oct 2010 03:07:07 -0500
Subject: [PATCH] trace/runtime_pm: add runtime_pm trace event

based on the recent hook from Arjan for powertop statistics
we add a tracepoint in order for pytimechart to display
the runtime_pm activity over time, and versus other events.

Signed-off-by: Pierre Tardy <pierre.tardy@intel.com>
---
 drivers/base/power/runtime.c |    3 ++-
 include/trace/events/power.h |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index b78c401..0f38447 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -9,6 +9,7 @@
 #include <linux/sched.h>
 #include <linux/pm_runtime.h>
 #include <linux/jiffies.h>
+#include <trace/events/power.h>

 static int __pm_runtime_resume(struct device *dev, bool from_wq);
 static int __pm_request_idle(struct device *dev);
@@ -159,9 +160,9 @@ void update_pm_runtime_accounting(struct device *dev)
 static void __update_runtime_status(struct device *dev, enum rpm_status status)
 {
 	update_pm_runtime_accounting(dev);
+	trace_runtime_pm_status(dev, status);
 	dev->power.runtime_status = status;
 }
-
 /**
  * __pm_runtime_suspend - Carry out run-time suspend of given device.
  * @dev: Device to suspend.
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 286784d..dd57c29 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -6,6 +6,7 @@

 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
+#include <linux/device.h>

 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
@@ -75,6 +76,40 @@ TRACE_EVENT(power_end,

 );

+#ifdef CONFIG_PM_RUNTIME
+#define rpm_status_name(status) { RPM_##status, #status }
+#define show_rpm_status_name(val)				\
+	__print_symbolic(val,					\
+		rpm_status_name(SUSPENDED),			\
+		rpm_status_name(SUSPENDING),			\
+		rpm_status_name(RESUMING),			\
+		rpm_status_name(ACTIVE)		                \
+		)
+TRACE_EVENT(runtime_pm_status,
+
+	TP_PROTO(struct device *dev, int new_status),
+
+	TP_ARGS(dev, new_status),
+
+	TP_STRUCT__entry(
+		__string(devname,dev_name(dev))
+		__string(drivername,dev_driver_string(dev))
+		__field(u32, prev_status)
+		__field(u32, status)
+	),
+
+	TP_fast_assign(
+		__assign_str(devname, dev_name(dev));
+		__assign_str(drivername, dev_driver_string(dev));
+		__entry->prev_status = (u32)dev->power.runtime_status;
+		__entry->status = (u32)new_status;
+	),
+
+	TP_printk("driver=%s dev=%s prev_status=%s status=%s",
__get_str(drivername),__get_str(devname),
+		  show_rpm_status_name(__entry->prev_status),
show_rpm_status_name(__entry->status))
+);
+#endif /* CONFIG_PM_RUNTIME */
+
 /*
  * The clock events are used for clock enable/disable and for
  *  clock rate change
-- 
1.7.2.3
--
To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 15:32       ` Pierre Tardy
@ 2010-10-26 16:04         ` Arjan van de Ven
  2010-10-26 16:04         ` Arjan van de Ven
  1 sibling, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-26 16:04 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Andrew Morton, linux-trace-users, Frederic Weisbecker,
	Jean Pihet, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Mathieu Desnoyers, linux-pm, Masami Hiramatsu, Tejun Heo,
	Thomas Gleixner, linux-omap, Linus Torvalds, Ingo Molnar

On 10/26/2010 8:32 AM, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 2:10 AM, Ingo Molnar<mingo@elte.hu>  wrote:
>
>> How will future PCI (or other device) power saving tracepoints be called?
>>
>> Might be more consistent to use:
>>
>>   power:cpu_idle
>>   power:machine_idle
>>   power:device_idle
> Agree with this.
>
> FYI, I have a runtime_pm tracepoint currently cooking. Here is
> preliminary patch.
> Can this be a candidate for a "power:device_idle" tracepoint?


I would like to see a slightly more advanced tracepoint do the runtime 
pm framework;
specifically I'd like to see the "comm" of the process that's taking a 
refcount on a device
(that way, powertop can track which process keeps a device busy)

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 15:32       ` Pierre Tardy
  2010-10-26 16:04         ` Arjan van de Ven
@ 2010-10-26 16:04         ` Arjan van de Ven
  2010-10-26 16:56           ` Pierre Tardy
  2010-10-26 16:56           ` Pierre Tardy
  1 sibling, 2 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-26 16:04 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Ingo Molnar, Thomas Renninger, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, rjw, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/26/2010 8:32 AM, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 2:10 AM, Ingo Molnar<mingo@elte.hu>  wrote:
>
>> How will future PCI (or other device) power saving tracepoints be called?
>>
>> Might be more consistent to use:
>>
>>   power:cpu_idle
>>   power:machine_idle
>>   power:device_idle
> Agree with this.
>
> FYI, I have a runtime_pm tracepoint currently cooking. Here is
> preliminary patch.
> Can this be a candidate for a "power:device_idle" tracepoint?


I would like to see a slightly more advanced tracepoint do the runtime 
pm framework;
specifically I'd like to see the "comm" of the process that's taking a 
refcount on a device
(that way, powertop can track which process keeps a device busy)


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 16:04         ` Arjan van de Ven
  2010-10-26 16:56           ` Pierre Tardy
@ 2010-10-26 16:56           ` Pierre Tardy
  1 sibling, 0 replies; 157+ messages in thread
From: Pierre Tardy @ 2010-10-26 16:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, linux-trace-users, Frederic Weisbecker,
	Jean Pihet, Steven Rostedt, Peter Zijlstra, Frank Eigler,
	Mathieu Desnoyers, linux-pm, Masami Hiramatsu, Tejun Heo,
	Thomas Gleixner, linux-omap, Linus Torvalds, Ingo Molnar

> I would like to see a slightly more advanced tracepoint do the runtime pm
> framework;
> specifically I'd like to see the "comm" of the process that's taking a
> refcount on a device
> (that way, powertop can track which process keeps a device busy)
>
>
Yes, the "comm" for this tracepoint should be the runtime_pm workqueue.
To track responsabilities, I'm making another tracepoint, that traces
the rpm_get/put.

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 0f38447..54d9911 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -792,7 +792,7 @@ EXPORT_SYMBOL_GPL(pm_request_resume);
 int __pm_runtime_get(struct device *dev, bool sync)
 {
        int retval;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);

@@ -813,6 +813,7 @@ int __pm_runtime_put(struct device *dev, bool sync)
 {
        int retval = 0;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)-1);
        if (atomic_dec_and_test(&dev->power.usage_count))
                retval = sync ? pm_runtime_idle(dev) : pm_request_idle(dev);

@@ -1065,6 +1066,7 @@ void pm_runtime_forbid(struct device *dev)
                goto out;

        dev->power.runtime_auto = false;
+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        __pm_runtime_resume(dev, false);

@@ -1086,6 +1088,7 @@ void pm_runtime_allow(struct device *dev)
                goto out;

        dev->power.runtime_auto = true;
+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)-1);
        if (atomic_dec_and_test(&dev->power.usage_count))
                __pm_runtime_idle(dev);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index ea514eb..813229c 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -100,6 +100,29 @@ TRACE_EVENT(runtime_pm_status,
        TP_printk("driver=%s dev=%s prev_status=%s status=%s",
__get_str(drivername),__get_str(devname),
                  show_rpm_status_name(__entry->prev_status),
show_rpm_status_name(__entry->status))
 );
+TRACE_EVENT(runtime_pm_usage,
+
+       TP_PROTO(struct device *dev, int new_usage),
+
+       TP_ARGS(dev, new_usage),
+
+       TP_STRUCT__entry(
+               __string(devname,dev_name(dev))
+               __string(drivername,dev_driver_string(dev))
+               __field(u32, prev_usage)
+               __field(u32, usage)
+       ),
+
+       TP_fast_assign(
+               __assign_str(devname, dev_name(dev));
+               __assign_str(drivername, dev_driver_string(dev));
+               __entry->prev_usage = (u32)atomic_read(&dev->power.usage_count);
+               __entry->usage = (u32)new_usage;
+       ),
+
+       TP_printk("driver=%s dev=%s prev_usage=%d usage=%s",
__get_str(drivername),__get_str(devname),
+                 __entry->prev_usage, __entry->usage)
+);
-- 
Pierre

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 16:04         ` Arjan van de Ven
@ 2010-10-26 16:56           ` Pierre Tardy
  2010-10-26 17:58             ` Peter Zijlstra
  2010-10-26 17:58             ` Peter Zijlstra
  2010-10-26 16:56           ` Pierre Tardy
  1 sibling, 2 replies; 157+ messages in thread
From: Pierre Tardy @ 2010-10-26 16:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Thomas Renninger, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, rjw, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

> I would like to see a slightly more advanced tracepoint do the runtime pm
> framework;
> specifically I'd like to see the "comm" of the process that's taking a
> refcount on a device
> (that way, powertop can track which process keeps a device busy)
>
>
Yes, the "comm" for this tracepoint should be the runtime_pm workqueue.
To track responsabilities, I'm making another tracepoint, that traces
the rpm_get/put.

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 0f38447..54d9911 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -792,7 +792,7 @@ EXPORT_SYMBOL_GPL(pm_request_resume);
 int __pm_runtime_get(struct device *dev, bool sync)
 {
        int retval;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);

@@ -813,6 +813,7 @@ int __pm_runtime_put(struct device *dev, bool sync)
 {
        int retval = 0;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)-1);
        if (atomic_dec_and_test(&dev->power.usage_count))
                retval = sync ? pm_runtime_idle(dev) : pm_request_idle(dev);

@@ -1065,6 +1066,7 @@ void pm_runtime_forbid(struct device *dev)
                goto out;

        dev->power.runtime_auto = false;
+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        __pm_runtime_resume(dev, false);

@@ -1086,6 +1088,7 @@ void pm_runtime_allow(struct device *dev)
                goto out;

        dev->power.runtime_auto = true;
+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)-1);
        if (atomic_dec_and_test(&dev->power.usage_count))
                __pm_runtime_idle(dev);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index ea514eb..813229c 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -100,6 +100,29 @@ TRACE_EVENT(runtime_pm_status,
        TP_printk("driver=%s dev=%s prev_status=%s status=%s",
__get_str(drivername),__get_str(devname),
                  show_rpm_status_name(__entry->prev_status),
show_rpm_status_name(__entry->status))
 );
+TRACE_EVENT(runtime_pm_usage,
+
+       TP_PROTO(struct device *dev, int new_usage),
+
+       TP_ARGS(dev, new_usage),
+
+       TP_STRUCT__entry(
+               __string(devname,dev_name(dev))
+               __string(drivername,dev_driver_string(dev))
+               __field(u32, prev_usage)
+               __field(u32, usage)
+       ),
+
+       TP_fast_assign(
+               __assign_str(devname, dev_name(dev));
+               __assign_str(drivername, dev_driver_string(dev));
+               __entry->prev_usage = (u32)atomic_read(&dev->power.usage_count);
+               __entry->usage = (u32)new_usage;
+       ),
+
+       TP_printk("driver=%s dev=%s prev_usage=%d usage=%s",
__get_str(drivername),__get_str(devname),
+                 __entry->prev_usage, __entry->usage)
+);
-- 
Pierre

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 16:56           ` Pierre Tardy
  2010-10-26 17:58             ` Peter Zijlstra
@ 2010-10-26 17:58             ` Peter Zijlstra
  1 sibling, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-10-26 17:58 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Frederic Weisbecker, Andrew Morton, Arjan van de Ven, linux-pm,
	Jean Pihet, Rostedt, linux-trace-users, Frank Eigler,
	Mathieu Desnoyers, Steven, Masami Hiramatsu, Tejun Heo,
	Thomas Gleixner, linux-omap, Linus Torvalds, Ingo Molnar

On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count); 

That's terribly racy..

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 16:56           ` Pierre Tardy
@ 2010-10-26 17:58             ` Peter Zijlstra
  2010-10-26 18:14               ` Mathieu Desnoyers
                                 ` (3 more replies)
  2010-10-26 17:58             ` Peter Zijlstra
  1 sibling, 4 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-10-26 17:58 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Arjan van de Ven, Ingo Molnar, Thomas Renninger, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Masami Hiramatsu, Frank Eigler,
	Steven Rostedt, Kevin Hilman, linux-omap, rjw, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count); 

That's terribly racy..

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 17:58             ` Peter Zijlstra
@ 2010-10-26 18:14               ` Mathieu Desnoyers
  2010-10-26 18:14               ` Mathieu Desnoyers
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 18:14 UTC (permalink / raw)
  To: Peter Zijlstra, Greg Kroah-Hartman
  Cc: Andrew Morton, Pierre Tardy, Arjan van de Ven,
	Frederic Weisbecker, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Thomas Gleixner, linux-pm,
	Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count); 
> 
> That's terribly racy..

Looking at the original code, it looks racy even without considering the
tracepoint:

int __pm_runtime_get(struct device *dev, bool sync)
 {
        int retval;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);

There is no implied memory barrier after "atomic_inc". So either all these
inc/dec are protected with mutexes or spinlocks, in which case one might wonder
why atomic operations are used at all, or it's a racy mess. (I vote for the
second option)

kref should certainly be used there.

About the instrumentation, well... the only way to have something that's not
racy would be to instrument kref directly, and use atomic_add_return() in both
the get/put paths. But I fear that the performance impact on many architectures
might be significant (turning atomic_add + smp_mb() into a cmpxchg()). Maybe it
could be acceptable as a kernel debug option.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 17:58             ` Peter Zijlstra
  2010-10-26 18:14               ` Mathieu Desnoyers
@ 2010-10-26 18:14               ` Mathieu Desnoyers
  2010-10-26 18:50                 ` [linux-pm] " Alan Stern
                                   ` (3 more replies)
  2010-10-26 18:15               ` Pierre Tardy
  2010-10-26 18:15               ` Pierre Tardy
  3 siblings, 4 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 18:14 UTC (permalink / raw)
  To: Peter Zijlstra, Greg Kroah-Hartman
  Cc: Pierre Tardy, Arjan van de Ven, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, rjw,
	linux-pm, linux-trace-users, Jean Pihet, Frederic Weisbecker,
	Tejun Heo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count); 
> 
> That's terribly racy..

Looking at the original code, it looks racy even without considering the
tracepoint:

int __pm_runtime_get(struct device *dev, bool sync)
 {
        int retval;

+       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
        atomic_inc(&dev->power.usage_count);
        retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);

There is no implied memory barrier after "atomic_inc". So either all these
inc/dec are protected with mutexes or spinlocks, in which case one might wonder
why atomic operations are used at all, or it's a racy mess. (I vote for the
second option)

kref should certainly be used there.

About the instrumentation, well... the only way to have something that's not
racy would be to instrument kref directly, and use atomic_add_return() in both
the get/put paths. But I fear that the performance impact on many architectures
might be significant (turning atomic_add + smp_mb() into a cmpxchg()). Maybe it
could be acceptable as a kernel debug option.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 17:58             ` Peter Zijlstra
                                 ` (2 preceding siblings ...)
  2010-10-26 18:15               ` Pierre Tardy
@ 2010-10-26 18:15               ` Pierre Tardy
  3 siblings, 0 replies; 157+ messages in thread
From: Pierre Tardy @ 2010-10-26 18:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Arjan van de Ven, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler,
	Mathieu Desnoyers, linux-pm, Masami Hiramatsu, Tejun Heo,
	Thomas Gleixner, linux-omap, Linus Torvalds, Ingo Molnar

On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>>
>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>>         atomic_inc(&dev->power.usage_count);
>
> That's terribly racy..
>
I know. I'm not proud of this.. As I said, this is preliminary patch.
We dont really need to have this prev_usage. This is just for debug.
It mayprobably endup with something like:

         atomic_inc(&dev->power.usage_count);
+       trace_power_device_usage(dev);

-- 
Pierre

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 17:58             ` Peter Zijlstra
  2010-10-26 18:14               ` Mathieu Desnoyers
  2010-10-26 18:14               ` Mathieu Desnoyers
@ 2010-10-26 18:15               ` Pierre Tardy
  2010-10-26 19:08                 ` Rafael J. Wysocki
  2010-10-26 19:08                 ` Rafael J. Wysocki
  2010-10-26 18:15               ` Pierre Tardy
  3 siblings, 2 replies; 157+ messages in thread
From: Pierre Tardy @ 2010-10-26 18:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Ingo Molnar, Thomas Renninger, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Masami Hiramatsu, Frank Eigler,
	Steven Rostedt, Kevin Hilman, linux-omap, rjw, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>>
>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>>         atomic_inc(&dev->power.usage_count);
>
> That's terribly racy..
>
I know. I'm not proud of this.. As I said, this is preliminary patch.
We dont really need to have this prev_usage. This is just for debug.
It mayprobably endup with something like:

         atomic_inc(&dev->power.usage_count);
+       trace_power_device_usage(dev);

-- 
Pierre
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:14               ` Mathieu Desnoyers
  2010-10-26 18:50                 ` [linux-pm] " Alan Stern
@ 2010-10-26 18:50                 ` Alan Stern
  2010-10-26 19:04                 ` Rafael J. Wysocki
  2010-10-26 19:04                 ` Rafael J. Wysocki
  3 siblings, 0 replies; 157+ messages in thread
From: Alan Stern @ 2010-10-26 18:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Pierre Tardy, Peter Zijlstra, linux-pm, Ingo Molnar, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:

> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count); 
> > 
> > That's terribly racy..
> 
> Looking at the original code, it looks racy even without considering the
> tracepoint:
> 
> int __pm_runtime_get(struct device *dev, bool sync)
>  {
>         int retval;
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count);
>         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> 
> There is no implied memory barrier after "atomic_inc". So either all these
> inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> why atomic operations are used at all, or it's a racy mess. (I vote for the
> second option)

I don't understand.  What's the problem?  The inc/dec are atomic 
because they are not protected by spinlocks, but everything else is 
(aside from the tracepoint, which is new).

> kref should certainly be used there.

What for?

Alan Stern

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:14               ` Mathieu Desnoyers
@ 2010-10-26 18:50                 ` Alan Stern
  2010-10-26 21:33                   ` Mathieu Desnoyers
  2010-10-26 21:33                   ` [linux-pm] " Mathieu Desnoyers
  2010-10-26 18:50                 ` Alan Stern
                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 157+ messages in thread
From: Alan Stern @ 2010-10-26 18:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Greg Kroah-Hartman, Andrew Morton, Pierre Tardy,
	Arjan van de Ven, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds

On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:

> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count); 
> > 
> > That's terribly racy..
> 
> Looking at the original code, it looks racy even without considering the
> tracepoint:
> 
> int __pm_runtime_get(struct device *dev, bool sync)
>  {
>         int retval;
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count);
>         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> 
> There is no implied memory barrier after "atomic_inc". So either all these
> inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> why atomic operations are used at all, or it's a racy mess. (I vote for the
> second option)

I don't understand.  What's the problem?  The inc/dec are atomic 
because they are not protected by spinlocks, but everything else is 
(aside from the tracepoint, which is new).

> kref should certainly be used there.

What for?

Alan Stern


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
                       ` (6 preceding siblings ...)
  2010-10-26 18:52     ` Rafael J. Wysocki
@ 2010-10-26 18:52     ` Rafael J. Wysocki
  7 siblings, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 18:52 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Mathieu Desnoyers, Ingo Molnar, linux-pm,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Linus Torvalds, Thomas Gleixner

On Tuesday, October 26, 2010, Thomas Renninger wrote:
> Changes in V2:
>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle
> 
> and
>   power:power_frequency
> is replaced with:
>   power:processor_frequency
> 
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.

As I said already somewhere else, I think this one should be done at the
core level rather than in arch-specific code.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
                       ` (5 preceding siblings ...)
  2010-10-26  7:59     ` Jean Pihet
@ 2010-10-26 18:52     ` Rafael J. Wysocki
  2010-10-26 18:52     ` Rafael J. Wysocki
  7 siblings, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 18:52 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven, Ingo Molnar

On Tuesday, October 26, 2010, Thomas Renninger wrote:
> Changes in V2:
>   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
>   - Use u32 instead of u64 for cpuid, state which is by far enough
> 
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle
> 
> and
>   power:power_frequency
> is replaced with:
>   power:processor_frequency
> 
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.

As I said already somewhere else, I think this one should be done at the
core level rather than in arch-specific code.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:48           ` Thomas Renninger
                               ` (2 preceding siblings ...)
  2010-10-26 18:57             ` Rafael J. Wysocki
@ 2010-10-26 18:57             ` Rafael J. Wysocki
  3 siblings, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 18:57 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, Andrew Morton, linux-trace-users,
	Frederic Weisbecker, Pierre Tardy, Jean Pihet, Steven Rostedt,
	Peter Zijlstra, Frank Eigler, Mathieu Desnoyers, linux-pm,
	Masami Hiramatsu, Tejun Heo, Thomas Gleixner, linux-omap,
	Linus Torvalds, Ingo Molnar

On Tuesday, October 26, 2010, Thomas Renninger wrote:
> On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> > 
> > * Jean Pihet <jean.pihet@newoldbits.com> wrote:
> ..
> > > >> +#ifndef _TRACE_POWER_ENUM_
> > > >> +#define _TRACE_POWER_ENUM_
> > > >> +enum {
> > > >> +     POWER_NONE = 0,
> > > >> +     POWER_CSTATE = 1,
> > > >> +     POWER_PSTATE = 2,
> > > >> +};
> > > >> +#endif
> > > >
> > > > Since we are cleaning up all these events, those enum definitions dont really look
> > > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> > >
> > > The enum belongs to the deprecated API so I would rather not touch it.
> > > Keeping the deprecated code isolated will make it easier to remove
> > > later.
> > 
> > So what will replace it? We still have a state field.
> Nothing, this is part of the cleanup.
> As you state above: POWER_NONE does not make sense at all.
> The whole thing (type= attribute that vanishes now) is
> passed to userspace, but never gets used there because the
> same info is in the event name:
> cpu_frequency <-> frequency_switch      <-> PSTATE
> cpu_idle      <-> power_start/power_end <-> CSTATE 
> 
> I expect that there was an initial power_start/end which
> was also used for frequency switching.
> Then it got realized that _start/_end does not work out and
> frequency_switch got introduced.
> To stay compatible the whole power_start/end was not renamed
> to cpu_idle and the type= field was kept.
> 
> This is a guess without even looking at the git history.
> Therefore my partly harsh comments about the sanity of the
> current power tracing events.
> 
> > Passing in platform specific codes is a step backwards.
> > 
> > > >> +TRACE_EVENT(machine_suspend,
> > > >> +
> > > >> +     TP_PROTO(unsigned int state),
> > > >> +
> > > >> +     TP_ARGS(state),
> > > >> +
> > > >> +     TP_STRUCT__entry(
> > > >> +             __field(        u32,            state           )
> > > >> +     ),
> > > >
> > > > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > > > that adds usage, and what are the possible values for 'state'?
> > >
> > > This will come as a separate patch, which fits all platforms. Cf.
> > > http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> > > The state field is of type suspend_state_t, cf. include/linux/suspend.h
> > 
> > Ok, that's at least generic. Needs the review of Rafael, to determine
> > whether this state value is all we want to know when we enter suspend.
> He already gave an acked-by on this generic one here:
> Re: [PATCH 3/4] perf: add calls to suspend trace point
> Oh now, that was on the X86 specific part which depends on this one.
> One should expect that he's fine with the generic part as well then,
> but I agree that he should definitely have a look at this and sign it off.

What patch exactly do you mean?  I'm not quite sure from your comment above.

Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:48           ` Thomas Renninger
  2010-10-26 11:54             ` Ingo Molnar
  2010-10-26 11:54             ` Ingo Molnar
@ 2010-10-26 18:57             ` Rafael J. Wysocki
  2010-10-27  0:00               ` Thomas Renninger
  2010-10-27  0:00               ` Thomas Renninger
  2010-10-26 18:57             ` Rafael J. Wysocki
  3 siblings, 2 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 18:57 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Jean Pihet, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven

On Tuesday, October 26, 2010, Thomas Renninger wrote:
> On Tuesday 26 October 2010 13:21:29 Ingo Molnar wrote:
> > 
> > * Jean Pihet <jean.pihet@newoldbits.com> wrote:
> ..
> > > >> +#ifndef _TRACE_POWER_ENUM_
> > > >> +#define _TRACE_POWER_ENUM_
> > > >> +enum {
> > > >> +     POWER_NONE = 0,
> > > >> +     POWER_CSTATE = 1,
> > > >> +     POWER_PSTATE = 2,
> > > >> +};
> > > >> +#endif
> > > >
> > > > Since we are cleaning up all these events, those enum definitions dont really look
> > > > logical. For example, what is 'POWER_NONE'? Can a CPU have 'no power'?
> > >
> > > The enum belongs to the deprecated API so I would rather not touch it.
> > > Keeping the deprecated code isolated will make it easier to remove
> > > later.
> > 
> > So what will replace it? We still have a state field.
> Nothing, this is part of the cleanup.
> As you state above: POWER_NONE does not make sense at all.
> The whole thing (type= attribute that vanishes now) is
> passed to userspace, but never gets used there because the
> same info is in the event name:
> cpu_frequency <-> frequency_switch      <-> PSTATE
> cpu_idle      <-> power_start/power_end <-> CSTATE 
> 
> I expect that there was an initial power_start/end which
> was also used for frequency switching.
> Then it got realized that _start/_end does not work out and
> frequency_switch got introduced.
> To stay compatible the whole power_start/end was not renamed
> to cpu_idle and the type= field was kept.
> 
> This is a guess without even looking at the git history.
> Therefore my partly harsh comments about the sanity of the
> current power tracing events.
> 
> > Passing in platform specific codes is a step backwards.
> > 
> > > >> +TRACE_EVENT(machine_suspend,
> > > >> +
> > > >> +     TP_PROTO(unsigned int state),
> > > >> +
> > > >> +     TP_ARGS(state),
> > > >> +
> > > >> +     TP_STRUCT__entry(
> > > >> +             __field(        u32,            state           )
> > > >> +     ),
> > > >
> > > > Hm, this event is not used anywhere in the submitted patches. Where is the patch
> > > > that adds usage, and what are the possible values for 'state'?
> > >
> > > This will come as a separate patch, which fits all platforms. Cf.
> > > http://marc.info/?l=linux-omap&m=128620575300682&w=2.
> > > The state field is of type suspend_state_t, cf. include/linux/suspend.h
> > 
> > Ok, that's at least generic. Needs the review of Rafael, to determine
> > whether this state value is all we want to know when we enter suspend.
> He already gave an acked-by on this generic one here:
> Re: [PATCH 3/4] perf: add calls to suspend trace point
> Oh now, that was on the X86 specific part which depends on this one.
> One should expect that he's fine with the generic part as well then,
> but I agree that he should definitely have a look at this and sign it off.

What patch exactly do you mean?  I'm not quite sure from your comment above.

Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:19         ` Ingo Molnar
  2010-10-26 19:01           ` Rafael J. Wysocki
@ 2010-10-26 19:01           ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-trace-users, Frederic Weisbecker,
	Pierre Tardy, Jean Pihet, Steven Rostedt, Peter Zijlstra,
	Frank Eigler, Thomas Gleixner, linux-pm, Masami Hiramatsu,
	Tejun Heo, Andrew Morton, linux-omap, Linus Torvalds,
	Mathieu Desnoyers

On Tuesday, October 26, 2010, Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > Changes in V2:
> > > >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> > > >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > > 
> > > > and
> > > >   power:power_frequency
> > > > is replaced with:
> > > >   power:processor_frequency
> > > 
> > > Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> > > shortness? We generally use 'cpu' in the kernel and for events.
> > Sure.
> > > 
> > > > power:machine_suspend
> > > 
> > > How will future PCI (or other device) power saving tracepoints be called?
> > > 
> > > Might be more consistent to use:
> > > 
> > >   power:cpu_idle
> > >   power:machine_idle
> > >   power:device_idle
> >
> > device idle is not true. Those may be low power modes
> > like reduced network throughput, reduced wlan range, the device
> > needs not to be idle.
> > Device power states is probably the most complex area, if such
> > a thing gets introduced, it should makes sense to state
> > the interface experimental for some time until a wider range of
> > devices uses it (in contrast to these new ones
> > which should not change that soon anymore...).
> 
> Ok.
> 
> > Also machine_idle may be true, but machine_suspend sounds more
> > familiar and everyone immediately knows what the event is about.
> 
> Ok - fair enough.
> 
> > > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > 
> > > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
> >
> > No, below enum will vanish, but -1 is nicer.
> 
> When it vanishes what will replace it?
> 
> > ...
> > 
> > > Plus:
> > > 
> > > > +DECLARE_EVENT_CLASS(processor,
> > > > +
> > > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > > +
> > > > +	TP_ARGS(state, cpu_id),
> > > > +
> > > > +	TP_STRUCT__entry(
> > > > +		__field(	u32,		state		)
> > > > +		__field(	u32,		cpu_id		)
> > > 
> > > Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> > > ever be different from that CPU id?
> >
> > Yes. A core's frequency can depend on another one which
> > will get switched as well (by one command/MSR).
> > Compare with commit 6f4f2723d08534fd4e407e1e.
> > 
> > This can theoretically also be the case for sleep states.
> > Afaik such HW does not exist yet, but ACPI spec already
> > provides interfaces to pass these dependency from BIOS to OS.
> > -> We want a stable ABI and should be prepared for such stuff.
> > 
> > > > +	),
> > > > +
> > > > +	TP_fast_assign(
> > > > +		__entry->state = state;
> > > > +		__entry->cpu_id = cpu_id;
> > > > +	),
> > > > +
> > > > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > > > +		  (unsigned long)__entry->cpu_id)
> > > > +);
> > > > +
> > > > +DEFINE_EVENT(processor, processor_idle,
> > > > +
> > > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > > +
> > > > +	     TP_ARGS(state, cpu_id)
> > > > +);
> > > > +
> > > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > > +
> > > > +DEFINE_EVENT(processor, processor_frequency,
> > > > +
> > > > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > > > +
> > > > +	TP_ARGS(frequency, cpu_id)
> > > > +);
> > > 
> > > So, we have a 'state' field in the class, which is used as 'state' by the 
> > > power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
> >
> > Yes, is this a problem?
> >
> > Definitions are a bit shorter having one power processor class.
> > As "frequency" is stated in frequency event definition everything should
> > be obvious and this one looks like the more elegant way to me.
> >  
> > > Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> > > there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> > > then it wont fit into u32.
> >
> > drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
> >         unsigned int            min;    /* in kHz */
> >         unsigned int            max;    /* in kHz */
> >         unsigned int            cur;    /* in kHz,
> >         ...
> > that should be fine.
> 
> ok, good - so we should be fine up to 4 THz CPUs.
> 
> > > Also, might there be a future need to express different types of frequencies? 
> > > For example, should we decide to track turbo frequencies in Intel CPUs, how 
> > > would that be expressed via these events? Are there any architectures and CPUs 
> > > that somehow have some extra attribute to the frequency value?
> >
> > I wonder whether this ever can/will work in a sane way.
> > Userspace can compare with:
> > /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
> > everything above is turbo. So I do not think it's ever needed.
> > But adding an additional value at the end does not violate
> > userspace compatibility. This has been done with the cpuid
> > as well.
> >  
> > > > +TRACE_EVENT(machine_suspend,
> > > > +
> > > > +	TP_PROTO(unsigned int state),
> > > > +
> > > > +	TP_ARGS(state),
> > > > +
> > > > +	TP_STRUCT__entry(
> > > > +		__field(	u32,		state		)
> > > > +	),
> > > 
> > > Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> > > that adds usage, and what are the possible values for 'state'?
> >
> > Jean wants to make use of it on ARM.
> > I also had patch for x86, I can have another look at it, Rafael
> > already gave me a comment on it. But on X86 you typically realize
> > when you suspend the machine (could imagine this is more useful on
> > ARM driven mobile phones and similar), still I can add it..
> > 
> > Values probably should be (include/linux/suspend.h):
> > #define PM_SUSPEND_ON       0
> > #define PM_SUSPEND_STANDBY  1
> > #define PM_SUSPEND_MEM      3
> > #define PM_SUSPEND_MAX      4
> > 
> > How this strange state Arjan talked about is passed is up
> > to these guys. Instead of using 0 and above pre-defined such
> > arch specific special states better should get passed by:
> > 
> > #define X86_MOORESTOWN_STANDBY_S0   0x100
> > ..                                  0x101
> > #define ARM_WHATEVER_STRANGE_THING  0x200
> > ...
> 
> I'd rather like to see a meaningful name to be given to these states and them being 
> passed, instead of weird platform specific things. Tooling will try to be as generic 
> as possible.
> 
> I dont know this stuff, but making a distinction between s2ram and s2disk events 
> would seem meaningful.

Basically, we have 

standby (which is what it says)
mem (s2ram)
disk (s2disk)

These are the transitions the PM core really cares about (if supported).
The can be read from /sys/power/state and I think these names should be used
by the tracing interfaces too.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 11:19         ` Ingo Molnar
@ 2010-10-26 19:01           ` Rafael J. Wysocki
  2010-10-26 19:01           ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven

On Tuesday, October 26, 2010, Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Tuesday 26 October 2010 09:10:20 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > Changes in V2:
> > > >   - Introduce PWR_EVENT_EXIT instead of 0 to mark non-power state
> > > >   - Use u32 instead of u64 for cpuid, state which is by far enough
> > > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > > 
> > > > and
> > > >   power:power_frequency
> > > > is replaced with:
> > > >   power:processor_frequency
> > > 
> > > Could you please name it power:cpu_idle and power:cpu_frequency instead, for 
> > > shortness? We generally use 'cpu' in the kernel and for events.
> > Sure.
> > > 
> > > > power:machine_suspend
> > > 
> > > How will future PCI (or other device) power saving tracepoints be called?
> > > 
> > > Might be more consistent to use:
> > > 
> > >   power:cpu_idle
> > >   power:machine_idle
> > >   power:device_idle
> >
> > device idle is not true. Those may be low power modes
> > like reduced network throughput, reduced wlan range, the device
> > needs not to be idle.
> > Device power states is probably the most complex area, if such
> > a thing gets introduced, it should makes sense to state
> > the interface experimental for some time until a wider range of
> > devices uses it (in contrast to these new ones
> > which should not change that soon anymore...).
> 
> Ok.
> 
> > Also machine_idle may be true, but machine_suspend sounds more
> > familiar and everyone immediately knows what the event is about.
> 
> Ok - fair enough.
> 
> > > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > 
> > > Shouldnt this be part of the POWER_ enum? (and you can write -1 there)
> >
> > No, below enum will vanish, but -1 is nicer.
> 
> When it vanishes what will replace it?
> 
> > ...
> > 
> > > Plus:
> > > 
> > > > +DECLARE_EVENT_CLASS(processor,
> > > > +
> > > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > > +
> > > > +	TP_ARGS(state, cpu_id),
> > > > +
> > > > +	TP_STRUCT__entry(
> > > > +		__field(	u32,		state		)
> > > > +		__field(	u32,		cpu_id		)
> > > 
> > > Trace entries can carry a cpu_id of the current processor already. Can this cpu_id 
> > > ever be different from that CPU id?
> >
> > Yes. A core's frequency can depend on another one which
> > will get switched as well (by one command/MSR).
> > Compare with commit 6f4f2723d08534fd4e407e1e.
> > 
> > This can theoretically also be the case for sleep states.
> > Afaik such HW does not exist yet, but ACPI spec already
> > provides interfaces to pass these dependency from BIOS to OS.
> > -> We want a stable ABI and should be prepared for such stuff.
> > 
> > > > +	),
> > > > +
> > > > +	TP_fast_assign(
> > > > +		__entry->state = state;
> > > > +		__entry->cpu_id = cpu_id;
> > > > +	),
> > > > +
> > > > +	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > > > +		  (unsigned long)__entry->cpu_id)
> > > > +);
> > > > +
> > > > +DEFINE_EVENT(processor, processor_idle,
> > > > +
> > > > +	TP_PROTO(unsigned int state, unsigned int cpu_id),
> > > > +
> > > > +	     TP_ARGS(state, cpu_id)
> > > > +);
> > > > +
> > > > +#define PWR_EVENT_EXIT 0xFFFFFFFF
> > > > +
> > > > +DEFINE_EVENT(processor, processor_frequency,
> > > > +
> > > > +	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > > > +
> > > > +	TP_ARGS(frequency, cpu_id)
> > > > +);
> > > 
> > > So, we have a 'state' field in the class, which is used as 'state' by the 
> > > power::cpu_idle event, and as 'frequency' by the power::cpu_freq event?
> >
> > Yes, is this a problem?
> >
> > Definitions are a bit shorter having one power processor class.
> > As "frequency" is stated in frequency event definition everything should
> > be obvious and this one looks like the more elegant way to me.
> >  
> > > Are there any architectures that track frequency in Hz, not in kHz? If yes, might 
> > > there ever be a need for the frequency value to be larger than 4.29 GHz? If yes, 
> > > then it wont fit into u32.
> >
> > drivers/cpufreq subsystem is fixed to unsigned int (cmp. include/linux/cpufreq.h):
> >         unsigned int            min;    /* in kHz */
> >         unsigned int            max;    /* in kHz */
> >         unsigned int            cur;    /* in kHz,
> >         ...
> > that should be fine.
> 
> ok, good - so we should be fine up to 4 THz CPUs.
> 
> > > Also, might there be a future need to express different types of frequencies? 
> > > For example, should we decide to track turbo frequencies in Intel CPUs, how 
> > > would that be expressed via these events? Are there any architectures and CPUs 
> > > that somehow have some extra attribute to the frequency value?
> >
> > I wonder whether this ever can/will work in a sane way.
> > Userspace can compare with:
> > /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
> > everything above is turbo. So I do not think it's ever needed.
> > But adding an additional value at the end does not violate
> > userspace compatibility. This has been done with the cpuid
> > as well.
> >  
> > > > +TRACE_EVENT(machine_suspend,
> > > > +
> > > > +	TP_PROTO(unsigned int state),
> > > > +
> > > > +	TP_ARGS(state),
> > > > +
> > > > +	TP_STRUCT__entry(
> > > > +		__field(	u32,		state		)
> > > > +	),
> > > 
> > > Hm, this event is not used anywhere in the submitted patches. Where is the patch 
> > > that adds usage, and what are the possible values for 'state'?
> >
> > Jean wants to make use of it on ARM.
> > I also had patch for x86, I can have another look at it, Rafael
> > already gave me a comment on it. But on X86 you typically realize
> > when you suspend the machine (could imagine this is more useful on
> > ARM driven mobile phones and similar), still I can add it..
> > 
> > Values probably should be (include/linux/suspend.h):
> > #define PM_SUSPEND_ON       0
> > #define PM_SUSPEND_STANDBY  1
> > #define PM_SUSPEND_MEM      3
> > #define PM_SUSPEND_MAX      4
> > 
> > How this strange state Arjan talked about is passed is up
> > to these guys. Instead of using 0 and above pre-defined such
> > arch specific special states better should get passed by:
> > 
> > #define X86_MOORESTOWN_STANDBY_S0   0x100
> > ..                                  0x101
> > #define ARM_WHATEVER_STRANGE_THING  0x200
> > ...
> 
> I'd rather like to see a meaningful name to be given to these states and them being 
> passed, instead of weird platform specific things. Tooling will try to be as generic 
> as possible.
> 
> I dont know this stuff, but making a distinction between s2ram and s2disk events 
> would seem meaningful.

Basically, we have 

standby (which is what it says)
mem (s2ram)
disk (s2disk)

These are the transitions the PM core really cares about (if supported).
The can be read from /sys/power/state and I think these names should be used
by the tracing interfaces too.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:14               ` Mathieu Desnoyers
  2010-10-26 18:50                 ` [linux-pm] " Alan Stern
  2010-10-26 18:50                 ` Alan Stern
@ 2010-10-26 19:04                 ` Rafael J. Wysocki
  2010-10-26 19:04                 ` Rafael J. Wysocki
  3 siblings, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Pierre Tardy, Peter Zijlstra, linux-trace-users,
	Jean Pihet, Steven Rostedt, Frederic Weisbecker, Linus Torvalds,
	Frank Eigler, Thomas Gleixner, linux-pm, Masami Hiramatsu,
	Tejun Heo, Ingo Molnar, linux-omap, Arjan van de Ven

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count); 
> > 
> > That's terribly racy..
> 
> Looking at the original code, it looks racy even without considering the
> tracepoint:
> 
> int __pm_runtime_get(struct device *dev, bool sync)
>  {
>         int retval;
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count);
>         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> 
> There is no implied memory barrier after "atomic_inc". So either all these
> inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> why atomic operations are used at all, or it's a racy mess. (I vote for the
> second option)

No, it isn't.

> kref should certainly be used there.

No, it shouldn't.

Please try to understand the code you're commenting on first.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:14               ` Mathieu Desnoyers
                                   ` (2 preceding siblings ...)
  2010-10-26 19:04                 ` Rafael J. Wysocki
@ 2010-10-26 19:04                 ` Rafael J. Wysocki
  2010-10-26 21:38                   ` Mathieu Desnoyers
  2010-10-26 21:38                   ` Mathieu Desnoyers
  3 siblings, 2 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Greg Kroah-Hartman, Pierre Tardy,
	Arjan van de Ven, Ingo Molnar, Thomas Renninger, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Masami Hiramatsu, Frank Eigler,
	Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count); 
> > 
> > That's terribly racy..
> 
> Looking at the original code, it looks racy even without considering the
> tracepoint:
> 
> int __pm_runtime_get(struct device *dev, bool sync)
>  {
>         int retval;
> 
> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>         atomic_inc(&dev->power.usage_count);
>         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> 
> There is no implied memory barrier after "atomic_inc". So either all these
> inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> why atomic operations are used at all, or it's a racy mess. (I vote for the
> second option)

No, it isn't.

> kref should certainly be used there.

No, it shouldn't.

Please try to understand the code you're commenting on first.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:15               ` Pierre Tardy
  2010-10-26 19:08                 ` Rafael J. Wysocki
@ 2010-10-26 19:08                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:08 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Andrew Morton, linux-trace-users, Peter Zijlstra,
	Arjan van de Ven, Jean Pihet, Steven Rostedt,
	Frederic Weisbecker, Frank Eigler, Thomas Gleixner, linux-pm,
	Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

On Tuesday, October 26, 2010, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >>
> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >>         atomic_inc(&dev->power.usage_count);
> >
> > That's terribly racy..
> >
> I know. I'm not proud of this.. As I said, this is preliminary patch.
> We dont really need to have this prev_usage. This is just for debug.
> It mayprobably endup with something like:
> 
>          atomic_inc(&dev->power.usage_count);
> +       trace_power_device_usage(dev);

Well, please tell me what you're trying to achieve.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:15               ` Pierre Tardy
@ 2010-10-26 19:08                 ` Rafael J. Wysocki
  2010-10-26 20:23                   ` Pierre Tardy
  2010-10-26 20:23                   ` Pierre Tardy
  2010-10-26 19:08                 ` Rafael J. Wysocki
  1 sibling, 2 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 19:08 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Peter Zijlstra, Arjan van de Ven, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Tuesday, October 26, 2010, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >>
> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >>         atomic_inc(&dev->power.usage_count);
> >
> > That's terribly racy..
> >
> I know. I'm not proud of this.. As I said, this is preliminary patch.
> We dont really need to have this prev_usage. This is just for debug.
> It mayprobably endup with something like:
> 
>          atomic_inc(&dev->power.usage_count);
> +       trace_power_device_usage(dev);

Well, please tell me what you're trying to achieve.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 19:08                 ` Rafael J. Wysocki
@ 2010-10-26 20:23                   ` Pierre Tardy
  2010-10-26 20:23                   ` Pierre Tardy
  1 sibling, 0 replies; 157+ messages in thread
From: Pierre Tardy @ 2010-10-26 20:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, linux-trace-users, Peter Zijlstra,
	Arjan van de Ven, Jean Pihet, Steven Rostedt,
	Frederic Weisbecker, Frank Eigler, Thomas Gleixner, linux-pm,
	Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 1372 bytes --]

On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>> >>
>> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>> >>         atomic_inc(&dev->power.usage_count);
>> >
>> > That's terribly racy..
>> >
>> I know. I'm not proud of this.. As I said, this is preliminary patch.
>> We dont really need to have this prev_usage. This is just for debug.
>> It mayprobably endup with something like:
>>
>>          atomic_inc(&dev->power.usage_count);
>> +       trace_power_device_usage(dev);
>
> Well, please tell me what you're trying to achieve.

Please see attached the kind of pytimechart output I'm trying to
achieve (yes, this chart is not coherent, seems I'm still missing some
traces)

We basically want to have a trace point eachtime the usage_counter
changes, so that I can display nice timecharts, and Arjan can have the
comm of the process that eventually generated the rpm_get, in order to
pinpoint it in powertop.

What you dont see in the above two lines is that
trace_power_device_usage(dev); actually reads the usage_count, as well
as the driver and device name.

Regards,
-- 
Pierre

[-- Attachment #2: pytimechart_runtime_pm.png --]
[-- Type: image/png, Size: 16247 bytes --]

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 19:08                 ` Rafael J. Wysocki
  2010-10-26 20:23                   ` Pierre Tardy
@ 2010-10-26 20:23                   ` Pierre Tardy
  2010-10-26 20:38                     ` Rafael J. Wysocki
  2010-10-26 20:38                     ` Rafael J. Wysocki
  1 sibling, 2 replies; 157+ messages in thread
From: Pierre Tardy @ 2010-10-26 20:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Arjan van de Ven, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 1372 bytes --]

On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>> >>
>> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>> >>         atomic_inc(&dev->power.usage_count);
>> >
>> > That's terribly racy..
>> >
>> I know. I'm not proud of this.. As I said, this is preliminary patch.
>> We dont really need to have this prev_usage. This is just for debug.
>> It mayprobably endup with something like:
>>
>>          atomic_inc(&dev->power.usage_count);
>> +       trace_power_device_usage(dev);
>
> Well, please tell me what you're trying to achieve.

Please see attached the kind of pytimechart output I'm trying to
achieve (yes, this chart is not coherent, seems I'm still missing some
traces)

We basically want to have a trace point eachtime the usage_counter
changes, so that I can display nice timecharts, and Arjan can have the
comm of the process that eventually generated the rpm_get, in order to
pinpoint it in powertop.

What you dont see in the above two lines is that
trace_power_device_usage(dev); actually reads the usage_count, as well
as the driver and device name.

Regards,
-- 
Pierre

[-- Attachment #2: pytimechart_runtime_pm.png --]
[-- Type: image/png, Size: 16247 bytes --]

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:23                   ` Pierre Tardy
@ 2010-10-26 20:38                     ` Rafael J. Wysocki
  2010-10-26 20:38                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 20:38 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Andrew Morton, linux-trace-users, Peter Zijlstra,
	Arjan van de Ven, Jean Pihet, Steven Rostedt,
	Frederic Weisbecker, Frank Eigler, Thomas Gleixner, linux-pm,
	Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Mathieu Desnoyers

On Tuesday, October 26, 2010, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >> >>
> >> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >> >>         atomic_inc(&dev->power.usage_count);
> >> >
> >> > That's terribly racy..
> >> >
> >> I know. I'm not proud of this.. As I said, this is preliminary patch.
> >> We dont really need to have this prev_usage. This is just for debug.
> >> It mayprobably endup with something like:
> >>
> >>          atomic_inc(&dev->power.usage_count);
> >> +       trace_power_device_usage(dev);
> >
> > Well, please tell me what you're trying to achieve.
> 
> Please see attached the kind of pytimechart output I'm trying to
> achieve (yes, this chart is not coherent, seems I'm still missing some
> traces)
> 
> We basically want to have a trace point eachtime the usage_counter
> changes, so that I can display nice timecharts, and Arjan can have the
> comm of the process that eventually generated the rpm_get, in order to
> pinpoint it in powertop.
> 
> What you dont see in the above two lines is that
> trace_power_device_usage(dev); actually reads the usage_count, as well
> as the driver and device name.

I'm afraid that for this to really work you'd need to put usage_count under a
spinlock along with your trace point, which I'm not really sure I like.

Besides, I'm not really sure the manipulations of usage_count are worth
tracing.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:23                   ` Pierre Tardy
  2010-10-26 20:38                     ` Rafael J. Wysocki
@ 2010-10-26 20:38                     ` Rafael J. Wysocki
  2010-10-26 20:52                       ` Arjan van de Ven
  2010-10-26 20:52                       ` Arjan van de Ven
  1 sibling, 2 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 20:38 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Peter Zijlstra, Arjan van de Ven, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Tuesday, October 26, 2010, Pierre Tardy wrote:
> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >> >>
> >> >> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >> >>         atomic_inc(&dev->power.usage_count);
> >> >
> >> > That's terribly racy..
> >> >
> >> I know. I'm not proud of this.. As I said, this is preliminary patch.
> >> We dont really need to have this prev_usage. This is just for debug.
> >> It mayprobably endup with something like:
> >>
> >>          atomic_inc(&dev->power.usage_count);
> >> +       trace_power_device_usage(dev);
> >
> > Well, please tell me what you're trying to achieve.
> 
> Please see attached the kind of pytimechart output I'm trying to
> achieve (yes, this chart is not coherent, seems I'm still missing some
> traces)
> 
> We basically want to have a trace point eachtime the usage_counter
> changes, so that I can display nice timecharts, and Arjan can have the
> comm of the process that eventually generated the rpm_get, in order to
> pinpoint it in powertop.
> 
> What you dont see in the above two lines is that
> trace_power_device_usage(dev); actually reads the usage_count, as well
> as the driver and device name.

I'm afraid that for this to really work you'd need to put usage_count under a
spinlock along with your trace point, which I'm not really sure I like.

Besides, I'm not really sure the manipulations of usage_count are worth
tracing.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:38                     ` Rafael J. Wysocki
@ 2010-10-26 20:52                       ` Arjan van de Ven
  2010-10-26 20:52                       ` Arjan van de Ven
  1 sibling, 0 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-26 20:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Pierre Tardy, Peter Zijlstra, Frederic Weisbecker,
	linux-trace-users, Jean Pihet, Steven Rostedt, Frank Eigler,
	Thomas Gleixner, linux-pm, Masami Hiramatsu, Tejun Heo,
	Ingo Molnar, linux-omap, Linus Torvalds, Mathieu Desnoyers

On 10/26/2010 1:38 PM, Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki<rjw@sisk.pl>  wrote:
>>> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>>>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra<peterz@infradead.org>  wrote:
>>>>> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>>>>>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>>>>>>          atomic_inc(&dev->power.usage_count);
>>>>> That's terribly racy..
>>>>>
>>>> I know. I'm not proud of this.. As I said, this is preliminary patch.
>>>> We dont really need to have this prev_usage. This is just for debug.
>>>> It mayprobably endup with something like:
>>>>
>>>>           atomic_inc(&dev->power.usage_count);
>>>> +       trace_power_device_usage(dev);
>>> Well, please tell me what you're trying to achieve.
>> Please see attached the kind of pytimechart output I'm trying to
>> achieve (yes, this chart is not coherent, seems I'm still missing some
>> traces)
>>
>> We basically want to have a trace point eachtime the usage_counter
>> changes, so that I can display nice timecharts, and Arjan can have the
>> comm of the process that eventually generated the rpm_get, in order to
>> pinpoint it in powertop.
>>
>> What you dont see in the above two lines is that
>> trace_power_device_usage(dev); actually reads the usage_count, as well
>> as the driver and device name.
> I'm afraid that for this to really work you'd need to put usage_count under a
> spinlock along with your trace point, which I'm not really sure I like.
>
> Besides, I'm not really sure the manipulations of usage_count are worth
> tracing.

what's most interesting is the 0->1  and 1->0 transitions.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:38                     ` Rafael J. Wysocki
  2010-10-26 20:52                       ` Arjan van de Ven
@ 2010-10-26 20:52                       ` Arjan van de Ven
  2010-10-26 21:17                         ` Rafael J. Wysocki
  2010-10-26 21:17                         ` Rafael J. Wysocki
  1 sibling, 2 replies; 157+ messages in thread
From: Arjan van de Ven @ 2010-10-26 20:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pierre Tardy, Peter Zijlstra, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/26/2010 1:38 PM, Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki<rjw@sisk.pl>  wrote:
>>> On Tuesday, October 26, 2010, Pierre Tardy wrote:
>>>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra<peterz@infradead.org>  wrote:
>>>>> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
>>>>>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
>>>>>>          atomic_inc(&dev->power.usage_count);
>>>>> That's terribly racy..
>>>>>
>>>> I know. I'm not proud of this.. As I said, this is preliminary patch.
>>>> We dont really need to have this prev_usage. This is just for debug.
>>>> It mayprobably endup with something like:
>>>>
>>>>           atomic_inc(&dev->power.usage_count);
>>>> +       trace_power_device_usage(dev);
>>> Well, please tell me what you're trying to achieve.
>> Please see attached the kind of pytimechart output I'm trying to
>> achieve (yes, this chart is not coherent, seems I'm still missing some
>> traces)
>>
>> We basically want to have a trace point eachtime the usage_counter
>> changes, so that I can display nice timecharts, and Arjan can have the
>> comm of the process that eventually generated the rpm_get, in order to
>> pinpoint it in powertop.
>>
>> What you dont see in the above two lines is that
>> trace_power_device_usage(dev); actually reads the usage_count, as well
>> as the driver and device name.
> I'm afraid that for this to really work you'd need to put usage_count under a
> spinlock along with your trace point, which I'm not really sure I like.
>
> Besides, I'm not really sure the manipulations of usage_count are worth
> tracing.

what's most interesting is the 0->1  and 1->0 transitions.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:52                       ` Arjan van de Ven
  2010-10-26 21:17                         ` Rafael J. Wysocki
@ 2010-10-26 21:17                         ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 21:17 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Pierre Tardy, Peter Zijlstra, Frederic Weisbecker,
	linux-trace-users, Jean Pihet, Steven Rostedt, Frank Eigler,
	Thomas Gleixner, linux-pm, Masami Hiramatsu, Tejun Heo,
	Ingo Molnar, linux-omap, Linus Torvalds, Mathieu Desnoyers

On Tuesday, October 26, 2010, Arjan van de Ven wrote:
> On 10/26/2010 1:38 PM, Rafael J. Wysocki wrote:
> > On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki<rjw@sisk.pl>  wrote:
> >>> On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >>>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra<peterz@infradead.org>  wrote:
> >>>>> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >>>>>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >>>>>>          atomic_inc(&dev->power.usage_count);
> >>>>> That's terribly racy..
> >>>>>
> >>>> I know. I'm not proud of this.. As I said, this is preliminary patch.
> >>>> We dont really need to have this prev_usage. This is just for debug.
> >>>> It mayprobably endup with something like:
> >>>>
> >>>>           atomic_inc(&dev->power.usage_count);
> >>>> +       trace_power_device_usage(dev);
> >>> Well, please tell me what you're trying to achieve.
> >> Please see attached the kind of pytimechart output I'm trying to
> >> achieve (yes, this chart is not coherent, seems I'm still missing some
> >> traces)
> >>
> >> We basically want to have a trace point eachtime the usage_counter
> >> changes, so that I can display nice timecharts, and Arjan can have the
> >> comm of the process that eventually generated the rpm_get, in order to
> >> pinpoint it in powertop.
> >>
> >> What you dont see in the above two lines is that
> >> trace_power_device_usage(dev); actually reads the usage_count, as well
> >> as the driver and device name.
> > I'm afraid that for this to really work you'd need to put usage_count under a
> > spinlock along with your trace point, which I'm not really sure I like.
> >
> > Besides, I'm not really sure the manipulations of usage_count are worth
> > tracing.
> 
> what's most interesting is the 0->1  and 1->0 transitions.

But they are only meaningful in specific situations.  For example, if someone
does pm_runtime_get_noresume() when the device is active, there may be
a device suspend already under way at the same time.  So IMO what really
is interesting is when rpm_resume() is called with usage_count > 0 and then
perhaps when rpm_idle() or rpm_suspend() is called after usage_count drops
back to 0.

There are some other interesting cases, but they all need to be checked under
->power.lock and you need to do that cleverly, so that the _functionality_ is
not harmed.

Overall, I think that adding tracepoints to the runtime PM core code is really
premature at this point, given that we've just reworked it quite a bit recently.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 20:52                       ` Arjan van de Ven
@ 2010-10-26 21:17                         ` Rafael J. Wysocki
  2010-10-26 21:17                         ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 21:17 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Pierre Tardy, Peter Zijlstra, Ingo Molnar, Thomas Renninger,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Tuesday, October 26, 2010, Arjan van de Ven wrote:
> On 10/26/2010 1:38 PM, Rafael J. Wysocki wrote:
> > On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >> On Tue, Oct 26, 2010 at 2:08 PM, Rafael J. Wysocki<rjw@sisk.pl>  wrote:
> >>> On Tuesday, October 26, 2010, Pierre Tardy wrote:
> >>>> On Tue, Oct 26, 2010 at 12:58 PM, Peter Zijlstra<peterz@infradead.org>  wrote:
> >>>>> On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> >>>>>> +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >>>>>>          atomic_inc(&dev->power.usage_count);
> >>>>> That's terribly racy..
> >>>>>
> >>>> I know. I'm not proud of this.. As I said, this is preliminary patch.
> >>>> We dont really need to have this prev_usage. This is just for debug.
> >>>> It mayprobably endup with something like:
> >>>>
> >>>>           atomic_inc(&dev->power.usage_count);
> >>>> +       trace_power_device_usage(dev);
> >>> Well, please tell me what you're trying to achieve.
> >> Please see attached the kind of pytimechart output I'm trying to
> >> achieve (yes, this chart is not coherent, seems I'm still missing some
> >> traces)
> >>
> >> We basically want to have a trace point eachtime the usage_counter
> >> changes, so that I can display nice timecharts, and Arjan can have the
> >> comm of the process that eventually generated the rpm_get, in order to
> >> pinpoint it in powertop.
> >>
> >> What you dont see in the above two lines is that
> >> trace_power_device_usage(dev); actually reads the usage_count, as well
> >> as the driver and device name.
> > I'm afraid that for this to really work you'd need to put usage_count under a
> > spinlock along with your trace point, which I'm not really sure I like.
> >
> > Besides, I'm not really sure the manipulations of usage_count are worth
> > tracing.
> 
> what's most interesting is the 0->1  and 1->0 transitions.

But they are only meaningful in specific situations.  For example, if someone
does pm_runtime_get_noresume() when the device is active, there may be
a device suspend already under way at the same time.  So IMO what really
is interesting is when rpm_resume() is called with usage_count > 0 and then
perhaps when rpm_idle() or rpm_suspend() is called after usage_count drops
back to 0.

There are some other interesting cases, but they all need to be checked under
->power.lock and you need to do that cleverly, so that the _functionality_ is
not harmed.

Overall, I think that adding tracepoints to the runtime PM core code is really
premature at this point, given that we've just reworked it quite a bit recently.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:50                 ` [linux-pm] " Alan Stern
@ 2010-10-26 21:33                   ` Mathieu Desnoyers
  2010-10-26 21:33                   ` [linux-pm] " Mathieu Desnoyers
  1 sibling, 0 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 21:33 UTC (permalink / raw)
  To: Alan Stern
  Cc: Paul E. McKenney, Pierre Tardy, Peter Zijlstra, linux-pm,
	Ingo Molnar, Jean Pihet, Steven Rostedt, linux-trace-users,
	Frank Eigler, Linus Torvalds, Frederic Weisbecker,
	Masami Hiramatsu, Tejun Heo, Andrew Morton, linux-omap,
	Arjan van de Ven, Thomas Gleixner

* Alan Stern (stern@rowland.harvard.edu) wrote:
> On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> 
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count); 
> > > 
> > > That's terribly racy..
> > 
> > Looking at the original code, it looks racy even without considering the
> > tracepoint:
> > 
> > int __pm_runtime_get(struct device *dev, bool sync)
> >  {
> >         int retval;
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count);
> >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > 
> > There is no implied memory barrier after "atomic_inc". So either all these
> > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > second option)
> 
> I don't understand.  What's the problem?  The inc/dec are atomic 
> because they are not protected by spinlocks, but everything else is 
> (aside from the tracepoint, which is new).
> 
> > kref should certainly be used there.
> 
> What for?

kref has the following "get":

        atomic_inc(&kref->refcount);
        smp_mb__after_atomic_inc();

What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
the memory barrier after the atomic increment. The atomic increment is free to
be reordered into the following spinlock (within pm_request_resume or pm_request
resume execution) because taking a spinlock only acts as a memory barrier with
acquire semantic, not a full memory barrier.

So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):

initial conditions: usage_count = 1

CPU A                                                       CPU B
1) __pm_runtime_get() (sync = true)
2)   atomic_inc(&usage_count) (not committed to memory yet)
3)   pm_runtime_resume()
4)     spin_lock_irqsave(&dev->power.lock, flags);
5)     retval = __pm_request_resume(dev);
6)     (execute the body of __pm_request_resume and return)
7)                                                          __pm_runtime_put() (sync = true) 
8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
                                                              (still see usage_count == 1 before decrement,
                                                               thus decrement to 0)
9)                                                             pm_runtime_idle()
10)  spin_unlock_irqrestore(&dev->power.lock, flags)
11)                                                            spin_lock_irq(&dev->power.lock);
12)                                                            retval = __pm_runtime_idle(dev);
13)                                                            spin_unlock_irq(&dev->power.lock);

So we end up in a situation where CPU A expects the device to be resumed, but
the last action performed has been to bring it to idle.

A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:50                 ` [linux-pm] " Alan Stern
  2010-10-26 21:33                   ` Mathieu Desnoyers
@ 2010-10-26 21:33                   ` Mathieu Desnoyers
  2010-10-26 22:20                     ` Rafael J. Wysocki
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
  1 sibling, 2 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 21:33 UTC (permalink / raw)
  To: Alan Stern
  Cc: Peter Zijlstra, Greg Kroah-Hartman, Andrew Morton, Pierre Tardy,
	Arjan van de Ven, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler, Thomas Gleixner,
	linux-pm, Masami Hiramatsu, Tejun Heo, Ingo Molnar, linux-omap,
	Linus Torvalds, Paul E. McKenney

* Alan Stern (stern@rowland.harvard.edu) wrote:
> On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> 
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count); 
> > > 
> > > That's terribly racy..
> > 
> > Looking at the original code, it looks racy even without considering the
> > tracepoint:
> > 
> > int __pm_runtime_get(struct device *dev, bool sync)
> >  {
> >         int retval;
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count);
> >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > 
> > There is no implied memory barrier after "atomic_inc". So either all these
> > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > second option)
> 
> I don't understand.  What's the problem?  The inc/dec are atomic 
> because they are not protected by spinlocks, but everything else is 
> (aside from the tracepoint, which is new).
> 
> > kref should certainly be used there.
> 
> What for?

kref has the following "get":

        atomic_inc(&kref->refcount);
        smp_mb__after_atomic_inc();

What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
the memory barrier after the atomic increment. The atomic increment is free to
be reordered into the following spinlock (within pm_request_resume or pm_request
resume execution) because taking a spinlock only acts as a memory barrier with
acquire semantic, not a full memory barrier.

So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):

initial conditions: usage_count = 1

CPU A                                                       CPU B
1) __pm_runtime_get() (sync = true)
2)   atomic_inc(&usage_count) (not committed to memory yet)
3)   pm_runtime_resume()
4)     spin_lock_irqsave(&dev->power.lock, flags);
5)     retval = __pm_request_resume(dev);
6)     (execute the body of __pm_request_resume and return)
7)                                                          __pm_runtime_put() (sync = true) 
8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
                                                              (still see usage_count == 1 before decrement,
                                                               thus decrement to 0)
9)                                                             pm_runtime_idle()
10)  spin_unlock_irqrestore(&dev->power.lock, flags)
11)                                                            spin_lock_irq(&dev->power.lock);
12)                                                            retval = __pm_runtime_idle(dev);
13)                                                            spin_unlock_irq(&dev->power.lock);

So we end up in a situation where CPU A expects the device to be resumed, but
the last action performed has been to bring it to idle.

A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 19:04                 ` Rafael J. Wysocki
@ 2010-10-26 21:38                   ` Mathieu Desnoyers
  2010-10-26 21:38                   ` Mathieu Desnoyers
  1 sibling, 0 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 21:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Pierre Tardy, Peter Zijlstra, linux-trace-users,
	Jean Pihet, Steven Rostedt, Frederic Weisbecker, Linus Torvalds,
	Frank Eigler, Thomas Gleixner, linux-pm, Masami Hiramatsu,
	Tejun Heo, Ingo Molnar, linux-omap, Arjan van de Ven

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count); 
> > > 
> > > That's terribly racy..
> > 
> > Looking at the original code, it looks racy even without considering the
> > tracepoint:
> > 
> > int __pm_runtime_get(struct device *dev, bool sync)
> >  {
> >         int retval;
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count);
> >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > 
> > There is no implied memory barrier after "atomic_inc". So either all these
> > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > second option)
> 
> No, it isn't.
> 
> > kref should certainly be used there.
> 
> No, it shouldn't.
> 
> Please try to understand the code you're commenting on first.

Please see my reply to Alan Stern:

http://www.spinics.net/lists/linux-omap/msg39382.html

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 19:04                 ` Rafael J. Wysocki
  2010-10-26 21:38                   ` Mathieu Desnoyers
@ 2010-10-26 21:38                   ` Mathieu Desnoyers
  2010-10-26 22:22                     ` Rafael J. Wysocki
  2010-10-26 22:22                     ` Rafael J. Wysocki
  1 sibling, 2 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-26 21:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Greg Kroah-Hartman, Pierre Tardy,
	Arjan van de Ven, Ingo Molnar, Thomas Renninger, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Masami Hiramatsu, Frank Eigler,
	Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count); 
> > > 
> > > That's terribly racy..
> > 
> > Looking at the original code, it looks racy even without considering the
> > tracepoint:
> > 
> > int __pm_runtime_get(struct device *dev, bool sync)
> >  {
> >         int retval;
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count);
> >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > 
> > There is no implied memory barrier after "atomic_inc". So either all these
> > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > second option)
> 
> No, it isn't.
> 
> > kref should certainly be used there.
> 
> No, it shouldn't.
> 
> Please try to understand the code you're commenting on first.

Please see my reply to Alan Stern:

http://www.spinics.net/lists/linux-omap/msg39382.html

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 21:33                   ` [linux-pm] " Mathieu Desnoyers
@ 2010-10-26 22:20                     ` Rafael J. Wysocki
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:20 UTC (permalink / raw)
  To: linux-pm
  Cc: linux-omap, Arjan van de Ven, Thomas Gleixner, Pierre Tardy,
	Peter Zijlstra, Frederic Weisbecker, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Paul E. McKenney, Linus Torvalds,
	Andrew Morton, Tejun Heo

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Alan Stern (stern@rowland.harvard.edu) wrote:
> > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > 
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count); 
> > > > 
> > > > That's terribly racy..
> > > 
> > > Looking at the original code, it looks racy even without considering the
> > > tracepoint:
> > > 
> > > int __pm_runtime_get(struct device *dev, bool sync)
> > >  {
> > >         int retval;
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count);
> > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > 
> > > There is no implied memory barrier after "atomic_inc". So either all these
> > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > second option)
> > 
> > I don't understand.  What's the problem?  The inc/dec are atomic 
> > because they are not protected by spinlocks, but everything else is 
> > (aside from the tracepoint, which is new).
> > 
> > > kref should certainly be used there.
> > 
> > What for?
> 
> kref has the following "get":
> 
>         atomic_inc(&kref->refcount);
>         smp_mb__after_atomic_inc();
> 
> What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> the memory barrier after the atomic increment. The atomic increment is free to
> be reordered into the following spinlock (within pm_request_resume or pm_request
> resume execution) because taking a spinlock only acts as a memory barrier with
> acquire semantic, not a full memory barrier.
>
> So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> 
> initial conditions: usage_count = 1
> 
> CPU A                                                       CPU B
> 1) __pm_runtime_get() (sync = true)
> 2)   atomic_inc(&usage_count) (not committed to memory yet)
> 3)   pm_runtime_resume()
> 4)     spin_lock_irqsave(&dev->power.lock, flags);
> 5)     retval = __pm_request_resume(dev);

If sync = true this is
           retval = __pm_runtime_resume(dev);
which drops and reacquires the spinlock.  In the meantime it sets
->power.runtime_status so that __pm_runtime_idle() will fail if run at this
point.

> 6)     (execute the body of __pm_request_resume and return)
> 7)                                                          __pm_runtime_put() (sync = true) 
> 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
>                                                               (still see usage_count == 1 before decrement,
>                                                                thus decrement to 0)
> 9)                                                             pm_runtime_idle()
> 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> 11)                                                            spin_lock_irq(&dev->power.lock);
> 12)                                                            retval = __pm_runtime_idle(dev);

Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
so it will see it's been incremented in the meantime and it will back off.

> 13)                                                            spin_unlock_irq(&dev->power.lock);
> 
> So we end up in a situation where CPU A expects the device to be resumed, but
> the last action performed has been to bring it to idle.
>
> A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.

I don't think this particular race is possible.  However, there is another one
that seems to be possible (in a different function) that an explicit barrier
will prevent from happening.

It's related to pm_runtime_get_noresume(), but I think it's better to put the
barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 21:33                   ` [linux-pm] " Mathieu Desnoyers
  2010-10-26 22:20                     ` Rafael J. Wysocki
@ 2010-10-26 22:20                     ` Rafael J. Wysocki
  2010-10-26 22:39                       ` Rafael J. Wysocki
                                         ` (3 more replies)
  1 sibling, 4 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:20 UTC (permalink / raw)
  To: linux-pm
  Cc: Mathieu Desnoyers, Alan Stern, Paul E. McKenney, Pierre Tardy,
	Peter Zijlstra, Ingo Molnar, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Alan Stern (stern@rowland.harvard.edu) wrote:
> > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > 
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count); 
> > > > 
> > > > That's terribly racy..
> > > 
> > > Looking at the original code, it looks racy even without considering the
> > > tracepoint:
> > > 
> > > int __pm_runtime_get(struct device *dev, bool sync)
> > >  {
> > >         int retval;
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count);
> > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > 
> > > There is no implied memory barrier after "atomic_inc". So either all these
> > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > second option)
> > 
> > I don't understand.  What's the problem?  The inc/dec are atomic 
> > because they are not protected by spinlocks, but everything else is 
> > (aside from the tracepoint, which is new).
> > 
> > > kref should certainly be used there.
> > 
> > What for?
> 
> kref has the following "get":
> 
>         atomic_inc(&kref->refcount);
>         smp_mb__after_atomic_inc();
> 
> What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> the memory barrier after the atomic increment. The atomic increment is free to
> be reordered into the following spinlock (within pm_request_resume or pm_request
> resume execution) because taking a spinlock only acts as a memory barrier with
> acquire semantic, not a full memory barrier.
>
> So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> 
> initial conditions: usage_count = 1
> 
> CPU A                                                       CPU B
> 1) __pm_runtime_get() (sync = true)
> 2)   atomic_inc(&usage_count) (not committed to memory yet)
> 3)   pm_runtime_resume()
> 4)     spin_lock_irqsave(&dev->power.lock, flags);
> 5)     retval = __pm_request_resume(dev);

If sync = true this is
           retval = __pm_runtime_resume(dev);
which drops and reacquires the spinlock.  In the meantime it sets
->power.runtime_status so that __pm_runtime_idle() will fail if run at this
point.

> 6)     (execute the body of __pm_request_resume and return)
> 7)                                                          __pm_runtime_put() (sync = true) 
> 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
>                                                               (still see usage_count == 1 before decrement,
>                                                                thus decrement to 0)
> 9)                                                             pm_runtime_idle()
> 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> 11)                                                            spin_lock_irq(&dev->power.lock);
> 12)                                                            retval = __pm_runtime_idle(dev);

Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
so it will see it's been incremented in the meantime and it will back off.

> 13)                                                            spin_unlock_irq(&dev->power.lock);
> 
> So we end up in a situation where CPU A expects the device to be resumed, but
> the last action performed has been to bring it to idle.
>
> A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.

I don't think this particular race is possible.  However, there is another one
that seems to be possible (in a different function) that an explicit barrier
will prevent from happening.

It's related to pm_runtime_get_noresume(), but I think it's better to put the
barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 21:38                   ` Mathieu Desnoyers
  2010-10-26 22:22                     ` Rafael J. Wysocki
@ 2010-10-26 22:22                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Pierre Tardy, Peter Zijlstra, linux-trace-users,
	Jean Pihet, Steven Rostedt, Frederic Weisbecker, Linus Torvalds,
	Frank Eigler, Thomas Gleixner, linux-pm, Masami Hiramatsu,
	Tejun Heo, Ingo Molnar, linux-omap, Arjan van de Ven

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count); 
> > > > 
> > > > That's terribly racy..
> > > 
> > > Looking at the original code, it looks racy even without considering the
> > > tracepoint:
> > > 
> > > int __pm_runtime_get(struct device *dev, bool sync)
> > >  {
> > >         int retval;
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count);
> > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > 
> > > There is no implied memory barrier after "atomic_inc". So either all these
> > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > second option)
> > 
> > No, it isn't.
> > 
> > > kref should certainly be used there.
> > 
> > No, it shouldn't.
> > 
> > Please try to understand the code you're commenting on first.
> 
> Please see my reply to Alan Stern:
> 
> http://www.spinics.net/lists/linux-omap/msg39382.html

I have and I'm still unimpressed. :-)

Please see my reply to that message.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 21:38                   ` Mathieu Desnoyers
@ 2010-10-26 22:22                     ` Rafael J. Wysocki
  2010-10-26 22:22                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Greg Kroah-Hartman, Pierre Tardy,
	Arjan van de Ven, Ingo Molnar, Thomas Renninger, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Masami Hiramatsu, Frank Eigler,
	Steven Rostedt, Kevin Hilman, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Frederic Weisbecker, Tejun Heo

On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count); 
> > > > 
> > > > That's terribly racy..
> > > 
> > > Looking at the original code, it looks racy even without considering the
> > > tracepoint:
> > > 
> > > int __pm_runtime_get(struct device *dev, bool sync)
> > >  {
> > >         int retval;
> > > 
> > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > >         atomic_inc(&dev->power.usage_count);
> > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > 
> > > There is no implied memory barrier after "atomic_inc". So either all these
> > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > second option)
> > 
> > No, it isn't.
> > 
> > > kref should certainly be used there.
> > 
> > No, it shouldn't.
> > 
> > Please try to understand the code you're commenting on first.
> 
> Please see my reply to Alan Stern:
> 
> http://www.spinics.net/lists/linux-omap/msg39382.html

I have and I'm still unimpressed. :-)

Please see my reply to that message.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
@ 2010-10-26 22:39                       ` Rafael J. Wysocki
  2010-10-26 22:39                       ` [linux-pm] " Rafael J. Wysocki
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:39 UTC (permalink / raw)
  To: linux-pm
  Cc: Paul E. McKenney, Andrew Morton, Pierre Tardy, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Mathieu Desnoyers,
	Masami Hiramatsu, Tejun Heo, Thomas Gleixner, linux-omap,
	Arjan van de Ven, Ingo Molnar

On Wednesday, October 27, 2010, Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > 
> > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > 
> > > > > That's terribly racy..
> > > > 
> > > > Looking at the original code, it looks racy even without considering the
> > > > tracepoint:
> > > > 
> > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > >  {
> > > >         int retval;
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count);
> > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > 
> > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > second option)
> > > 
> > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > because they are not protected by spinlocks, but everything else is 
> > > (aside from the tracepoint, which is new).
> > > 
> > > > kref should certainly be used there.
> > > 
> > > What for?
> > 
> > kref has the following "get":
> > 
> >         atomic_inc(&kref->refcount);
> >         smp_mb__after_atomic_inc();
> > 
> > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > the memory barrier after the atomic increment. The atomic increment is free to
> > be reordered into the following spinlock (within pm_request_resume or pm_request
> > resume execution) because taking a spinlock only acts as a memory barrier with
> > acquire semantic, not a full memory barrier.
> >
> > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > 
> > initial conditions: usage_count = 1
> > 
> > CPU A                                                       CPU B
> > 1) __pm_runtime_get() (sync = true)
> > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > 3)   pm_runtime_resume()
> > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > 5)     retval = __pm_request_resume(dev);
> 
> If sync = true this is
>            retval = __pm_runtime_resume(dev);
> which drops and reacquires the spinlock.  In the meantime it sets
> ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> point.
> 
> > 6)     (execute the body of __pm_request_resume and return)
> > 7)                                                          __pm_runtime_put() (sync = true) 
> > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> >                                                               (still see usage_count == 1 before decrement,
> >                                                                thus decrement to 0)
> > 9)                                                             pm_runtime_idle()
> > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > 11)                                                            spin_lock_irq(&dev->power.lock);
> > 12)                                                            retval = __pm_runtime_idle(dev);
> 
> Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> so it will see it's been incremented in the meantime and it will back off.
> 
> > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > 
> > So we end up in a situation where CPU A expects the device to be resumed, but
> > the last action performed has been to bring it to idle.
> >
> > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> 
> I don't think this particular race is possible.  However, there is another one
> that seems to be possible (in a different function) that an explicit barrier
> will prevent from happening.
> 
> It's related to pm_runtime_get_noresume(), but I think it's better to put the
> barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
the spinlock, the race I was thinking about doesn't appear to be possible after
all.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
  2010-10-26 22:39                       ` Rafael J. Wysocki
@ 2010-10-26 22:39                       ` Rafael J. Wysocki
  2010-10-27  0:46                       ` Mathieu Desnoyers
  2010-10-27  0:46                       ` Mathieu Desnoyers
  3 siblings, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-26 22:39 UTC (permalink / raw)
  To: linux-pm
  Cc: linux-omap, Arjan van de Ven, Thomas Gleixner, Pierre Tardy,
	Peter Zijlstra, Frederic Weisbecker, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Paul E. McKenney, Linus Torvalds,
	Andrew Morton, Tejun Heo

On Wednesday, October 27, 2010, Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > 
> > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > 
> > > > > That's terribly racy..
> > > > 
> > > > Looking at the original code, it looks racy even without considering the
> > > > tracepoint:
> > > > 
> > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > >  {
> > > >         int retval;
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count);
> > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > 
> > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > second option)
> > > 
> > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > because they are not protected by spinlocks, but everything else is 
> > > (aside from the tracepoint, which is new).
> > > 
> > > > kref should certainly be used there.
> > > 
> > > What for?
> > 
> > kref has the following "get":
> > 
> >         atomic_inc(&kref->refcount);
> >         smp_mb__after_atomic_inc();
> > 
> > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > the memory barrier after the atomic increment. The atomic increment is free to
> > be reordered into the following spinlock (within pm_request_resume or pm_request
> > resume execution) because taking a spinlock only acts as a memory barrier with
> > acquire semantic, not a full memory barrier.
> >
> > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > 
> > initial conditions: usage_count = 1
> > 
> > CPU A                                                       CPU B
> > 1) __pm_runtime_get() (sync = true)
> > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > 3)   pm_runtime_resume()
> > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > 5)     retval = __pm_request_resume(dev);
> 
> If sync = true this is
>            retval = __pm_runtime_resume(dev);
> which drops and reacquires the spinlock.  In the meantime it sets
> ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> point.
> 
> > 6)     (execute the body of __pm_request_resume and return)
> > 7)                                                          __pm_runtime_put() (sync = true) 
> > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> >                                                               (still see usage_count == 1 before decrement,
> >                                                                thus decrement to 0)
> > 9)                                                             pm_runtime_idle()
> > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > 11)                                                            spin_lock_irq(&dev->power.lock);
> > 12)                                                            retval = __pm_runtime_idle(dev);
> 
> Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> so it will see it's been incremented in the meantime and it will back off.
> 
> > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > 
> > So we end up in a situation where CPU A expects the device to be resumed, but
> > the last action performed has been to bring it to idle.
> >
> > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> 
> I don't think this particular race is possible.  However, there is another one
> that seems to be possible (in a different function) that an explicit barrier
> will prevent from happening.
> 
> It's related to pm_runtime_get_noresume(), but I think it's better to put the
> barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
the spinlock, the race I was thinking about doesn't appear to be possible after
all.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:57             ` Rafael J. Wysocki
  2010-10-27  0:00               ` Thomas Renninger
@ 2010-10-27  0:00               ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-27  0:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Arjan van de Ven, Andrew Morton, linux-trace-users,
	Frederic Weisbecker, Pierre Tardy, Jean Pihet, Steven Rostedt,
	Peter Zijlstra, Frank Eigler, Mathieu Desnoyers, linux-pm,
	Masami Hiramatsu, Tejun Heo, Thomas Gleixner, linux-omap,
	Linus Torvalds, Ingo Molnar

On Tuesday 26 October 2010 08:57:01 pm Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Thomas Renninger wrote:
> > > 
> > > Ok, that's at least generic. Needs the review of Rafael, to determine
> > > whether this state value is all we want to know when we enter suspend.
> > He already gave an acked-by on this generic one here:
> > Re: [PATCH 3/4] perf: add calls to suspend trace point
> > Oh now, that was on the X86 specific part which depends on this one.
> > One should expect that he's fine with the generic part as well then,
> > but I agree that he should definitely have a look at this and sign it off.
> 
> What patch exactly do you mean?  I'm not quite sure from your comment above.

Eh, Jean's patch, sorry about that.
Needs a tiny change to use PWR_EVENT_EXIT instead of 0 with
my new patch series:

Signed-off-by: Jean Pihet <j-pihet@ti.com>
CC: Thomas Renninger <trenn@suse.de>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>

---
 kernel/power/suspend.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 7335952..10cad5c 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/suspend.h>
+#include <trace/events/power.h>
 
 #include "power.h"
 
@@ -164,7 +165,9 @@ static int suspend_enter(suspend_state_t state)
        error = sysdev_suspend(PMSG_SUSPEND);
        if (!error) {
                if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
+                       trace_machine_suspend(state);
                        error = suspend_ops->enter(state);
+                       trace_machine_suspend(0);
                        events_check_enabled = false;
                }
                sysdev_resume();

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 18:57             ` Rafael J. Wysocki
@ 2010-10-27  0:00               ` Thomas Renninger
  2010-10-27  9:16                 ` Rafael J. Wysocki
  2010-10-27  9:16                 ` Rafael J. Wysocki
  2010-10-27  0:00               ` Thomas Renninger
  1 sibling, 2 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-27  0:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Jean Pihet, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven

On Tuesday 26 October 2010 08:57:01 pm Rafael J. Wysocki wrote:
> On Tuesday, October 26, 2010, Thomas Renninger wrote:
> > > 
> > > Ok, that's at least generic. Needs the review of Rafael, to determine
> > > whether this state value is all we want to know when we enter suspend.
> > He already gave an acked-by on this generic one here:
> > Re: [PATCH 3/4] perf: add calls to suspend trace point
> > Oh now, that was on the X86 specific part which depends on this one.
> > One should expect that he's fine with the generic part as well then,
> > but I agree that he should definitely have a look at this and sign it off.
> 
> What patch exactly do you mean?  I'm not quite sure from your comment above.

Eh, Jean's patch, sorry about that.
Needs a tiny change to use PWR_EVENT_EXIT instead of 0 with
my new patch series:

Signed-off-by: Jean Pihet <j-pihet@ti.com>
CC: Thomas Renninger <trenn@suse.de>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>

---
 kernel/power/suspend.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 7335952..10cad5c 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/suspend.h>
+#include <trace/events/power.h>
 
 #include "power.h"
 
@@ -164,7 +165,9 @@ static int suspend_enter(suspend_state_t state)
        error = sysdev_suspend(PMSG_SUSPEND);
        if (!error) {
                if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
+                       trace_machine_suspend(state);
                        error = suspend_ops->enter(state);
+                       trace_machine_suspend(0);
                        events_check_enabled = false;
                }
                sysdev_resume();

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
                                         ` (2 preceding siblings ...)
  2010-10-27  0:46                       ` Mathieu Desnoyers
@ 2010-10-27  0:46                       ` Mathieu Desnoyers
  3 siblings, 0 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-27  0:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-omap, Arjan van de Ven, Andrew Morton, Thomas Gleixner,
	Pierre Tardy, Peter Zijlstra, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler,
	Masami Hiramatsu, Tejun Heo, linux-pm, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > 
> > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > 
> > > > > That's terribly racy..
> > > > 
> > > > Looking at the original code, it looks racy even without considering the
> > > > tracepoint:
> > > > 
> > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > >  {
> > > >         int retval;
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count);
> > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > 
> > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > second option)
> > > 
> > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > because they are not protected by spinlocks, but everything else is 
> > > (aside from the tracepoint, which is new).
> > > 
> > > > kref should certainly be used there.
> > > 
> > > What for?
> > 
> > kref has the following "get":
> > 
> >         atomic_inc(&kref->refcount);
> >         smp_mb__after_atomic_inc();
> > 
> > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > the memory barrier after the atomic increment. The atomic increment is free to
> > be reordered into the following spinlock (within pm_request_resume or pm_request
> > resume execution) because taking a spinlock only acts as a memory barrier with
> > acquire semantic, not a full memory barrier.
> >
> > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > 
> > initial conditions: usage_count = 1
> > 
> > CPU A                                                       CPU B
> > 1) __pm_runtime_get() (sync = true)
> > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > 3)   pm_runtime_resume()
> > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > 5)     retval = __pm_request_resume(dev);
> 
> If sync = true this is
>            retval = __pm_runtime_resume(dev);
> which drops and reacquires the spinlock.

Let's see. Upon entry in __pm_runtime_resume, the following condition holds
(remember, the initial condition is that usage_count == 1):

  dev->power.runtime_status == RPM_ACTIVE

so retval is set to 1, which goto directly to "out", without setting "parent".
So there does not seem to be any spinlock reacquire on this path, or am I
misunderstanding how the "runtime_status" works ?

> In the meantime it sets
> ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> point.

runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
expected by __pm_runtime_idle.

> 
> > 6)     (execute the body of __pm_request_resume and return)
> > 7)                                                          __pm_runtime_put() (sync = true) 
> > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> >                                                               (still see usage_count == 1 before decrement,
> >                                                                thus decrement to 0)
> > 9)                                                             pm_runtime_idle()
> > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > 11)                                                            spin_lock_irq(&dev->power.lock);
> > 12)                                                            retval = __pm_runtime_idle(dev);
> 
> Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> so it will see it's been incremented in the meantime and it will back off.

This is a subtle but important point. Yes, my scenario seems to be dealt with by
the extra usage_count check while the spinlock is held.

How about adding a comment under this atomic_inc() stating that the memory
barriers are implicitely dealt with by the following spinlock release and the
extra check while spinlock is held ?

Commenting memory barriers is important, but commenting why memory barriers are
not needed due to a subtle corner-case looks even more important.

(hrm, but more below considering pm_runtime_get_noresume())

> 
> > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > 
> > So we end up in a situation where CPU A expects the device to be resumed, but
> > the last action performed has been to bring it to idle.
> >
> > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> 
> I don't think this particular race is possible.  However, there is another one
> that seems to be possible (in a different function) that an explicit barrier
> will prevent from happening.
> 
> It's related to pm_runtime_get_noresume(), but I think it's better to put the
> barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Quoting your following mail:

> Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> the spinlock, the race I was thinking about doesn't appear to be possible
> after all.

Hrm, for the extra-usage_count-under-spinlock check to work, all
pm_runtime_get_noresume() callers should grab and release the dev->power.lock
after incrementing the usage_count. This does not seem to be the case though. So
you might really have a race there.

So every code path that does:

1) pm_runtime_get_noresume(dev);

2) ...

3) pm_runtime_put_noidle(dev);

expecting that the device state cannot be changed between 1 and 3 might be
surprised by a concurrent call to __pm_runtime_idle() that would put a device to
idle (or similarly with suspend) due to lack of memory barrier after the atomic
increment.

Or am I missing something else ?

Thanks,

Mathieu

> 
> Thanks,
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
  2010-10-26 22:39                       ` Rafael J. Wysocki
  2010-10-26 22:39                       ` [linux-pm] " Rafael J. Wysocki
@ 2010-10-27  0:46                       ` Mathieu Desnoyers
  2010-10-27 10:22                         ` Rafael J. Wysocki
  2010-10-27 10:22                         ` Rafael J. Wysocki
  2010-10-27  0:46                       ` Mathieu Desnoyers
  3 siblings, 2 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-27  0:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, Alan Stern, Paul E. McKenney, Pierre Tardy,
	Peter Zijlstra, Ingo Molnar, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > 
> > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > 
> > > > > That's terribly racy..
> > > > 
> > > > Looking at the original code, it looks racy even without considering the
> > > > tracepoint:
> > > > 
> > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > >  {
> > > >         int retval;
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count);
> > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > 
> > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > second option)
> > > 
> > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > because they are not protected by spinlocks, but everything else is 
> > > (aside from the tracepoint, which is new).
> > > 
> > > > kref should certainly be used there.
> > > 
> > > What for?
> > 
> > kref has the following "get":
> > 
> >         atomic_inc(&kref->refcount);
> >         smp_mb__after_atomic_inc();
> > 
> > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > the memory barrier after the atomic increment. The atomic increment is free to
> > be reordered into the following spinlock (within pm_request_resume or pm_request
> > resume execution) because taking a spinlock only acts as a memory barrier with
> > acquire semantic, not a full memory barrier.
> >
> > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > 
> > initial conditions: usage_count = 1
> > 
> > CPU A                                                       CPU B
> > 1) __pm_runtime_get() (sync = true)
> > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > 3)   pm_runtime_resume()
> > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > 5)     retval = __pm_request_resume(dev);
> 
> If sync = true this is
>            retval = __pm_runtime_resume(dev);
> which drops and reacquires the spinlock.

Let's see. Upon entry in __pm_runtime_resume, the following condition holds
(remember, the initial condition is that usage_count == 1):

  dev->power.runtime_status == RPM_ACTIVE

so retval is set to 1, which goto directly to "out", without setting "parent".
So there does not seem to be any spinlock reacquire on this path, or am I
misunderstanding how the "runtime_status" works ?

> In the meantime it sets
> ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> point.

runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
expected by __pm_runtime_idle.

> 
> > 6)     (execute the body of __pm_request_resume and return)
> > 7)                                                          __pm_runtime_put() (sync = true) 
> > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> >                                                               (still see usage_count == 1 before decrement,
> >                                                                thus decrement to 0)
> > 9)                                                             pm_runtime_idle()
> > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > 11)                                                            spin_lock_irq(&dev->power.lock);
> > 12)                                                            retval = __pm_runtime_idle(dev);
> 
> Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> so it will see it's been incremented in the meantime and it will back off.

This is a subtle but important point. Yes, my scenario seems to be dealt with by
the extra usage_count check while the spinlock is held.

How about adding a comment under this atomic_inc() stating that the memory
barriers are implicitely dealt with by the following spinlock release and the
extra check while spinlock is held ?

Commenting memory barriers is important, but commenting why memory barriers are
not needed due to a subtle corner-case looks even more important.

(hrm, but more below considering pm_runtime_get_noresume())

> 
> > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > 
> > So we end up in a situation where CPU A expects the device to be resumed, but
> > the last action performed has been to bring it to idle.
> >
> > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> 
> I don't think this particular race is possible.  However, there is another one
> that seems to be possible (in a different function) that an explicit barrier
> will prevent from happening.
> 
> It's related to pm_runtime_get_noresume(), but I think it's better to put the
> barrier where it's necessary rather than into pm_runtime_get_noresume() itself.

Quoting your following mail:

> Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> the spinlock, the race I was thinking about doesn't appear to be possible
> after all.

Hrm, for the extra-usage_count-under-spinlock check to work, all
pm_runtime_get_noresume() callers should grab and release the dev->power.lock
after incrementing the usage_count. This does not seem to be the case though. So
you might really have a race there.

So every code path that does:

1) pm_runtime_get_noresume(dev);

2) ...

3) pm_runtime_put_noidle(dev);

expecting that the device state cannot be changed between 1 and 3 might be
surprised by a concurrent call to __pm_runtime_idle() that would put a device to
idle (or similarly with suspend) due to lack of memory barrier after the atomic
increment.

Or am I missing something else ?

Thanks,

Mathieu

> 
> Thanks,
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27  0:00               ` Thomas Renninger
  2010-10-27  9:16                 ` Rafael J. Wysocki
@ 2010-10-27  9:16                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27  9:16 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Arjan van de Ven, Andrew Morton, linux-trace-users,
	Frederic Weisbecker, Pierre Tardy, Jean Pihet, Steven Rostedt,
	Peter Zijlstra, Frank Eigler, Mathieu Desnoyers, linux-pm,
	Masami Hiramatsu, Tejun Heo, Thomas Gleixner, linux-omap,
	Linus Torvalds, Ingo Molnar

On Wednesday, October 27, 2010, Thomas Renninger wrote:
> On Tuesday 26 October 2010 08:57:01 pm Rafael J. Wysocki wrote:
> > On Tuesday, October 26, 2010, Thomas Renninger wrote:
> > > > 
> > > > Ok, that's at least generic. Needs the review of Rafael, to determine
> > > > whether this state value is all we want to know when we enter suspend.
> > > He already gave an acked-by on this generic one here:
> > > Re: [PATCH 3/4] perf: add calls to suspend trace point
> > > Oh now, that was on the X86 specific part which depends on this one.
> > > One should expect that he's fine with the generic part as well then,
> > > but I agree that he should definitely have a look at this and sign it off.
> > 
> > What patch exactly do you mean?  I'm not quite sure from your comment above.
> 
> Eh, Jean's patch, sorry about that.
> Needs a tiny change to use PWR_EVENT_EXIT instead of 0 with
> my new patch series:

No problem with that as far as I'm concerned.

Thanks,
Rafael


> Signed-off-by: Jean Pihet <j-pihet@ti.com>
> CC: Thomas Renninger <trenn@suse.de>
> Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
> 
> ---
>  kernel/power/suspend.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 7335952..10cad5c 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -22,6 +22,7 @@
>  #include <linux/mm.h>
>  #include <linux/slab.h>
>  #include <linux/suspend.h>
> +#include <trace/events/power.h>
>  
>  #include "power.h"
>  
> @@ -164,7 +165,9 @@ static int suspend_enter(suspend_state_t state)
>         error = sysdev_suspend(PMSG_SUSPEND);
>         if (!error) {
>                 if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
> +                       trace_machine_suspend(state);
>                         error = suspend_ops->enter(state);
> +                       trace_machine_suspend(0);
>                         events_check_enabled = false;
>                 }
>                 sysdev_resume();
> 
> 

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27  0:00               ` Thomas Renninger
@ 2010-10-27  9:16                 ` Rafael J. Wysocki
  2010-10-27  9:16                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27  9:16 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Jean Pihet, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers, Arjan van de Ven

On Wednesday, October 27, 2010, Thomas Renninger wrote:
> On Tuesday 26 October 2010 08:57:01 pm Rafael J. Wysocki wrote:
> > On Tuesday, October 26, 2010, Thomas Renninger wrote:
> > > > 
> > > > Ok, that's at least generic. Needs the review of Rafael, to determine
> > > > whether this state value is all we want to know when we enter suspend.
> > > He already gave an acked-by on this generic one here:
> > > Re: [PATCH 3/4] perf: add calls to suspend trace point
> > > Oh now, that was on the X86 specific part which depends on this one.
> > > One should expect that he's fine with the generic part as well then,
> > > but I agree that he should definitely have a look at this and sign it off.
> > 
> > What patch exactly do you mean?  I'm not quite sure from your comment above.
> 
> Eh, Jean's patch, sorry about that.
> Needs a tiny change to use PWR_EVENT_EXIT instead of 0 with
> my new patch series:

No problem with that as far as I'm concerned.

Thanks,
Rafael


> Signed-off-by: Jean Pihet <j-pihet@ti.com>
> CC: Thomas Renninger <trenn@suse.de>
> Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
> 
> ---
>  kernel/power/suspend.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 7335952..10cad5c 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -22,6 +22,7 @@
>  #include <linux/mm.h>
>  #include <linux/slab.h>
>  #include <linux/suspend.h>
> +#include <trace/events/power.h>
>  
>  #include "power.h"
>  
> @@ -164,7 +165,9 @@ static int suspend_enter(suspend_state_t state)
>         error = sysdev_suspend(PMSG_SUSPEND);
>         if (!error) {
>                 if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
> +                       trace_machine_suspend(state);
>                         error = suspend_ops->enter(state);
> +                       trace_machine_suspend(0);
>                         events_check_enabled = false;
>                 }
>                 sysdev_resume();
> 
> 


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27  0:46                       ` Mathieu Desnoyers
  2010-10-27 10:22                         ` Rafael J. Wysocki
@ 2010-10-27 10:22                         ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27 10:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-omap, Arjan van de Ven, Andrew Morton, Thomas Gleixner,
	Pierre Tardy, Peter Zijlstra, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler,
	Masami Hiramatsu, Tejun Heo, linux-pm, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > > 
> > > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > > 
> > > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > > 
> > > > > > That's terribly racy..
> > > > > 
> > > > > Looking at the original code, it looks racy even without considering the
> > > > > tracepoint:
> > > > > 
> > > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > > >  {
> > > > >         int retval;
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count);
> > > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > > 
> > > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > > second option)
> > > > 
> > > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > > because they are not protected by spinlocks, but everything else is 
> > > > (aside from the tracepoint, which is new).
> > > > 
> > > > > kref should certainly be used there.
> > > > 
> > > > What for?
> > > 
> > > kref has the following "get":
> > > 
> > >         atomic_inc(&kref->refcount);
> > >         smp_mb__after_atomic_inc();
> > > 
> > > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > > the memory barrier after the atomic increment. The atomic increment is free to
> > > be reordered into the following spinlock (within pm_request_resume or pm_request
> > > resume execution) because taking a spinlock only acts as a memory barrier with
> > > acquire semantic, not a full memory barrier.
> > >
> > > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > > 
> > > initial conditions: usage_count = 1
> > > 
> > > CPU A                                                       CPU B
> > > 1) __pm_runtime_get() (sync = true)
> > > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > > 3)   pm_runtime_resume()
> > > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > > 5)     retval = __pm_request_resume(dev);
> > 
> > If sync = true this is
> >            retval = __pm_runtime_resume(dev);
> > which drops and reacquires the spinlock.
> 
> Let's see. Upon entry in __pm_runtime_resume, the following condition holds
> (remember, the initial condition is that usage_count == 1):
> 
>   dev->power.runtime_status == RPM_ACTIVE
> 
> so retval is set to 1, which goto directly to "out", without setting "parent".
> So there does not seem to be any spinlock reacquire on this path, or am I
> misunderstanding how the "runtime_status" works ?

No, you're not I think, the above is correct.  I was referring to the scenario
in which the device was RPM_SUSPENDED initially.

> > In the meantime it sets
> > ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> > point.
> 
> runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
> expected by __pm_runtime_idle.
> 
> > 
> > > 6)     (execute the body of __pm_request_resume and return)
> > > 7)                                                          __pm_runtime_put() (sync = true) 
> > > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> > >                                                               (still see usage_count == 1 before decrement,
> > >                                                                thus decrement to 0)
> > > 9)                                                             pm_runtime_idle()
> > > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > > 11)                                                            spin_lock_irq(&dev->power.lock);
> > > 12)                                                            retval = __pm_runtime_idle(dev);
> > 
> > Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> > so it will see it's been incremented in the meantime and it will back off.
> 
> This is a subtle but important point. Yes, my scenario seems to be dealt with by
> the extra usage_count check while the spinlock is held.
> 
> How about adding a comment under this atomic_inc() stating that the memory
> barriers are implicitely dealt with by the following spinlock release and the
> extra check while spinlock is held ?
> 
> Commenting memory barriers is important, but commenting why memory barriers are
> not needed due to a subtle corner-case looks even more important.

Well, given that this discussion is taking place at all, I admit that it would
be good to document this somehow. :-)

I'll take care of that.

> (hrm, but more below considering pm_runtime_get_noresume())
> 
> > 
> > > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > > 
> > > So we end up in a situation where CPU A expects the device to be resumed, but
> > > the last action performed has been to bring it to idle.
> > >
> > > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> > 
> > I don't think this particular race is possible.  However, there is another one
> > that seems to be possible (in a different function) that an explicit barrier
> > will prevent from happening.
> > 
> > It's related to pm_runtime_get_noresume(), but I think it's better to put the
> > barrier where it's necessary rather than into pm_runtime_get_noresume() itself.
> 
> Quoting your following mail:
> 
> > Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> > the spinlock, the race I was thinking about doesn't appear to be possible
> > after all.
> 
> Hrm, for the extra-usage_count-under-spinlock check to work, all
> pm_runtime_get_noresume() callers should grab and release the dev->power.lock
> after incrementing the usage_count. This does not seem to be the case though. So
> you might really have a race there.
> 
> So every code path that does:
> 
> 1) pm_runtime_get_noresume(dev);
> 
> 2) ...
> 
> 3) pm_runtime_put_noidle(dev);
> 
> expecting that the device state cannot be changed between 1 and 3 might be
> surprised by a concurrent call to __pm_runtime_idle() that would put a device to
> idle (or similarly with suspend) due to lack of memory barrier after the atomic
> increment.
> 
> Or am I missing something else ?

First of all, the device can always be resumed regardless of the usage_count
value.  usage_count is only used to block attempts to suspend the device and
execute its driver's ->runtime_idle() callback after it has been resumed.
That's why the "normal" pm_runtime_get() queues up a resume request.

IOW, the _get() only becomes meaningful after attempting to resume the device
(which is what I tried to tell Arjan in one of the previous messages).

Second, there's no synchronization between pm_runtime_get_noresume() and
pm_runtime_suspend/idle() etc., so calling pm_runtime_get_noresume() is
certainly insufficient to block pm_runtime_suspend/idle() regardless of memory
barriers (there may be one already in progress when _get_noresume() is called).
To limit possible status changes from happening one should (at least) run
pm_runtime_barrier() (surprise, no? ;-)) after pm_runtime_get_noresume().

So if you don't want to resume the device immediately after increasing its
usage_count (in which case it's better to use pm_runtime_get_sync()), you
should do something like this:

1) pm_runtime_get_noresume(dev);
1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.

2) ...
 
3) pm_runtime_put_noidle(dev);

[The meaning of pm_runtime_barrier() is that all of the runtime PM activity
started before the barrier has been completed when it returns.]

There's one place in the PM core where that really is necessary, but I wouldn't
recommend anyone doing anything like it in a driver.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27  0:46                       ` Mathieu Desnoyers
@ 2010-10-27 10:22                         ` Rafael J. Wysocki
  2010-10-27 12:21                           ` Mathieu Desnoyers
  2010-10-27 12:21                           ` Mathieu Desnoyers
  2010-10-27 10:22                         ` Rafael J. Wysocki
  1 sibling, 2 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27 10:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-pm, Alan Stern, Paul E. McKenney, Pierre Tardy,
	Peter Zijlstra, Ingo Molnar, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > > 
> > > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > > 
> > > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > > 
> > > > > > That's terribly racy..
> > > > > 
> > > > > Looking at the original code, it looks racy even without considering the
> > > > > tracepoint:
> > > > > 
> > > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > > >  {
> > > > >         int retval;
> > > > > 
> > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > >         atomic_inc(&dev->power.usage_count);
> > > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > > 
> > > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > > second option)
> > > > 
> > > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > > because they are not protected by spinlocks, but everything else is 
> > > > (aside from the tracepoint, which is new).
> > > > 
> > > > > kref should certainly be used there.
> > > > 
> > > > What for?
> > > 
> > > kref has the following "get":
> > > 
> > >         atomic_inc(&kref->refcount);
> > >         smp_mb__after_atomic_inc();
> > > 
> > > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > > the memory barrier after the atomic increment. The atomic increment is free to
> > > be reordered into the following spinlock (within pm_request_resume or pm_request
> > > resume execution) because taking a spinlock only acts as a memory barrier with
> > > acquire semantic, not a full memory barrier.
> > >
> > > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > > 
> > > initial conditions: usage_count = 1
> > > 
> > > CPU A                                                       CPU B
> > > 1) __pm_runtime_get() (sync = true)
> > > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > > 3)   pm_runtime_resume()
> > > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > > 5)     retval = __pm_request_resume(dev);
> > 
> > If sync = true this is
> >            retval = __pm_runtime_resume(dev);
> > which drops and reacquires the spinlock.
> 
> Let's see. Upon entry in __pm_runtime_resume, the following condition holds
> (remember, the initial condition is that usage_count == 1):
> 
>   dev->power.runtime_status == RPM_ACTIVE
> 
> so retval is set to 1, which goto directly to "out", without setting "parent".
> So there does not seem to be any spinlock reacquire on this path, or am I
> misunderstanding how the "runtime_status" works ?

No, you're not I think, the above is correct.  I was referring to the scenario
in which the device was RPM_SUSPENDED initially.

> > In the meantime it sets
> > ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> > point.
> 
> runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
> expected by __pm_runtime_idle.
> 
> > 
> > > 6)     (execute the body of __pm_request_resume and return)
> > > 7)                                                          __pm_runtime_put() (sync = true) 
> > > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> > >                                                               (still see usage_count == 1 before decrement,
> > >                                                                thus decrement to 0)
> > > 9)                                                             pm_runtime_idle()
> > > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > > 11)                                                            spin_lock_irq(&dev->power.lock);
> > > 12)                                                            retval = __pm_runtime_idle(dev);
> > 
> > Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> > so it will see it's been incremented in the meantime and it will back off.
> 
> This is a subtle but important point. Yes, my scenario seems to be dealt with by
> the extra usage_count check while the spinlock is held.
> 
> How about adding a comment under this atomic_inc() stating that the memory
> barriers are implicitely dealt with by the following spinlock release and the
> extra check while spinlock is held ?
> 
> Commenting memory barriers is important, but commenting why memory barriers are
> not needed due to a subtle corner-case looks even more important.

Well, given that this discussion is taking place at all, I admit that it would
be good to document this somehow. :-)

I'll take care of that.

> (hrm, but more below considering pm_runtime_get_noresume())
> 
> > 
> > > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > > 
> > > So we end up in a situation where CPU A expects the device to be resumed, but
> > > the last action performed has been to bring it to idle.
> > >
> > > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> > 
> > I don't think this particular race is possible.  However, there is another one
> > that seems to be possible (in a different function) that an explicit barrier
> > will prevent from happening.
> > 
> > It's related to pm_runtime_get_noresume(), but I think it's better to put the
> > barrier where it's necessary rather than into pm_runtime_get_noresume() itself.
> 
> Quoting your following mail:
> 
> > Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> > the spinlock, the race I was thinking about doesn't appear to be possible
> > after all.
> 
> Hrm, for the extra-usage_count-under-spinlock check to work, all
> pm_runtime_get_noresume() callers should grab and release the dev->power.lock
> after incrementing the usage_count. This does not seem to be the case though. So
> you might really have a race there.
> 
> So every code path that does:
> 
> 1) pm_runtime_get_noresume(dev);
> 
> 2) ...
> 
> 3) pm_runtime_put_noidle(dev);
> 
> expecting that the device state cannot be changed between 1 and 3 might be
> surprised by a concurrent call to __pm_runtime_idle() that would put a device to
> idle (or similarly with suspend) due to lack of memory barrier after the atomic
> increment.
> 
> Or am I missing something else ?

First of all, the device can always be resumed regardless of the usage_count
value.  usage_count is only used to block attempts to suspend the device and
execute its driver's ->runtime_idle() callback after it has been resumed.
That's why the "normal" pm_runtime_get() queues up a resume request.

IOW, the _get() only becomes meaningful after attempting to resume the device
(which is what I tried to tell Arjan in one of the previous messages).

Second, there's no synchronization between pm_runtime_get_noresume() and
pm_runtime_suspend/idle() etc., so calling pm_runtime_get_noresume() is
certainly insufficient to block pm_runtime_suspend/idle() regardless of memory
barriers (there may be one already in progress when _get_noresume() is called).
To limit possible status changes from happening one should (at least) run
pm_runtime_barrier() (surprise, no? ;-)) after pm_runtime_get_noresume().

So if you don't want to resume the device immediately after increasing its
usage_count (in which case it's better to use pm_runtime_get_sync()), you
should do something like this:

1) pm_runtime_get_noresume(dev);
1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.

2) ...
 
3) pm_runtime_put_noidle(dev);

[The meaning of pm_runtime_barrier() is that all of the runtime PM activity
started before the barrier has been completed when it returns.]

There's one place in the PM core where that really is necessary, but I wouldn't
recommend anyone doing anything like it in a driver.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 10:22                         ` Rafael J. Wysocki
  2010-10-27 12:21                           ` Mathieu Desnoyers
@ 2010-10-27 12:21                           ` Mathieu Desnoyers
  1 sibling, 0 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-27 12:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-omap, Arjan van de Ven, Andrew Morton, Thomas Gleixner,
	Pierre Tardy, Peter Zijlstra, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler,
	Masami Hiramatsu, Tejun Heo, linux-pm, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> > * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > > > 
> > > > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > > > 
> > > > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > > > 
> > > > > > > That's terribly racy..
> > > > > > 
> > > > > > Looking at the original code, it looks racy even without considering the
> > > > > > tracepoint:
> > > > > > 
> > > > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > > > >  {
> > > > > >         int retval;
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count);
> > > > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > > > 
> > > > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > > > second option)
> > > > > 
> > > > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > > > because they are not protected by spinlocks, but everything else is 
> > > > > (aside from the tracepoint, which is new).
> > > > > 
> > > > > > kref should certainly be used there.
> > > > > 
> > > > > What for?
> > > > 
> > > > kref has the following "get":
> > > > 
> > > >         atomic_inc(&kref->refcount);
> > > >         smp_mb__after_atomic_inc();
> > > > 
> > > > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > > > the memory barrier after the atomic increment. The atomic increment is free to
> > > > be reordered into the following spinlock (within pm_request_resume or pm_request
> > > > resume execution) because taking a spinlock only acts as a memory barrier with
> > > > acquire semantic, not a full memory barrier.
> > > >
> > > > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > > > 
> > > > initial conditions: usage_count = 1
> > > > 
> > > > CPU A                                                       CPU B
> > > > 1) __pm_runtime_get() (sync = true)
> > > > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > > > 3)   pm_runtime_resume()
> > > > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > > > 5)     retval = __pm_request_resume(dev);
> > > 
> > > If sync = true this is
> > >            retval = __pm_runtime_resume(dev);
> > > which drops and reacquires the spinlock.
> > 
> > Let's see. Upon entry in __pm_runtime_resume, the following condition holds
> > (remember, the initial condition is that usage_count == 1):
> > 
> >   dev->power.runtime_status == RPM_ACTIVE
> > 
> > so retval is set to 1, which goto directly to "out", without setting "parent".
> > So there does not seem to be any spinlock reacquire on this path, or am I
> > misunderstanding how the "runtime_status" works ?
> 
> No, you're not I think, the above is correct.  I was referring to the scenario
> in which the device was RPM_SUSPENDED initially.

Good to know I'm not losing it. ;-)

> 
> > > In the meantime it sets
> > > ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> > > point.
> > 
> > runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
> > expected by __pm_runtime_idle.
> > 
> > > 
> > > > 6)     (execute the body of __pm_request_resume and return)
> > > > 7)                                                          __pm_runtime_put() (sync = true) 
> > > > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> > > >                                                               (still see usage_count == 1 before decrement,
> > > >                                                                thus decrement to 0)
> > > > 9)                                                             pm_runtime_idle()
> > > > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > > > 11)                                                            spin_lock_irq(&dev->power.lock);
> > > > 12)                                                            retval = __pm_runtime_idle(dev);
> > > 
> > > Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> > > so it will see it's been incremented in the meantime and it will back off.
> > 
> > This is a subtle but important point. Yes, my scenario seems to be dealt with by
> > the extra usage_count check while the spinlock is held.
> > 
> > How about adding a comment under this atomic_inc() stating that the memory
> > barriers are implicitely dealt with by the following spinlock release and the
> > extra check while spinlock is held ?
> > 
> > Commenting memory barriers is important, but commenting why memory barriers are
> > not needed due to a subtle corner-case looks even more important.
> 
> Well, given that this discussion is taking place at all, I admit that it would
> be good to document this somehow. :-)

Yep, it's astonishing how a few comments can end up saving lots of emails from
confused reviewers. ;-)

> 
> I'll take care of that.
> 
> > (hrm, but more below considering pm_runtime_get_noresume())
> > 
> > > 
> > > > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > > > 
> > > > So we end up in a situation where CPU A expects the device to be resumed, but
> > > > the last action performed has been to bring it to idle.
> > > >
> > > > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> > > 
> > > I don't think this particular race is possible.  However, there is another one
> > > that seems to be possible (in a different function) that an explicit barrier
> > > will prevent from happening.
> > > 
> > > It's related to pm_runtime_get_noresume(), but I think it's better to put the
> > > barrier where it's necessary rather than into pm_runtime_get_noresume() itself.
> > 
> > Quoting your following mail:
> > 
> > > Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> > > the spinlock, the race I was thinking about doesn't appear to be possible
> > > after all.
> > 
> > Hrm, for the extra-usage_count-under-spinlock check to work, all
> > pm_runtime_get_noresume() callers should grab and release the dev->power.lock
> > after incrementing the usage_count. This does not seem to be the case though. So
> > you might really have a race there.
> > 
> > So every code path that does:
> > 
> > 1) pm_runtime_get_noresume(dev);
> > 
> > 2) ...
> > 
> > 3) pm_runtime_put_noidle(dev);
> > 
> > expecting that the device state cannot be changed between 1 and 3 might be
> > surprised by a concurrent call to __pm_runtime_idle() that would put a device to
> > idle (or similarly with suspend) due to lack of memory barrier after the atomic
> > increment.
> > 
> > Or am I missing something else ?
> 
> First of all, the device can always be resumed regardless of the usage_count
> value.  usage_count is only used to block attempts to suspend the device and
> execute its driver's ->runtime_idle() callback after it has been resumed.
> That's why the "normal" pm_runtime_get() queues up a resume request.
> 
> IOW, the _get() only becomes meaningful after attempting to resume the device
> (which is what I tried to tell Arjan in one of the previous messages).

OK

> 
> Second, there's no synchronization between pm_runtime_get_noresume() and
> pm_runtime_suspend/idle() etc., so calling pm_runtime_get_noresume() is
> certainly insufficient to block pm_runtime_suspend/idle() regardless of memory
> barriers (there may be one already in progress when _get_noresume() is called).

Agreed, I was wondering how this was expected to work.

> To limit possible status changes from happening one should (at least) run
> pm_runtime_barrier() (surprise, no? ;-)) after pm_runtime_get_noresume().

Hrm, then why export pm_runtime_get_noresume() at all ? I don't feel comfortable
with some of the pm_runtime_get_noresume() callers.

> 
> So if you don't want to resume the device immediately after increasing its
> usage_count (in which case it's better to use pm_runtime_get_sync()), you
> should do something like this:
> 
> 1) pm_runtime_get_noresume(dev);
> 1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.
> 
> 2) ...
>  
> 3) pm_runtime_put_noidle(dev);
> 
> [The meaning of pm_runtime_barrier() is that all of the runtime PM activity
> started before the barrier has been completed when it returns.]
> 
> There's one place in the PM core where that really is necessary, but I wouldn't
> recommend anyone doing anything like it in a driver.

grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.

e.g.:

drivers/usb/core/drivers.c: usb_autopm_get_interface_async()

        pm_runtime_get_noresume(&intf->dev);
        s = ACCESS_ONCE(intf->dev.power.runtime_status);
        if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
                status = pm_request_resume(&intf->dev);

How is this supposed to work ?

If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
device can be suspended even after the check.

My point is that a get/put semantic should imply memory barriers, especially if
these are exported APIs.

Thanks,

Mathieu


> 
> Thanks,
> Rafael

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 10:22                         ` Rafael J. Wysocki
@ 2010-10-27 12:21                           ` Mathieu Desnoyers
  2010-10-27 14:32                             ` Alan Stern
                                               ` (4 more replies)
  2010-10-27 12:21                           ` Mathieu Desnoyers
  1 sibling, 5 replies; 157+ messages in thread
From: Mathieu Desnoyers @ 2010-10-27 12:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, Alan Stern, Paul E. McKenney, Pierre Tardy,
	Peter Zijlstra, Ingo Molnar, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> > * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> > > On Tuesday, October 26, 2010, Mathieu Desnoyers wrote:
> > > > * Alan Stern (stern@rowland.harvard.edu) wrote:
> > > > > On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> > > > > 
> > > > > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > > > > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > > > > > 
> > > > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > > > >         atomic_inc(&dev->power.usage_count); 
> > > > > > > 
> > > > > > > That's terribly racy..
> > > > > > 
> > > > > > Looking at the original code, it looks racy even without considering the
> > > > > > tracepoint:
> > > > > > 
> > > > > > int __pm_runtime_get(struct device *dev, bool sync)
> > > > > >  {
> > > > > >         int retval;
> > > > > > 
> > > > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > > > >         atomic_inc(&dev->power.usage_count);
> > > > > >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > > > > > 
> > > > > > There is no implied memory barrier after "atomic_inc". So either all these
> > > > > > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > > > > > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > > > > > second option)
> > > > > 
> > > > > I don't understand.  What's the problem?  The inc/dec are atomic 
> > > > > because they are not protected by spinlocks, but everything else is 
> > > > > (aside from the tracepoint, which is new).
> > > > > 
> > > > > > kref should certainly be used there.
> > > > > 
> > > > > What for?
> > > > 
> > > > kref has the following "get":
> > > > 
> > > >         atomic_inc(&kref->refcount);
> > > >         smp_mb__after_atomic_inc();
> > > > 
> > > > What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
> > > > the memory barrier after the atomic increment. The atomic increment is free to
> > > > be reordered into the following spinlock (within pm_request_resume or pm_request
> > > > resume execution) because taking a spinlock only acts as a memory barrier with
> > > > acquire semantic, not a full memory barrier.
> > > >
> > > > So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):
> > > > 
> > > > initial conditions: usage_count = 1
> > > > 
> > > > CPU A                                                       CPU B
> > > > 1) __pm_runtime_get() (sync = true)
> > > > 2)   atomic_inc(&usage_count) (not committed to memory yet)
> > > > 3)   pm_runtime_resume()
> > > > 4)     spin_lock_irqsave(&dev->power.lock, flags);
> > > > 5)     retval = __pm_request_resume(dev);
> > > 
> > > If sync = true this is
> > >            retval = __pm_runtime_resume(dev);
> > > which drops and reacquires the spinlock.
> > 
> > Let's see. Upon entry in __pm_runtime_resume, the following condition holds
> > (remember, the initial condition is that usage_count == 1):
> > 
> >   dev->power.runtime_status == RPM_ACTIVE
> > 
> > so retval is set to 1, which goto directly to "out", without setting "parent".
> > So there does not seem to be any spinlock reacquire on this path, or am I
> > misunderstanding how the "runtime_status" works ?
> 
> No, you're not I think, the above is correct.  I was referring to the scenario
> in which the device was RPM_SUSPENDED initially.

Good to know I'm not losing it. ;-)

> 
> > > In the meantime it sets
> > > ->power.runtime_status so that __pm_runtime_idle() will fail if run at this
> > > point.
> > 
> > runtime_status will be left at "RPM_ACTIVE", which is the appropriate value
> > expected by __pm_runtime_idle.
> > 
> > > 
> > > > 6)     (execute the body of __pm_request_resume and return)
> > > > 7)                                                          __pm_runtime_put() (sync = true) 
> > > > 8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
> > > >                                                               (still see usage_count == 1 before decrement,
> > > >                                                                thus decrement to 0)
> > > > 9)                                                             pm_runtime_idle()
> > > > 10)  spin_unlock_irqrestore(&dev->power.lock, flags)
> > > > 11)                                                            spin_lock_irq(&dev->power.lock);
> > > > 12)                                                            retval = __pm_runtime_idle(dev);
> > > 
> > > Moreover, __pm_runtime_idle() checks ->power.usage_count under the spinlock,
> > > so it will see it's been incremented in the meantime and it will back off.
> > 
> > This is a subtle but important point. Yes, my scenario seems to be dealt with by
> > the extra usage_count check while the spinlock is held.
> > 
> > How about adding a comment under this atomic_inc() stating that the memory
> > barriers are implicitely dealt with by the following spinlock release and the
> > extra check while spinlock is held ?
> > 
> > Commenting memory barriers is important, but commenting why memory barriers are
> > not needed due to a subtle corner-case looks even more important.
> 
> Well, given that this discussion is taking place at all, I admit that it would
> be good to document this somehow. :-)

Yep, it's astonishing how a few comments can end up saving lots of emails from
confused reviewers. ;-)

> 
> I'll take care of that.
> 
> > (hrm, but more below considering pm_runtime_get_noresume())
> > 
> > > 
> > > > 13)                                                            spin_unlock_irq(&dev->power.lock);
> > > > 
> > > > So we end up in a situation where CPU A expects the device to be resumed, but
> > > > the last action performed has been to bring it to idle.
> > > >
> > > > A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.
> > > 
> > > I don't think this particular race is possible.  However, there is another one
> > > that seems to be possible (in a different function) that an explicit barrier
> > > will prevent from happening.
> > > 
> > > It's related to pm_runtime_get_noresume(), but I think it's better to put the
> > > barrier where it's necessary rather than into pm_runtime_get_noresume() itself.
> > 
> > Quoting your following mail:
> > 
> > > Actually, no.  Since rpm_idle() and rpm_suspend() both check usage_count under
> > > the spinlock, the race I was thinking about doesn't appear to be possible
> > > after all.
> > 
> > Hrm, for the extra-usage_count-under-spinlock check to work, all
> > pm_runtime_get_noresume() callers should grab and release the dev->power.lock
> > after incrementing the usage_count. This does not seem to be the case though. So
> > you might really have a race there.
> > 
> > So every code path that does:
> > 
> > 1) pm_runtime_get_noresume(dev);
> > 
> > 2) ...
> > 
> > 3) pm_runtime_put_noidle(dev);
> > 
> > expecting that the device state cannot be changed between 1 and 3 might be
> > surprised by a concurrent call to __pm_runtime_idle() that would put a device to
> > idle (or similarly with suspend) due to lack of memory barrier after the atomic
> > increment.
> > 
> > Or am I missing something else ?
> 
> First of all, the device can always be resumed regardless of the usage_count
> value.  usage_count is only used to block attempts to suspend the device and
> execute its driver's ->runtime_idle() callback after it has been resumed.
> That's why the "normal" pm_runtime_get() queues up a resume request.
> 
> IOW, the _get() only becomes meaningful after attempting to resume the device
> (which is what I tried to tell Arjan in one of the previous messages).

OK

> 
> Second, there's no synchronization between pm_runtime_get_noresume() and
> pm_runtime_suspend/idle() etc., so calling pm_runtime_get_noresume() is
> certainly insufficient to block pm_runtime_suspend/idle() regardless of memory
> barriers (there may be one already in progress when _get_noresume() is called).

Agreed, I was wondering how this was expected to work.

> To limit possible status changes from happening one should (at least) run
> pm_runtime_barrier() (surprise, no? ;-)) after pm_runtime_get_noresume().

Hrm, then why export pm_runtime_get_noresume() at all ? I don't feel comfortable
with some of the pm_runtime_get_noresume() callers.

> 
> So if you don't want to resume the device immediately after increasing its
> usage_count (in which case it's better to use pm_runtime_get_sync()), you
> should do something like this:
> 
> 1) pm_runtime_get_noresume(dev);
> 1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.
> 
> 2) ...
>  
> 3) pm_runtime_put_noidle(dev);
> 
> [The meaning of pm_runtime_barrier() is that all of the runtime PM activity
> started before the barrier has been completed when it returns.]
> 
> There's one place in the PM core where that really is necessary, but I wouldn't
> recommend anyone doing anything like it in a driver.

grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.

e.g.:

drivers/usb/core/drivers.c: usb_autopm_get_interface_async()

        pm_runtime_get_noresume(&intf->dev);
        s = ACCESS_ONCE(intf->dev.power.runtime_status);
        if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
                status = pm_request_resume(&intf->dev);

How is this supposed to work ?

If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
device can be suspended even after the check.

My point is that a get/put semantic should imply memory barriers, especially if
these are exported APIs.

Thanks,

Mathieu


> 
> Thanks,
> Rafael

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 12:21                           ` Mathieu Desnoyers
  2010-10-27 14:32                             ` Alan Stern
  2010-10-27 14:32                             ` Alan Stern
@ 2010-10-27 14:32                             ` Alan Stern
  2010-10-28 15:22                               ` Alan Stern
  2010-10-28 15:22                               ` [linux-pm] " Alan Stern
  2010-10-27 21:43                             ` Rafael J. Wysocki
  2010-10-27 21:43                             ` Rafael J. Wysocki
  4 siblings, 2 replies; 157+ messages in thread
From: Alan Stern @ 2010-10-27 14:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Rafael J. Wysocki, Linux-pm mailing list, Paul E. McKenney,
	Kernel development list

[CC: list trimmed drastically, on the assumption that most of the 
people on it aren't very interested in the details of the PM runtime 
memory barriers.]

On Wed, 27 Oct 2010, Mathieu Desnoyers wrote:

> grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> 
> e.g.:
> 
> drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> 
>         pm_runtime_get_noresume(&intf->dev);
>         s = ACCESS_ONCE(intf->dev.power.runtime_status);
>         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
>                 status = pm_request_resume(&intf->dev);
> 
> How is this supposed to work ?

It's worth pointing out that this code is going to be removed during
the 2.6.38 development cycle, due to ongoing changes in the runtime PM
core.  It would have been removed already if not for the difficulty of
coordinating cross-subsystem changes.

But it's legitimate to ask how the code _was_ supposed to work...

> If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> device can be suspended even after the check.

You are correct; the code as written may sometimes fail.  It was a
hack from the beginning; the kind of test it performs should not be
done outside the PM core.  However at the time it was the easiest way 
to do what I wanted.

> My point is that a get/put semantic should imply memory barriers, especially if
> these are exported APIs.

As far as I am aware, apart from the hack above,
pm_runtime_get_noresume is called only in places where either:

	it is purely advisory (e.g., we know that we will use the
	device in the near future so we would prefer to prevent it from
	being suspended, but we don't really care because we're going
	to call pm_runtime_resume_sync before using it anyway);

	or we already know that the usage_count is > 0.

No memory barrier is required for either of these cases.

Alan Stern


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 12:21                           ` Mathieu Desnoyers
@ 2010-10-27 14:32                             ` Alan Stern
  2010-10-27 14:32                             ` Alan Stern
                                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 157+ messages in thread
From: Alan Stern @ 2010-10-27 14:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linux-pm mailing list, Paul E. McKenney, Kernel development list

[CC: list trimmed drastically, on the assumption that most of the 
people on it aren't very interested in the details of the PM runtime 
memory barriers.]

On Wed, 27 Oct 2010, Mathieu Desnoyers wrote:

> grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> 
> e.g.:
> 
> drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> 
>         pm_runtime_get_noresume(&intf->dev);
>         s = ACCESS_ONCE(intf->dev.power.runtime_status);
>         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
>                 status = pm_request_resume(&intf->dev);
> 
> How is this supposed to work ?

It's worth pointing out that this code is going to be removed during
the 2.6.38 development cycle, due to ongoing changes in the runtime PM
core.  It would have been removed already if not for the difficulty of
coordinating cross-subsystem changes.

But it's legitimate to ask how the code _was_ supposed to work...

> If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> device can be suspended even after the check.

You are correct; the code as written may sometimes fail.  It was a
hack from the beginning; the kind of test it performs should not be
done outside the PM core.  However at the time it was the easiest way 
to do what I wanted.

> My point is that a get/put semantic should imply memory barriers, especially if
> these are exported APIs.

As far as I am aware, apart from the hack above,
pm_runtime_get_noresume is called only in places where either:

	it is purely advisory (e.g., we know that we will use the
	device in the near future so we would prefer to prevent it from
	being suspended, but we don't really care because we're going
	to call pm_runtime_resume_sync before using it anyway);

	or we already know that the usage_count is > 0.

No memory barrier is required for either of these cases.

Alan Stern

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 12:21                           ` Mathieu Desnoyers
  2010-10-27 14:32                             ` Alan Stern
@ 2010-10-27 14:32                             ` Alan Stern
  2010-10-27 14:32                             ` [linux-pm] " Alan Stern
                                               ` (2 subsequent siblings)
  4 siblings, 0 replies; 157+ messages in thread
From: Alan Stern @ 2010-10-27 14:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linux-pm mailing list, Paul E. McKenney, Kernel development list

[CC: list trimmed drastically, on the assumption that most of the 
people on it aren't very interested in the details of the PM runtime 
memory barriers.]

On Wed, 27 Oct 2010, Mathieu Desnoyers wrote:

> grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> 
> e.g.:
> 
> drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> 
>         pm_runtime_get_noresume(&intf->dev);
>         s = ACCESS_ONCE(intf->dev.power.runtime_status);
>         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
>                 status = pm_request_resume(&intf->dev);
> 
> How is this supposed to work ?

It's worth pointing out that this code is going to be removed during
the 2.6.38 development cycle, due to ongoing changes in the runtime PM
core.  It would have been removed already if not for the difficulty of
coordinating cross-subsystem changes.

But it's legitimate to ask how the code _was_ supposed to work...

> If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> device can be suspended even after the check.

You are correct; the code as written may sometimes fail.  It was a
hack from the beginning; the kind of test it performs should not be
done outside the PM core.  However at the time it was the easiest way 
to do what I wanted.

> My point is that a get/put semantic should imply memory barriers, especially if
> these are exported APIs.

As far as I am aware, apart from the hack above,
pm_runtime_get_noresume is called only in places where either:

	it is purely advisory (e.g., we know that we will use the
	device in the near future so we would prefer to prevent it from
	being suspended, but we don't really care because we're going
	to call pm_runtime_resume_sync before using it anyway);

	or we already know that the usage_count is > 0.

No memory barrier is required for either of these cases.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Geschäftsführender Gesellschafter: Tjark Auerbach
Sitz der Gesellschaft: Tettnang
Handelsregister: Amtsgericht Ulm, HRB 630992
ALLGEMEINE GESCHÄFTSBEDINGUNGEN
Es gelten unsere Allgemeinen Geschäftsbedingungen
(AGB). Sie finden sie in der jeweils gültigen Fassung
im Internet unter http://www.avira.com/de/standard-terms-conditions-business-de
***************************************************
_______________________________________________
linux-pm mailing list
linux-pm@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/linux-pm

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 12:21                           ` Mathieu Desnoyers
                                               ` (3 preceding siblings ...)
  2010-10-27 21:43                             ` Rafael J. Wysocki
@ 2010-10-27 21:43                             ` Rafael J. Wysocki
  4 siblings, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27 21:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-omap, Arjan van de Ven, Andrew Morton, Thomas Gleixner,
	Pierre Tardy, Peter Zijlstra, Frederic Weisbecker, Jean Pihet,
	Steven Rostedt, linux-trace-users, Frank Eigler,
	Masami Hiramatsu, Tejun Heo, linux-pm, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
...
> 
> Hrm, then why export pm_runtime_get_noresume() at all ?

Basically, the PM core needs it for some obscure stuff.  Beyond that people
really should use it with care (preferably avoid using it at all).

> I don't feel comfortable with some of the pm_runtime_get_noresume() callers.
> 
> > 
> > So if you don't want to resume the device immediately after increasing its
> > usage_count (in which case it's better to use pm_runtime_get_sync()), you
> > should do something like this:
> > 
> > 1) pm_runtime_get_noresume(dev);
> > 1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.
> > 
> > 2) ...
> >  
> > 3) pm_runtime_put_noidle(dev);
> > 
> > [The meaning of pm_runtime_barrier() is that all of the runtime PM activity
> > started before the barrier has been completed when it returns.]
> > 
> > There's one place in the PM core where that really is necessary, but I wouldn't
> > recommend anyone doing anything like it in a driver.
> 
> grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> 
> e.g.:
> 
> drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> 
>         pm_runtime_get_noresume(&intf->dev);
>         s = ACCESS_ONCE(intf->dev.power.runtime_status);
>         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
>                 status = pm_request_resume(&intf->dev);
> 
> How is this supposed to work ?
> 
> If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> device can be suspended even after the check.
> 
> My point is that a get/put semantic should imply memory barriers, especially if
> these are exported APIs.

Well, IMO adding a memory barrier to pm_runtime_get_noresume() wouldn't really
change a lot (it still would be racy with respect to some other runtime PM helper
funtions).  That said I guess we should put a "handle with care" sticker on it.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 12:21                           ` Mathieu Desnoyers
                                               ` (2 preceding siblings ...)
  2010-10-27 14:32                             ` [linux-pm] " Alan Stern
@ 2010-10-27 21:43                             ` Rafael J. Wysocki
  2010-10-27 21:43                             ` Rafael J. Wysocki
  4 siblings, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-27 21:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-pm, Alan Stern, Paul E. McKenney, Pierre Tardy,
	Peter Zijlstra, Ingo Molnar, Jean Pihet, Steven Rostedt,
	linux-trace-users, Frank Eigler, Linus Torvalds,
	Frederic Weisbecker, Masami Hiramatsu, Tejun Heo, Andrew Morton,
	linux-omap, Arjan van de Ven, Thomas Gleixner

On Wednesday, October 27, 2010, Mathieu Desnoyers wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
...
> 
> Hrm, then why export pm_runtime_get_noresume() at all ?

Basically, the PM core needs it for some obscure stuff.  Beyond that people
really should use it with care (preferably avoid using it at all).

> I don't feel comfortable with some of the pm_runtime_get_noresume() callers.
> 
> > 
> > So if you don't want to resume the device immediately after increasing its
> > usage_count (in which case it's better to use pm_runtime_get_sync()), you
> > should do something like this:
> > 
> > 1) pm_runtime_get_noresume(dev);
> > 1a) pm_runtime_barrier(dev);  // That takes care of all pending requests etc.
> > 
> > 2) ...
> >  
> > 3) pm_runtime_put_noidle(dev);
> > 
> > [The meaning of pm_runtime_barrier() is that all of the runtime PM activity
> > started before the barrier has been completed when it returns.]
> > 
> > There's one place in the PM core where that really is necessary, but I wouldn't
> > recommend anyone doing anything like it in a driver.
> 
> grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> 
> e.g.:
> 
> drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> 
>         pm_runtime_get_noresume(&intf->dev);
>         s = ACCESS_ONCE(intf->dev.power.runtime_status);
>         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
>                 status = pm_request_resume(&intf->dev);
> 
> How is this supposed to work ?
> 
> If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> device can be suspended even after the check.
> 
> My point is that a get/put semantic should imply memory barriers, especially if
> these are exported APIs.

Well, IMO adding a memory barrier to pm_runtime_get_noresume() wouldn't really
change a lot (it still would be racy with respect to some other runtime PM helper
funtions).  That said I guess we should put a "handle with care" sticker on it.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 14:32                             ` [linux-pm] " Alan Stern
  2010-10-28 15:22                               ` Alan Stern
@ 2010-10-28 15:22                               ` Alan Stern
  1 sibling, 0 replies; 157+ messages in thread
From: Alan Stern @ 2010-10-28 15:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linux-pm mailing list, Paul E. McKenney, Kernel development list

On Wed, 27 Oct 2010, Alan Stern wrote:

> On Wed, 27 Oct 2010, Mathieu Desnoyers wrote:
> 
> > grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> > 
> > e.g.:
> > 
> > drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> > 
> >         pm_runtime_get_noresume(&intf->dev);
> >         s = ACCESS_ONCE(intf->dev.power.runtime_status);
> >         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
> >                 status = pm_request_resume(&intf->dev);
> > 
> > How is this supposed to work ?

> > If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> > device can be suspended even after the check.
> 
> You are correct; the code as written may sometimes fail.  It was a
> hack from the beginning; the kind of test it performs should not be
> done outside the PM core.  However at the time it was the easiest way 
> to do what I wanted.

I forgot to mention one other thing...  The fact that this code will
sometimes behave unexpectedly isn't a bug.  That function is documented
as requiring additional locking when a driver uses it.  The need for
extra locking is unavoidable because I/O requests can arrive at any
time, even while a runtime suspend is in progress.

Therefore the fact that usb_autopm_get_interface_async() can race with 
a runtime suspend doesn't matter.  The driver making the call should 
have sufficient locking to know that the runtime suspend should fail 
because the driver is busy.

Alan Stern


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH] PERF(kernel): Cleanup power events V2
  2010-10-27 14:32                             ` [linux-pm] " Alan Stern
@ 2010-10-28 15:22                               ` Alan Stern
  2010-10-28 15:22                               ` [linux-pm] " Alan Stern
  1 sibling, 0 replies; 157+ messages in thread
From: Alan Stern @ 2010-10-28 15:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linux-pm mailing list, Paul E. McKenney, Kernel development list

On Wed, 27 Oct 2010, Alan Stern wrote:

> On Wed, 27 Oct 2010, Mathieu Desnoyers wrote:
> 
> > grep -r pm_runtime_get_noresume drivers/    hands out very interesting info.
> > 
> > e.g.:
> > 
> > drivers/usb/core/drivers.c: usb_autopm_get_interface_async()
> > 
> >         pm_runtime_get_noresume(&intf->dev);
> >         s = ACCESS_ONCE(intf->dev.power.runtime_status);
> >         if (s == RPM_SUSPENDING || s == RPM_SUSPENDED)
> >                 status = pm_request_resume(&intf->dev);
> > 
> > How is this supposed to work ?

> > If the ACCESS_ONCE can be reordered before the atomic_inc(), then I fear the
> > device can be suspended even after the check.
> 
> You are correct; the code as written may sometimes fail.  It was a
> hack from the beginning; the kind of test it performs should not be
> done outside the PM core.  However at the time it was the easiest way 
> to do what I wanted.

I forgot to mention one other thing...  The fact that this code will
sometimes behave unexpectedly isn't a bug.  That function is documented
as requiring additional locking when a driver uses it.  The need for
extra locking is unavoidable because I/O requests can arrive at any
time, even while a runtime suspend is in progress.

Therefore the fact that usb_autopm_get_interface_async() can race with 
a runtime suspend doesn't matter.  The driver making the call should 
have sufficient locking to know that the runtime suspend should fail 
because the driver is busy.

Alan Stern

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-18 16:34                   ` Jean Pihet
@ 2010-11-19  0:14                     ` Thomas Renninger
  0 siblings, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-11-19  0:14 UTC (permalink / raw)
  To: Jean Pihet; +Cc: Ingo Molnar, rjw, linux-kernel, arjan

On Thursday 18 November 2010 05:34:15 pm Jean Pihet wrote:
> On Thu, Nov 18, 2010 at 11:52 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >
...
> The problem is because power.h gets included mutliple times, and so
> the POWER_ enum and the empty deprecated functions need to be
> protected from that.
> 
> Here is a patch below that fixes it, compile tested with and without
> CONFIG_EVENT_POWER_TRACING_DEPRECATED set.
Yep, that should be the correct fix.
While I tested both options before, after the pre-processor mess ups, it
looks like I did test a lot of different archs/flavors through our
build service, but all .configs were set by default to
CONFIG_EVENT_POWER_TRACING_DEPRECATED=y
Stupid, sorry about that.
 
> Ingo, Thomas, please let me know if you want me tp refresh the patches
> with that fix.
I'll add it to the end, based on the one Ingo sent.
Ingo: As you started fiddling with it, is that enough or do you prefer
another whole patch series resend?

...

> > From b989c51b6f1989a834eecd9a64a7bd52ed230ea0 Mon Sep 17 00:00:00 2001
> > From: Thomas Renninger <trenn@suse.de>
> > Date: Thu, 18 Nov 2010 10:25:12 +0100
> > Subject: [PATCH] perf: Do not export power_frequency, but power_start
> > event 
This is an independent fix, you can just push it.

Thanks Jean/Ingo, hope that was the last remaining issue...

          Thomas

--------
perf: Clean up power events

Add these new power trace events:

 power:cpu_idle
 power:cpu_frequency
 power:machine_suspend

The old C-state/idle accounting events:
  power:power_start
  power:power_end

Have now a replacement (but we are still keeping the old
tracepoints for compatibility):

  power:cpu_idle

and
  power:power_frequency

is replaced with:
  power:cpu_frequency

power:machine_suspend is newly introduced.

Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rjw@sisk.pl
LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..155d975 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_cpu_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_cpu_idle(0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..4b9befa 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -113,8 +113,8 @@ void cpu_idle(void)
 			stop_critical_timings();
 			pm_idle();
 			start_critical_timings();
-
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b3d7a3a..4c818a7 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT,
+				       smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index c63a438..1109f68 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..08d5f05 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 3c95325..ba5134f 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -221,6 +221,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_cpu_idle((eax >> 4) + 1, cpu);
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 286784d..46596ad 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,16 +7,67 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
-#ifndef _TRACE_POWER_ENUM_
-#define _TRACE_POWER_ENUM_
-enum {
-	POWER_NONE	= 0,
-	POWER_CSTATE	= 1,	/* C-State */
-	POWER_PSTATE	= 2,	/* Fequency change or DVFS */
-	POWER_SSTATE	= 3,	/* Suspend */
-};
+DECLARE_EVENT_CLASS(cpu,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(cpu, cpu_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id)
+);
+
+/* This file can get included multiple times, TRACE_HEADER_MULTI_READ at top */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
+
+#define PWR_EVENT_EXIT -1
 #endif
 
+DEFINE_EVENT(cpu, cpu_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+/* This code will be removed after deprecation time exceeded (2.6.41) */
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 /*
  * The power events are used for cpuidle & suspend (power_start, power_end)
  *  and for cpufreq (power_frequency)
@@ -75,6 +126,35 @@ TRACE_EVENT(power_end,
 
 );
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif /* _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED */
+
+#else /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+
+/* These dummy declaration have to be ripped out when the deprecated
+   events get removed */
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED */
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
 /*
  * The clock events are used for clock enable/disable and for
  *  clock rate change
@@ -153,7 +233,6 @@ DEFINE_EVENT(power_domain, power_domain_target,
 
 	TP_ARGS(name, state, cpu_id)
 );
-
 #endif /* _TRACE_POWER_H */
 
 /* This part must be outside protection */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index ea37e2f..14674dc 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -69,6 +69,21 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool "Deprecated power event trace API, to be removed"
+	default y
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..f55fcf6 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
 

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-18 10:52                 ` Ingo Molnar
@ 2010-11-18 16:34                   ` Jean Pihet
  2010-11-19  0:14                     ` Thomas Renninger
  0 siblings, 1 reply; 157+ messages in thread
From: Jean Pihet @ 2010-11-18 16:34 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Renninger; +Cc: rjw, linux-kernel, arjan

On Thu, Nov 18, 2010 at 11:52 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> I am also getting build failures:
>
> drivers/cpufreq/cpufreq.c:357: error: 'POWER_PSTATE' undeclared (first use in this function)
> drivers/cpufreq/cpufreq.c:357: error: (Each undeclared identifier is reported only once
> drivers/cpufreq/cpufreq.c:357: error: for each function it appears in.)
> arch/x86/kernel/process.c:375: error: 'POWER_CSTATE' undeclared (first use in this function)
> arch/x86/kernel/process.c:375: error: (Each undeclared identifier is reported only once
> arch/x86/kernel/process.c:375: error: for each function it appears in.)
> arch/x86/kernel/process.c:446: error: 'POWER_CSTATE' undeclared (first use in this function)
> arch/x86/kernel/process.c:463: error: 'POWER_CSTATE' undeclared (first use in this function)
> arch/x86/kernel/process.c:485: error: 'POWER_CSTATE' undeclared (first use in this function)
> include/trace/events/power.h:142: error: redefinition of 'trace_power_start'
>
> Config attached.

The problem is because power.h gets included mutliple times, and so
the POWER_ enum and the empty deprecated functions need to be
protected from that.

Here is a patch below that fixes it, compile tested with and without
CONFIG_EVENT_POWER_TRACING_DEPRECATED set.

Ingo, Thomas, please let me know if you want me tp refresh the patches
with that fix.

diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 00d9819..89db5a1 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -136,12 +136,24 @@ enum {
        POWER_PSTATE = 2,
 };
 #endif /* _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED */
-#else
+
+#else /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+enum {
+       POWER_NONE = 0,
+       POWER_CSTATE = 1,
+       POWER_PSTATE = 2,
+};
+
 /* These dummy declaration have to be ripped out when the deprecated
    events get removed */
 static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
 static inline void trace_power_end(u64 cpuid) {};
 static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED */
+
 #endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */

 /*

Thanks,
Jean

>
> Note: please reuse the two commits from below for further work, i did some small
> cleanups to the commit text and to the patches.
>
> Thanks,
>
>        Ingo
>
> ---------------->
> From 87a2cfbda3f53c3bf00c424ce18d97b03b0c3aa0 Mon Sep 17 00:00:00 2001
> From: Thomas Renninger <trenn@suse.de>
> Date: Thu, 18 Nov 2010 10:25:13 +0100
> Subject: [PATCH] perf: Clean up power events
>
> Add these new power trace events:
>
>  power:cpu_idle
>  power:cpu_frequency
>  power:machine_suspend
>
> The old C-state/idle accounting events:
>  power:power_start
>  power:power_end
>
> Have now a replacement (but we are still keeping the old
> tracepoints for compatibility):
>
>  power:cpu_idle
>
> and
>  power:power_frequency
>
> is replaced with:
>  power:cpu_frequency
>
> power:machine_suspend is newly introduced.
>
> Jean Pihet has a patch integrated into the generic layer
> (kernel/power/suspend.c) which will make use of it.
>
> the type= field got removed from both, it was never
> used and the type is differed by the event type itself.
>
> perf timechart userspace tool gets adjusted in a separate patch.
>
> Signed-off-by: Thomas Renninger <trenn@suse.de>
> Acked-by: Arjan van de Ven <arjan@linux.intel.com>
> Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: rjw@sisk.pl
> LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  arch/x86/kernel/process.c    |    7 +++-
>  arch/x86/kernel/process_32.c |    2 +-
>  arch/x86/kernel/process_64.c |    2 +
>  drivers/cpufreq/cpufreq.c    |    1 +
>  drivers/cpuidle/cpuidle.c    |    1 +
>  drivers/idle/intel_idle.c    |    1 +
>  include/trace/events/power.h |   86 +++++++++++++++++++++++++++++++++++++----
>  kernel/trace/Kconfig         |   15 +++++++
>  kernel/trace/power-traces.c  |    3 +
>  9 files changed, 107 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 57d1868..155d975 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -374,6 +374,7 @@ void default_idle(void)
>  {
>        if (hlt_use_halt()) {
>                trace_power_start(POWER_CSTATE, 1, smp_processor_id());
> +               trace_cpu_idle(1, smp_processor_id());
>                current_thread_info()->status &= ~TS_POLLING;
>                /*
>                 * TS_POLLING-cleared state must be visible before we
> @@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
>  void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
>  {
>        trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
> +       trace_cpu_idle((ax>>4)+1, smp_processor_id());
>        if (!need_resched()) {
>                if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
>                        clflush((void *)&current_thread_info()->flags);
> @@ -460,6 +462,7 @@ static void mwait_idle(void)
>  {
>        if (!need_resched()) {
>                trace_power_start(POWER_CSTATE, 1, smp_processor_id());
> +               trace_cpu_idle(1, smp_processor_id());
>                if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
>                        clflush((void *)&current_thread_info()->flags);
>
> @@ -481,10 +484,12 @@ static void mwait_idle(void)
>  static void poll_idle(void)
>  {
>        trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> +       trace_cpu_idle(0, smp_processor_id());
>        local_irq_enable();
>        while (!need_resched())
>                cpu_relax();
> -       trace_power_end(0);
> +       trace_power_end(smp_processor_id());
> +       trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>  }
>
>  /*
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 96586c3..4b9befa 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -113,8 +113,8 @@ void cpu_idle(void)
>                        stop_critical_timings();
>                        pm_idle();
>                        start_critical_timings();
> -
>                        trace_power_end(smp_processor_id());
> +                       trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>                }
>                tick_nohz_restart_sched_tick();
>                preempt_enable_no_resched();
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index b3d7a3a..4c818a7 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -142,6 +142,8 @@ void cpu_idle(void)
>                        start_critical_timings();
>
>                        trace_power_end(smp_processor_id());
> +                       trace_cpu_idle(PWR_EVENT_EXIT,
> +                                      smp_processor_id());
>
>                        /* In many cases the interrupt that ended idle
>                           has already called exit_idle. But some idle
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index c63a438..1109f68 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
>                dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
>                        (unsigned long)freqs->cpu);
>                trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
> +               trace_cpu_frequency(freqs->new, freqs->cpu);
>                srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
>                                CPUFREQ_POSTCHANGE, freqs);
>                if (likely(policy) && likely(policy->cpu == freqs->cpu))
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index a507108..08d5f05 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
>        if (cpuidle_curr_governor->reflect)
>                cpuidle_curr_governor->reflect(dev);
>        trace_power_end(smp_processor_id());
> +       trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>  }
>
>  /**
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 3c95325..ba5134f 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -221,6 +221,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
>
>        stop_critical_timings();
>        trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
> +       trace_cpu_idle((eax >> 4) + 1, cpu);
>        if (!need_resched()) {
>
>                __monitor((void *)&current_thread_info()->flags, 0, 0);
> diff --git a/include/trace/events/power.h b/include/trace/events/power.h
> index 286784d..00d9819 100644
> --- a/include/trace/events/power.h
> +++ b/include/trace/events/power.h
> @@ -7,16 +7,67 @@
>  #include <linux/ktime.h>
>  #include <linux/tracepoint.h>
>
> -#ifndef _TRACE_POWER_ENUM_
> -#define _TRACE_POWER_ENUM_
> -enum {
> -       POWER_NONE      = 0,
> -       POWER_CSTATE    = 1,    /* C-State */
> -       POWER_PSTATE    = 2,    /* Fequency change or DVFS */
> -       POWER_SSTATE    = 3,    /* Suspend */
> -};
> +DECLARE_EVENT_CLASS(cpu,
> +
> +       TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +       TP_ARGS(state, cpu_id),
> +
> +       TP_STRUCT__entry(
> +               __field(        u32,            state           )
> +               __field(        u32,            cpu_id          )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->state = state;
> +               __entry->cpu_id = cpu_id;
> +       ),
> +
> +       TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> +                 (unsigned long)__entry->cpu_id)
> +);
> +
> +DEFINE_EVENT(cpu, cpu_idle,
> +
> +       TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +       TP_ARGS(state, cpu_id)
> +);
> +
> +/* This file can get included multiple times, TRACE_HEADER_MULTI_READ at top */
> +#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
> +#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
> +
> +#define PWR_EVENT_EXIT -1
>  #endif
>
> +DEFINE_EVENT(cpu, cpu_frequency,
> +
> +       TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> +
> +       TP_ARGS(frequency, cpu_id)
> +);
> +
> +TRACE_EVENT(machine_suspend,
> +
> +       TP_PROTO(unsigned int state),
> +
> +       TP_ARGS(state),
> +
> +       TP_STRUCT__entry(
> +               __field(        u32,            state           )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->state = state;
> +       ),
> +
> +       TP_printk("state=%lu", (unsigned long)__entry->state)
> +);
> +
> +/* This code will be removed after deprecation time exceeded (2.6.41) */
> +#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
> +
>  /*
>  * The power events are used for cpuidle & suspend (power_start, power_end)
>  *  and for cpufreq (power_frequency)
> @@ -75,6 +126,24 @@ TRACE_EVENT(power_end,
>
>  );
>
> +/* Deprecated dummy functions must be protected against multi-declartion */
> +#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
> +#define _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
> +
> +enum {
> +       POWER_NONE = 0,
> +       POWER_CSTATE = 1,
> +       POWER_PSTATE = 2,
> +};
> +#endif /* _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED */
> +#else
> +/* These dummy declaration have to be ripped out when the deprecated
> +   events get removed */
> +static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
> +static inline void trace_power_end(u64 cpuid) {};
> +static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
> +#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
> +
>  /*
>  * The clock events are used for clock enable/disable and for
>  *  clock rate change
> @@ -153,7 +222,6 @@ DEFINE_EVENT(power_domain, power_domain_target,
>
>        TP_ARGS(name, state, cpu_id)
>  );
> -
>  #endif /* _TRACE_POWER_H */
>
>  /* This part must be outside protection */
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index e04b8bc..59b44a1 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -69,6 +69,21 @@ config EVENT_TRACING
>        select CONTEXT_SWITCH_TRACER
>        bool
>
> +config EVENT_POWER_TRACING_DEPRECATED
> +       depends on EVENT_TRACING
> +       bool "Deprecated power event trace API, to be removed"
> +       default y
> +       help
> +         Provides old power event types:
> +         C-state/idle accounting events:
> +         power:power_start
> +         power:power_end
> +         and old cpufreq accounting event:
> +         power:power_frequency
> +         This is for userspace compatibility
> +         and will vanish after 5 kernel iterations,
> +         namely 2.6.41.
> +
>  config CONTEXT_SWITCH_TRACER
>        bool
>
> diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
> index 0e0497d..f55fcf6 100644
> --- a/kernel/trace/power-traces.c
> +++ b/kernel/trace/power-traces.c
> @@ -13,5 +13,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/power.h>
>
> +#ifdef EVENT_POWER_TRACING_DEPRECATED
>  EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
> +#endif
> +EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
>
>
> From b989c51b6f1989a834eecd9a64a7bd52ed230ea0 Mon Sep 17 00:00:00 2001
> From: Thomas Renninger <trenn@suse.de>
> Date: Thu, 18 Nov 2010 10:25:12 +0100
> Subject: [PATCH] perf: Do not export power_frequency, but power_start event
>
> power_frequency moved to drivers/cpufreq/cpufreq.c which has
> to be compiled in, no need to export it.
>
> intel_idle can a be module though...
>
> Signed-off-by: Thomas Renninger <trenn@suse.de>
> Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
> CC: Arjan van de Ven <arjan@linux.intel.com>
> Cc: rjw@sisk.pl
> LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  drivers/idle/intel_idle.c   |    2 --
>  kernel/trace/power-traces.c |    2 +-
>  2 files changed, 1 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 41665d2..3c95325 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -220,9 +220,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
>        kt_before = ktime_get_real();
>
>        stop_critical_timings();
> -#ifndef MODULE
>        trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
> -#endif
>        if (!need_resched()) {
>
>                __monitor((void *)&current_thread_info()->flags, 0, 0);
> diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
> index a22582a..0e0497d 100644
> --- a/kernel/trace/power-traces.c
> +++ b/kernel/trace/power-traces.c
> @@ -13,5 +13,5 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/power.h>
>
> -EXPORT_TRACEPOINT_SYMBOL_GPL(power_frequency);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
>
>

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-18 13:01 Power trace event cleanup by still providing old interface for some time Thomas Renninger
@ 2010-11-18 13:01 ` Thomas Renninger
  0 siblings, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-11-18 13:01 UTC (permalink / raw)
  To: j-pihet, arjan, mingo, linux-kernel, trenn; +Cc: rjw

Recent changes:
  - Enable EVENT_POWER_TRACING_DEPRECATED by default

New power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:cpu_idle

and
  power:power_frequency
is replaced with:
  power:cpu_frequency

power:machine_suspend
is newly introduced.
Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Jean Pihet <j-pihet@ti.com>
CC: Jean Pihet <j-pihet@ti.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl
---
 arch/x86/kernel/process.c    |    7 +++-
 arch/x86/kernel/process_32.c |    2 +-
 arch/x86/kernel/process_64.c |    2 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   86 +++++++++++++++++++++++++++++++++++++----
 kernel/trace/Kconfig         |   15 +++++++
 kernel/trace/power-traces.c  |    3 +
 9 files changed, 107 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..155d975 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_cpu_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_cpu_idle(0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..4b9befa 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -113,8 +113,8 @@ void cpu_idle(void)
 			stop_critical_timings();
 			pm_idle();
 			start_critical_timings();
-
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b3d7a3a..4c818a7 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT,
+				       smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index c63a438..1109f68 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..08d5f05 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 3c95325..ba5134f 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -221,6 +221,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_cpu_idle((eax >> 4) + 1, cpu);
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 286784d..00d9819 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,16 +7,67 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
-#ifndef _TRACE_POWER_ENUM_
-#define _TRACE_POWER_ENUM_
-enum {
-	POWER_NONE	= 0,
-	POWER_CSTATE	= 1,	/* C-State */
-	POWER_PSTATE	= 2,	/* Fequency change or DVFS */
-	POWER_SSTATE	= 3,	/* Suspend */
-};
+DECLARE_EVENT_CLASS(cpu,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(cpu, cpu_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id)
+);
+
+/* This file can get included multiple times, TRACE_HEADER_MULTI_READ at top */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
+
+#define PWR_EVENT_EXIT -1
 #endif
 
+DEFINE_EVENT(cpu, cpu_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+/* This code will be removed after deprecation time exceeded (2.6.41) */
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 /*
  * The power events are used for cpuidle & suspend (power_start, power_end)
  *  and for cpufreq (power_frequency)
@@ -75,6 +126,24 @@ TRACE_EVENT(power_end,
 
 );
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif /* _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED */
+#else
+/* These dummy declaration have to be ripped out when the deprecated
+   events get removed */
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
 /*
  * The clock events are used for clock enable/disable and for
  *  clock rate change
@@ -153,7 +222,6 @@ DEFINE_EVENT(power_domain, power_domain_target,
 
 	TP_ARGS(name, state, cpu_id)
 );
-
 #endif /* _TRACE_POWER_H */
 
 /* This part must be outside protection */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e04b8bc..59b44a1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -69,6 +69,21 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool "Deprecated power event trace API, to be removed"
+	default y
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..f55fcf6 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
 
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-18  9:36               ` Ingo Molnar
  2010-11-18  9:44                 ` Jean Pihet
@ 2010-11-18 10:52                 ` Ingo Molnar
  2010-11-18 16:34                   ` Jean Pihet
  1 sibling, 1 reply; 157+ messages in thread
From: Ingo Molnar @ 2010-11-18 10:52 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: Jean Pihet, rjw, linux-kernel, arjan

[-- Attachment #1: Type: text/plain, Size: 12091 bytes --]


I am also getting build failures:

drivers/cpufreq/cpufreq.c:357: error: 'POWER_PSTATE' undeclared (first use in this function)
drivers/cpufreq/cpufreq.c:357: error: (Each undeclared identifier is reported only once
drivers/cpufreq/cpufreq.c:357: error: for each function it appears in.)
arch/x86/kernel/process.c:375: error: 'POWER_CSTATE' undeclared (first use in this function)
arch/x86/kernel/process.c:375: error: (Each undeclared identifier is reported only once
arch/x86/kernel/process.c:375: error: for each function it appears in.)
arch/x86/kernel/process.c:446: error: 'POWER_CSTATE' undeclared (first use in this function)
arch/x86/kernel/process.c:463: error: 'POWER_CSTATE' undeclared (first use in this function)
arch/x86/kernel/process.c:485: error: 'POWER_CSTATE' undeclared (first use in this function)
include/trace/events/power.h:142: error: redefinition of 'trace_power_start'

Config attached.

Note: please reuse the two commits from below for further work, i did some small 
cleanups to the commit text and to the patches.

Thanks,

	Ingo

---------------->
>From 87a2cfbda3f53c3bf00c424ce18d97b03b0c3aa0 Mon Sep 17 00:00:00 2001
From: Thomas Renninger <trenn@suse.de>
Date: Thu, 18 Nov 2010 10:25:13 +0100
Subject: [PATCH] perf: Clean up power events

Add these new power trace events:

 power:cpu_idle
 power:cpu_frequency
 power:machine_suspend

The old C-state/idle accounting events:
  power:power_start
  power:power_end

Have now a replacement (but we are still keeping the old
tracepoints for compatibility):

  power:cpu_idle

and
  power:power_frequency

is replaced with:
  power:cpu_frequency

power:machine_suspend is newly introduced.

Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rjw@sisk.pl
LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process.c    |    7 +++-
 arch/x86/kernel/process_32.c |    2 +-
 arch/x86/kernel/process_64.c |    2 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   86 +++++++++++++++++++++++++++++++++++++----
 kernel/trace/Kconfig         |   15 +++++++
 kernel/trace/power-traces.c  |    3 +
 9 files changed, 107 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..155d975 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_cpu_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_cpu_idle(0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..4b9befa 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -113,8 +113,8 @@ void cpu_idle(void)
 			stop_critical_timings();
 			pm_idle();
 			start_critical_timings();
-
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b3d7a3a..4c818a7 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT,
+				       smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index c63a438..1109f68 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..08d5f05 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 3c95325..ba5134f 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -221,6 +221,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_cpu_idle((eax >> 4) + 1, cpu);
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 286784d..00d9819 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,16 +7,67 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
-#ifndef _TRACE_POWER_ENUM_
-#define _TRACE_POWER_ENUM_
-enum {
-	POWER_NONE	= 0,
-	POWER_CSTATE	= 1,	/* C-State */
-	POWER_PSTATE	= 2,	/* Fequency change or DVFS */
-	POWER_SSTATE	= 3,	/* Suspend */
-};
+DECLARE_EVENT_CLASS(cpu,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(cpu, cpu_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id)
+);
+
+/* This file can get included multiple times, TRACE_HEADER_MULTI_READ at top */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
+
+#define PWR_EVENT_EXIT -1
 #endif
 
+DEFINE_EVENT(cpu, cpu_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+/* This code will be removed after deprecation time exceeded (2.6.41) */
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 /*
  * The power events are used for cpuidle & suspend (power_start, power_end)
  *  and for cpufreq (power_frequency)
@@ -75,6 +126,24 @@ TRACE_EVENT(power_end,
 
 );
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif /* _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED */
+#else
+/* These dummy declaration have to be ripped out when the deprecated
+   events get removed */
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
 /*
  * The clock events are used for clock enable/disable and for
  *  clock rate change
@@ -153,7 +222,6 @@ DEFINE_EVENT(power_domain, power_domain_target,
 
 	TP_ARGS(name, state, cpu_id)
 );
-
 #endif /* _TRACE_POWER_H */
 
 /* This part must be outside protection */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e04b8bc..59b44a1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -69,6 +69,21 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool "Deprecated power event trace API, to be removed"
+	default y
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..f55fcf6 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
 

>From b989c51b6f1989a834eecd9a64a7bd52ed230ea0 Mon Sep 17 00:00:00 2001
From: Thomas Renninger <trenn@suse.de>
Date: Thu, 18 Nov 2010 10:25:12 +0100
Subject: [PATCH] perf: Do not export power_frequency, but power_start event

power_frequency moved to drivers/cpufreq/cpufreq.c which has
to be compiled in, no need to export it.

intel_idle can a be module though...

Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
Cc: rjw@sisk.pl
LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 drivers/idle/intel_idle.c   |    2 --
 kernel/trace/power-traces.c |    2 +-
 2 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 41665d2..3c95325 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -220,9 +220,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 	kt_before = ktime_get_real();
 
 	stop_critical_timings();
-#ifndef MODULE
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
-#endif
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index a22582a..0e0497d 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,5 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
-EXPORT_TRACEPOINT_SYMBOL_GPL(power_frequency);
+EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
 

[-- Attachment #2: config --]
[-- Type: text/plain, Size: 39692 bytes --]

#
# Automatically generated make config: don't edit
# Linux/i386 2.6.37-rc2 Kernel Configuration
# Thu Nov 18 12:29:46 2010
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf32-i386"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
# CONFIG_NEED_DMA_MAP_STATE is not set
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_GPIO=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
# CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-ecx -fcall-saved-edx"
CONFIG_KTIME_SCALAR=y
CONFIG_BOOTPARAM_SUPPORT_NOT_WANTED=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_CONSTRUCTORS=y
CONFIG_HAVE_IRQ_WORK=y
CONFIG_IRQ_WORK=y

#
# General setup
#
# CONFIG_EXPERIMENTAL is not set
CONFIG_BROKEN_BOOT_ALLOWED4=y
CONFIG_BROKEN_BOOT_ALLOWED3=y
CONFIG_BROKEN_BOOT_ALLOWED2=y
CONFIG_BROKEN_BOOT_ALLOWED=y
CONFIG_BROKEN_BOOT_EUROPE=y
CONFIG_BROKEN_BOOT_TITAN=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_LZO=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_KERNEL_LZO=y
# CONFIG_SWAP is not set
# CONFIG_SYSVIPC is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y
CONFIG_HAVE_GENERIC_HARDIRQS=y

#
# IRQ subsystem
#
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
# CONFIG_GENERIC_HARDIRQS_NO_DEPRECATED is not set
CONFIG_HAVE_SPARSE_IRQ=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
# CONFIG_AUTO_IRQ_AFFINITY is not set
# CONFIG_IRQ_PER_CPU is not set
# CONFIG_HARDIRQS_SW_RESEND is not set
CONFIG_SPARSE_IRQ=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
CONFIG_RCU_TRACE=y
CONFIG_RCU_FANOUT=32
CONFIG_RCU_FANOUT_EXACT=y
CONFIG_TREE_RCU_TRACE=y
CONFIG_IKCONFIG=m
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=20
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
CONFIG_PID_NS=y
# CONFIG_NET_NS is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
# CONFIG_BLK_DEV_INITRD is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
# CONFIG_PRINTK is not set
# CONFIG_BUG is not set
# CONFIG_ELF_CORE is not set
# CONFIG_PCSPKR_PLATFORM is not set
CONFIG_BASE_FULL=y
# CONFIG_FUTEX is not set
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
# CONFIG_TIMERFD is not set
# CONFIG_EVENTFD is not set
# CONFIG_SHMEM is not set
# CONFIG_AIO is not set
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_PERF_COUNTERS is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
CONFIG_COMPAT_BRK=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
# CONFIG_PROFILING is not set
CONFIG_TRACEPOINTS=y
CONFIG_HAVE_OPROFILE=y
# CONFIG_KPROBES is not set
# CONFIG_JUMP_LABEL is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y

#
# GCOV-based kernel profiling
#
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_LBDAF is not set
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_INTEGRITY=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_CFQ=m
CONFIG_DEFAULT_NOOP=y
CONFIG_DEFAULT_IOSCHED="noop"
# CONFIG_INLINE_SPIN_TRYLOCK is not set
# CONFIG_INLINE_SPIN_TRYLOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK is not set
# CONFIG_INLINE_SPIN_LOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK_IRQ is not set
# CONFIG_INLINE_SPIN_LOCK_IRQSAVE is not set
CONFIG_INLINE_SPIN_UNLOCK=y
# CONFIG_INLINE_SPIN_UNLOCK_BH is not set
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
# CONFIG_INLINE_SPIN_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_READ_TRYLOCK is not set
# CONFIG_INLINE_READ_LOCK is not set
# CONFIG_INLINE_READ_LOCK_BH is not set
# CONFIG_INLINE_READ_LOCK_IRQ is not set
# CONFIG_INLINE_READ_LOCK_IRQSAVE is not set
CONFIG_INLINE_READ_UNLOCK=y
# CONFIG_INLINE_READ_UNLOCK_BH is not set
CONFIG_INLINE_READ_UNLOCK_IRQ=y
# CONFIG_INLINE_READ_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_WRITE_TRYLOCK is not set
# CONFIG_INLINE_WRITE_LOCK is not set
# CONFIG_INLINE_WRITE_LOCK_BH is not set
# CONFIG_INLINE_WRITE_LOCK_IRQ is not set
# CONFIG_INLINE_WRITE_LOCK_IRQSAVE is not set
CONFIG_INLINE_WRITE_UNLOCK=y
# CONFIG_INLINE_WRITE_UNLOCK_BH is not set
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
# CONFIG_INLINE_WRITE_UNLOCK_IRQRESTORE is not set
CONFIG_MUTEX_SPIN_ON_OWNER=y
# CONFIG_FREEZER is not set

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP_SUPPORT=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_BIGSMP=y
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_ELAN is not set
CONFIG_X86_RDC321X=y
# CONFIG_X86_32_NON_STANDARD is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT_GUEST=y
# CONFIG_XEN_PRIVILEGED_GUEST is not set
CONFIG_KVM_CLOCK=y
# CONFIG_KVM_GUEST is not set
# CONFIG_LGUEST_GUEST is not set
CONFIG_PARAVIRT=y
CONFIG_PARAVIRT_CLOCK=y
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
CONFIG_M686=y
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=5
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_X86_XADD=y
# CONFIG_X86_PPRO_FENCE is not set
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=5
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PROCESSOR_SELECT=y
CONFIG_CPU_SUP_INTEL=y
# CONFIG_CPU_SUP_CYRIX_32 is not set
# CONFIG_CPU_SUP_AMD is not set
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_TRANSMETA_32=y
# CONFIG_CPU_SUP_UMC_32 is not set
# CONFIG_HPET_TIMER is not set
CONFIG_DMI=y
# CONFIG_IOMMU_HELPER is not set
# CONFIG_IOMMU_API is not set
CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_INTEL is not set
# CONFIG_X86_MCE_AMD is not set
CONFIG_X86_ANCIENT_MCE=y
CONFIG_X86_MCE_INJECT=m
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
CONFIG_X86_REBOOTFIXUPS=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_INTEL=y
# CONFIG_MICROCODE_AMD is not set
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
# CONFIG_UP_WANTED_1 is not set
CONFIG_SMP=y
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_PAGE_OFFSET=0xC0000000
CONFIG_HIGHMEM=y
# CONFIG_ARCH_PHYS_ADDR_T_64BIT is not set
# CONFIG_ARCH_DMA_ADDR_T_64BIT is not set
CONFIG_ILLEGAL_POINTER_VALUE=0
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_PHYS_ADDR_T_64BIT is not set
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_MEMORY_FAILURE is not set
# CONFIG_HIGHPTE is not set
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW=64
CONFIG_MATH_EMULATION=y
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
# CONFIG_X86_PAT is not set
CONFIG_SECCOMP=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_300=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=300
CONFIG_SCHED_HRTICK=y
# CONFIG_KEXEC is not set
CONFIG_CRASH_DUMP=y
CONFIG_PHYSICAL_START=0x1000000
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x1000000
# CONFIG_HOTPLUG_CPU is not set
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
# CONFIG_SUSPEND is not set
# CONFIG_PM_RUNTIME is not set
# CONFIG_SFI is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=m
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=m
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

#
# CPUFreq processor drivers
#
CONFIG_X86_POWERNOW_K6=m
CONFIG_X86_POWERNOW_K7=m
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_SPEEDSTEP_ICH=m
# CONFIG_X86_P4_CLOCKMOD is not set
CONFIG_X86_LONGRUN=m

#
# shared options
#
CONFIG_X86_SPEEDSTEP_LIB=m
# CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_INTEL_IDLE=y

#
# Bus options (PCI etc.)
#
# CONFIG_PCI is not set
CONFIG_PCI_BIOS=y
# CONFIG_ARCH_SUPPORTS_MSI is not set
CONFIG_ISA_DMA_API=y
# CONFIG_ISA is not set
CONFIG_MCA=y
CONFIG_MCA_LEGACY=y
# CONFIG_MCA_PROC_FS is not set
# CONFIG_SCx200 is not set
CONFIG_OLPC=y
CONFIG_OLPC_OPENFIRMWARE=y
CONFIG_PCCARD=m
# CONFIG_PCMCIA is not set

#
# PC-card bridges
#

#
# Executable file formats / Emulations
#
# CONFIG_BINFMT_ELF is not set
CONFIG_HAVE_AOUT=y
# CONFIG_BINFMT_AOUT is not set
CONFIG_BINFMT_MISC=m
CONFIG_HAVE_ATOMIC_IOMAP=y
CONFIG_HAVE_TEXT_POKE_SMP=y
CONFIG_NET=y

#
# Networking options
#
# CONFIG_PACKET is not set
CONFIG_UNIX=m
CONFIG_XFRM=y
CONFIG_XFRM_USER=m
CONFIG_XFRM_IPCOMP=m
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_MULTIPLE_TABLES is not set
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
CONFIG_NET_IPGRE_DEMUX=m
# CONFIG_NET_IPGRE is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
# CONFIG_IP_PIMSM_V1 is not set
# CONFIG_IP_PIMSM_V2 is not set
CONFIG_ARPD=y
# CONFIG_SYN_COOKIES is not set
# CONFIG_INET_AH is not set
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=m
# CONFIG_INET_LRO is not set
# CONFIG_INET_DIAG is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_IPV6=m
# CONFIG_IPV6_PRIVACY is not set
CONFIG_IPV6_ROUTER_PREF=y
# CONFIG_INET6_AH is not set
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
CONFIG_IPV6_SIT=m
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=m
CONFIG_NETLABEL=y
# CONFIG_NETWORK_SECMARK is not set
# CONFIG_NETFILTER is not set
CONFIG_ATM=m
# CONFIG_ATM_CLIP is not set
# CONFIG_ATM_LANE is not set
CONFIG_ATM_BR2684=m
CONFIG_ATM_BR2684_IPFILTER=y
CONFIG_L2TP=m
CONFIG_L2TP_DEBUGFS=m
# CONFIG_BRIDGE is not set
CONFIG_VLAN_8021Q=m
# CONFIG_VLAN_8021Q_GVRP is not set
CONFIG_DECNET=m
CONFIG_LLC=m
CONFIG_LLC2=m
# CONFIG_PHONET is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_ATM=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_MULTIQ=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFQ=m
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
CONFIG_NET_SCH_GRED=m
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
CONFIG_NET_SCH_DRR=m

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
CONFIG_NET_CLS_FW=m
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_CLS_FLOW=m
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
# CONFIG_NET_CLS_ACT is not set
# CONFIG_NET_CLS_IND is not set
CONFIG_NET_SCH_FIFO=y
# CONFIG_DCB is not set
CONFIG_RPS=y

#
# Network testing
#
CONFIG_NET_PKTGEN=m
CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
# CONFIG_AX25 is not set
CONFIG_CAN=m
CONFIG_CAN_RAW=m
CONFIG_CAN_BCM=m

#
# CAN Device Drivers
#
# CONFIG_CAN_VCAN is not set
CONFIG_CAN_DEV=m
# CONFIG_CAN_CALC_BITTIMING is not set
# CONFIG_CAN_SJA1000 is not set
CONFIG_CAN_DEBUG_DEVICES=y
CONFIG_IRDA=m

#
# IrDA protocols
#
CONFIG_IRLAN=m
# CONFIG_IRNET is not set
# CONFIG_IRCOMM is not set
CONFIG_IRDA_ULTRA=y

#
# IrDA options
#
# CONFIG_IRDA_CACHE_LAST_LSAP is not set
# CONFIG_IRDA_FAST_RR is not set
CONFIG_IRDA_DEBUG=y

#
# Infrared-port device drivers
#

#
# SIR device drivers
#
# CONFIG_IRTTY_SIR is not set

#
# Dongle support
#

#
# FIR device drivers
#
# CONFIG_NSC_FIR is not set
# CONFIG_WINBOND_FIR is not set
CONFIG_VIA_FIR=m
CONFIG_BT=m
# CONFIG_BT_L2CAP is not set
# CONFIG_BT_SCO is not set

#
# Bluetooth device drivers
#
# CONFIG_BT_HCIUART is not set
CONFIG_BT_HCIVHCI=m
# CONFIG_BT_MRVL is not set
CONFIG_FIB_RULES=y
# CONFIG_WIRELESS is not set
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
# CONFIG_CAIF is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=m
# CONFIG_FIRMWARE_IN_KERNEL is not set
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_CONNECTOR is not set
CONFIG_MTD=m
CONFIG_MTD_DEBUG=y
CONFIG_MTD_DEBUG_VERBOSE=0
# CONFIG_MTD_TESTS is not set
# CONFIG_MTD_CONCAT is not set
CONFIG_MTD_PARTITIONS=y
# CONFIG_MTD_REDBOOT_PARTS is not set
CONFIG_MTD_AR7_PARTS=m

#
# User Modules And Translation Layers
#
CONFIG_MTD_CHAR=m
CONFIG_MTD_BLKDEVS=m
# CONFIG_MTD_BLOCK is not set
# CONFIG_MTD_BLOCK_RO is not set
# CONFIG_FTL is not set
CONFIG_NFTL=m
# CONFIG_NFTL_RW is not set
# CONFIG_INFTL is not set
# CONFIG_RFD_FTL is not set
# CONFIG_SSFDC is not set
CONFIG_MTD_OOPS=m

#
# RAM/ROM/Flash chip drivers
#
CONFIG_MTD_CFI=m
# CONFIG_MTD_JEDECPROBE is not set
CONFIG_MTD_GEN_PROBE=m
# CONFIG_MTD_CFI_ADV_OPTIONS is not set
CONFIG_MTD_MAP_BANK_WIDTH_1=y
CONFIG_MTD_MAP_BANK_WIDTH_2=y
CONFIG_MTD_MAP_BANK_WIDTH_4=y
# CONFIG_MTD_MAP_BANK_WIDTH_8 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_16 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_32 is not set
CONFIG_MTD_CFI_I1=y
CONFIG_MTD_CFI_I2=y
# CONFIG_MTD_CFI_I4 is not set
# CONFIG_MTD_CFI_I8 is not set
# CONFIG_MTD_CFI_INTELEXT is not set
# CONFIG_MTD_CFI_AMDSTD is not set
# CONFIG_MTD_CFI_STAA is not set
CONFIG_MTD_CFI_UTIL=m
# CONFIG_MTD_RAM is not set
CONFIG_MTD_ROM=m
# CONFIG_MTD_ABSENT is not set

#
# Mapping drivers for chip access
#
CONFIG_MTD_COMPLEX_MAPPINGS=y
CONFIG_MTD_PHYSMAP=m
CONFIG_MTD_PHYSMAP_COMPAT=y
CONFIG_MTD_PHYSMAP_START=0x8000000
CONFIG_MTD_PHYSMAP_LEN=0
CONFIG_MTD_PHYSMAP_BANKWIDTH=2
# CONFIG_MTD_NETSC520 is not set
# CONFIG_MTD_TS5500 is not set
# CONFIG_MTD_GPIO_ADDR is not set
# CONFIG_MTD_PLATRAM is not set

#
# Self-contained MTD device drivers
#
CONFIG_MTD_SLRAM=m
CONFIG_MTD_PHRAM=m
CONFIG_MTD_MTDRAM=m
CONFIG_MTDRAM_TOTAL_SIZE=4096
CONFIG_MTDRAM_ERASE_SIZE=128
CONFIG_MTD_BLOCK2MTD=m

#
# Disk-On-Chip Device Drivers
#
# CONFIG_MTD_DOC2000 is not set
# CONFIG_MTD_DOC2001 is not set
CONFIG_MTD_DOC2001PLUS=m
CONFIG_MTD_DOCPROBE=m
CONFIG_MTD_DOCECC=m
CONFIG_MTD_DOCPROBE_ADVANCED=y
CONFIG_MTD_DOCPROBE_ADDRESS=0x0000
# CONFIG_MTD_DOCPROBE_HIGH is not set
# CONFIG_MTD_DOCPROBE_55AA is not set
# CONFIG_MTD_NAND is not set
CONFIG_MTD_NAND_IDS=m
# CONFIG_MTD_ONENAND is not set

#
# LPDDR flash memory drivers
#
CONFIG_MTD_LPDDR=m
CONFIG_MTD_QINFO_PROBE=m
CONFIG_MTD_UBI=m
CONFIG_MTD_UBI_WL_THRESHOLD=4096
CONFIG_MTD_UBI_BEB_RESERVE=1
# CONFIG_MTD_UBI_GLUEBI is not set

#
# UBI debugging options
#
# CONFIG_MTD_UBI_DEBUG is not set
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
# CONFIG_PARPORT_GSC is not set
CONFIG_PARPORT_AX88796=m
# CONFIG_PARPORT_1284 is not set
CONFIG_PARPORT_NOT_PC=y
# CONFIG_BLK_DEV is not set
CONFIG_MISC_DEVICES=y
CONFIG_AD525X_DPOT=m
# CONFIG_AD525X_DPOT_I2C is not set
CONFIG_ENCLOSURE_SERVICES=m
CONFIG_APDS9802ALS=m
CONFIG_ISL29003=m
CONFIG_ISL29020=m
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1780 is not set
CONFIG_SENSORS_BH1770=m
CONFIG_SENSORS_APDS990X=m
CONFIG_HMC6352=m
# CONFIG_VMWARE_BALLOON is not set
# CONFIG_BMP085 is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
CONFIG_EEPROM_LEGACY=m
# CONFIG_EEPROM_93CX6 is not set

#
# Texas Instruments shared transport line discipline
#
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=m
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=m
CONFIG_SCSI_DMA=y
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
# CONFIG_BLK_DEV_SD is not set
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
# CONFIG_BLK_DEV_SR is not set
# CONFIG_CHR_DEV_SG is not set
CONFIG_CHR_DEV_SCH=m
CONFIG_SCSI_ENCLOSURE=m
CONFIG_SCSI_MULTI_LUN=y
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
# CONFIG_SCSI_ISCSI_ATTRS is not set
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
# CONFIG_SCSI_SAS_HOST_SMP is not set
CONFIG_SCSI_SAS_LIBSAS_DEBUG=y
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
CONFIG_ISCSI_BOOT_SYSFS=m
CONFIG_SCSI_BUSLOGIC=m
CONFIG_LIBFC=m
CONFIG_LIBFCOE=m
# CONFIG_SCSI_FD_MCS is not set
# CONFIG_SCSI_IBMMCA is not set
# CONFIG_SCSI_PPA is not set
CONFIG_SCSI_IMM=m
# CONFIG_SCSI_IZIP_EPP16 is not set
CONFIG_SCSI_IZIP_SLOW_CTR=y
CONFIG_SCSI_NCR_D700=m
# CONFIG_SCSI_NCR_Q720 is not set
# CONFIG_SCSI_SIM710 is not set
# CONFIG_SCSI_DH is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
# CONFIG_ATA is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=m
# CONFIG_MD_LINEAR is not set
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
# CONFIG_BLK_DEV_DM is not set
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_NETDEVICES=y
CONFIG_DUMMY=m
CONFIG_BONDING=m
CONFIG_EQUALIZER=m
# CONFIG_TUN is not set
CONFIG_VETH=m
# CONFIG_MII is not set
CONFIG_PHYLIB=m

#
# MII PHY device drivers
#
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_BROADCOM_PHY=m
# CONFIG_BCM63XX_PHY is not set
CONFIG_ICPLUS_PHY=m
CONFIG_REALTEK_PHY=m
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
CONFIG_LSI_ET1011C_PHY=m
CONFIG_MICREL_PHY=m
CONFIG_MDIO_BITBANG=m
CONFIG_MDIO_GPIO=m
# CONFIG_NET_ETHERNET is not set
CONFIG_NETDEV_1000=y
# CONFIG_STMMAC_ETH is not set
# CONFIG_NETDEV_10000 is not set
CONFIG_TR=m
# CONFIG_IBMTR is not set
CONFIG_TMS380TR=m
# CONFIG_MADGEMC is not set
# CONFIG_SMCTR is not set
# CONFIG_WLAN is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
CONFIG_WAN=y
# CONFIG_HDLC is not set
# CONFIG_DLCI is not set
# CONFIG_SBNI is not set
CONFIG_ATM_DRIVERS=y
# CONFIG_ATM_DUMMY is not set
# CONFIG_ATM_TCP is not set

#
# CAIF transport drivers
#
# CONFIG_PLIP is not set
CONFIG_PPP=m
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
# CONFIG_PPP_SYNC_TTY is not set
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_BSDCOMP=m
CONFIG_PPPOATM=m
# CONFIG_SLIP is not set
CONFIG_SLHC=m
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
# CONFIG_ISDN is not set
CONFIG_PHONE=m

#
# Input device support
#
# CONFIG_INPUT is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=m
# CONFIG_SERIO_I8042 is not set
CONFIG_SERIO_SERPORT=m
# CONFIG_SERIO_CT82C710 is not set
CONFIG_SERIO_PARKBD=m
# CONFIG_SERIO_LIBPS2 is not set
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
CONFIG_SERIO_PS2MULT=m
# CONFIG_GAMEPORT is not set

#
# Character devices
#
# CONFIG_VT is not set
# CONFIG_DEVKMEM is not set
CONFIG_SERIAL_NONSTANDARD=y
CONFIG_N_HDLC=m
CONFIG_RISCOM8=m
# CONFIG_SPECIALIX is not set
CONFIG_STALDRV=y

#
# Serial drivers
#
CONFIG_SERIAL_8250=m
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
# CONFIG_SERIAL_8250_RSA is not set
CONFIG_SERIAL_8250_MCA=m

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=m
CONFIG_SERIAL_TIMBERDALE=m
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
CONFIG_SERIAL_ALTERA_UART=m
CONFIG_SERIAL_ALTERA_UART_MAXPORTS=4
CONFIG_SERIAL_ALTERA_UART_BAUDRATE=115200
CONFIG_UNIX98_PTYS=y
CONFIG_DEVPTS_MULTIPLE_INSTANCES=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_TTY_PRINTK=y
# CONFIG_PRINTER is not set
# CONFIG_PPDEV is not set
CONFIG_HVC_DRIVER=y
CONFIG_VIRTIO_CONSOLE=m
CONFIG_IPMI_HANDLER=m
# CONFIG_IPMI_PANIC_EVENT is not set
# CONFIG_IPMI_DEVICE_INTERFACE is not set
CONFIG_IPMI_SI=m
CONFIG_IPMI_WATCHDOG=m
CONFIG_IPMI_POWEROFF=m
CONFIG_HW_RANDOM=m
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
CONFIG_HW_RANDOM_VIA=m
CONFIG_HW_RANDOM_VIRTIO=m
CONFIG_NVRAM=m
# CONFIG_RTC is not set
# CONFIG_GEN_RTC is not set
# CONFIG_R3964 is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
CONFIG_NSC_GPIO=m
CONFIG_CS5535_GPIO=m
CONFIG_RAW_DRIVER=m
CONFIG_MAX_RAW_DEVS=256
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_RAMOOPS is not set
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_HELPER_AUTO is not set
CONFIG_I2C_SMBUS=m

#
# I2C Algorithms
#
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCF=m
CONFIG_I2C_ALGOPCA=m

#
# I2C Hardware Bus support
#

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
CONFIG_I2C_GPIO=m
CONFIG_I2C_PCA_PLATFORM=m
CONFIG_I2C_SIMTEC=m

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_PARPORT is not set
CONFIG_I2C_PARPORT_LIGHT=m

#
# Other I2C/SMBus bus drivers
#
CONFIG_I2C_DEBUG_CORE=y
CONFIG_I2C_DEBUG_ALGO=y
CONFIG_I2C_DEBUG_BUS=y
# CONFIG_SPI is not set

#
# PPS support
#
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIOLIB=y

#
# Memory mapped GPIO expanders:
#
CONFIG_GPIO_BASIC_MMIO=m
# CONFIG_GPIO_IT8761E is not set
CONFIG_GPIO_VX855=m

#
# I2C GPIO expanders:
#
# CONFIG_GPIO_MAX7300 is not set
CONFIG_GPIO_MAX732X=m
# CONFIG_GPIO_PCA953X is not set
# CONFIG_GPIO_PCF857X is not set
CONFIG_GPIO_ADP5588=m

#
# PCI GPIO expanders:
#

#
# SPI GPIO expanders:
#

#
# AC97 GPIO expanders:
#

#
# MODULbus GPIO expanders:
#
# CONFIG_W1 is not set
# CONFIG_POWER_SUPPLY is not set
CONFIG_HWMON=m
CONFIG_HWMON_VID=m
CONFIG_HWMON_DEBUG_CHIP=y

#
# Native drivers
#
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
# CONFIG_SENSORS_ADM1026 is not set
CONFIG_SENSORS_ADM1029=m
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_DS1621 is not set
CONFIG_SENSORS_F71805F=m
# CONFIG_SENSORS_F71882FG is not set
CONFIG_SENSORS_F75375S=m
CONFIG_SENSORS_FSCHMD=m
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_GL518SM is not set
CONFIG_SENSORS_GL520SM=m
# CONFIG_SENSORS_GPIO_FAN is not set
# CONFIG_SENSORS_IBMAEM is not set
CONFIG_SENSORS_IBMPEX=m
CONFIG_SENSORS_IT87=m
CONFIG_SENSORS_JC42=m
CONFIG_SENSORS_LM63=m
# CONFIG_SENSORS_LM73 is not set
CONFIG_SENSORS_LM75=m
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
# CONFIG_SENSORS_LM87 is not set
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_LM92=m
CONFIG_SENSORS_LM93=m
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_MAX1619 is not set
CONFIG_SENSORS_PC87360=m
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
CONFIG_SENSORS_SHT15=m
CONFIG_SENSORS_EMC1403=m
CONFIG_SENSORS_EMC2103=m
CONFIG_SENSORS_SMSC47M1=m
# CONFIG_SENSORS_SMSC47M192 is not set
CONFIG_SENSORS_ADS7828=m
# CONFIG_SENSORS_THMC50 is not set
CONFIG_SENSORS_VIA_CPUTEMP=m
CONFIG_SENSORS_VT1211=m
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
CONFIG_SENSORS_W83792D=m
CONFIG_SENSORS_W83627HF=m
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_THERMAL is not set
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set
CONFIG_MFD_SUPPORT=y
CONFIG_MFD_CORE=m
CONFIG_MFD_SM501=m
# CONFIG_MFD_SM501_GPIO is not set
CONFIG_HTC_PASIC3=m
# CONFIG_TPS65010 is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_PCF50633 is not set
CONFIG_ABX500_CORE=y
CONFIG_MFD_VX855=m
CONFIG_REGULATOR=y
# CONFIG_REGULATOR_DEBUG is not set
# CONFIG_REGULATOR_DUMMY is not set
# CONFIG_REGULATOR_FIXED_VOLTAGE is not set
CONFIG_REGULATOR_VIRTUAL_CONSUMER=m
# CONFIG_REGULATOR_USERSPACE_CONSUMER is not set
# CONFIG_REGULATOR_BQ24022 is not set
# CONFIG_REGULATOR_MAX1586 is not set
# CONFIG_REGULATOR_MAX8649 is not set
# CONFIG_REGULATOR_MAX8660 is not set
CONFIG_REGULATOR_MAX8952=m
CONFIG_REGULATOR_LP3971=m
# CONFIG_REGULATOR_LP3972 is not set
# CONFIG_REGULATOR_TPS65023 is not set
CONFIG_REGULATOR_TPS6507X=m
CONFIG_REGULATOR_ISL6271A=m
# CONFIG_REGULATOR_AD5398 is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
# CONFIG_DRM is not set
CONFIG_VGASTATE=m
# CONFIG_VIDEO_OUTPUT_CONTROL is not set
CONFIG_FB=m
# CONFIG_FIRMWARE_EDID is not set
# CONFIG_FB_DDC is not set
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=m
CONFIG_FB_CFB_COPYAREA=m
CONFIG_FB_CFB_IMAGEBLIT=m
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=m
CONFIG_FB_SYS_COPYAREA=m
CONFIG_FB_SYS_IMAGEBLIT=m
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=m
CONFIG_FB_DEFERRED_IO=y
CONFIG_FB_HECUBA=m
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
# CONFIG_FB_MODE_HELPERS is not set
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
CONFIG_FB_ARC=m
CONFIG_FB_VGA16=m
CONFIG_FB_N411=m
CONFIG_FB_HGA=m
CONFIG_FB_S1D13XXX=m
# CONFIG_FB_TMIO is not set
# CONFIG_FB_SM501 is not set
# CONFIG_FB_VIRTUAL is not set
CONFIG_FB_METRONOME=m
CONFIG_FB_MB862XX=m
# CONFIG_FB_BROADSHEET is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=m
CONFIG_BACKLIGHT_GENERIC=m
CONFIG_BACKLIGHT_MBP_NVIDIA=m
CONFIG_BACKLIGHT_SAHARA=m
CONFIG_BACKLIGHT_ADP8860=m

#
# Display device support
#
CONFIG_DISPLAY_SUPPORT=m

#
# Display hardware drivers
#
# CONFIG_LOGO is not set
# CONFIG_SOUND is not set
CONFIG_USB_SUPPORT=y
# CONFIG_USB_ARCH_HAS_HCD is not set
# CONFIG_USB_ARCH_HAS_OHCI is not set
# CONFIG_USB_ARCH_HAS_EHCI is not set
# CONFIG_USB_OTG_WHITELIST is not set
CONFIG_USB_OTG_BLACKLIST_HUB=y

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#
CONFIG_USB_GADGET=m
# CONFIG_USB_GADGET_DEBUG_FILES is not set
# CONFIG_USB_GADGET_DEBUG_FS is not set
CONFIG_USB_GADGET_VBUS_DRAW=2
CONFIG_USB_GADGET_SELECTED=y
CONFIG_USB_GADGET_R8A66597=y
CONFIG_USB_R8A66597=m
# CONFIG_USB_GADGET_M66592 is not set
CONFIG_USB_GADGET_DUALSPEED=y
# CONFIG_USB_ZERO is not set
CONFIG_USB_ETH=m
CONFIG_USB_ETH_RNDIS=y
CONFIG_USB_ETH_EEM=y
# CONFIG_USB_FILE_STORAGE is not set
# CONFIG_USB_MASS_STORAGE is not set
CONFIG_USB_G_SERIAL=m
# CONFIG_USB_G_PRINTER is not set
# CONFIG_USB_CDC_COMPOSITE is not set
# CONFIG_USB_G_MULTI is not set
CONFIG_USB_G_HID=m
CONFIG_USB_G_DBGP=m
# CONFIG_USB_G_DBGP_PRINTK is not set
CONFIG_USB_G_DBGP_SERIAL=y

#
# OTG and related infrastructure
#
CONFIG_USB_OTG_UTILS=y
CONFIG_USB_GPIO_VBUS=m
CONFIG_NOP_USB_XCEIV=m
# CONFIG_MMC is not set
CONFIG_MEMSTICK=m
CONFIG_MEMSTICK_DEBUG=y

#
# MemoryStick drivers
#
CONFIG_MEMSTICK_UNSAFE_RESUME=y
CONFIG_MSPRO_BLOCK=m

#
# MemoryStick Host Controller Drivers
#
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_ALIX2 is not set
CONFIG_LEDS_GPIO=m
CONFIG_LEDS_GPIO_PLATFORM=y
CONFIG_LEDS_LP3944=m
# CONFIG_LEDS_LP5521 is not set
# CONFIG_LEDS_LP5523 is not set
# CONFIG_LEDS_PCA955X is not set
CONFIG_LEDS_REGULATOR=m
# CONFIG_LEDS_BD2802 is not set
CONFIG_LEDS_LT3593=m
CONFIG_LEDS_TRIGGERS=y

#
# LED Triggers
#
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_GPIO is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
CONFIG_ACCESSIBILITY=y
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
# CONFIG_EDAC_MM_EDAC is not set
# CONFIG_RTC_CLASS is not set
CONFIG_DMADEVICES=y
CONFIG_DMADEVICES_DEBUG=y
# CONFIG_DMADEVICES_VDEBUG is not set

#
# DMA Devices
#
# CONFIG_TIMB_DMA is not set
# CONFIG_AUXDISPLAY is not set
CONFIG_UIO=m
CONFIG_UIO_PDRV=m
CONFIG_UIO_PDRV_GENIRQ=m
CONFIG_X86_PLATFORM_DEVICES=y

#
# Firmware Drivers
#
CONFIG_EDD=m
CONFIG_EDD_OFF=y
CONFIG_FIRMWARE_MEMMAP=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
CONFIG_EXT2_FS=m
# CONFIG_EXT2_FS_XATTR is not set
CONFIG_EXT2_FS_XIP=y
# CONFIG_EXT3_FS is not set
# CONFIG_EXT4_FS is not set
CONFIG_FS_XIP=y
CONFIG_REISERFS_FS=m
CONFIG_REISERFS_CHECK=y
# CONFIG_REISERFS_PROC_INFO is not set
CONFIG_REISERFS_FS_XATTR=y
# CONFIG_REISERFS_FS_POSIX_ACL is not set
# CONFIG_REISERFS_FS_SECURITY is not set
# CONFIG_JFS_FS is not set
# CONFIG_FS_POSIX_ACL is not set
CONFIG_XFS_FS=m
CONFIG_XFS_QUOTA=y
# CONFIG_XFS_POSIX_ACL is not set
CONFIG_XFS_RT=y
# CONFIG_OCFS2_FS is not set
CONFIG_EXPORTFS=m
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
# CONFIG_DNOTIFY is not set
# CONFIG_INOTIFY_USER is not set
# CONFIG_FANOTIFY is not set
# CONFIG_QUOTA is not set
# CONFIG_QUOTA_NETLINK_INTERFACE is not set
CONFIG_QUOTACTL=y
CONFIG_AUTOFS4_FS=m
# CONFIG_FUSE_FS is not set

#
# Caches
#
CONFIG_FSCACHE=m
# CONFIG_FSCACHE_STATS is not set
# CONFIG_FSCACHE_HISTOGRAM is not set
# CONFIG_FSCACHE_DEBUG is not set
CONFIG_FSCACHE_OBJECT_LIST=y
CONFIG_CACHEFILES=m
# CONFIG_CACHEFILES_DEBUG is not set
# CONFIG_CACHEFILES_HISTOGRAM is not set

#
# CD-ROM/DVD Filesystems
#
# CONFIG_ISO9660_FS is not set

#
# DOS/FAT/NT Filesystems
#
# CONFIG_MSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
# CONFIG_PROC_PAGE_MONITOR is not set
CONFIG_SYSFS=y
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_CONFIGFS_FS=m
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_HFSPLUS_FS is not set
# CONFIG_JFFS2_FS is not set
CONFIG_UBIFS_FS=m
# CONFIG_UBIFS_FS_XATTR is not set
CONFIG_UBIFS_FS_ADVANCED_COMPR=y
CONFIG_UBIFS_FS_LZO=y
CONFIG_UBIFS_FS_ZLIB=y
CONFIG_UBIFS_FS_DEBUG=y
CONFIG_UBIFS_FS_DEBUG_MSG_LVL=0
# CONFIG_UBIFS_FS_DEBUG_CHKS is not set
CONFIG_CRAMFS=m
CONFIG_SQUASHFS=m
# CONFIG_SQUASHFS_XATTR is not set
# CONFIG_SQUASHFS_LZO is not set
CONFIG_SQUASHFS_EMBEDDED=y
CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=3
# CONFIG_VXFS_FS is not set
CONFIG_MINIX_FS=m
CONFIG_OMFS_FS=m
# CONFIG_QNX4FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_SYSV_FS=m
# CONFIG_NETWORK_FILESYSTEMS is not set

#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y
# CONFIG_NLS is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_ENABLE_WARN_DEPRECATED is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=1024
CONFIG_MAGIC_SYSRQ=y
CONFIG_STRIP_ASM_SYMS=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_KERNEL is not set
# CONFIG_HARDLOCKUP_DETECTOR is not set
# CONFIG_SLUB_DEBUG_ON is not set
CONFIG_SLUB_STATS=y
# CONFIG_BKL is not set
# CONFIG_SPARSE_RCU_POINTER is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_MEMORY_INIT is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_LKDTM is not set
# CONFIG_SYSCTL_SYSCALL_CHECK is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FTRACE_NMI_ENTER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_RING_BUFFER=y
CONFIG_FTRACE_NMI_ENTER=y
CONFIG_EVENT_TRACING=y
# CONFIG_EVENT_POWER_TRACING_DEPRECATED is not set
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_SCHED_TRACER is not set
CONFIG_FTRACE_SYSCALLS=y
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
# CONFIG_STACK_TRACER is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
# CONFIG_EVENT_TRACE_TEST_SYSCALLS is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_DMA_API_DEBUG is not set
CONFIG_ATOMIC64_SELFTEST=y
CONFIG_SAMPLES=y
# CONFIG_SAMPLE_TRACEPOINTS is not set
CONFIG_SAMPLE_TRACE_EVENTS=m
CONFIG_SAMPLE_KOBJECT=m
CONFIG_SAMPLE_HW_BREAKPOINT=m
CONFIG_SAMPLE_KFIFO=m
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
# CONFIG_STRICT_DEVMEM is not set
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
CONFIG_DOUBLEFAULT=y
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_OPTIMIZE_INLINING is not set

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
# CONFIG_SECURITY_NETWORK is not set
CONFIG_SECURITY_PATH=y
CONFIG_SECURITY_TOMOYO=y
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_IMA is not set
CONFIG_DEFAULT_SECURITY_TOMOYO=y
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="tomoyo"
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=m
CONFIG_CRYPTO_ALGAPI2=m
CONFIG_CRYPTO_AEAD=m
CONFIG_CRYPTO_AEAD2=m
CONFIG_CRYPTO_BLKCIPHER=m
CONFIG_CRYPTO_BLKCIPHER2=m
CONFIG_CRYPTO_HASH=m
CONFIG_CRYPTO_HASH2=m
CONFIG_CRYPTO_RNG2=m
CONFIG_CRYPTO_PCOMP=m
CONFIG_CRYPTO_PCOMP2=m
CONFIG_CRYPTO_MANAGER=m
CONFIG_CRYPTO_MANAGER2=m
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_GF128MUL=m
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_WORKQUEUE=m
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=m
CONFIG_CRYPTO_TEST=m

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=m
# CONFIG_CRYPTO_CTR is not set
CONFIG_CRYPTO_CTS=m
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_PCBC is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=m

#
# Digest
#
CONFIG_CRYPTO_CRC32C=m
CONFIG_CRYPTO_CRC32C_INTEL=m
CONFIG_CRYPTO_GHASH=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=m
CONFIG_CRYPTO_MICHAEL_MIC=m
# CONFIG_CRYPTO_RMD128 is not set
CONFIG_CRYPTO_RMD160=m
CONFIG_CRYPTO_RMD256=m
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=m
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
# CONFIG_CRYPTO_TGR192 is not set
CONFIG_CRYPTO_WP512=m

#
# Ciphers
#
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_586=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_CAMELLIA=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_DES=m
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SEED is not set
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_TWOFISH_586=m

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_ZLIB=m
CONFIG_CRYPTO_LZO=m

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_HW is not set
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
CONFIG_VIRTIO=m
CONFIG_VIRTIO_RING=m
CONFIG_VIRTIO_BALLOON=m
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=m
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC_T10DIF=m
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=m
CONFIG_CRC7=m
CONFIG_LIBCRC32C=m
CONFIG_AUDIT_GENERIC=y
CONFIG_ZLIB_INFLATE=m
CONFIG_ZLIB_DEFLATE=m
CONFIG_LZO_COMPRESS=m
CONFIG_LZO_DECOMPRESS=m
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_NLATTR=y
CONFIG_FORCE_SUCCESSFUL_BUILD=y
CONFIG_X86_32_ALWAYS_ON=y

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-18  9:36               ` Ingo Molnar
@ 2010-11-18  9:44                 ` Jean Pihet
  2010-11-18 10:52                 ` Ingo Molnar
  1 sibling, 0 replies; 157+ messages in thread
From: Jean Pihet @ 2010-11-18  9:44 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Thomas Renninger, rjw, linux-kernel, arjan

On Thu, Nov 18, 2010 at 10:36 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Thomas Renninger <trenn@suse.de> wrote:
>
>> On Thursday 18 November 2010 09:01:32 Ingo Molnar wrote:
>> ...
>> > > @Ingo: If this does not go into x86/tip, but perf or whatever tree, it would
>> > > be great if you can ping me as soon as this stuff is in.
>> >
>> > Mind sending the latest version which has been adjusted/fixed and all acks added?
>> Done.
>> This time with lkml excluded as there were only cleanups/fixes
>> due to a messed merge.
>
> Please do not exclude lkml from such iterations of patches in the future - every
> modification to patches is relevant - often pure resends get resent to lkml as well.

Ok for me!

Acked-by: Jean Pihet <j-pihet@ti.com>

Note the ti.com email address to be used for Sign-offs and Acks.

Thanks,
Jean

>
> Thanks,
>
>        Ingo
>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-18  9:27             ` Thomas Renninger
@ 2010-11-18  9:36               ` Ingo Molnar
  2010-11-18  9:44                 ` Jean Pihet
  2010-11-18 10:52                 ` Ingo Molnar
  0 siblings, 2 replies; 157+ messages in thread
From: Ingo Molnar @ 2010-11-18  9:36 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: Jean Pihet, rjw, linux-kernel, arjan


* Thomas Renninger <trenn@suse.de> wrote:

> On Thursday 18 November 2010 09:01:32 Ingo Molnar wrote:
> ...
> > > @Ingo: If this does not go into x86/tip, but perf or whatever tree, it would
> > > be great if you can ping me as soon as this stuff is in.
> > 
> > Mind sending the latest version which has been adjusted/fixed and all acks added?
> Done.
> This time with lkml excluded as there were only cleanups/fixes
> due to a messed merge.

Please do not exclude lkml from such iterations of patches in the future - every 
modification to patches is relevant - often pure resends get resent to lkml as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-18  8:01           ` Ingo Molnar
@ 2010-11-18  9:27             ` Thomas Renninger
  2010-11-18  9:36               ` Ingo Molnar
  0 siblings, 1 reply; 157+ messages in thread
From: Thomas Renninger @ 2010-11-18  9:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Jean Pihet, rjw, linux-kernel, arjan

On Thursday 18 November 2010 09:01:32 Ingo Molnar wrote:
...
> > @Ingo: If this does not go into x86/tip, but perf or whatever tree, it would
> > be great if you can ping me as soon as this stuff is in.
> 
> Mind sending the latest version which has been adjusted/fixed and all acks added?
Done.
This time with lkml excluded as there were only cleanups/fixes
due to a messed merge.

Thanks,

  Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-14 13:34         ` Thomas Renninger
@ 2010-11-18  8:01           ` Ingo Molnar
  2010-11-18  9:27             ` Thomas Renninger
  0 siblings, 1 reply; 157+ messages in thread
From: Ingo Molnar @ 2010-11-18  8:01 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: Jean Pihet, rjw, linux-kernel, arjan


* Thomas Renninger <trenn@suse.de> wrote:

> On Friday 12 November 2010 03:50:21 pm Jean Pihet wrote:
> > On Fri, Nov 12, 2010 at 7:17 PM, Thomas Renninger <trenn@suse.de> wrote:
> ...
> > >> > +
> > >> > +#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
> > >> > +
> > >> >  #ifndef _TRACE_POWER_ENUM_
> > >> >  #define _TRACE_POWER_ENUM_
> > >> >  enum {
> > >> > @@ -153,8 +214,32 @@ DEFINE_EVENT(power_domain, power_domain_target,
> > >> >
> > >> >        TP_ARGS(name, state, cpu_id)
> > >> >  );
> > >> > -
> > >> > +#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
> > >> The clock and power_domain events have been recently introduced and so
> > >> must be part of the new API. Can this #endif be moved right after the
> > >> definition of power_end?
> > > Oops, I pulled again meanwhile and the patches still patched without fuzz,
> > > but probably with some offset.
> > > I'll look at that and resend this one.
> > Ok
> Thanks for pointing this out. Because pre-processor conditionals only have 
> been moved around it looks like my test build after pulling still succeeded,
> while the #ifdefs/#endifs were rather messed up.
> 
> I adjusted these parts and successfully test-built on quite a lot .config 
> flavors on i386, x86_64, different ppc, ia64 and s390.
> 
> > >> A string is needed here. Without it it is impossible to have the option
> > >> unset.
> > >> This does the trick: +bool "Deprecated power event trace API, to be
> > >> removed" 
> Adjusted, thanks.
> 
> > > I am currently rebuilding on several archs/flavors and hope to be able
> > > to re-send this one today or on Tue.
> Done.
> 
> @Ingo: If this does not go into x86/tip, but perf or whatever tree, it would
> be great if you can ping me as soon as this stuff is in.

Mind sending the latest version which has been adjusted/fixed and all acks added?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-14 13:22   ` Thomas Renninger
@ 2010-11-15 15:49     ` Jean Pihet
  0 siblings, 0 replies; 157+ messages in thread
From: Jean Pihet @ 2010-11-15 15:49 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: mingo, rjw, linux-kernel, arjan

Acked-by: Jean Pihet <j-pihet@ti.com>

On Sun, Nov 14, 2010 at 2:22 PM, Thomas Renninger <trenn@suse.de> wrote:
> PERF(kernel): Cleanup power events
>
> Recent changes:
>  - Fix pre-processor conditionals which got messed up silently by a recent merge/pull
>  - Add a comment to EVENT_POWER_TRACING_DEPRECATED .config option
>
> New power trace events:
> power:cpu_idle
> power:cpu_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>  power:power_start
>  power:power_end
> are replaced with:
>  power:cpu_idle
>
> and
>  power:power_frequency
> is replaced with:
>  power:cpu_frequency
>
> power:machine_suspend
> is newly introduced.
> Jean Pihet has a patch integrated into the generic layer
> (kernel/power/suspend.c) which will make use of it.
>
> the type= field got removed from both, it was never
> used and the type is differed by the event type itself.
>
> perf timechart
> userspace tool gets adjusted in a separate patch.
>
> Signed-off-by: Thomas Renninger <trenn@suse.de>
> Acked-by: Arjan van de Ven <arjan@linux.intel.com>
> Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
> CC: Arjan van de Ven <arjan@linux.intel.com>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: rjw@sisk.pl
> CC: linux-kernel@vger.kernel.org
>
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 57d1868..155d975 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -374,6 +374,7 @@ void default_idle(void)
>  {
>        if (hlt_use_halt()) {
>                trace_power_start(POWER_CSTATE, 1, smp_processor_id());
> +               trace_cpu_idle(1, smp_processor_id());
>                current_thread_info()->status &= ~TS_POLLING;
>                /*
>                 * TS_POLLING-cleared state must be visible before we
> @@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
>  void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
>  {
>        trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
> +       trace_cpu_idle((ax>>4)+1, smp_processor_id());
>        if (!need_resched()) {
>                if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
>                        clflush((void *)&current_thread_info()->flags);
> @@ -460,6 +462,7 @@ static void mwait_idle(void)
>  {
>        if (!need_resched()) {
>                trace_power_start(POWER_CSTATE, 1, smp_processor_id());
> +               trace_cpu_idle(1, smp_processor_id());
>                if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
>                        clflush((void *)&current_thread_info()->flags);
>
> @@ -481,10 +484,12 @@ static void mwait_idle(void)
>  static void poll_idle(void)
>  {
>        trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> +       trace_cpu_idle(0, smp_processor_id());
>        local_irq_enable();
>        while (!need_resched())
>                cpu_relax();
> -       trace_power_end(0);
> +       trace_power_end(smp_processor_id());
> +       trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>  }
>
>  /*
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 96586c3..4b9befa 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -113,8 +113,8 @@ void cpu_idle(void)
>                        stop_critical_timings();
>                        pm_idle();
>                        start_critical_timings();
> -
>                        trace_power_end(smp_processor_id());
> +                       trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>                }
>                tick_nohz_restart_sched_tick();
>                preempt_enable_no_resched();
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index b3d7a3a..4c818a7 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -142,6 +142,8 @@ void cpu_idle(void)
>                        start_critical_timings();
>
>                        trace_power_end(smp_processor_id());
> +                       trace_cpu_idle(PWR_EVENT_EXIT,
> +                                      smp_processor_id());
>
>                        /* In many cases the interrupt that ended idle
>                           has already called exit_idle. But some idle
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index c63a438..1109f68 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
>                dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
>                        (unsigned long)freqs->cpu);
>                trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
> +               trace_cpu_frequency(freqs->new, freqs->cpu);
>                srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
>                                CPUFREQ_POSTCHANGE, freqs);
>                if (likely(policy) && likely(policy->cpu == freqs->cpu))
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index a507108..08d5f05 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
>        if (cpuidle_curr_governor->reflect)
>                cpuidle_curr_governor->reflect(dev);
>        trace_power_end(smp_processor_id());
> +       trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>  }
>
>  /**
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 3c95325..ba5134f 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -221,6 +221,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
>
>        stop_critical_timings();
>        trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
> +       trace_cpu_idle((eax >> 4) + 1, cpu);
>        if (!need_resched()) {
>
>                __monitor((void *)&current_thread_info()->flags, 0, 0);
> diff --git a/include/trace/events/power.h b/include/trace/events/power.h
> index 286784d..00d9819 100644
> --- a/include/trace/events/power.h
> +++ b/include/trace/events/power.h
> @@ -7,16 +7,67 @@
>  #include <linux/ktime.h>
>  #include <linux/tracepoint.h>
>
> -#ifndef _TRACE_POWER_ENUM_
> -#define _TRACE_POWER_ENUM_
> -enum {
> -       POWER_NONE      = 0,
> -       POWER_CSTATE    = 1,    /* C-State */
> -       POWER_PSTATE    = 2,    /* Fequency change or DVFS */
> -       POWER_SSTATE    = 3,    /* Suspend */
> -};
> +DECLARE_EVENT_CLASS(cpu,
> +
> +       TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +       TP_ARGS(state, cpu_id),
> +
> +       TP_STRUCT__entry(
> +               __field(        u32,            state           )
> +               __field(        u32,            cpu_id          )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->state = state;
> +               __entry->cpu_id = cpu_id;
> +       ),
> +
> +       TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> +                 (unsigned long)__entry->cpu_id)
> +);
> +
> +DEFINE_EVENT(cpu, cpu_idle,
> +
> +       TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +       TP_ARGS(state, cpu_id)
> +);
> +
> +/* This file can get included multiple times, TRACE_HEADER_MULTI_READ at top */
> +#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
> +#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
> +
> +#define PWR_EVENT_EXIT -1
>  #endif
>
> +DEFINE_EVENT(cpu, cpu_frequency,
> +
> +       TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> +
> +       TP_ARGS(frequency, cpu_id)
> +);
> +
> +TRACE_EVENT(machine_suspend,
> +
> +       TP_PROTO(unsigned int state),
> +
> +       TP_ARGS(state),
> +
> +       TP_STRUCT__entry(
> +               __field(        u32,            state           )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->state = state;
> +       ),
> +
> +       TP_printk("state=%lu", (unsigned long)__entry->state)
> +);
> +
> +/* This code will be removed after deprecation time exceeded (2.6.41) */
> +#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
> +
>  /*
>  * The power events are used for cpuidle & suspend (power_start, power_end)
>  *  and for cpufreq (power_frequency)
> @@ -75,6 +126,24 @@ TRACE_EVENT(power_end,
>
>  );
>
> +/* Deprecated dummy functions must be protected against multi-declartion */
> +#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
> +#define _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
> +
> +enum {
> +       POWER_NONE = 0,
> +       POWER_CSTATE = 1,
> +       POWER_PSTATE = 2,
> +};
> +#endif /* _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED */
> +#else
> +/* These dummy declaration have to be ripped out when the deprecated
> +   events get removed */
> +static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
> +static inline void trace_power_end(u64 cpuid) {};
> +static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
> +#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
> +
>  /*
>  * The clock events are used for clock enable/disable and for
>  *  clock rate change
> @@ -153,7 +222,6 @@ DEFINE_EVENT(power_domain, power_domain_target,
>
>        TP_ARGS(name, state, cpu_id)
>  );
> -
>  #endif /* _TRACE_POWER_H */
>
>  /* This part must be outside protection */
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index e04b8bc..59b44a1 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -69,6 +69,21 @@ config EVENT_TRACING
>        select CONTEXT_SWITCH_TRACER
>        bool
>
> +config EVENT_POWER_TRACING_DEPRECATED
> +       depends on EVENT_TRACING
> +       bool "Deprecated power event trace API, to be removed"
> +       default y
> +       help
> +         Provides old power event types:
> +         C-state/idle accounting events:
> +         power:power_start
> +         power:power_end
> +         and old cpufreq accounting event:
> +         power:power_frequency
> +         This is for userspace compatibility
> +         and will vanish after 5 kernel iterations,
> +         namely 2.6.41.
> +
>  config CONTEXT_SWITCH_TRACER
>        bool
>
> diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
> index 0e0497d..f55fcf6 100644
> --- a/kernel/trace/power-traces.c
> +++ b/kernel/trace/power-traces.c
> @@ -13,5 +13,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/power.h>
>
> +#ifdef EVENT_POWER_TRACING_DEPRECATED
>  EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
> +#endif
> +EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
>
>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-12 21:50       ` Jean Pihet
@ 2010-11-14 13:34         ` Thomas Renninger
  2010-11-18  8:01           ` Ingo Molnar
  0 siblings, 1 reply; 157+ messages in thread
From: Thomas Renninger @ 2010-11-14 13:34 UTC (permalink / raw)
  To: Jean Pihet; +Cc: mingo, rjw, linux-kernel, arjan

On Friday 12 November 2010 03:50:21 pm Jean Pihet wrote:
> On Fri, Nov 12, 2010 at 7:17 PM, Thomas Renninger <trenn@suse.de> wrote:
...
> >> > +
> >> > +#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
> >> > +
> >> >  #ifndef _TRACE_POWER_ENUM_
> >> >  #define _TRACE_POWER_ENUM_
> >> >  enum {
> >> > @@ -153,8 +214,32 @@ DEFINE_EVENT(power_domain, power_domain_target,
> >> >
> >> >        TP_ARGS(name, state, cpu_id)
> >> >  );
> >> > -
> >> > +#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
> >> The clock and power_domain events have been recently introduced and so
> >> must be part of the new API. Can this #endif be moved right after the
> >> definition of power_end?
> > Oops, I pulled again meanwhile and the patches still patched without fuzz,
> > but probably with some offset.
> > I'll look at that and resend this one.
> Ok
Thanks for pointing this out. Because pre-processor conditionals only have 
been moved around it looks like my test build after pulling still succeeded,
while the #ifdefs/#endifs were rather messed up.

I adjusted these parts and successfully test-built on quite a lot .config 
flavors on i386, x86_64, different ppc, ia64 and s390.

> >> A string is needed here. Without it it is impossible to have the option
> >> unset.
> >> This does the trick: +bool "Deprecated power event trace API, to be
> >> removed" 
Adjusted, thanks.

> > I am currently rebuilding on several archs/flavors and hope to be able
> > to re-send this one today or on Tue.
Done.

@Ingo: If this does not go into x86/tip, but perf or whatever tree, it would
be great if you can ping me as soon as this stuff is in.
I want to cleanup the "double cpu_idle events" issues on top and make this
more architecture independent (throw cpu_idle events from cpuidle framework
instead of throwing very x86 specific mwait states, etc.).

Thanks,

       Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-11 18:03 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
  2010-11-12 14:20   ` Jean Pihet
@ 2010-11-14 13:22   ` Thomas Renninger
  2010-11-15 15:49     ` Jean Pihet
  1 sibling, 1 reply; 157+ messages in thread
From: Thomas Renninger @ 2010-11-14 13:22 UTC (permalink / raw)
  To: mingo; +Cc: rjw, linux-kernel, arjan, jean.pihet

PERF(kernel): Cleanup power events

Recent changes:
  - Fix pre-processor conditionals which got messed up silently by a recent merge/pull
  - Add a comment to EVENT_POWER_TRACING_DEPRECATED .config option

New power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:cpu_idle

and
  power:power_frequency
is replaced with:
  power:cpu_frequency

power:machine_suspend
is newly introduced.
Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl
CC: linux-kernel@vger.kernel.org

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..155d975 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_cpu_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_cpu_idle(0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..4b9befa 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -113,8 +113,8 @@ void cpu_idle(void)
 			stop_critical_timings();
 			pm_idle();
 			start_critical_timings();
-
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b3d7a3a..4c818a7 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT,
+				       smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index c63a438..1109f68 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..08d5f05 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 3c95325..ba5134f 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -221,6 +221,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_cpu_idle((eax >> 4) + 1, cpu);
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 286784d..00d9819 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,16 +7,67 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
-#ifndef _TRACE_POWER_ENUM_
-#define _TRACE_POWER_ENUM_
-enum {
-	POWER_NONE	= 0,
-	POWER_CSTATE	= 1,	/* C-State */
-	POWER_PSTATE	= 2,	/* Fequency change or DVFS */
-	POWER_SSTATE	= 3,	/* Suspend */
-};
+DECLARE_EVENT_CLASS(cpu,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(cpu, cpu_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id)
+);
+
+/* This file can get included multiple times, TRACE_HEADER_MULTI_READ at top */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
+
+#define PWR_EVENT_EXIT -1
 #endif
 
+DEFINE_EVENT(cpu, cpu_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+/* This code will be removed after deprecation time exceeded (2.6.41) */
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 /*
  * The power events are used for cpuidle & suspend (power_start, power_end)
  *  and for cpufreq (power_frequency)
@@ -75,6 +126,24 @@ TRACE_EVENT(power_end,
 
 );
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED
+
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif /* _PWR_EVENT_AVOID_DOUBLE_DEFINING_DEPRECATED */
+#else
+/* These dummy declaration have to be ripped out when the deprecated
+   events get removed */
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
 /*
  * The clock events are used for clock enable/disable and for
  *  clock rate change
@@ -153,7 +222,6 @@ DEFINE_EVENT(power_domain, power_domain_target,
 
 	TP_ARGS(name, state, cpu_id)
 );
-
 #endif /* _TRACE_POWER_H */
 
 /* This part must be outside protection */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e04b8bc..59b44a1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -69,6 +69,21 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool "Deprecated power event trace API, to be removed"
+	default y
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..f55fcf6 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
 

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-12 18:17     ` Thomas Renninger
@ 2010-11-12 21:50       ` Jean Pihet
  2010-11-14 13:34         ` Thomas Renninger
  0 siblings, 1 reply; 157+ messages in thread
From: Jean Pihet @ 2010-11-12 21:50 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: mingo, rjw, linux-kernel, arjan

On Fri, Nov 12, 2010 at 7:17 PM, Thomas Renninger <trenn@suse.de> wrote:
> On Friday 12 November 2010 08:20:47 am Jean Pihet wrote:
>> Thomas,
> ...
>> > +
>> > +       TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
>> > +                 (unsigned long)__entry->cpu_id)
>> Using %lu for the state field causes PWR_EVENT_EXIT to appear as
>> 4294967295 instead of -1. Can the field be of a signed type?
> This is intended, what exactly is the problem?
There is no problem, I just wanted to warn about it. I am fine with it.

>
> ...
>> > +       TP_printk("state=%lu", (unsigned long)__entry->state)
>> Same remark about the unsigned type for the state field.
> Same.
>>
>> > +);
>> > +
>> > +#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
>> > +
>> >  #ifndef _TRACE_POWER_ENUM_
>> >  #define _TRACE_POWER_ENUM_
>> >  enum {
>> > @@ -153,8 +214,32 @@ DEFINE_EVENT(power_domain, power_domain_target,
>> >
>> >        TP_ARGS(name, state, cpu_id)
>> >  );
>> > -
>> > +#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
>> The clock and power_domain events have been recently introduced and so
>> must be part of the new API. Can this #endif be moved right after the
>> definition of power_end?
> Oops, I pulled again meanwhile and the patches still patched without fuzz,
> but probably with some offset.
> I'll look at that and resend this one.
Ok

>
>> >  #endif /* _TRACE_POWER_H */
>> Should this be at the very end of the file?
> Not sure whether this also came from merge issues, but yes, several
> #ifdef conditions need to get corrected.
Ok

>
> ...
>
>> A string is needed here. Without it it is impossible to have the option
>> unset.
>> This does the trick: +bool "Deprecated power event trace API, to be removed"
> Ok, thanks.
>
> I am currently rebuilding on several archs/flavors and hope to be able
> to re-send this one today or on Tue.
>
> Thanks,
>
>    Thomas
>
Thanks!

Jean

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-12 14:20   ` Jean Pihet
@ 2010-11-12 18:17     ` Thomas Renninger
  2010-11-12 21:50       ` Jean Pihet
  0 siblings, 1 reply; 157+ messages in thread
From: Thomas Renninger @ 2010-11-12 18:17 UTC (permalink / raw)
  To: Jean Pihet; +Cc: mingo, rjw, linux-kernel, arjan

On Friday 12 November 2010 08:20:47 am Jean Pihet wrote:
> Thomas,
...
> > +
> > +       TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> > +                 (unsigned long)__entry->cpu_id)
> Using %lu for the state field causes PWR_EVENT_EXIT to appear as
> 4294967295 instead of -1. Can the field be of a signed type?
This is intended, what exactly is the problem?
 
...
> > +       TP_printk("state=%lu", (unsigned long)__entry->state)
> Same remark about the unsigned type for the state field.
Same.
> 
> > +);
> > +
> > +#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
> > +
> >  #ifndef _TRACE_POWER_ENUM_
> >  #define _TRACE_POWER_ENUM_
> >  enum {
> > @@ -153,8 +214,32 @@ DEFINE_EVENT(power_domain, power_domain_target,
> >
> >        TP_ARGS(name, state, cpu_id)
> >  );
> > -
> > +#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
> The clock and power_domain events have been recently introduced and so
> must be part of the new API. Can this #endif be moved right after the
> definition of power_end?
Oops, I pulled again meanwhile and the patches still patched without fuzz,
but probably with some offset.
I'll look at that and resend this one.

> >  #endif /* _TRACE_POWER_H */
> Should this be at the very end of the file?
Not sure whether this also came from merge issues, but yes, several
#ifdef conditions need to get corrected.

...

> A string is needed here. Without it it is impossible to have the option
> unset. 
> This does the trick: +bool "Deprecated power event trace API, to be removed"
Ok, thanks.

I am currently rebuilding on several archs/flavors and hope to be able
to re-send this one today or on Tue.

Thanks,

    Thomas

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-11 18:03 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
@ 2010-11-12 14:20   ` Jean Pihet
  2010-11-12 18:17     ` Thomas Renninger
  2010-11-14 13:22   ` Thomas Renninger
  1 sibling, 1 reply; 157+ messages in thread
From: Jean Pihet @ 2010-11-12 14:20 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: mingo, rjw, linux-kernel, arjan

Thomas,

Thanks for the patches re-spin!

Here are my comments inlined.

On Thu, Nov 11, 2010 at 7:03 PM, Thomas Renninger <trenn@suse.de> wrote:
> Recent changes:
>  - Enable EVENT_POWER_TRACING_DEPRECATED by default
>
> New power trace events:
> power:cpu_idle
> power:cpu_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>  power:power_start
>  power:power_end
> are replaced with:
>  power:cpu_idle
>
> and
>  power:power_frequency
> is replaced with:
>  power:cpu_frequency
>
> power:machine_suspend
> is newly introduced.
> Jean Pihet has a patch integrated into the generic layer
> (kernel/power/suspend.c) which will make use of it.
>
> the type= field got removed from both, it was never
> used and the type is differed by the event type itself.
>
> perf timechart
> userspace tool gets adjusted in a separate patch.
>
> Signed-off-by: Thomas Renninger <trenn@suse.de>
> Acked-by: Arjan van de Ven <arjan@linux.intel.com>
> Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
> CC: Arjan van de Ven <arjan@linux.intel.com>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: rjw@sisk.pl
> CC: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/process.c    |    7 +++-
>  arch/x86/kernel/process_32.c |    2 +-
>  arch/x86/kernel/process_64.c |    2 +
>  drivers/cpufreq/cpufreq.c    |    1 +
>  drivers/cpuidle/cpuidle.c    |    1 +
>  drivers/idle/intel_idle.c    |    1 +
>  include/trace/events/power.h |   87 +++++++++++++++++++++++++++++++++++++++++-
>  kernel/trace/Kconfig         |   15 +++++++
>  kernel/trace/power-traces.c  |    3 +
>  9 files changed, 116 insertions(+), 3 deletions(-)
>
...
> diff --git a/include/trace/events/power.h b/include/trace/events/power.h
> index 286784d..ab26d8e 100644
> --- a/include/trace/events/power.h
> +++ b/include/trace/events/power.h
> @@ -7,6 +7,67 @@
>  #include <linux/ktime.h>
>  #include <linux/tracepoint.h>
>
> +DECLARE_EVENT_CLASS(cpu,
> +
> +       TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +       TP_ARGS(state, cpu_id),
> +
> +       TP_STRUCT__entry(
> +               __field(        u32,            state           )
> +               __field(        u32,            cpu_id          )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->state = state;
> +               __entry->cpu_id = cpu_id;
> +       ),
> +
> +       TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
> +                 (unsigned long)__entry->cpu_id)
Using %lu for the state field causes PWR_EVENT_EXIT to appear as
4294967295 instead of -1. Can the field be of a signed type?

> +);
> +
> +DEFINE_EVENT(cpu, cpu_idle,
> +
> +       TP_PROTO(unsigned int state, unsigned int cpu_id),
> +
> +       TP_ARGS(state, cpu_id)
> +);
> +
> +/* This file can get included multiple times, TRACE_HEADER_MULTI_READ at top */
> +#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
> +#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
> +
> +#define PWR_EVENT_EXIT -1
> +
> +#endif
> +
> +DEFINE_EVENT(cpu, cpu_frequency,
> +
> +       TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> +
> +       TP_ARGS(frequency, cpu_id)
> +);
> +
> +TRACE_EVENT(machine_suspend,
> +
> +       TP_PROTO(unsigned int state),
> +
> +       TP_ARGS(state),
> +
> +       TP_STRUCT__entry(
> +               __field(        u32,            state           )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->state = state;
> +       ),
> +
> +       TP_printk("state=%lu", (unsigned long)__entry->state)
Same remark about the unsigned type for the state field.

> +);
> +
> +#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
> +
>  #ifndef _TRACE_POWER_ENUM_
>  #define _TRACE_POWER_ENUM_
>  enum {
> @@ -153,8 +214,32 @@ DEFINE_EVENT(power_domain, power_domain_target,
>
>        TP_ARGS(name, state, cpu_id)
>  );
> -
> +#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
The clock and power_domain events have been recently introduced and so
must be part of the new API. Can this #endif be moved right after the
definition of power_end?

>  #endif /* _TRACE_POWER_H */
Should this be at the very end of the file?

>
> +/* Deprecated dummy functions must be protected against multi-declartion */
> +#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
> +#define EVENT_POWER_TRACING_DEPRECATED_PART_H
> +
> +#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
> +
> +#ifndef _TRACE_POWER_ENUM_
> +#define _TRACE_POWER_ENUM_
> +enum {
> +       POWER_NONE = 0,
> +       POWER_CSTATE = 1,
> +       POWER_PSTATE = 2,
> +};
> +#endif
> +
> +static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
> +static inline void trace_power_end(u64 cpuid) {};
> +static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
> +#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
> +
> +#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
> +
> +
> +
>  /* This part must be outside protection */
>  #include <trace/define_trace.h>
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index e04b8bc..0be2e7f 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -69,6 +69,21 @@ config EVENT_TRACING
>        select CONTEXT_SWITCH_TRACER
>        bool
>
> +config EVENT_POWER_TRACING_DEPRECATED
> +       depends on EVENT_TRACING
> +       bool
A string is needed here. Without it it is impossible to have the option unset.
This does the trick: +bool "Deprecated power event trace API, to be removed"

> +       default y
> +       help
> +         Provides old power event types:
> +         C-state/idle accounting events:
> +         power:power_start
> +         power:power_end
> +         and old cpufreq accounting event:
> +         power:power_frequency
> +         This is for userspace compatibility
> +         and will vanish after 5 kernel iterations,
> +         namely 2.6.41.
> +
>  config CONTEXT_SWITCH_TRACER
>        bool
>
...

Thanks,
Jean

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-11-11 18:03 [RESEND] Power trace event cleanup by still providing old interface for some time Thomas Renninger
@ 2010-11-11 18:03 ` Thomas Renninger
  2010-11-12 14:20   ` Jean Pihet
  2010-11-14 13:22   ` Thomas Renninger
  0 siblings, 2 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-11-11 18:03 UTC (permalink / raw)
  To: mingo; +Cc: trenn, rjw, linux-kernel, arjan, jean.pihet

Recent changes:
  - Enable EVENT_POWER_TRACING_DEPRECATED by default

New power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:cpu_idle

and
  power:power_frequency
is replaced with:
  power:cpu_frequency

power:machine_suspend
is newly introduced.
Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/process.c    |    7 +++-
 arch/x86/kernel/process_32.c |    2 +-
 arch/x86/kernel/process_64.c |    2 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   87 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   15 +++++++
 kernel/trace/power-traces.c  |    3 +
 9 files changed, 116 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..155d975 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_cpu_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_cpu_idle(0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..4b9befa 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -113,8 +113,8 @@ void cpu_idle(void)
 			stop_critical_timings();
 			pm_idle();
 			start_critical_timings();
-
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b3d7a3a..4c818a7 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT,
+				       smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index c63a438..1109f68 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..08d5f05 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 3c95325..ba5134f 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -221,6 +221,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_cpu_idle((eax >> 4) + 1, cpu);
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 286784d..ab26d8e 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,67 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(cpu,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(cpu, cpu_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id)
+);
+
+/* This file can get included multiple times, TRACE_HEADER_MULTI_READ at top */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
+
+#define PWR_EVENT_EXIT -1
+
+#endif
+
+DEFINE_EVENT(cpu, cpu_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -153,8 +214,32 @@ DEFINE_EVENT(power_domain, power_domain_target,
 
 	TP_ARGS(name, state, cpu_id)
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e04b8bc..0be2e7f 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -69,6 +69,21 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	default y
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..f55fcf6 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
 
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-28 11:31     ` [linux-pm] " Rafael J. Wysocki
  2010-10-28 11:37       ` Thomas Renninger
@ 2010-10-28 11:37       ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-28 11:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: jean.pihet, linux-trace-users, linux-pm, linux-omap, arjan, mingo

Recent changes:
  - Enable EVENT_POWER_TRACING_DEPRECATED by default

New power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:cpu_idle

and
  power:power_frequency
is replaced with:
  power:cpu_frequency

power:machine_suspend
is newly introduced.
Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
CC: linux-omap@vger.kernel.org
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..155d975 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_cpu_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_cpu_idle(0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..4b9befa 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -113,8 +113,8 @@ void cpu_idle(void)
 			stop_critical_timings();
 			pm_idle();
 			start_critical_timings();
-
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..28153a9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT,
+				       smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..ed4919e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs 
*freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..08d5f05 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..d3701bf 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, 
struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_cpu_idle((eax >> 4) + 1, cpu);
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..f10de41 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,67 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(cpu,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(cpu, cpu_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id)
+);
+
+/* This file can get included multiple times, TRACE_HEADER_MULTI_READ 
at top */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
+
+#define PWR_EVENT_EXIT -1
+
+#endif
+
+DEFINE_EVENT(cpu, cpu_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +130,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-
declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) 
{};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 
cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..8ccbedd 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,21 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	default y
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..f55fcf6 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
 

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-28 11:31     ` [linux-pm] " Rafael J. Wysocki
@ 2010-10-28 11:37       ` Thomas Renninger
  2010-10-28 11:37       ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-28 11:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, jean.pihet, linux-trace-users, linux-omap, arjan, mingo

Recent changes:
  - Enable EVENT_POWER_TRACING_DEPRECATED by default

New power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:cpu_idle

and
  power:power_frequency
is replaced with:
  power:cpu_frequency

power:machine_suspend
is newly introduced.
Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
CC: linux-omap@vger.kernel.org
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..155d975 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_cpu_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_cpu_idle(0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..4b9befa 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -113,8 +113,8 @@ void cpu_idle(void)
 			stop_critical_timings();
 			pm_idle();
 			start_critical_timings();
-
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..28153a9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT,
+				       smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..ed4919e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs 
*freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..08d5f05 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..d3701bf 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, 
struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_cpu_idle((eax >> 4) + 1, cpu);
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..f10de41 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,67 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(cpu,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(cpu, cpu_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id)
+);
+
+/* This file can get included multiple times, TRACE_HEADER_MULTI_READ 
at top */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
+
+#define PWR_EVENT_EXIT -1
+
+#endif
+
+DEFINE_EVENT(cpu, cpu_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +130,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-
declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) 
{};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 
cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..8ccbedd 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,21 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	default y
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..f55fcf6 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
 

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-28 11:17   ` Rafael J. Wysocki
@ 2010-10-28 11:31     ` Rafael J. Wysocki
  2010-10-28 11:31     ` [linux-pm] " Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-28 11:31 UTC (permalink / raw)
  To: linux-pm; +Cc: jean.pihet, linux-trace-users, mingo, linux-omap, arjan

On Thursday, October 28, 2010, Rafael J. Wysocki wrote:
> On Thursday, October 28, 2010, Thomas Renninger wrote:
> > Recent changes:
> >   - Enable EVENT_POWER_TRACING_DEPRECATED by default
> > 
> > New power trace events:
> > power:cpu_idle
> > power:cpu_frequency
> > power:machine_suspend
> > 
> > 
> > C-state/idle accounting events:
> >   power:power_start
> >   power:power_end
> > are replaced with:
> >   power:cpu_idle
> > 
> > and
> >   power:power_frequency
> > is replaced with:
> >   power:cpu_frequency
> > 
> > power:machine_suspend
> > is newly introduced, a first implementation
> > comes from the ARM side, but it's easy to add these events
> > in X86 as well if needed.
> 
> Can you please check that changelog, please?

Sorry s/check/modify/

In fact, there won't be any ARM implementation, because it's going to be
added at the core level.

> I've asked you for that already once.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-28  9:02 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
@ 2010-10-28 11:17   ` Rafael J. Wysocki
  2010-10-28 11:17   ` Rafael J. Wysocki
  1 sibling, 0 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-28 11:17 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: jean.pihet, linux-trace-users, linux-pm, linux-omap, arjan, mingo

On Thursday, October 28, 2010, Thomas Renninger wrote:
> Recent changes:
>   - Enable EVENT_POWER_TRACING_DEPRECATED by default
> 
> New power trace events:
> power:cpu_idle
> power:cpu_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:cpu_idle
> 
> and
>   power:power_frequency
> is replaced with:
>   power:cpu_frequency
> 
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.

Can you please check that changelog, please?

I've asked you for that already once.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-28  9:02 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
  2010-10-28 11:17   ` Rafael J. Wysocki
@ 2010-10-28 11:17   ` Rafael J. Wysocki
  2010-10-28 11:31     ` Rafael J. Wysocki
  2010-10-28 11:31     ` [linux-pm] " Rafael J. Wysocki
  1 sibling, 2 replies; 157+ messages in thread
From: Rafael J. Wysocki @ 2010-10-28 11:17 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-omap, linux-pm, linux-trace-users, jean.pihet, arjan, mingo

On Thursday, October 28, 2010, Thomas Renninger wrote:
> Recent changes:
>   - Enable EVENT_POWER_TRACING_DEPRECATED by default
> 
> New power trace events:
> power:cpu_idle
> power:cpu_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:cpu_idle
> 
> and
>   power:power_frequency
> is replaced with:
>   power:cpu_frequency
> 
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.

Can you please check that changelog, please?

I've asked you for that already once.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-28  9:02 Cleanup and enhance power trace events Thomas Renninger
  2010-10-28  9:02 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
@ 2010-10-28  9:02 ` Thomas Renninger
  1 sibling, 0 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-28  9:02 UTC (permalink / raw)
  To: trenn, linux-omap, linux-pm, linux-trace-users, jean.pihet, arjan, mingo

Recent changes:
  - Enable EVENT_POWER_TRACING_DEPRECATED by default

New power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:cpu_idle

and
  power:power_frequency
is replaced with:
  power:cpu_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
CC: linux-omap@vger.kernel.org
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl
---
 arch/x86/kernel/process.c    |    7 +++-
 arch/x86/kernel/process_32.c |    2 +-
 arch/x86/kernel/process_64.c |    2 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   87 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   15 +++++++
 kernel/trace/power-traces.c  |    3 +
 9 files changed, 116 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..155d975 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_cpu_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_cpu_idle(0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..4b9befa 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -113,8 +113,8 @@ void cpu_idle(void)
 			stop_critical_timings();
 			pm_idle();
 			start_critical_timings();
-
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..28153a9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT,
+				       smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..ed4919e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..08d5f05 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..d3701bf 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_cpu_idle((eax >> 4) + 1, cpu);
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..f10de41 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,67 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(cpu,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(cpu, cpu_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id)
+);
+
+/* This file can get included multiple times, TRACE_HEADER_MULTI_READ at top */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
+
+#define PWR_EVENT_EXIT -1
+
+#endif
+
+DEFINE_EVENT(cpu, cpu_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +130,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..8ccbedd 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,21 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	default y
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..f55fcf6 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
 
-- 
1.6.3

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-28  9:02 Cleanup and enhance power trace events Thomas Renninger
@ 2010-10-28  9:02 ` Thomas Renninger
  2010-10-28 11:17   ` Rafael J. Wysocki
  2010-10-28 11:17   ` Rafael J. Wysocki
  2010-10-28  9:02 ` Thomas Renninger
  1 sibling, 2 replies; 157+ messages in thread
From: Thomas Renninger @ 2010-10-28  9:02 UTC (permalink / raw)
  To: trenn, linux-omap, linux-pm, linux-trace-users, jean.pihet,
	arjan, mingo, rjw

Recent changes:
  - Enable EVENT_POWER_TRACING_DEPRECATED by default

New power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:cpu_idle

and
  power:power_frequency
is replaced with:
  power:cpu_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
CC: linux-omap@vger.kernel.org
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl
---
 arch/x86/kernel/process.c    |    7 +++-
 arch/x86/kernel/process_32.c |    2 +-
 arch/x86/kernel/process_64.c |    2 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   87 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   15 +++++++
 kernel/trace/power-traces.c  |    3 +
 9 files changed, 116 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..155d975 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_cpu_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_cpu_idle(0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..4b9befa 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -113,8 +113,8 @@ void cpu_idle(void)
 			stop_critical_timings();
 			pm_idle();
 			start_critical_timings();
-
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..28153a9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT,
+				       smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..ed4919e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..08d5f05 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..d3701bf 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_cpu_idle((eax >> 4) + 1, cpu);
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..f10de41 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,67 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(cpu,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(cpu, cpu_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id)
+);
+
+/* This file can get included multiple times, TRACE_HEADER_MULTI_READ at top */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
+
+#define PWR_EVENT_EXIT -1
+
+#endif
+
+DEFINE_EVENT(cpu, cpu_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +130,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..8ccbedd 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,21 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	default y
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..f55fcf6 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
 
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 157+ messages in thread

end of thread, other threads:[~2010-11-19  0:14 UTC | newest]

Thread overview: 157+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
2010-10-19 11:36 ` [PATCH 1/3] PERF: Do not export power_frequency, but power_start event Thomas Renninger
2010-10-19 11:36 ` Thomas Renninger
2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
2010-10-25  6:54   ` Arjan van de Ven
2010-10-25  6:54   ` Arjan van de Ven
2010-10-25  9:41     ` Thomas Renninger
2010-10-25 13:55       ` Arjan van de Ven
2010-10-25 13:55       ` Arjan van de Ven
2010-10-25 14:36         ` Thomas Renninger
2010-10-25 14:45           ` Arjan van de Ven
2010-10-25 14:45           ` Arjan van de Ven
2010-10-25 14:56             ` Ingo Molnar
2010-10-25 14:56             ` Ingo Molnar
2010-10-25 15:48               ` Thomas Renninger
2010-10-25 16:00                 ` Arjan van de Ven
2010-10-25 23:32                   ` Thomas Renninger
2010-10-25 23:32                   ` Thomas Renninger
2010-10-25 16:00                 ` Arjan van de Ven
2010-10-25 15:48               ` Thomas Renninger
2010-10-25 14:36         ` Thomas Renninger
2010-10-25  9:41     ` Thomas Renninger
2010-10-25  6:58   ` Arjan van de Ven
2010-10-25  6:58   ` Arjan van de Ven
2010-10-25 10:04   ` Ingo Molnar
2010-10-25 10:04   ` Ingo Molnar
2010-10-25 11:03     ` Thomas Renninger
2010-10-25 11:03     ` Thomas Renninger
2010-10-25 11:55       ` Ingo Molnar
2010-10-25 12:55         ` Thomas Renninger
2010-10-25 14:11           ` Arjan van de Ven
2010-10-25 14:11           ` Arjan van de Ven
2010-10-25 14:51             ` Thomas Renninger
2010-10-25 14:51             ` Thomas Renninger
2010-10-25 12:55         ` Thomas Renninger
2010-10-25 12:58         ` Mathieu Desnoyers
2010-10-25 12:58         ` Mathieu Desnoyers
2010-10-25 20:29           ` Rafael J. Wysocki
2010-10-25 20:29           ` Rafael J. Wysocki
2010-10-25 11:55       ` Ingo Molnar
2010-10-25 13:58       ` Arjan van de Ven
2010-10-25 13:58       ` Arjan van de Ven
2010-10-25 20:33         ` Rafael J. Wysocki
2010-10-25 20:33         ` Rafael J. Wysocki
2010-10-25 23:33   ` [PATCH] PERF(kernel): Cleanup power events V2 Thomas Renninger
2010-10-26  1:09     ` Arjan van de Ven
2010-10-26  1:09     ` Arjan van de Ven
2010-10-26  7:10     ` Ingo Molnar
2010-10-26  7:10     ` Ingo Molnar
2010-10-26  8:08       ` Jean Pihet
2010-10-26 11:21         ` Ingo Molnar
2010-10-26 11:48           ` Thomas Renninger
2010-10-26 11:48           ` Thomas Renninger
2010-10-26 11:54             ` Ingo Molnar
2010-10-26 11:54             ` Ingo Molnar
2010-10-26 13:17               ` Thomas Renninger
2010-10-26 13:35                 ` Thomas Renninger
2010-10-26 13:35                 ` Thomas Renninger
2010-10-26 13:17               ` Thomas Renninger
2010-10-26 18:57             ` Rafael J. Wysocki
2010-10-27  0:00               ` Thomas Renninger
2010-10-27  9:16                 ` Rafael J. Wysocki
2010-10-27  9:16                 ` Rafael J. Wysocki
2010-10-27  0:00               ` Thomas Renninger
2010-10-26 18:57             ` Rafael J. Wysocki
2010-10-26 11:21         ` Ingo Molnar
2010-10-26  8:08       ` Jean Pihet
2010-10-26  9:58       ` Arjan van de Ven
2010-10-26 10:19         ` Ingo Molnar
2010-10-26 10:19         ` Ingo Molnar
2010-10-26  9:58       ` Arjan van de Ven
2010-10-26 10:37       ` Thomas Renninger
2010-10-26 10:37       ` Thomas Renninger
2010-10-26 11:19         ` Ingo Molnar
2010-10-26 11:19         ` Ingo Molnar
2010-10-26 19:01           ` Rafael J. Wysocki
2010-10-26 19:01           ` Rafael J. Wysocki
2010-10-26 15:32       ` Pierre Tardy
2010-10-26 16:04         ` Arjan van de Ven
2010-10-26 16:04         ` Arjan van de Ven
2010-10-26 16:56           ` Pierre Tardy
2010-10-26 17:58             ` Peter Zijlstra
2010-10-26 18:14               ` Mathieu Desnoyers
2010-10-26 18:14               ` Mathieu Desnoyers
2010-10-26 18:50                 ` [linux-pm] " Alan Stern
2010-10-26 21:33                   ` Mathieu Desnoyers
2010-10-26 21:33                   ` [linux-pm] " Mathieu Desnoyers
2010-10-26 22:20                     ` Rafael J. Wysocki
2010-10-26 22:20                     ` [linux-pm] " Rafael J. Wysocki
2010-10-26 22:39                       ` Rafael J. Wysocki
2010-10-26 22:39                       ` [linux-pm] " Rafael J. Wysocki
2010-10-27  0:46                       ` Mathieu Desnoyers
2010-10-27 10:22                         ` Rafael J. Wysocki
2010-10-27 12:21                           ` Mathieu Desnoyers
2010-10-27 14:32                             ` Alan Stern
2010-10-27 14:32                             ` Alan Stern
2010-10-27 14:32                             ` [linux-pm] " Alan Stern
2010-10-28 15:22                               ` Alan Stern
2010-10-28 15:22                               ` [linux-pm] " Alan Stern
2010-10-27 21:43                             ` Rafael J. Wysocki
2010-10-27 21:43                             ` Rafael J. Wysocki
2010-10-27 12:21                           ` Mathieu Desnoyers
2010-10-27 10:22                         ` Rafael J. Wysocki
2010-10-27  0:46                       ` Mathieu Desnoyers
2010-10-26 18:50                 ` Alan Stern
2010-10-26 19:04                 ` Rafael J. Wysocki
2010-10-26 19:04                 ` Rafael J. Wysocki
2010-10-26 21:38                   ` Mathieu Desnoyers
2010-10-26 21:38                   ` Mathieu Desnoyers
2010-10-26 22:22                     ` Rafael J. Wysocki
2010-10-26 22:22                     ` Rafael J. Wysocki
2010-10-26 18:15               ` Pierre Tardy
2010-10-26 19:08                 ` Rafael J. Wysocki
2010-10-26 20:23                   ` Pierre Tardy
2010-10-26 20:23                   ` Pierre Tardy
2010-10-26 20:38                     ` Rafael J. Wysocki
2010-10-26 20:38                     ` Rafael J. Wysocki
2010-10-26 20:52                       ` Arjan van de Ven
2010-10-26 20:52                       ` Arjan van de Ven
2010-10-26 21:17                         ` Rafael J. Wysocki
2010-10-26 21:17                         ` Rafael J. Wysocki
2010-10-26 19:08                 ` Rafael J. Wysocki
2010-10-26 18:15               ` Pierre Tardy
2010-10-26 17:58             ` Peter Zijlstra
2010-10-26 16:56           ` Pierre Tardy
2010-10-26 15:32       ` Pierre Tardy
2010-10-26  7:59     ` Jean Pihet
2010-10-26  7:59     ` Jean Pihet
2010-10-26 18:52     ` Rafael J. Wysocki
2010-10-26 18:52     ` Rafael J. Wysocki
2010-10-25 23:33   ` Thomas Renninger
2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
2010-10-19 11:36 ` [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new " Thomas Renninger
2010-10-19 11:36 ` Thomas Renninger
2010-10-26  0:18   ` [PATCH] PERF(userspace): Adjust perf timechart to the new power events V2 Thomas Renninger
2010-10-26  0:18   ` Thomas Renninger
2010-10-28  9:02 Cleanup and enhance power trace events Thomas Renninger
2010-10-28  9:02 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
2010-10-28 11:17   ` Rafael J. Wysocki
2010-10-28 11:17   ` Rafael J. Wysocki
2010-10-28 11:31     ` Rafael J. Wysocki
2010-10-28 11:31     ` [linux-pm] " Rafael J. Wysocki
2010-10-28 11:37       ` Thomas Renninger
2010-10-28 11:37       ` Thomas Renninger
2010-10-28  9:02 ` Thomas Renninger
2010-11-11 18:03 [RESEND] Power trace event cleanup by still providing old interface for some time Thomas Renninger
2010-11-11 18:03 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
2010-11-12 14:20   ` Jean Pihet
2010-11-12 18:17     ` Thomas Renninger
2010-11-12 21:50       ` Jean Pihet
2010-11-14 13:34         ` Thomas Renninger
2010-11-18  8:01           ` Ingo Molnar
2010-11-18  9:27             ` Thomas Renninger
2010-11-18  9:36               ` Ingo Molnar
2010-11-18  9:44                 ` Jean Pihet
2010-11-18 10:52                 ` Ingo Molnar
2010-11-18 16:34                   ` Jean Pihet
2010-11-19  0:14                     ` Thomas Renninger
2010-11-14 13:22   ` Thomas Renninger
2010-11-15 15:49     ` Jean Pihet
2010-11-18 13:01 Power trace event cleanup by still providing old interface for some time Thomas Renninger
2010-11-18 13:01 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.