All of lore.kernel.org
 help / color / mirror / Atom feed
* Fix powerTOP regression with 2.6.39-rc5
@ 2011-05-06 20:08 Arjan van de Ven
  2011-05-06 20:20 ` Steven Rostedt
  0 siblings, 1 reply; 40+ messages in thread
From: Arjan van de Ven @ 2011-05-06 20:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, Steven Rostedt, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2269 bytes --]

 From c2db3ab8cff0d1cfb9582fc149df2794978a332e Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Thu, 5 May 2011 23:55:18 -0400
Subject: [PATCH] Regression: partial revert "tracing: Remove lock_depth 
from event entry"

This patch partially reverts commit 
e6e1e2593592a8f6f6380496655d8c6f67431266.
That commit changed the structure layout of the trace structure, which in
turn broke PowerTOP (1.9x generation) quite badly.

I appreciate not wanting to expose the variable in question, and 
PowerTOP was
not using it, so I've replaced the variable with just a padding field....
... that way if in the future a new field is needed it can just use this 
padding
variable.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
  include/linux/ftrace_event.h |    1 +
  kernel/trace/trace.c         |    1 +
  kernel/trace/trace_events.c  |    1 +
  3 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 22b32af1..b5a550a 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -37,6 +37,7 @@ struct trace_entry {
      unsigned char        flags;
      unsigned char        preempt_count;
      int            pid;
+    int            padding;
  };

  #define FTRACE_MAX_EVENT                        \
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index d38c16a..1cb49be 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1110,6 +1110,7 @@ tracing_generic_entry_update(struct trace_entry 
*entry, unsigned long flags,

      entry->preempt_count        = pc & 0xff;
      entry->pid            = (tsk) ? tsk->pid : 0;
+    entry->padding            = 0;
      entry->flags =
  #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
          (irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index e88f74f..2fe1103 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -116,6 +116,7 @@ static int trace_define_common_fields(void)
      __common_field(unsigned char, flags);
      __common_field(unsigned char, preempt_count);
      __common_field(int, pid);
+    __common_field(int, padding);

      return ret;
  }
-- 
1.7.2.2


[-- Attachment #2: 0001-Regression-partial-revert-tracing-Remove-lock_depth.patch --]
[-- Type: text/plain, Size: 2126 bytes --]

>From c2db3ab8cff0d1cfb9582fc149df2794978a332e Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Thu, 5 May 2011 23:55:18 -0400
Subject: [PATCH] Regression: partial revert "tracing: Remove lock_depth from event entry"

This patch partially reverts commit e6e1e2593592a8f6f6380496655d8c6f67431266.
That commit changed the structure layout of the trace structure, which in
turn broke PowerTOP (1.9x generation) quite badly.

I appreciate not wanting to expose the variable in question, and PowerTOP was
not using it, so I've replaced the variable with just a padding byte....
... that way if in the future a new field is needed it can just use this padding
byte.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/ftrace_event.h |    1 +
 kernel/trace/trace.c         |    1 +
 kernel/trace/trace_events.c  |    1 +
 3 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 22b32af1..b5a550a 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -37,6 +37,7 @@ struct trace_entry {
 	unsigned char		flags;
 	unsigned char		preempt_count;
 	int			pid;
+	int			padding;
 };
 
 #define FTRACE_MAX_EVENT						\
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index d38c16a..1cb49be 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1110,6 +1110,7 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
 
 	entry->preempt_count		= pc & 0xff;
 	entry->pid			= (tsk) ? tsk->pid : 0;
+	entry->padding			= 0;
 	entry->flags =
 #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
 		(irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index e88f74f..2fe1103 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -116,6 +116,7 @@ static int trace_define_common_fields(void)
 	__common_field(unsigned char, flags);
 	__common_field(unsigned char, preempt_count);
 	__common_field(int, pid);
+	__common_field(int, padding);
 
 	return ret;
 }
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-06 20:08 Fix powerTOP regression with 2.6.39-rc5 Arjan van de Ven
@ 2011-05-06 20:20 ` Steven Rostedt
  2011-05-06 20:51   ` Linus Torvalds
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2011-05-06 20:20 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner

On Fri, 2011-05-06 at 13:08 -0700, Arjan van de Ven wrote:
> From c2db3ab8cff0d1cfb9582fc149df2794978a332e Mon Sep 17 00:00:00 2001
> From: Arjan van de Ven <arjan@linux.intel.com>
> Date: Thu, 5 May 2011 23:55:18 -0400
> Subject: [PATCH] Regression: partial revert "tracing: Remove lock_depth 
> from event entry"
> 
> This patch partially reverts commit 
> e6e1e2593592a8f6f6380496655d8c6f67431266.
> That commit changed the structure layout of the trace structure, which in
> turn broke PowerTOP (1.9x generation) quite badly.
> 
> I appreciate not wanting to expose the variable in question, and 
> PowerTOP was
> not using it, so I've replaced the variable with just a padding field....
> ... that way if in the future a new field is needed it can just use this 
> padding
> variable.
> 

I strongly NACK this!

We can not be locked down in the format of the trace events.

THAT'S WHY I EXPORTED THE FORMATS IN THE FIRST PLACE

User tools that do not parse the formats, are broken. Look at this
patch. It adds nothing but padding, wasted space in the ring buffer.
Why??? Because a tool read raw binary data without using the correct
means to extract it. I already have a user space library that does the
work for you. Perf currently uses it (although, an older version of it,
and it is hardcoded into perf).

Sure, perhaps event names should not be modfied, nor should some fields
without true reason. But come on, lockdepth?  This was added by Frederic
to help with the BKL removal. I think he only used it for a very short
time. It probably should have never been there in the first place. But I
do not want to waste 4 bytes of the ring buffer for every event because
of a broken tool.

Note, coming soon, we will probably be removing preempt count, flags,
and even the pid. What then? Is this broken tool going to prevent us
from moving forward?

I could understand this if we did not give you the means to parse the
data. But we did. It's been there as long as the trace events
themselves.

The format of the event format files are the ABI, not the raw data of
the events themselves.

-- Steve

> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>   include/linux/ftrace_event.h |    1 +
>   kernel/trace/trace.c         |    1 +
>   kernel/trace/trace_events.c  |    1 +
>   3 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index 22b32af1..b5a550a 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -37,6 +37,7 @@ struct trace_entry {
>       unsigned char        flags;
>       unsigned char        preempt_count;
>       int            pid;
> +    int            padding;
>   };
> 
>   #define FTRACE_MAX_EVENT                        \
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index d38c16a..1cb49be 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1110,6 +1110,7 @@ tracing_generic_entry_update(struct trace_entry 
> *entry, unsigned long flags,
> 
>       entry->preempt_count        = pc & 0xff;
>       entry->pid            = (tsk) ? tsk->pid : 0;
> +    entry->padding            = 0;
>       entry->flags =
>   #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
>           (irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index e88f74f..2fe1103 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -116,6 +116,7 @@ static int trace_define_common_fields(void)
>       __common_field(unsigned char, flags);
>       __common_field(unsigned char, preempt_count);
>       __common_field(int, pid);
> +    __common_field(int, padding);
> 
>       return ret;
>   }



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-06 20:20 ` Steven Rostedt
@ 2011-05-06 20:51   ` Linus Torvalds
  2011-05-06 21:10     ` Steven Rostedt
                       ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Linus Torvalds @ 2011-05-06 20:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arjan van de Ven, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner

On Fri, May 6, 2011 at 1:20 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I strongly NACK this!

Doesn't matter.

Binary compatibility is more important.

And if binaries don't use the interface to parse the format (or just
parse it wrongly - see the fairly recent example of adding uuid's to
/proc/self/mountinfo), then it's a regression.

And regressions get reverted, unless there are security issues or
similar that makes us go "Oh Gods, we really have to break things".

I don't understand why this simple logic is so hard for some kernel
developers to understand. Reality matters. Your personal wishes matter
NOT AT ALL.

If you made an interface that can be used without parsing the
interface description, then we're stuck with the interface. Theory
simply doesn't matter.

You could help fix the tools, and try to avoid the compatibility
issues that way. There aren't that many of them.

                         Linus

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-06 20:51   ` Linus Torvalds
@ 2011-05-06 21:10     ` Steven Rostedt
  2011-05-06 21:24       ` Linus Torvalds
  2011-05-06 21:14     ` Steven Rostedt
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2011-05-06 21:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner

On Fri, 2011-05-06 at 13:51 -0700, Linus Torvalds wrote:
> On Fri, May 6, 2011 at 1:20 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I strongly NACK this!
> 
> Doesn't matter.
> 
> Binary compatibility is more important.
> 
> And if binaries don't use the interface to parse the format (or just
> parse it wrongly - see the fairly recent example of adding uuid's to
> /proc/self/mountinfo), then it's a regression.
> 
> And regressions get reverted, unless there are security issues or
> similar that makes us go "Oh Gods, we really have to break things".

Um, this is an internal tracepoint. Does this mean that all internal
data inside the kernel that is exported with trace events are locked
down?

> 
> I don't understand why this simple logic is so hard for some kernel
> developers to understand. Reality matters. Your personal wishes matter
> NOT AT ALL.

This isn't a personal wish. This brings every advancement that I was
planning on making this year to a dead halt. We were really about to
restructure the events to make them lighter weight and faster.

Also, this isn't the first time this structure has changed. It just
happens that something started using it. This field did not even exist
until recently.


> 
> If you made an interface that can be used without parsing the
> interface description, then we're stuck with the interface. Theory
> simply doesn't matter.

I never had an interface used this way. It was just by luck. Damn, I
should have listened to Peter Zijlstra when he recommended that every
boot restructures the data in the format differently. Then this would
never have happened. But doing that would have slowed things down
tremendously (or remove the ease of TRACE_EVENT).

> 
> You could help fix the tools, and try to avoid the compatibility
> issues that way. There aren't that many of them.

As I said, I have a library (.so even) that does the parsing for you. If
I get powertop to use it, can we hold off on this patch?

Note, I'm about to leave to Budapest. I could try to get this done on
the trip.

-- Steve



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-06 20:51   ` Linus Torvalds
  2011-05-06 21:10     ` Steven Rostedt
@ 2011-05-06 21:14     ` Steven Rostedt
  2011-05-06 21:28       ` Linus Torvalds
  2011-05-06 21:29     ` Arjan van de Ven
  2011-05-07  6:58     ` Ingo Molnar
  3 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2011-05-06 21:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner

On Fri, 2011-05-06 at 13:51 -0700, Linus Torvalds wrote:


> If you made an interface that can be used without parsing the
> interface description, then we're stuck with the interface. Theory
> simply doesn't matter.

Note, at last Kernel Summit, you pointed at my slide of the trace event
ASCII format, and stated, "That's the ABI". What happend now?

-- Steve



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-06 21:10     ` Steven Rostedt
@ 2011-05-06 21:24       ` Linus Torvalds
  0 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2011-05-06 21:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arjan van de Ven, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner

On 5/6/11, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 2011-05-06 at 13:51 -0700, Linus Torvalds wrote:
>> On Fri, May 6, 2011 at 1:20 PM, Steven Rostedt <rostedt@goodmis.org>
>
> Um, this is an internal tracepoint. Does this mean that all internal
> data inside the kernel that is exported with trace events are locked
> down?

 it's clearly NOT an internal tracepoint. By definition. It's being
used by powertop.

What part of "reality" vs "your wishes" didn't you understand?

               Linus

>
>>
>> I

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-06 21:14     ` Steven Rostedt
@ 2011-05-06 21:28       ` Linus Torvalds
  0 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2011-05-06 21:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arjan van de Ven, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner

On 5/6/11, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Note, at last Kernel Summit, you pointed at my slide of the trace event
> ASCII format, and stated, "That's the ABI". What happend now?

That whole reality thing again?

Clearly it didn't end up as the ABI. We have programs that use that
ABI and thus it's a regression if they break.

          Linus

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-06 20:51   ` Linus Torvalds
  2011-05-06 21:10     ` Steven Rostedt
  2011-05-06 21:14     ` Steven Rostedt
@ 2011-05-06 21:29     ` Arjan van de Ven
  2011-05-06 21:57       ` Steven Rostedt
  2011-05-07  6:58     ` Ingo Molnar
  3 siblings, 1 reply; 40+ messages in thread
From: Arjan van de Ven @ 2011-05-06 21:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner

On 5/6/2011 1:51 PM, Linus Torvalds wrote:
>
> You could help fix the tools, and try to avoid the compatibility
> issues that way. There aren't that many of them.

once we have a usable library for these (and by usable, I mean that, eg 
something that someone like Fedora or Ubuntu can package and install)
I will be more than happy to use it in PowerTOP. Today that does not exist.

If such library becomes reality soon, then I'll adjust PowerTOP soon as 
well, and we can remove the padding some time after that even if that is 
really desired.
But we're talking 4 bytes here....


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-06 21:29     ` Arjan van de Ven
@ 2011-05-06 21:57       ` Steven Rostedt
  0 siblings, 0 replies; 40+ messages in thread
From: Steven Rostedt @ 2011-05-06 21:57 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner

On Fri, 2011-05-06 at 14:29 -0700, Arjan van de Ven wrote:
> On 5/6/2011 1:51 PM, Linus Torvalds wrote:
> >
> > You could help fix the tools, and try to avoid the compatibility
> > issues that way. There aren't that many of them.
> 
> once we have a usable library for these (and by usable, I mean that, eg 
> something that someone like Fedora or Ubuntu can package and install)
> I will be more than happy to use it in PowerTOP. Today that does not exist.
> 
> If such library becomes reality soon, then I'll adjust PowerTOP soon as 
> well, and we can remove the padding some time after that even if that is 
> really desired.
> But we're talking 4 bytes here....

I'll work on getting the library out to distros. The 4 bytes is not the
issue, it's the implication that the header is fixed in stone. As we
want to get rid of the pid, and flags as well. Making this in stone just
killed any more progress that ftrace/perf can do in becoming a more
robust tracer.

-- Steve




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-06 20:51   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2011-05-06 21:29     ` Arjan van de Ven
@ 2011-05-07  6:58     ` Ingo Molnar
  2011-05-07 10:45       ` Steven Rostedt
  3 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2011-05-07  6:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Arjan van de Ven, linux-kernel,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, May 6, 2011 at 1:20 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I strongly NACK this!
> 
> Doesn't matter.
> 
> Binary compatibility is more important.

Yes, absolutely, violently agreed.

Acked-by: Ingo Molnar <mingo@elte.hu>

Steve, we had this argument again and again internally, and you still do not 
seem to understand it: viable tooling is *way* more important than the 
short-term, marginal cleanliness interests of kernel developers. We wont be 
able to merge ftrace into perf until you understand this principle ...

Arjan, Steve, i think we need to create a 'perf test' testcase for ftrace 
events as well, to catch such ABI breakages faster, hm? It took a couple of 
months for this breakage to surface and that's clearly too slow.

> And if binaries don't use the interface to parse the format (or just parse it 
> wrongly - see the fairly recent example of adding uuid's to 
> /proc/self/mountinfo), then it's a regression.
> 
> And regressions get reverted, unless there are security issues or similar 
> that makes us go "Oh Gods, we really have to break things".
> 
> I don't understand why this simple logic is so hard for some kernel 
> developers to understand. Reality matters. Your personal wishes matter NOT AT 
> ALL.

You have just summed up the main philosophical difference between perf and 
ftrace: with perf we have a "sane tooling first" approach, while ftrace is 
still the old "kernel developers first" approach.

In the past 10 years i pushed tons of instrumentation code upstream and for a 
long time the kernel-integrated ftrace approach looked like the technical best 
solution to me, but after 2 years of sane instrumentation tooling via a proper 
user-space ABI and tools/perf/ i'm not looking back.

I am strongly convinced that we need to bite the bullet and unify the two 
approaches to enable even better tooling: expose the remaining bits of tracing 
functionality not available via perf yet via the perf ABI and move it under a 
single umbrella, slowly phase out the ABI-unstable /debug/tracing/ debugfs crap 
for new features and use the strict perf ABI approach. Steve?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-07  6:58     ` Ingo Molnar
@ 2011-05-07 10:45       ` Steven Rostedt
  2011-05-07 14:44         ` Ingo Molnar
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2011-05-07 10:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Arjan van de Ven, linux-kernel,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner

On Sat, 2011-05-07 at 08:58 +0200, Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:

> You have just summed up the main philosophical difference between perf and 
> ftrace: with perf we have a "sane tooling first" approach, while ftrace is 
> still the old "kernel developers first" approach.

I actually believe that the opposite is true.

> 
> In the past 10 years i pushed tons of instrumentation code upstream and for a 
> long time the kernel-integrated ftrace approach looked like the technical best 
> solution to me, but after 2 years of sane instrumentation tooling via a proper 
> user-space ABI and tools/perf/ i'm not looking back.
> 

I would like to point out that the problem with the ABI breakage came
through perf and not ftrace. From what I gathered from Linus's response,
is that, although I made a robust interface (the format of the events)
for tools to use, but it was possible for the tools to use another
interface to directly interact with the raw binary data. Since it was
easier to just map the raw binary data instead of using the exported
format, they did that instead, even though a library already existed to
parse the format and keep the events robust. And the "reality" is that
the raw binary format became the ABI.

With ftrace, there was no easy way to get at that raw format. It was
perf that exposed the raw binary formats that tools like powerTop used.
The "easy" way was just to use the raw binary format as perf made it
easy to access. Thus, instead of spending the time to use the proper
robust format, tools just mapped the raw binary format instead. Peter
Zijlstra, wisely saw this problem and asked me to randomize the fields
to prevent the raw mappings. But that would have broken the ease of use
of TRACE_EVENTS() for kernel developers, or would have drastically
slowed down the trace recording. We both reluctantly kept the fields the
same. Once again, I feel burned because I didn't listen to Peter ;)

Now the end of Linus's email, he gave a slight "but". It seems as though
not many tools are currently accessing the raw data, if all those tools
agree to convert to the proper format before too many others start, then
he may allow this change to take place. I already discussed this with
Arjan, and he agreed to use the libparsevent.so if I can get it packaged
with Fedora and Ubuntu. This is a robust solution, so that we do not get
stuck with things like recording for every single event, the pid,
preempt count, interrupt flags and other things in the kernel forever.


> I am strongly convinced that we need to bite the bullet and unify the two 
> approaches to enable even better tooling: expose the remaining bits of tracing 
> functionality not available via perf yet via the perf ABI and move it under a 
> single umbrella, slowly phase out the ABI-unstable /debug/tracing/ debugfs crap 
> for new features and use the strict perf ABI approach. Steve?

Actually, I now want to separate ftrace from perf even more. This
problem is not a ftrace problem but a perf one. The raw abi that tools
uses is from perf. Thus, that "padding" can be added to perf directly
instead of using the ftrace code, and powertop will still work, and
ftrace can change on the fly as all its tools use the libparsevent
libary.

Here's the choices then:

1) we get libparsevent.so out into the world and all tools can use it,
and the raw formats of the trace events will no longer be an issue as
long as the names of events and fields stay the same.

2) we separate perf from ftrace and keep the "stable" ABI for perf, and
let ftrace advance into a more efficient tracer.

-- Steve



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-07 10:45       ` Steven Rostedt
@ 2011-05-07 14:44         ` Ingo Molnar
  2011-05-07 17:20           ` Steven Rostedt
  0 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2011-05-07 14:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Arjan van de Ven, linux-kernel,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner


* Steven Rostedt <rostedt@goodmis.org> wrote:

> 2) we separate perf from ftrace and keep the "stable" ABI for perf, and let 
> ftrace advance into a more efficient tracer.

The thing is, ftrace is still largely separated from perf, and this is why this 
regression came in: a random tracing 'cleanup' churn was done to 'tracing' 
which broke PowerTop.

Look at the commit itself:

  e6e1e2593592: tracing: Remove lock_depth from event entry

Clearly you didnt even *realize* that there's a whole tooling world behind this 
mechanism ...

The core perf ABI is set up in a way that makes it rather hard to break the ABI 
accidentally. I can bisect back kernel release after kernel release and old 
tooling will work on new kernel and new tooling will work on old kernels.

PowerTop uses the perf ABI because it's a rather convenient and unified method 
to get a rich selection of events via the same facility, same ring-buffer, 
using a system call ABI, etc.

The ftrace event bits on the other hand, still somewhat glued on to the perf 
ABI are still very fragile to such spurious changes like the one that caused 
the regression here.

I raised this issue in the past. ftrace and perf has to be unified sooner 
rather than later.

> Here's the choices then:
> 
> 1) we get libparsevent.so out into the world and all tools can use it, and 
> the raw formats of the trace events will no longer be an issue as long as the 
> names of events and fields stay the same.

Firstly, such an event parser already exists in 
tools/perf/util/parse-events.[ch], so if you want to librarize it please talk 
to Arnaldo to create tools/perf/lib/ and a libperf.so.

There's 10 separate contributors to that file already:

 earth4:~/tip> git-authors-email tools/perf/util/parse-events.c 
      2  Peter Zijlstra <peterz@infradead.org>
      2  Stephane Eranian <eranian@google.com>
      2  Ulrich Drepper <drepper@redhat.com>
      3  Jason Baron <jbaron@redhat.com>
      3  Li Zefan <lizf@cn.fujitsu.com>
      3  Peter Zijlstra <a.p.zijlstra@chello.nl>
      7  Frederic Weisbecker <fweisbec@gmail.com>
      7  Jaswinder Singh Rajput <jaswinder@kernel.org>
     13  Arnaldo Carvalho de Melo <acme@redhat.com>
     15  Ingo Molnar <mingo@elte.hu>

Most notably *you* are not amongst them. Not a single commit out of close to a 
hundred commits. Why not? This really demonstrates the level of disinterest you 
are showing towards perf based tooling, still you keep modifying the underlying 
code in the kernel.

Secondly, you are solving the wrong problem and you are not seeing the real 
problems. We can keep and we *will* keep ABIs, it's not hard. 4 bytes padding 
is not an issue and it never was for PowerTop nor for any other real person who 
relies on tracing.

As i see it the problem is the thought-less ftrace churn and the fragility of 
how TRACE_EVENT() can be changed.

Really, ftrace and in particular you are showing a huge disconnect and i'm 
increasingly unhappy about it. Look at this very thread: you fought tooth and 
nail to even *acknowledge* that there is a problem...

As things look like from my side it appears you want to keep ftrace a messy, 
forked project with no regard to perf based tooling and this will fragment 
Linux instrumentation, the many technical disadvantages be damned.

I simply do not see that you understand this whole problem space, i do not see 
that you are driving towards unifying ftrace and perf - i only see that you are 
hacking in random, sometimes harmful directions and that you are stubornly 
ignoring my negative feedback about this. If problems like this continue i will 
have to stop pulling new ftrace features from you.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-07 14:44         ` Ingo Molnar
@ 2011-05-07 17:20           ` Steven Rostedt
  2011-05-07 17:59             ` Arjan van de Ven
                               ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Steven Rostedt @ 2011-05-07 17:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Sharp, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner

On Sat, 2011-05-07 at 16:44 +0200, Ingo Molnar wrote: 
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > 2) we separate perf from ftrace and keep the "stable" ABI for perf, and let 
> > ftrace advance into a more efficient tracer.
> 
> The thing is, ftrace is still largely separated from perf, and this is why this 
> regression came in: a random tracing 'cleanup' churn was done to 'tracing' 
> which broke PowerTop.
> 
> Look at the commit itself:
> 
>   e6e1e2593592: tracing: Remove lock_depth from event entry
> 
> Clearly you didnt even *realize* that there's a whole tooling world behind this 
> mechanism ...

Note, I discussed this change with Frederic and he totally agreed with
me on removing it. In fact, we are in discussions about getting rid of
pid, preempt-count, and irq flags as well. But according to your logic,
that is a no go. I guess Frederic also does not *realize* there's a
whole tooling world behind this mechanism too.

> 
> The core perf ABI is set up in a way that makes it rather hard to break the ABI 
> accidentally. I can bisect back kernel release after kernel release and old 
> tooling will work on new kernel and new tooling will work on old kernels.

and I could do the same with ftrace/trace-cmd/kernelshark

> 
> PowerTop uses the perf ABI because it's a rather convenient and unified method 
> to get a rich selection of events via the same facility, same ring-buffer, 
> using a system call ABI, etc.

It seams that Arjan read the perf kernel code to see how the raw data
was laid out and read it directly.

> 
> The ftrace event bits on the other hand, still somewhat glued on to the perf 
> ABI are still very fragile to such spurious changes like the one that caused 
> the regression here.

Note also that perf glued to ftrace, ftrace did not glue to perf.
Perhaps perf should have not used the ftrace infrastructure. Or more
exactly, it should not have exported the ftrace infrastructure out to
userspace.

I wrote the TRACE_EVENT() macros not to be tied down with ftrace, but
that it could benefit other utilities, including LTTng, and every one
knows that Mathieu and I do not agree on that project. But
never-the-less, I did not make TRACE_EVENT() bound to ftrace, but made
it agnostic to all utilities.

Ironically, perf benefited from this agnostic approach and easily bound
to the TRACE_EVENT() macro. But instead of writing its own
include/trace/perf.h code, it just hacked into the already existing
ftrace user of TRACE_EVENT(). This caused perf to read unnecessary
things like pid, preempt_count, interrupt flags and even lockdepth.


> 
> I raised this issue in the past. ftrace and perf has to be unified sooner 
> rather than later.

We agree on a unification, just that we do not agree on the path it
takes. Every time I take one step forward, you slam me backwards two
steps. At kernel summit, we all agreed on a stable event layout, or just
a new way to move the events out of debugfs, but you nacked it. You
wanted me to work with Peter Zijlstra with sysfs, and that was just too
complex. Peter seemed to agree with me as he wasn't working on software
events for it, but hardware events as that's where hardware lives.


> 
> > Here's the choices then:
> > 
> > 1) we get libparsevent.so out into the world and all tools can use it, and 
> > the raw formats of the trace events will no longer be an issue as long as the 
> > names of events and fields stay the same.
> 
> Firstly, such an event parser already exists in 
> tools/perf/util/parse-events.[ch], so if you want to librarize it please talk 
> to Arnaldo to create tools/perf/lib/ and a libperf.so.

The pares-events.[ch] file in perf just parses the command line options
to denote what events need to be recorded. The real work lies in
trace-event-parse.c that does the real parsing, and that file was copied
from libparsevent.so. I purposely wrote that as a library that perf
could use (again, being open to other ideologies), but instead of using
the library, the code was hard coded into perf, which guaranteed the
forking of this code.

> 
> There's 10 separate contributors to that file already:
> 
>  earth4:~/tip> git-authors-email tools/perf/util/parse-events.c 
>       2  Peter Zijlstra <peterz@infradead.org>
>       2  Stephane Eranian <eranian@google.com>
>       2  Ulrich Drepper <drepper@redhat.com>
>       3  Jason Baron <jbaron@redhat.com>
>       3  Li Zefan <lizf@cn.fujitsu.com>
>       3  Peter Zijlstra <a.p.zijlstra@chello.nl>
>       7  Frederic Weisbecker <fweisbec@gmail.com>
>       7  Jaswinder Singh Rajput <jaswinder@kernel.org>
>      13  Arnaldo Carvalho de Melo <acme@redhat.com>
>      15  Ingo Molnar <mingo@elte.hu>
> 
> Most notably *you* are not amongst them. Not a single commit out of close to a 
> hundred commits. Why not? This really demonstrates the level of disinterest you 
> are showing towards perf based tooling, still you keep modifying the underlying 
> code in the kernel.

You are judging my interest with a single file that is used to parse
command line options to perf?

I just recently posted a patch set to the function tracer that allows it
to have users agnostic to each other. That took me more than a week to
write. The main reason I wrote that was to allow perf to use function
tracing.

> Secondly, you are solving the wrong problem and you are not seeing the real 
> problems. We can keep and we *will* keep ABIs, it's not hard. 4 bytes padding 
> is not an issue and it never was for PowerTop nor for any other real person who 
> relies on tracing.

I've Cc'd the Google folks that are very interested in tracing, to let
them respond to that comment.

Do you think that "other real person"s are only kernel developers or
desktop users that are not using production systems?

And it's not just 4 bytes, its the entire useless header. Who cares
about preempt counts? I examine that field only 1% of the time. In most
other cases its totally useless. Same with interrupt flags, although I
do look at them more often than preempt count. We (Frederic and I) still
want to get rid of the pid for every event.


> 
> As i see it the problem is the thought-less ftrace churn and the fragility of 
> how TRACE_EVENT() can be changed.

Now you are just insulting me. There has not been any "thought-less"
churn.

I *designed* TRACE_EVENT() to be changed. That's why I wrote all that
code to export the event formats. If you think all raw data of events
are to be an ABI, then lets rip out all the event formats and make
everything hard-coded.

I'm sorry, but that mentality seems to encourage thoughtless churn.

> 
> Really, ftrace and in particular you are showing a huge disconnect and i'm 
> increasingly unhappy about it. Look at this very thread: you fought tooth and 
> nail to even *acknowledge* that there is a problem...

I agree there is a problem, but what each of us think the problem is, is
different. I say there's a problem with tools depending on a layout of
raw data that is internal to the kernel, especially when it was designed
to allow robustness. If we make it easy for tools to extract the data
properly, then there should not be any issues if the raw format changes.

Linus said:

> If you made an interface that can be used without parsing the
> interface description, then we're stuck with the interface. Theory
> simply doesn't matter.
> 
> You could help fix the tools, and try to avoid the compatibility
> issues that way. There aren't that many of them.

To me this seems that we have a way to have the tools do the right
thing. If a library can be used that allows a more robust interface,
then why not use it? The library already exists, I talked to Arjan, and
he's willing to use it. I'm willing to put the effort in fixing powerTop
and pushing this library to distributions. What's the problem?

You are entering a very dangerous precedence by stating that the raw
format is now the ABI, end of story. This will bite us in the future. It
just did, and it will just get worse.
 
> 
> As things look like from my side it appears you want to keep ftrace a messy, 
> forked project with no regard to perf based tooling and this will fragment 
> Linux instrumentation, the many technical disadvantages be damned.

ftrace is not a fork and never was. To be a fork, we need a common
ancestor. Ftrace and perf do not have that. Perf was created (after
Ftrace) to profile events, and did so very well. It just happened to
expand into the tracing area, where you want me to abandon all my ftrace
work and rewrite it on top of something that was not designed to do
tracing.

Now that perf has entered the tracing field, I would be happy to bring
the two together. But we disagree on how to do that. I will not drop
ftrace totally just to work on perf. There's too many users of ftrace
that want enhancements, and I will still support that. The reason being
is that I honestly do not believe that perf can do what these users want
anytime in the near future (if at all). I will not abandon a successful
project just because you feel that it is a fork.


> 
> I simply do not see that you understand this whole problem space, i do not see 
> that you are driving towards unifying ftrace and perf - i only see that you are 
> hacking in random, sometimes harmful directions and that you are stubornly 
> ignoring my negative feedback about this.


Exactly, *you* do not see it. I've been bending over backwards trying to
find common ground to get to an end result that will satisfy everyone.
When I talk with Frederic, Peter, Arnaldo, or Thomas, we move forward.
For some reason, I hit a wall with you. I can't move forward with you
because, unless we do it entirely your way, we can't do it at all.

You do not even realize that I worked my butt off this past week to get
function tracing for perf.

> 
> If problems like this continue i will have to stop pulling new ftrace
> features from you.

This is the difference between us. You are the gate-keeper. I can't get
in the kernel unless you agree. This means that, because you think
ftrace is a fork of perf, and that unless I do everything with the sole
purpose of making perf do what ftrace can, none of my work will get into
the kernel. Thus, even though I want to support ftrace, you will not let
me. I'm glad that's out in the public.


> Thanks,

Your welcome.

Note, I'm about to hop on a plane to start my journey to Budapest.
Perhaps we can sit down and discuss this over a beer. I'm still
optimistic that we can work this out.

-- Steve



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-07 17:20           ` Steven Rostedt
@ 2011-05-07 17:59             ` Arjan van de Ven
  2011-05-08 21:08               ` Frederic Weisbecker
  2011-05-07 19:00             ` Ingo Molnar
  2011-05-09 23:37             ` David Sharp
  2 siblings, 1 reply; 40+ messages in thread
From: Arjan van de Ven @ 2011-05-07 17:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, David Sharp, Vaibhav Nagarnaik, Michael Rubin,
	Linus Torvalds, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner

On 5/7/2011 10:20 AM, Steven Rostedt wrote:
> On Sat, 2011-05-07 at 16:44 +0200, Ingo Molnar wrote:
>> * Steven Rostedt<rostedt@goodmis.org>  wrote:
>>
>>> 2) we separate perf from ftrace and keep the "stable" ABI for perf, and let
>>> ftrace advance into a more efficient tracer.
>> The thing is, ftrace is still largely separated from perf, and this is why this
>> regression came in: a random tracing 'cleanup' churn was done to 'tracing'
>> which broke PowerTop.
>>
>> Look at the commit itself:
>>
>>    e6e1e2593592: tracing: Remove lock_depth from event entry
>>
>> Clearly you didnt even *realize* that there's a whole tooling world behind this
>> mechanism ...
> Note, I discussed this change with Frederic and he totally agreed with
> me on removing it. In fact, we are in discussions about getting rid of
> pid, preempt-count, and irq flags as well. But according to your logic,
> that is a no go. I guess Frederic also does not *realize* there's a
> whole tooling world behind this mechanism too.

btw if you remove some of these, how is userland supposed to find out if 
an event happened in irq context?



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-07 17:20           ` Steven Rostedt
  2011-05-07 17:59             ` Arjan van de Ven
@ 2011-05-07 19:00             ` Ingo Molnar
  2011-05-10  3:07               ` Steven Rostedt
  2011-05-09 23:37             ` David Sharp
  2 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2011-05-07 19:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Sharp, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sat, 2011-05-07 at 16:44 +0200, Ingo Molnar wrote: 
> > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > 2) we separate perf from ftrace and keep the "stable" ABI for perf, and let 
> > > ftrace advance into a more efficient tracer.
> > 
> > The thing is, ftrace is still largely separated from perf, and this is why this 
> > regression came in: a random tracing 'cleanup' churn was done to 'tracing' 
> > which broke PowerTop.
> > 
> > Look at the commit itself:
> > 
> >   e6e1e2593592: tracing: Remove lock_depth from event entry
> > 
> > Clearly you didnt even *realize* that there's a whole tooling world behind this 
> > mechanism ...
> 
> Note, I discussed this change with Frederic and he totally agreed with
> me on removing it. In fact, we are in discussions about getting rid of
> pid, preempt-count, and irq flags as well. But according to your logic,
> that is a no go. I guess Frederic also does not *realize* there's a
> whole tooling world behind this mechanism too.

Well, there's no sign of that in the changelog, Frederic did not write this 
change nor did he ack it or otherwise sign off on it. There is a whole world of 
difference between 'agreeing on IRC' and actually pushing such a commit 
upstream.

> > The core perf ABI is set up in a way that makes it rather hard to break the 
> > ABI accidentally. I can bisect back kernel release after kernel release and 
> > old tooling will work on new kernel and new tooling will work on old 
> > kernels.
> 
> and I could do the same with ftrace/trace-cmd/kernelshark

All created by you well after my repeated request for you to unify ftrace and 
perf.

You could have done all that based on a unified ABI. Instead you created forked 
tools.

> > PowerTop uses the perf ABI because it's a rather convenient and unified 
> > method to get a rich selection of events via the same facility, same 
> > ring-buffer, using a system call ABI, etc.
> 
> It seams that Arjan read the perf kernel code to see how the raw data was 
> laid out and read it directly.

Yes, of course - i also have code that uses the syscall directly, it's easier 
to code up ad-hoc than to parse some XML-lookalike descriptor from 
/debug/tracing/events/.

> [...]
> 
> > I raised this issue in the past. ftrace and perf has to be unified sooner 
> > rather than later.
> 
> We agree on a unification, just that we do not agree on the path it takes. 
> Every time I take one step forward, you slam me backwards two steps. At 
> kernel summit, we all agreed on a stable event layout, or just a new way to 
> move the events out of debugfs, but you nacked it. You wanted me to work with 
> Peter Zijlstra with sysfs, and that was just too complex. Peter seemed to 
> agree with me as he wasn't working on software events for it, but hardware 
> events as that's where hardware lives.

I think its rather obvious how the unification should be done: check 
tip:tmp.perf/trace for the 'trace' command that does tracing.

Check whether there's any feature missing from it that you'd like to see, add 
it. Rinse, repeat.

There is nothing inherently hard nor complex about it.

> > As i see it the problem is the thought-less ftrace churn and the fragility 
> > of how TRACE_EVENT() can be changed.
> 
> Now you are just insulting me. There has not been any "thought-less" churn.
> 
> I *designed* TRACE_EVENT() to be changed. That's why I wrote all that code to 
> export the event formats. If you think all raw data of events are to be an 
> ABI, then lets rip out all the event formats and make everything hard-coded.

You are quite mistaken there.

There are two main advantages of TRACE_EVENT():

 - it allows easy, C-alike tracepoint definitions that are unintrusive to
   kernel developers

 - it is easy to *EXTEND* it, so that tools can pick up new events

And yes, you must remember that we two kept iterating the early prototype of 
TRACE_EVENT() until those two requirements were met.

'Changing' a tracepoint is obviously easy if extending it transparently is 
easy, it's a side effect - an unfortunate one.

Changing tracepoints is definitely frowned upon. We do not change tracepoints 
if we can avoid it - we introduce new ones and we *maybe* phase out old, 
obsolete ones.

We had this exact discussion about the power events a couple of months ago!

> Linus said:
> 
> > If you made an interface that can be used without parsing the interface 
> > description, then we're stuck with the interface. Theory simply doesn't 
> > matter.
> > 
> > You could help fix the tools, and try to avoid the compatibility issues 
> > that way. There aren't that many of them.
> 
> To me this seems that we have a way to have the tools do the right thing. 
> [...]

Your whole premise is that we want to churn the tracepoints - and that premise 
is *utterly wrong*.

In practice we very rarely want to change tracepoints: in the past 2 years we 
had maybe 2-3 attempts.

*Adding* new tracepoints and *extending* functionality is the main action, 
dozens are added per year. We can also phase out tracepoints. Both of these 
things can be done without breaking the ABI.

Especially when there's tooling use it's not really desirable to fiddle with 
the tracepoint, even if in theory we could reorder fields and not see tools 
break. It's too easy to break the tool.

> [...] If a library can be used that allows a more robust interface, then why 
> not use it? [...]

But there is no library available in distros and realistically it wont be 
widely available within a year, even if you released and packaged one up 
overnight. Nor do you really seem to see the problem that changing tracepoints 
brings with itself.

The main property of TRACE_EVENT() is that new tracepoints can be picked up by 
scripting engines and other tools. You are concentrating on an aspect that is 
rarely done, unimportant and causes pain. Why?

> [...] The library already exists, I talked to Arjan, and he's willing to use 
> it. I'm willing to put the effort in fixing powerTop and pushing this library 
> to distributions. What's the problem?

I can see a couple of problems right away:

 - i do not actually want to see people change trace events all that often. It's
   a bad practice and even with a library it can break stuff.

 - old binaries and other tools that might already make use of these events.

 - you are complicating an otherwise really simple and dependable facility.

So you can write the library if it's more convenient to some people, that's not 
a problem (it is good) - just do not use it as an *excuse* to break the ABI. We 
cannot break the ABI today and we will likely not be able to do it for a long 
time.

> You are entering a very dangerous precedence by stating that the raw format 
> is now the ABI, end of story. This will bite us in the future. It just did, 
> and it will just get worse.

What is 'dangerous' about a stable ABI? I can only see many upsides and few 
downsides.

Again, your whole premise is wrong.

> > As things look like from my side it appears you want to keep ftrace a 
> > messy, forked project with no regard to perf based tooling and this will 
> > fragment Linux instrumentation, the many technical disadvantages be damned.
> 
> ftrace is not a fork and never was. To be a fork, we need a common ancestor. 
> Ftrace and perf do not have that. Perf was created (after Ftrace) to profile 
> events, and did so very well. It just happened to expand into the tracing 
> area, where you want me to abandon all my ftrace work and rewrite it on top 
> of something that was not designed to do tracing.

perf did not 'expand into tracing' - it was always a tracer from day 1 on, you 
cannot do profiling without having a trace to build a histogram out of :-)

I told you that as early as 1 months into the perf project. I also told you why 
we didnt use the ftrace ring buffer, 2 months into the perf project and asked 
you to please help out. We are now 30+ months later ...

perf is basically the ftrace UI and APIs done better, cleaner and more 
robustly. Look at all the tooling that sprang up around that ABI, almost 
overnight.

ftrace evolved through many iterations in the past and perf was simply the next 
logical step. You were happy when the original iotrace code was replaced by 
ftrace, right? Now you seem to be a lot more reluctant to let go of ftrace's 
current iteration in favor of a clearly superior tooling approach, and that is 
rather sad to see.

> Now that perf has entered the tracing field, I would be happy to bring the 
> two together. [...]

Great - please see tip:tmp.perf/trace, that would be a very good point to 
start. It's a working prototype for an ftrace-alike tracing workflow.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-07 17:59             ` Arjan van de Ven
@ 2011-05-08 21:08               ` Frederic Weisbecker
  2011-05-08 21:56                 ` Arjan van de Ven
  0 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2011-05-08 21:08 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Steven Rostedt, Ingo Molnar, David Sharp, Vaibhav Nagarnaik,
	Michael Rubin, Linus Torvalds, linux-kernel, Peter Zijlstra,
	Thomas Gleixner

On Sat, May 07, 2011 at 10:59:36AM -0700, Arjan van de Ven wrote:
> On 5/7/2011 10:20 AM, Steven Rostedt wrote:
> >On Sat, 2011-05-07 at 16:44 +0200, Ingo Molnar wrote:
> >>* Steven Rostedt<rostedt@goodmis.org>  wrote:
> >>
> >>>2) we separate perf from ftrace and keep the "stable" ABI for perf, and let
> >>>ftrace advance into a more efficient tracer.
> >>The thing is, ftrace is still largely separated from perf, and this is why this
> >>regression came in: a random tracing 'cleanup' churn was done to 'tracing'
> >>which broke PowerTop.
> >>
> >>Look at the commit itself:
> >>
> >>   e6e1e2593592: tracing: Remove lock_depth from event entry
> >>
> >>Clearly you didnt even *realize* that there's a whole tooling world behind this
> >>mechanism ...
> >Note, I discussed this change with Frederic and he totally agreed with
> >me on removing it. In fact, we are in discussions about getting rid of
> >pid, preempt-count, and irq flags as well. But according to your logic,
> >that is a no go. I guess Frederic also does not *realize* there's a
> >whole tooling world behind this mechanism too.
> 
> btw if you remove some of these, how is userland supposed to find
> out if an event happened in irq context?

You can use the irq events for that.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-08 21:08               ` Frederic Weisbecker
@ 2011-05-08 21:56                 ` Arjan van de Ven
  0 siblings, 0 replies; 40+ messages in thread
From: Arjan van de Ven @ 2011-05-08 21:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, Ingo Molnar, David Sharp, Vaibhav Nagarnaik,
	Michael Rubin, Linus Torvalds, linux-kernel, Peter Zijlstra,
	Thomas Gleixner

On 5
>> btw if you remove some of these, how is userland supposed to find
>> out if an event happened in irq context?
> You can use the irq events for that.

so I have to follow 2 "thousands of timers per second" events just to 
find out if my "once a week" event happened in interrupt context?
I assume you are making a joke with your answer....


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-07 17:20           ` Steven Rostedt
  2011-05-07 17:59             ` Arjan van de Ven
  2011-05-07 19:00             ` Ingo Molnar
@ 2011-05-09 23:37             ` David Sharp
  2011-05-10  7:39               ` Ingo Molnar
  2 siblings, 1 reply; 40+ messages in thread
From: David Sharp @ 2011-05-09 23:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner

On Sat, May 7, 2011 at 10:20 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Sat, 2011-05-07 at 16:44 +0200, Ingo Molnar wrote:
>
> <snip>
>
>> * Steven Rostedt <rostedt@goodmis.org> wrote:
>> > Here's the choices then:
>> >
>> > 1) we get libparsevent.so out into the world and all tools can use it, and
>> > the raw formats of the trace events will no longer be an issue as long as the
>> > names of events and fields stay the same.
>>
>> Firstly, such an event parser already exists in
>> tools/perf/util/parse-events.[ch], so if you want to librarize it please talk
>> to Arnaldo to create tools/perf/lib/ and a libperf.so.
>
> The pares-events.[ch] file in perf just parses the command line options
> to denote what events need to be recorded. The real work lies in
> trace-event-parse.c that does the real parsing, and that file was copied
> from libparsevent.so. I purposely wrote that as a library that perf
> could use (again, being open to other ideologies), but instead of using
> the library, the code was hard coded into perf, which guaranteed the
> forking of this code.

A small note that this has also created some minor frustration for me,
as I had to send patches to extend this parsing to two different trees
and maintainers. (I don't think the perf patch ever was applied... I
should follow up on that.)

I believe it's been suggested that trace-cmd should be part of the
kernel tree, just like the perf tool. This would be nice, and would
more easily allow them to share this parsing code. It would also give
them a common place to work on their unification.


>
> <snip>
>
>> Secondly, you are solving the wrong problem and you are not seeing the real
>> problems. We can keep and we *will* keep ABIs, it's not hard. 4 bytes padding
>> is not an issue and it never was for PowerTop nor for any other real person who
>> relies on tracing.
>
> I've Cc'd the Google folks that are very interested in tracing, to let
> them respond to that comment.

Thanks for raising it to my attention.

The size of events is a *huge* issue for us. Please look at the
patches we have been sending out for tracing: A lot of them are about
reducing the size of events. Most of the patches we carry internally
are about reducing the size of events. Memory is the most scarce
resource on our systems, so we *cannot* afford to use large trace
buffers. This means that with a 8MB/cpu buffer (this is about what we
can hope to allocate on a heavily loaded system), we can only get on
the order of 10 seconds of trace data at best when tracing
systemcalls, irqs, and sched_switch. This is not enough when we don't
know what exactly we're looking for.

We have gone so far as to add an entire second set of "_tiny" events
for syscalls that records only 16 bits of arg0, and for sched_switch
that records only next_pid. These alone have roughly doubled the trace
period, and it is still not enough.

I really can't stress enough how big an issue the size of events is
for us. It is our number one issue with tracing.

>
> Do you think that "other real person"s are only kernel developers or
> desktop users that are not using production systems?
>
> And it's not just 4 bytes, its the entire useless header. Who cares
> about preempt counts? I examine that field only 1% of the time. In most
> other cases its totally useless. Same with interrupt flags, although I
> do look at them more often than preempt count. We (Frederic and I) still
> want to get rid of the pid for every event.

Internally we have dropped all but event type and pid (and changed pid
to 16 bits), and we have plans and patches in development to drop pid.

>
>>
>> As i see it the problem is the thought-less ftrace churn and the fragility of
>> how TRACE_EVENT() can be changed.
>
> Now you are just insulting me. There has not been any "thought-less"
> churn.
>
> I *designed* TRACE_EVENT() to be changed. That's why I wrote all that
> code to export the event formats. If you think all raw data of events
> are to be an ABI, then lets rip out all the event formats and make
> everything hard-coded.

We have tools that rely on TRACE_EVENT formats being exported. This
was a factor in our choice to use ftrace, and I consider parsing the
formats to be part of the ABI.

It may be true that some fields of some events should not change
incompatibly, but having the formats exported allows a wide definition
of "compatible": mostly that the name should not change. Even changing
the width of an integer field works perfectly well.

>
> I'm sorry, but that mentality seems to encourage thoughtless churn.
>
>>
>> Really, ftrace and in particular you are showing a huge disconnect and i'm
>> increasingly unhappy about it. Look at this very thread: you fought tooth and
>> nail to even *acknowledge* that there is a problem...
>
> I agree there is a problem, but what each of us think the problem is, is
> different. I say there's a problem with tools depending on a layout of
> raw data that is internal to the kernel, especially when it was designed
> to allow robustness. If we make it easy for tools to extract the data
> properly, then there should not be any issues if the raw format changes.

Agreed. It also allows forward compatibility when new events or new
fields are added, or when an event changes.

I see tracing as primarily a debugging tool: It is about inspecting
kernel internals. You cannot expect kernel internals to change, and
not expect something that inspects those internals not to change the
format of the data it exports. Kernel variables, structures and
parameters will change, disappear, or become meaningless or useless
(eg, lock_depth); and they are supposed to as much as the kernel is
supposed to change improve. Luckily (or actually, by design), we have
a way to cope with this: the event formats are exported for tools to
read.

I think ftrace has an abi, although I'm not sure how recorded it is.
In my view it includes:
- the debugfs control files (events dir, buffer_size_kb, options,
set_event, etc)
- the format of the ring buffer pages and ring buffer event headers
- the format and meaning of the event format files.
- For a few select trace events, yet to be enumerated, the presence,
name, and wide-sense field type (integer, array, dyn. array) of some
select fields (eg "next_pid" in "sched_switch").

In my view it explicitly does *not* include:
- the exact content of the event format files, except as noted above.

>
> Linus said:
>
>> If you made an interface that can be used without parsing the
>> interface description, then we're stuck with the interface. Theory
>> simply doesn't matter.
>>
>> You could help fix the tools, and try to avoid the compatibility
>> issues that way. There aren't that many of them.
>
> To me this seems that we have a way to have the tools do the right
> thing. If a library can be used that allows a more robust interface,
> then why not use it? The library already exists, I talked to Arjan, and
> he's willing to use it. I'm willing to put the effort in fixing powerTop
> and pushing this library to distributions. What's the problem?
>
> You are entering a very dangerous precedence by stating that the raw
> format is now the ABI, end of story. This will bite us in the future. It
> just did, and it will just get worse.
>
>>
>> As things look like from my side it appears you want to keep ftrace a messy,
>> forked project with no regard to perf based tooling and this will fragment
>> Linux instrumentation, the many technical disadvantages be damned.
>
> ftrace is not a fork and never was. To be a fork, we need a common
> ancestor. Ftrace and perf do not have that. Perf was created (after
> Ftrace) to profile events, and did so very well. It just happened to
> expand into the tracing area, where you want me to abandon all my ftrace
> work and rewrite it on top of something that was not designed to do
> tracing.
>
> Now that perf has entered the tracing field, I would be happy to bring
> the two together. But we disagree on how to do that. I will not drop
> ftrace totally just to work on perf. There's too many users of ftrace
> that want enhancements, and I will still support that. The reason being
> is that I honestly do not believe that perf can do what these users want
> anytime in the near future (if at all). I will not abandon a successful
> project just because you feel that it is a fork.

We have invested heavily in using ftrace. We chose to use ftrace
because it was maintained upstream, fast, simple to use, and had all
the features we were already relying on from ktrace, and then some.
When we last measured (admittedly quite a while ago now), perf had
about 5x slower write latency than ftrace. This was probably the
biggest thing that stopped us from considering it.

Perf might improve its tracing story (I'm sure it already has, but
we've been playing ostrich a bit in order to get work done), and I'm
also in agreement that bringing them closer together is a Good Thing.
But if ftrace simply disappears, that would create a lot of work for
us. We are every day depending on ftrace and infrastructure built on
top of ftrace more and more to make Google faster. We can cope with
incremental improvements. Wholesale ripping would force us to fork
this part of the kernel, as we have too much invested in ftrace.

>
> <snip>
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-07 19:00             ` Ingo Molnar
@ 2011-05-10  3:07               ` Steven Rostedt
  2011-05-10  4:44                 ` Dave Chinner
                                   ` (4 more replies)
  0 siblings, 5 replies; 40+ messages in thread
From: Steven Rostedt @ 2011-05-10  3:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Sharp, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann

On Sat, 2011-05-07 at 21:00 +0200, Ingo Molnar wrote:

> > Note, I discussed this change with Frederic and he totally agreed with
> > me on removing it. In fact, we are in discussions about getting rid of
> > pid, preempt-count, and irq flags as well. But according to your logic,
> > that is a no go. I guess Frederic also does not *realize* there's a
> > whole tooling world behind this mechanism too.
> 
> Well, there's no sign of that in the changelog, Frederic did not write this 
> change nor did he ack it or otherwise sign off on it. There is a whole world of 
> difference between 'agreeing on IRC' and actually pushing such a commit 
> upstream.

You're right that this discussion was not in the change log, but
Frederic did recommend this change in email.

https://lkml.org/lkml/2010/12/9/195

"The first thing is that we need to get rid of the lock_depth field, the
bkl is dying."


> > > PowerTop uses the perf ABI because it's a rather convenient and unified 
> > > method to get a rich selection of events via the same facility, same 
> > > ring-buffer, using a system call ABI, etc.
> > 
> > It seams that Arjan read the perf kernel code to see how the raw data was 
> > laid out and read it directly.
> 
> Yes, of course - i also have code that uses the syscall directly, it's easier 
> to code up ad-hoc than to parse some XML-lookalike descriptor from 
> /debug/tracing/events/.

So you admit this is a ad-hoc way of doing things. Thus a library is a
perfect solution. And the event formats are far from XML-look-a-like.
You really think this:

name: sched_switch
ID: 57
format:
	field:unsigned short common_type;	offset:0;	size:2;
	field:unsigned char common_flags;	offset:2;	size:1;
	field:unsigned char common_preempt_count;	offset:3;	size:1;
	field:int common_pid;	offset:4;	size:4;
	field:int common_lock_depth;	offset:8;	size:4;

	field:char prev_comm[TASK_COMM_LEN];	offset:12;	size:16;
	field:pid_t prev_pid;	offset:28;	size:4;
	field:int prev_prio;	offset:32;	size:4;
	field:long prev_state;	offset:40;	size:8;
	field:char next_comm[TASK_COMM_LEN];	offset:48;	size:16;
	field:pid_t next_pid;	offset:64;	size:4;
	field:int next_prio;	offset:68;	size:4;

Is equivalent to this:

<?xml version="1.0" encoding="UTF-8"?>
<KernelShark><CaptureSettings><Events><CaptureType>Events</CaptureType><System>sched</System></Events><Plugin>function_graph</Plugin><File>/tmp/trace.dat</File></CaptureSettings></KernelShark>

??


> > > I raised this issue in the past. ftrace and perf has to be unified sooner 
> > > rather than later.
> > 
> > We agree on a unification, just that we do not agree on the path it takes. 
> > Every time I take one step forward, you slam me backwards two steps. At 
> > kernel summit, we all agreed on a stable event layout, or just a new way to 
> > move the events out of debugfs, but you nacked it. You wanted me to work with 
> > Peter Zijlstra with sysfs, and that was just too complex. Peter seemed to 
> > agree with me as he wasn't working on software events for it, but hardware 
> > events as that's where hardware lives.
> 
> I think its rather obvious how the unification should be done: check 
> tip:tmp.perf/trace for the 'trace' command that does tracing.

I'll tell you what. I've been talking with other developers and one
thing we came up with that we all seem to agree with is that ftrace is
designed to trace the entire system, and it does it very well. Perf is
designed to trace individual tasks, and it does it very well (trace is
an example of this. It's focus is on tasks not the system). Ftrace can
also trace individual tasks and perf can also trace the entire system,
but they both do those poorly.

If we can agree to keep ftrace as the "system view" and perf as the
"task view", I will gladly work on perf, and specifically this trace
utility. And where I can share infrastructures of the two, I will also
do that as well. Thus I can work on getting things like function tracing
for tasks in perf, and even PMU recording into ftrace. I always say,
"use the proper tool for the job". I believe this could work and I would
make a big effort in doing so.


> 
> Check whether there's any feature missing from it that you'd like to see, add 
> it. Rinse, repeat.

Again, the design of trace/perf is task oriented. Ftrace is system
oriented. Could we agree on that?

> 
> There is nothing inherently hard nor complex about it.
> 
> > > As i see it the problem is the thought-less ftrace churn and the fragility 
> > > of how TRACE_EVENT() can be changed.
> > 
> > Now you are just insulting me. There has not been any "thought-less" churn.
> > 
> > I *designed* TRACE_EVENT() to be changed. That's why I wrote all that code to 
> > export the event formats. If you think all raw data of events are to be an 
> > ABI, then lets rip out all the event formats and make everything hard-coded.
> 
> You are quite mistaken there.
> 
> There are two main advantages of TRACE_EVENT():
> 
>  - it allows easy, C-alike tracepoint definitions that are unintrusive to
>    kernel developers
> 
>  - it is easy to *EXTEND* it, so that tools can pick up new events
> 
> And yes, you must remember that we two kept iterating the early prototype of 
> TRACE_EVENT() until those two requirements were met.

But the work to create the event format files was for the sole purpose
of allowing a way to have events change to reflect kernel design
changes.

> 
> 'Changing' a tracepoint is obviously easy if extending it transparently is 
> easy, it's a side effect - an unfortunate one.
> 
> Changing tracepoints is definitely frowned upon. We do not change tracepoints 
> if we can avoid it - we introduce new ones and we *maybe* phase out old, 
> obsolete ones.

You mean that we should have two trace-points in code at the same
location? And maintaining the old one, especially when the old one no
longer reflects what the kernel is doing?

The reluctance to trace points in the first place was this fear that
they would be immutable, and be even more intrusive and a major burden
than maintaining old syscalls. Arnd told me that keeping things like
old_readdir around is a total burden on every filesystem. Can you
imagine what would happen if a user tool started depending on
information inside the kernel. This information would have to be
maintained indefinitely! Over time, tracepoints will start getting loads
of useless data.

> 
> We had this exact discussion about the power events a couple of months ago!

And it was a discussion of making some events and fields "stable", and
at KS, we all agreed to designate what fields and events make sense in a
separate file system (or anything). But you disagreed with that it it
never happened. Now we are discussing that the entire raw event format
is the ABI. Thus if a tool raw maps an event just to get one single
field from that event, if that is not the first field of the event, this
means the offset of that field must stay the same, and makes all fields
before it permanent even though those other fields become useless data
over time.

Fields that get used by tools are probably good candidates of those
things that should not change, because some tool found use with them
(like PowerTop has with the power and sched events). But even PowerTop
does not use all the fields (like lock_depth). If we keep this raw
binary requirement, all fields of an event, where that event is being
used by a tool, are now permanent ABI.

> 
> > Linus said:
> > 
> > > If you made an interface that can be used without parsing the interface 
> > > description, then we're stuck with the interface. Theory simply doesn't 
> > > matter.
> > > 
> > > You could help fix the tools, and try to avoid the compatibility issues 
> > > that way. There aren't that many of them.
> > 
> > To me this seems that we have a way to have the tools do the right thing. 
> > [...]
> 
> Your whole premise is that we want to churn the tracepoints - and that premise 
> is *utterly wrong*.
> 
> In practice we very rarely want to change tracepoints: in the past 2 years we 
> had maybe 2-3 attempts.
> 
> *Adding* new tracepoints and *extending* functionality is the main action, 
> dozens are added per year. We can also phase out tracepoints. Both of these 
> things can be done without breaking the ABI.

Phasing out tracepoints is not something likely if it becomes an ABI.
Look at the example of getting rid of old_readdir. It may be the case
that no one uses it anymore, but we can't be sure.


> 
> Especially when there's tooling use it's not really desirable to fiddle with 
> the tracepoint, even if in theory we could reorder fields and not see tools 
> break. It's too easy to break the tool.

"even if in theory we could reorder fields and not see tools 
break. It's too easy to break the tool."

I'm sorry but that sounds like a contradiction to me.

> 
> > [...] If a library can be used that allows a more robust interface, then why 
> > not use it? [...]
> 
> But there is no library available in distros and realistically it wont be 
> widely available within a year, even if you released and packaged one up 
> overnight.

Luckily I came to Budapest, which happens to be having the Linaro@UDS
conference going on, and the Ubuntu package maintainer is here. I
already talked to him and his willing to get this out ASAP (going
through the proper procedure). You can come and discuss this with us
too.

Also having discussions with various people here I was thinking of,
instead of a "libparsevent.so" make a full "libperf.so". I could work
with Arnaldo and others on this. Not only could this have an interface
to read the events, but it will basically get rid of the need for
PowerTop to have its own perf.cpp files. Have the entire interaction
with perf be a user library shipped with the distros. I'm willing to
make this happen. Imagine the tools that will appear using the perf ABI
if we have a library for it!


>  Nor do you really seem to see the problem that changing tracepoints 
> brings with itself.

I am not for changing tracepoints on a whim. But I would like
tracepoints to change as the kernel design changes. The reason
tracepoints have currently been stable is that kernel design changes do
not happen often. But they do happen, and I foresee that in the future,
the kernel will have a large number of "legacy tracepoints", and we will
be stuck maintaining them forever.

What happens if someone designs a tool that analyzes the XFS
filesystem's 200+ tracepoints? Will all those tracepoints now become
ABI?


> 
> The main property of TRACE_EVENT() is that new tracepoints can be picked up by 
> scripting engines and other tools. You are concentrating on an aspect that is 
> rarely done, unimportant and causes pain. Why?

Because it is rarely done, and if done correctly through a library, it
will not cause pain.

Why? Because I worry that this precedence that we set today will have
huge consequences for Linux in the future. Trace events are too easy to
create. Much easier to create than a new syscall. And we do not put the
time nor effort into these events to verify that they can last as a long
term ABI.

As I stated, with keeping the "raw format" as an ABI, then an event can
never change. Which will lead to dozens of these legacy tracepoints all
over the place. As tracepoints become more popular (as they are quickly
becoming), more tools will be developed that will interact with them.
Lets give these tools a library now that they can use to easily read
these tracepoints the proper way. We could also mark fields in the
TRACE_EVENT() that we would consider "stable" for tools to use. Then
this library could have an interface to read "stable" events/fields, and
"debug" events/fields. This will allow developers to mark events and
fields that they plan on maintaining for the future.

The approach I'm suggesting is to give tools a generic way to extract
these "useful" events, and not be bothered if that event has a "non
useful" field removed. And I would like a way to let applications know
which fields are stable and which are not. If an application developer
wants one of the non stable fields to become stable, then can request
it, the maintainer of that event could make the decision or help the
user find a way to not need this field as stable. We can document this
all in the man pages of this library.

To enforce the stable vs debug events even more, the library could
provide two interfaces. One is to get stable events and would print a
warning if they do not exist. The other is to get debug events, where
the idea behind the call is that these events/fields may not exist, and
it would not warn about them not existing, but instead quietly fail
(with a error return code).

> 
> > [...] The library already exists, I talked to Arjan, and he's willing to use 
> > it. I'm willing to put the effort in fixing powerTop and pushing this library 
> > to distributions. What's the problem?
> 
> I can see a couple of problems right away:
> 
>  - i do not actually want to see people change trace events all that often. It's
>    a bad practice and even with a library it can break stuff.

I do not see trace events changing very often either, but I do see them
changing. If we have a library it can still break, but if we implement
the "stable" vs "debug" perhaps we could avoid this issue too.

> 
>  - old binaries and other tools that might already make use of these events.

Currently it's perf and PowerTop. Perf uses and old version of the
libparsevent.so library, and is not affected by the lock_depth change.

PowerTop parsed the raw binary, which required the skills of a competent
kernel developer to be skilled enough to dig into the kernel source and
find how the event layout is done. There's not many userspace tools out
there that are developed by competent kernel developers. My fear is
someone may cut and paste the code of PowerTop and start with that.
Although mapping new events may be beyond non-competent kernel
developers abilities.

If we can make PowerTop use a library ASAP, then hopefully those
copy-cats will do it the right thing.


> 
>  - you are complicating an otherwise really simple and dependable facility.

It's simple today. But will be a burden in the future. And I actually
think that a library will make PowerTop (et al) even simpler.

> 
> So you can write the library if it's more convenient to some people, that's not 
> a problem (it is good) - just do not use it as an *excuse* to break the ABI. We 
> cannot break the ABI today and we will likely not be able to do it for a long 
> time.

It's not about breaking an ABI, it's about coming up with a robust ABI.
If we implement a library, and perhaps even implement "stable" events
and fields, then I believe in the long term, Linux will be much better
off.

> 
> > You are entering a very dangerous precedence by stating that the raw format 
> > is now the ABI, end of story. This will bite us in the future. It just did, 
> > and it will just get worse.
> 
> What is 'dangerous' about a stable ABI? I can only see many upsides and few 
> downsides.

If the stable ABI is not correctly done (and new trace events are not
added with the thought that they are a stable ABI), it will haunt us
forever.

> 
> Again, your whole premise is wrong.
> 
> > > As things look like from my side it appears you want to keep ftrace a 
> > > messy, forked project with no regard to perf based tooling and this will 
> > > fragment Linux instrumentation, the many technical disadvantages be damned.
> > 
> > ftrace is not a fork and never was. To be a fork, we need a common ancestor. 
> > Ftrace and perf do not have that. Perf was created (after Ftrace) to profile 
> > events, and did so very well. It just happened to expand into the tracing 
> > area, where you want me to abandon all my ftrace work and rewrite it on top 
> > of something that was not designed to do tracing.
> 
> perf did not 'expand into tracing' - it was always a tracer from day 1 on, you 
> cannot do profiling without having a trace to build a histogram out of :-)
> 
> I told you that as early as 1 months into the perf project. I also told you why 
> we didnt use the ftrace ring buffer, 2 months into the perf project and asked 
> you to please help out. We are now 30+ months later ...
> 
> perf is basically the ftrace UI and APIs done better, cleaner and more 
> robustly. Look at all the tooling that sprang up around that ABI, almost 
> overnight.

I could pretty much say the same for the ftrace tracers.  But as we
realize now, there was better ways of doing it (events). The reason for
perf's UI and APIs taking off was that it was so much easier than
oprofile and friends. But perf is still struggling with the tracing
front. Perhaps "trace" will fix that.

> 
> ftrace evolved through many iterations in the past and perf was simply the next 
> logical step.

Perf did not seem to follow any of the "lessons learned" of ftrace, but
instead it was a "lessons learned" of oprofile.

>  You were happy when the original iotrace code was replaced by 
> ftrace, right?

Honestly, my reaction was nothing more that "cool!", and then I forgot
about it. I was also happy when I found that oprofile used the ftrace
ring buffer.


>  Now you seem to be a lot more reluctant to let go of ftrace's 
> current iteration in favor of a clearly superior tooling approach, and that is 
> rather sad to see.

Unfortunately, I do not see it as a "clearly superior tooling approach".
In fact, I see it as quite the opposite, and we've had these discussions
before. Sure, I think a userspace tool to read ftrace is required,
whether it is trace-cmd or perf. But I find myself constantly using the
debugfs system as I still find it sometimes more convenient.

If the only way to read the ftrace data is through a tool (and this
includes ftrace_dump_on_oops console output), there would be times that
I would go back to using logdev. This means, I would stop using ftrace
for certain tasks. I find that bad if the maintainer doesn't use the
tool that he maintains.

Perhaps the only real upside of keeping the debugfs pretty printing is
convenience in kernel development/debugging.

But I see no upside for getting rid of it, besides some developers
saying "I don't like it".


> 
> > Now that perf has entered the tracing field, I would be happy to bring the 
> > two together. [...]
> 
> Great - please see tip:tmp.perf/trace, that would be a very good point to 
> start. It's a working prototype for an ftrace-alike tracing workflow.

I'll do it, if we can agree about the ftrace as system
tracing/debugging, and trace can focus on user specific tracing.

-- Steve



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  3:07               ` Steven Rostedt
@ 2011-05-10  4:44                 ` Dave Chinner
  2011-05-10  5:39                   ` Steven Rostedt
  2011-05-10  7:54                 ` Ingo Molnar
                                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 40+ messages in thread
From: Dave Chinner @ 2011-05-10  4:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, David Sharp, Vaibhav Nagarnaik, Michael Rubin,
	Linus Torvalds, Arjan van de Ven, linux-kernel,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Christoph Hellwig, Arnd Bergmann

On Mon, May 09, 2011 at 11:07:27PM -0400, Steven Rostedt wrote:
> >  Nor do you really seem to see the problem that changing tracepoints 
> > brings with itself.
> 
> I am not for changing tracepoints on a whim. But I would like
> tracepoints to change as the kernel design changes. The reason
> tracepoints have currently been stable is that kernel design changes do
> not happen often. But they do happen, and I foresee that in the future,
> the kernel will have a large number of "legacy tracepoints", and we will
> be stuck maintaining them forever.
> 
> What happens if someone designs a tool that analyzes the XFS
> filesystem's 200+ tracepoints? Will all those tracepoints now become
> ABI?

That's crazy talk.

XFS tracepoints are _not ever_ guaranteed to be consistent from one
kernel to another - they are highly dependent on the implementation
of the code, and as such will change *without warning*. This has
been the case for the XFS event subsystem since back in the days of
Irix (yes, that's where most of the events were originally
implemented). The fact that they are now exported via TRACE_EVENT()
(so no kernel debugger is needed) does not change the fact the
information is really for developer use only and as such are
volatile....

So, if someone wants to write an application that parses the XFS
tracepoints directly, then they have to live with the fact that
tracepoints will come and go and change size and shape all the
time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  4:44                 ` Dave Chinner
@ 2011-05-10  5:39                   ` Steven Rostedt
  2011-05-10  7:36                     ` Dave Chinner
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2011-05-10  5:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ingo Molnar, David Sharp, Vaibhav Nagarnaik, Michael Rubin,
	Linus Torvalds, Arjan van de Ven, linux-kernel,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Christoph Hellwig, Arnd Bergmann

On Tue, 2011-05-10 at 14:44 +1000, Dave Chinner wrote:
> On Mon, May 09, 2011 at 11:07:27PM -0400, Steven Rostedt wrote:
> > What happens if someone designs a tool that analyzes the XFS
> > filesystem's 200+ tracepoints? Will all those tracepoints now become
> > ABI?
> 
> That's crazy talk.

Right!

> 
> XFS tracepoints are _not ever_ guaranteed to be consistent from one
> kernel to another - they are highly dependent on the implementation
> of the code, and as such will change *without warning*. This has
> been the case for the XFS event subsystem since back in the days of
> Irix (yes, that's where most of the events were originally
> implemented). The fact that they are now exported via TRACE_EVENT()
> (so no kernel debugger is needed) does not change the fact the
> information is really for developer use only and as such are
> volatile....

But what makes these tracepoints any different from any other
tracepoint? Like power manament.

> 
> So, if someone wants to write an application that parses the XFS
> tracepoints directly, then they have to live with the fact that
> tracepoints will come and go and change size and shape all the
> time.

I totally agree. But that is our "wish" and may not reflect reality. The
whole point of this thread is if the kernel exports something to
userspace (in a released kernel), and userspace tools start to depend on
that data, the "reality" is that data just became an ABI, and Linus will
revert any changes that breaks that tool.

This is the precedence that I want to avoid. Yes, this may be "crazy
talk", but the possibility of it happening exists. In this case, I
rather be crazy than right.

-- Steve



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  5:39                   ` Steven Rostedt
@ 2011-05-10  7:36                     ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2011-05-10  7:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, David Sharp, Vaibhav Nagarnaik, Michael Rubin,
	Linus Torvalds, Arjan van de Ven, linux-kernel,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Christoph Hellwig, Arnd Bergmann

On Tue, May 10, 2011 at 01:39:50AM -0400, Steven Rostedt wrote:
> On Tue, 2011-05-10 at 14:44 +1000, Dave Chinner wrote:
> > On Mon, May 09, 2011 at 11:07:27PM -0400, Steven Rostedt wrote:
> > > What happens if someone designs a tool that analyzes the XFS
> > > filesystem's 200+ tracepoints? Will all those tracepoints now become
> > > ABI?
> > 
> > That's crazy talk.
> 
> Right!
> 
> > 
> > XFS tracepoints are _not ever_ guaranteed to be consistent from one
> > kernel to another - they are highly dependent on the implementation
> > of the code, and as such will change *without warning*. This has
> > been the case for the XFS event subsystem since back in the days of
> > Irix (yes, that's where most of the events were originally
> > implemented). The fact that they are now exported via TRACE_EVENT()
> > (so no kernel debugger is needed) does not change the fact the
> > information is really for developer use only and as such are
> > volatile....
> 
> But what makes these tracepoints any different from any other
> tracepoint? Like power manament.

Perhaps the intent of the tracepoints needs to be taken into
account? The XFS tracepoints are primarily for developers debugging
the filesystem on a specific kernel revision, whereas something like
power management has userspace diagnostic tools (like powertop) that
want to work across multiple kernel revisions. IOWs, we have two
different and conflicting requirements from two separate and
non-interacting subsystems, both of which currently use the same
(unstable) event interface to provide the functionality.

So, if you are going to define tracepoints to be a stable ABI, then
really we need a TRACE_EVENT_STABLE() extension to define all the
events that are intended to be stable.  That would make the code
self documenting w.r.t. intent, enable unstable->stable transitions
when applications require events to be stable (on a case by case
basis) and, best of all, not require the majority of tracepoint
users to change anything.

> > So, if someone wants to write an application that parses the XFS
> > tracepoints directly, then they have to live with the fact that
> > tracepoints will come and go and change size and shape all the
> > time.
> 
> I totally agree. But that is our "wish" and may not reflect reality.

It has nothing to do with "wish". The reality is that the XFS
tracepoints are unstable and we change them at will.

> The
> whole point of this thread is if the kernel exports something to
> userspace (in a released kernel), and userspace tools start to depend on
> that data, the "reality" is that data just became an ABI, and Linus will
> revert any changes that breaks that tool.

Fmeh. Retrofitting a "one size fits all" model that clearly does not
fit the existing tracepoint use cases is a sure way to annoy people.
Trace points are currently unstable, and if we want them to be
stable we need to define a new interface with new semantics and
rules about how, when and what type of modifications are allowed.

> This is the precedence that I want to avoid. Yes, this may be "crazy
> talk", but the possibility of it happening exists. In this case, I
> rather be crazy than right.

So define a new a stable tracepoint ABI and convert everything that
userspace wants to be stable over to that.  That way if someone
writes an application that uses XFS tracepoints and wants them to be
stable across multiple releases, that's a discussion and decision
that can be made wholly within the XFS community. That is, stability
of tracepoints is (and should always be) a subsystem maintainer
decision, not a royal decree sent down from above that all serfs
must obey.

Define the model for a stable ABI, provide the infrastructure to
make it easy to define stable tracepoints, but leave the definition
of which tracepoints are stable up to those working intimately with
the tracepoints in question....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-09 23:37             ` David Sharp
@ 2011-05-10  7:39               ` Ingo Molnar
  0 siblings, 0 replies; 40+ messages in thread
From: Ingo Molnar @ 2011-05-10  7:39 UTC (permalink / raw)
  To: David Sharp
  Cc: Steven Rostedt, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner


* David Sharp <dhsharp@google.com> wrote:

> I believe it's been suggested that trace-cmd should be part of the kernel 
> tree, just like the perf tool. This would be nice, and would more easily 
> allow them to share this parsing code. It would also give them a common place 
> to work on their unification.

Thomas and me has done some work on providing that functionality, see the 
tip:perf/trace2 branch:

  http://people.redhat.com/mingo/tip.git/README

Do 'trace -a' to do system-wide tracing. It still needs quite some work, help 
is welcome :)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  3:07               ` Steven Rostedt
  2011-05-10  4:44                 ` Dave Chinner
@ 2011-05-10  7:54                 ` Ingo Molnar
  2011-05-10  8:09                 ` Ingo Molnar
                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 40+ messages in thread
From: Ingo Molnar @ 2011-05-10  7:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Sharp, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann


* Steven Rostedt <rostedt@goodmis.org> wrote:

> > I think its rather obvious how the unification should be done: check 
> > tip:tmp.perf/trace for the 'trace' command that does tracing.
> 
> I'll tell you what. I've been talking with other developers and one thing we 
> came up with that we all seem to agree with is that ftrace is designed to 
> trace the entire system, and it does it very well. Perf is designed to trace 
> individual tasks, and it does it very well (trace is an example of this. It's 
> focus is on tasks not the system). Ftrace can also trace individual tasks and 
> perf can also trace the entire system, but they both do those poorly.

Not sure where you picked that up but it's 100% nonsense and you could not be 
more wrong.

The reason why you see most instrumentation users use per task tracing and 
profiling is very simple: they *can* do it and local views are what most 
developer are interested in!

Otherwise perf has been designed to do system-wide (global) tracing pretty much 
from day one on. In fact one of the first applications of perf: kerneltop, the 
tool that evolved into 'perf top' has a system-wide view and never had any 
other default but system-wide tracing+profiling ...

'perf top' is what many kernel developers use and it's very popular because the 
kernel itself is 'system-wide' so obviously kernel developers want to have (and 
need to have) a system-wide view.

ftrace uses system-wide tracing because that's pretty much the only model it 
has. That is one of its many design mistakes, not a feature.

But the world is a lot more than just kernel focused workflows and perf 
supports various other popular views:

  - per task
  - per task hierarchy (tree spanning fork()/exec()/clone() trees of tasks)
  - per cgroup
  - system-wide

And you want to keep ftrace a forked identity on the weird notion that somehow 
perf can not do system-wide event collection and that somehow fundamentally 
instrumentation can not serve these goals of event grouping?

Steve, your opinion is, sadly, very narrow.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  3:07               ` Steven Rostedt
  2011-05-10  4:44                 ` Dave Chinner
  2011-05-10  7:54                 ` Ingo Molnar
@ 2011-05-10  8:09                 ` Ingo Molnar
  2011-05-10  8:32                   ` Arjan van de Ven
  2011-05-10  8:41                 ` Ingo Molnar
  2011-05-10  8:47                 ` Ingo Molnar
  4 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2011-05-10  8:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Sharp, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann


* Steven Rostedt <rostedt@goodmis.org> wrote:

> > > > PowerTop uses the perf ABI because it's a rather convenient and unified 
> > > > method to get a rich selection of events via the same facility, same 
> > > > ring-buffer, using a system call ABI, etc.
> > > 
> > > It seams that Arjan read the perf kernel code to see how the raw data was 
> > > laid out and read it directly.
> > 
> > Yes, of course - i also have code that uses the syscall directly, it's 
> > easier to code up ad-hoc than to parse some XML-lookalike descriptor from 
> > /debug/tracing/events/.
> 
> So you admit this is a ad-hoc way of doing things. [...]

There's nothing wrong with being able to code and make use of instrumentation 
ad-hoc, using a simple interface.

The *worst* possible thing is to force user-space to use something complex.
So we kept the perf ABI simple.

> [...] Thus a library is a perfect solution. [...]

That's a non sequitor.

> [...] And the event formats are far from XML-look-a-like. You really think 
> this:
> 
> name: sched_switch
> ID: 57
> format:
> 	field:unsigned short common_type;	offset:0;	size:2;
> 	field:unsigned char common_flags;	offset:2;	size:1;
> 	field:unsigned char common_preempt_count;	offset:3;	size:1;
> 	field:int common_pid;	offset:4;	size:4;
> 	field:int common_lock_depth;	offset:8;	size:4;
> 
> 	field:char prev_comm[TASK_COMM_LEN];	offset:12;	size:16;
> 	field:pid_t prev_pid;	offset:28;	size:4;
> 	field:int prev_prio;	offset:32;	size:4;
> 	field:long prev_state;	offset:40;	size:8;
> 	field:char next_comm[TASK_COMM_LEN];	offset:48;	size:16;
> 	field:pid_t next_pid;	offset:64;	size:4;
> 	field:int next_prio;	offset:68;	size:4;
> 
> Is equivalent to this:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <KernelShark><CaptureSettings><Events><CaptureType>Events</CaptureType><System>sched</System></Events><Plugin>function_graph</Plugin><File>/tmp/trace.dat</File></CaptureSettings></KernelShark>

I did not say that it's equivalent, i said it's XML look-alike.

Steve, we even joked about that, that if we continue like this we'll end up 
with an XML parser ... I requested several changes to the description format so 
that it becomes more human readable.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  8:09                 ` Ingo Molnar
@ 2011-05-10  8:32                   ` Arjan van de Ven
  2011-05-10  8:44                     ` Ingo Molnar
  0 siblings, 1 reply; 40+ messages in thread
From: Arjan van de Ven @ 2011-05-10  8:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, David Sharp, Vaibhav Nagarnaik, Michael Rubin,
	Linus Torvalds, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann

On
>> name: sched_switch
>> ID: 57
>> format:
>> 	field:unsigned short common_type;	offset:0;	size:2;
>> 	field:unsigned char common_flags;	offset:2;	size:1;
>> 	field:unsigned char common_preempt_count;	offset:3;	size:1;
>> 	field:int common_pid;	offset:4;	size:4;
>> 	field:int common_lock_depth;	offset:8;	size:4;
>>
>> 	field:char prev_comm[TASK_COMM_LEN];	offset:12;	size:16;
>> 	field:pid_t prev_pid;	offset:28;	size:4;
>> 	field:int prev_prio;	offset:32;	size:4;
>> 	field:long prev_state;	offset:40;	size:8;
>> 	field:char next_comm[TASK_COMM_LEN];	offset:48;	size:16;
>> 	field:pid_t next_pid;	offset:64;	size:4;
>> 	field:int next_prio;	offset:68;	size:4;
>>
>> Is equivalent to this:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <KernelShark><CaptureSettings><Events><CaptureType>Events</CaptureType><System>sched</System></Events><Plugin>function_graph</Plugin><File>/tmp/trace.dat</File></CaptureSettings></KernelShark>
> I did not say that it's equivalent, i said it's XML look-alike.
>
> Steve, we even joked about that, that if we continue like this we'll end up
> with an XML parser ... I requested several changes to the description format so
> that it becomes more human readable.

frankly, for software, XML is easier to deal with than the human 
readable form.

if we are serious about wanting software to parse this stuff.. maybe 
exposing it in an easy to parse format as well is not a bad idea....


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  3:07               ` Steven Rostedt
                                   ` (2 preceding siblings ...)
  2011-05-10  8:09                 ` Ingo Molnar
@ 2011-05-10  8:41                 ` Ingo Molnar
  2011-05-10 13:06                   ` Steven Rostedt
  2011-05-10  8:47                 ` Ingo Molnar
  4 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2011-05-10  8:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Sharp, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann


* Steven Rostedt <rostedt@goodmis.org> wrote:

> > Check whether there's any feature missing from it that you'd like to see, add 
> > it. Rinse, repeat.
> 
> Again, the design of trace/perf is task oriented. Ftrace is system
> oriented. Could we agree on that?

Like i said in the previous mail, i don't know where you got this nonsensical 
idea from. ftrace is indeed system oriented and that's hardcoded at the design 
- i.e. its a design mistake.

perf is fundamentally *event* oriented - and various levels of grouping and 
buffering can be applied to events.

'system wide', 'per cpu', 'per workload', 'per task' or 'per cgroup' are just 
one of the many natural groupings of events that users/developers would like to 
see - and we offer these.

 - that is why sysprof is using perf events to collect system-wide events.

 - that is why PowerTOP uses perf events in system-wide event collection mode.

 - that is why 'perf top' uses system wide profiling by default (but can do per 
   CPU or per task profiling as well)

 - that is why 'perf record' defaults to a per workload (not a per task as you 
   claim) mode of event collection

 - that is why 'perf stat' defalts to per workload events

Do you see that it is ftrace that remained behind the times, by stubbornly 
forcing some nonsensical global view and encoding it not only in its design but 
in its APIs as well?

I really meant it when i told you that perf events were the natural next step 
after ftrace, in the evolution of Linux tracing/instrumentation.

> > > Now that perf has entered the tracing field, I would be happy to bring 
> > > the two together. [...]
> > 
> > Great - please see tip:tmp.perf/trace, that would be a very good point to 
> > start. It's a working prototype for an ftrace-alike tracing workflow.
> 
> I'll do it, if we can agree about the ftrace as system tracing/debugging, and 
> trace can focus on user specific tracing.

Ok, you've finally admitted that you do not really want 'unification' between 
ftrace and perf - which was my suspicion all along. I really prefer 100% honest 
discussions with people from whom i pull and it took quite some time for you to 
admit to this position ...

Despite what you say perf and 'trace' can do system-wide tracing just fine:

  $ trace record -a
  ^C
  # trace recorded [205.108 MB] - try 'trace summary' to get an overview

( and note that the code in tip:tmp.perf/trace2 is a very early prototype, 
  barely tested - it just demonstrates the idea. )

In fact we could make 'trace' default to system-wide tracing by default and it 
would fall back to workload level tracing only if it does not have the 
privileges to trace the whole system.

Why not use the correctly designed tracing approach and enhance it, and merge 
all the remaining useful bits of ftrace into it?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  8:32                   ` Arjan van de Ven
@ 2011-05-10  8:44                     ` Ingo Molnar
  2011-05-10  9:14                       ` Pekka Enberg
  0 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2011-05-10  8:44 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Steven Rostedt, David Sharp, Vaibhav Nagarnaik, Michael Rubin,
	Linus Torvalds, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On
> >>name: sched_switch
> >>ID: 57
> >>format:
> >>	field:unsigned short common_type;	offset:0;	size:2;
> >>	field:unsigned char common_flags;	offset:2;	size:1;
> >>	field:unsigned char common_preempt_count;	offset:3;	size:1;
> >>	field:int common_pid;	offset:4;	size:4;
> >>	field:int common_lock_depth;	offset:8;	size:4;
> >>
> >>	field:char prev_comm[TASK_COMM_LEN];	offset:12;	size:16;
> >>	field:pid_t prev_pid;	offset:28;	size:4;
> >>	field:int prev_prio;	offset:32;	size:4;
> >>	field:long prev_state;	offset:40;	size:8;
> >>	field:char next_comm[TASK_COMM_LEN];	offset:48;	size:16;
> >>	field:pid_t next_pid;	offset:64;	size:4;
> >>	field:int next_prio;	offset:68;	size:4;
> >>
> >>Is equivalent to this:
> >>
> >><?xml version="1.0" encoding="UTF-8"?>
> >><KernelShark><CaptureSettings><Events><CaptureType>Events</CaptureType><System>sched</System></Events><Plugin>function_graph</Plugin><File>/tmp/trace.dat</File></CaptureSettings></KernelShark>
> >I did not say that it's equivalent, i said it's XML look-alike.
> >
> >Steve, we even joked about that, that if we continue like this we'll end up
> >with an XML parser ... I requested several changes to the description format so
> >that it becomes more human readable.
> 
> frankly, for software, XML is easier to deal with than the human
> readable form.

Yes, absolutely - still i think keeping it human readable is important.

> if we are serious about wanting software to parse this stuff.. maybe exposing 
> it in an easy to parse format as well is not a bad idea....

Well, the code to parse it intelligently already exists so i dont think we are 
forced to go back to some harder to read (and easier to parse) format.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  3:07               ` Steven Rostedt
                                   ` (3 preceding siblings ...)
  2011-05-10  8:41                 ` Ingo Molnar
@ 2011-05-10  8:47                 ` Ingo Molnar
  2011-05-10 10:33                   ` Steven Rostedt
  4 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2011-05-10  8:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Sharp, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann


* Steven Rostedt <rostedt@goodmis.org> wrote:

> [...] Thus a library is a perfect solution. [...]

Btw., just to make things clear, if we indeed have a library to parse things 
and if all apps use that then the ABI moves to another (library) level.

The requirement from my maintenance POV is very, very simple: apps should not 
break on new kernels. If this is achieved by making apps smarter then that's a 
valid solution.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  8:44                     ` Ingo Molnar
@ 2011-05-10  9:14                       ` Pekka Enberg
  0 siblings, 0 replies; 40+ messages in thread
From: Pekka Enberg @ 2011-05-10  9:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Steven Rostedt, David Sharp, Vaibhav Nagarnaik,
	Michael Rubin, Linus Torvalds, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann

On Tue, May 10, 2011 at 11:44 AM, Ingo Molnar <mingo@elte.hu> wrote:
>> >Steve, we even joked about that, that if we continue like this we'll end up
>> >with an XML parser ... I requested several changes to the description format so
>> >that it becomes more human readable.
>>
>> frankly, for software, XML is easier to deal with than the human
>> readable form.
>
> Yes, absolutely - still i think keeping it human readable is important.

This is getting slightly off-topic but there are good human readable
formats that are also easily parseable by computers. JSON and YAML
come to mind.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  8:47                 ` Ingo Molnar
@ 2011-05-10 10:33                   ` Steven Rostedt
  2011-05-10 19:13                     ` David Sharp
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2011-05-10 10:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Sharp, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann

On Tue, 2011-05-10 at 10:47 +0200, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > [...] Thus a library is a perfect solution. [...]
> 
> Btw., just to make things clear, if we indeed have a library to parse things 
> and if all apps use that then the ABI moves to another (library) level.
> 
> The requirement from my maintenance POV is very, very simple: apps should not 
> break on new kernels. If this is achieved by making apps smarter then that's a 
> valid solution.
> 

Great! Because this is what I want. I would also want a way to designate
events as stable. I'll add a TRACE_EVENT_STABLE() that can only have the
events that maintainers agree to maintain. And give the apps an ability
to only see these. Have the other events need either a separate library,
or perhaps just separate calls from within the same library, so the XFS
developers can feel safe that their tracepoints will not be depended on.
And perhaps have two tracepoints for sched_switch such that Peter
Zijlsta is happy that he's not bound by an tracepoint that keeps him
from getting rid of FIFO ;)

I'm happy to write a libperf.so and I can discuss with Arnaldo, Arjan
and yourself what is the best way of doing this.

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10  8:41                 ` Ingo Molnar
@ 2011-05-10 13:06                   ` Steven Rostedt
  2011-05-11 21:51                     ` Ingo Molnar
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2011-05-10 13:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Sharp, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann

On Tue, 2011-05-10 at 10:41 +0200, Ingo Molnar wrote: 
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > > Check whether there's any feature missing from it that you'd like to see, add 
> > > it. Rinse, repeat.
> > 
> > Again, the design of trace/perf is task oriented. Ftrace is system
> > oriented. Could we agree on that?
> 
> Like i said in the previous mail, i don't know where you got this nonsensical 
> idea from. ftrace is indeed system oriented and that's hardcoded at the design 
> - i.e. its a design mistake.

Actually, it would not be too hard to implement some of the same ideas
of perf into ftrace for user focused tracing. The design is flexible
enough to do so. The only reason I never submitted patches to allow
ftrace to do so was because that would have been a direct competition
with perf, and unnecessary.

> 
> perf is fundamentally *event* oriented - and various levels of grouping and 
> buffering can be applied to events.

How do you trace all events for the entire system? There is no "enable
all events" in perf (that I know of). But I see that it can't even
handle all syscalls:

[root@bxf perf]# ~/bin/perf record -a -e 'syscalls:*'

  Error: sys_perf_event_open() syscall returned with 24 (Too many open files).  /bin/dmesg may provide additional information.

  Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?

[root@bxf perf]# dmesg | tail
NET: Registered protocol family 10
ip6_tables: (C) 2000-2006 Netfilter Core Team
p4-clockmod: P4/Xeon(TM) CPU On-Demand Clock Modulation available
RPC: Registered udp transport module.
RPC: Registered tcp transport module.
RPC: Registered tcp NFSv4.1 backchannel transport module.
ADDRCONF(NETDEV_UP): eth0: link is not ready
e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
eth0: no IPv6 routers present


And yes CONFIG_PERF_EVENTS is enabled and a record -a -e 'sched:*' works.


With ftrace, this has never been an issue:

[root@bxf perf]# trace-cmd record -e all
[root@bxf perf]# trace-cmd report
version = 6
cpus=4
       trace-cmd-3016  [003]  1007.136631: lock_release:         0xffff88003fe72c98 &(&zone->lru_lock)->rlock
       trace-cmd-3017  [001]  1007.136631: lock_acquire:         0xffff88003d6f00c8 &(&fs->lock)->rlock
       trace-cmd-3015  [002]  1007.136633: lock_acquire:         0xffffffff825df1d8 read &fsnotify_mark_srcu
       trace-cmd-3018  [000]  1007.136635: mm_page_alloc:        page=0xffffea00005a13c0 pfn=5903296 order=0 migratetype=1 gfp_flags=GFP_TEMPORARY|GFP_NOWARN|GFP_NORETRY|GFP_THISNODE
       trace-cmd-3017  [001]  1007.136643: lock_acquire:         0xffff880039f91348 &(&dentry->d_lock)->rlock
       trace-cmd-3015  [002]  1007.136644: lock_release:         0xffffffff825df1d8 &fsnotify_mark_srcu
       trace-cmd-3016  [003]  1007.136645: lock_acquire:         0xffff88003fe72c98 &(&zone->lru_lock)->rlock
       trace-cmd-3018  [000]  1007.136648: lock_acquire:         0xffff88003d5fd948 &(&parent->list_lock)->rlock



> 
> 'system wide', 'per cpu', 'per workload', 'per task' or 'per cgroup' are just 
> one of the many natural groupings of events that users/developers would like to 
> see - and we offer these.
> 
>  - that is why sysprof is using perf events to collect system-wide events.
> 
>  - that is why PowerTOP uses perf events in system-wide event collection mode.
> 
>  - that is why 'perf top' uses system wide profiling by default (but can do per 
>    CPU or per task profiling as well)
> 
>  - that is why 'perf record' defaults to a per workload (not a per task as you 
>    claim) mode of event collection
> 
>  - that is why 'perf stat' defalts to per workload events

I should have been more specific of not just system wide events, but
many more types of events. It will be interesting to see how perf
handles function tracing.

Also, the tools that you show are usually used by non critical paths.
I've done the benchmarks before (I'll post the LKML link if you like)
and perf has significant overhead. This is something I tried hard in
ftrace to avoid.

> 
> Do you see that it is ftrace that remained behind the times, by stubbornly 
> forcing some nonsensical global view and encoding it not only in its design but 
> in its APIs as well?

There's nothing in the ABI that keeps it global focused. It would be
easy to make ftrace user/event focused, but I just never did because I
did not want us to fight any more. Would you have accepted patches from
me that extended ftrace to do this?

I was never asked to have it user focused before perf came around, and
by then, the thing preventing ftrace from being user focused was more
social than technical.

> 
> I really meant it when i told you that perf events were the natural next step 
> after ftrace, in the evolution of Linux tracing/instrumentation.

I know you meant that, but I don't see nor feel it myself. Maybe I'm
mistaken but I don't have the belief that I can just jump on faith into
perf and abandon all the work of current ftrace. But I'm happy to help
unify the kernel infrastructure. That is the important part.

> 
> > > > Now that perf has entered the tracing field, I would be happy to bring 
> > > > the two together. [...]
> > > 
> > > Great - please see tip:tmp.perf/trace, that would be a very good point to 
> > > start. It's a working prototype for an ftrace-alike tracing workflow.
> > 
> > I'll do it, if we can agree about the ftrace as system tracing/debugging, and 
> > trace can focus on user specific tracing.
> 
> Ok, you've finally admitted that you do not really want 'unification' between 
> ftrace and perf - which was my suspicion all along. I really prefer 100% honest 
> discussions with people from whom i pull and it took quite some time for you to 
> admit to this position ...


Ingo, I think this is a communication problem more than an honesty
problem. This is why we really need to speak face to face. I'm not
always the best at expressing my thoughts through email and IRC. It's
too easy to get into flames and start attacking each other personally.
Having a discussion over a beer is probably something that would help
us.

I've always been 100% honest with you, but when I've tried to express
myself we end up flaming each other. I'll admit, I've avoided having
more conversations with you because I'm tired of the flames. I don't
know what it is between us, but for some reason we can push each other's
buttons just right and the conversation moves from being technical to
personal.

I'm not conspiring to under mind either you nor perf. The problem with
us is that we have two different ideas of where we want to go. From day
one, I've fought for the debugfs interface. I've said that I will let it
disappear if (and only if) perf is so convenient that it is totally
unneeded. But this point has always caused us to fight with each other.

trace-cmd started as a proof of concept for perf, but you and Peter
nak'd the idea of using the ftrace ring buffer. I still find (and
others, like Google also) that the ftrace ring buffer is superior in
tracing than perf's. Maybe it's not just the ring buffer itself, but the
other overhead of recording perf data. I don't know, the perf ring
buffer is extremely coupled with perf so it's hard to measure without
the rest of perf.

I'll be truly honest here. I continued with trace-cmd hoping that it
would eventually impress you and the two tools could merge. Obviously
that didn't occur, and you took it that I did the trace-cmd work as a
way to compete against perf. That was not my intent.

I've mentioned earlier, that I broke up trace-cmd (libparsevent.so) so
that perf could *use* the features of trace-cmd. Heck, Frederic ported
the code from it to perf. I was hoping for perf to use the library but
I'm not sure why it never did. libparsevent.so is totally agnostic to
ftrace as it only focuses on the event data parsing. I have a separate
libtracecmd.so that implemented the ftrace side. I was hoping that
libperf.so would do the perf side.

Now that trace-cmd is out, and used by many users, its interface is an
ABI, so we are stuck with it regardless. I don't think this really did
hurt perf. In fact, I think it can help perf.

> 
> Despite what you say perf and 'trace' can do system-wide tracing just fine:
> 
>   $ trace record -a
>   ^C
>   # trace recorded [205.108 MB] - try 'trace summary' to get an overview
> 
> ( and note that the code in tip:tmp.perf/trace2 is a very early prototype, 
>   barely tested - it just demonstrates the idea. )
> 
> In fact we could make 'trace' default to system-wide tracing by default and it 
> would fall back to workload level tracing only if it does not have the 
> privileges to trace the whole system.
> 
> Why not use the correctly designed tracing approach and enhance it, and merge 
> all the remaining useful bits of ftrace into it?

The problem we have is that we disagree on what a correctly designed
tracing approach is. Tracing is one of those things that everyone has a
different idea of what is important. As you stated, you do not care
about 4 bytes in an event. If you have 4 million events that is 4
million bytes. A typical event size could be 20 bytes, that 4 bytes is
1/5th of the event that is wasted space.

I believing in an evolutionary approach to merging as suppose to an
intellectual design. I've always said, lets start merging piece by
piece, and hopefully we end up with a great product. I don't care if
this end product is perf or ftrace, but if it is designed properly I'd
be happy with it.

But we need to take it step by step. You are correct that lately I've
been avoiding working directly on perf, but instead started working on
the ftrace side to make it easier to integrate the two. The reason is
that I'm scared to email you anymore, because I don't know what email is
going to trigger another flame war.

-- Steve
 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10 10:33                   ` Steven Rostedt
@ 2011-05-10 19:13                     ` David Sharp
  0 siblings, 0 replies; 40+ messages in thread
From: David Sharp @ 2011-05-10 19:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann

On Tue, May 10, 2011 at 3:33 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 2011-05-10 at 10:47 +0200, Ingo Molnar wrote:
>> * Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> > [...] Thus a library is a perfect solution. [...]
>>
>> Btw., just to make things clear, if we indeed have a library to parse things
>> and if all apps use that then the ABI moves to another (library) level.
>>
>> The requirement from my maintenance POV is very, very simple: apps should not
>> break on new kernels. If this is achieved by making apps smarter then that's a
>> valid solution.
>>
>
> Great! Because this is what I want. I would also want a way to designate
> events as stable. I'll add a TRACE_EVENT_STABLE() that can only have the
> events that maintainers agree to maintain. And give the apps an ability
> to only see these.

A TRACE_EVENT_STABLE() would mark the entire event as stable. I was
wondering if we should instead mark fields within events as stable.
Even within a "stable" event, certain fields we might not want to
guarantee to be in the next release. This might also make it clearer
that the position of a field (stable or not) in an event can change,
and tools really should parse the event format.

> Have the other events need either a separate library,
> or perhaps just separate calls from within the same library, so the XFS
> developers can feel safe that their tracepoints will not be depended on.
> And perhaps have two tracepoints for sched_switch such that Peter
> Zijlsta is happy that he's not bound by an tracepoint that keeps him
> from getting rid of FIFO ;)
>
> I'm happy to write a libperf.so and I can discuss with Arnaldo, Arjan
> and yourself what is the best way of doing this.
>
> Thanks,
>
> -- Steve
>
>
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-10 13:06                   ` Steven Rostedt
@ 2011-05-11 21:51                     ` Ingo Molnar
  2011-05-11 22:36                       ` Steven Rostedt
  2011-05-17  7:15                       ` Michael Rubin
  0 siblings, 2 replies; 40+ messages in thread
From: Ingo Molnar @ 2011-05-11 21:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Sharp, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann


* Steven Rostedt <rostedt@goodmis.org> wrote:

> [root@bxf perf]# ~/bin/perf record -a -e 'syscalls:*'
> 
>   Error: sys_perf_event_open() syscall returned with 24 (Too many open files).  /bin/dmesg may provide additional information.
> 
>   Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?

Yeah, this is a known bug, have you seen Peter's patch that addresses this?

People who run into this bug will go the way of least resistence: not fix it 
and use ftrace. This is sadly how 'splitting a small pond into two' tends to 
work out in practice: both halves stink a little bit more than they would if 
they were kept together ;-)

This is why lttng as a separate project within the kernel was and is a bad idea 
IMO.

I think this further strengthens the idea that we should join stuff and not 
keep it split!

> > I really meant it when i told you that perf events were the natural next 
> > step after ftrace, in the evolution of Linux tracing/instrumentation.
> 
> I know you meant that, but I don't see nor feel it myself. [...]

My position is very simple: right now we have two tracing tools while for many 
years we (including you!) always worked hard to have unified infrastructure.

For years ftrace was maintained and pushed upstream optimistically on the 
assumption that we are reasonable people who can agree on technical solutions 
objectively.

My technical point, at its core, is even simpler:

 - If the ftrace UI/API/ABI design is better then perf can be migrated to it 
   and we can use the ftrace APIs to do more tooling goodness.
   Everyone will be happy.

 - If the perf UI/API/ABI design is better then ftrace can be migrated to it
   and we can use the perf APIs to do more tooling goodness.
   Everyone will be happy.

 - If we do neither we will have continued tooling badness, tooling pain and
   kernel-churn-without-a-clear-purpose. I will be sad.

Call me an egoist but i do not like being sad, i'd like to see one of the 
options implemented where everyone is happy! :-)

So we could really have a dedicated tracing tool that can do what ftrace and 
perf trace can do and much more. I fully expect that it would have an ftrace 
work-alike workflow.

What we do not want is the current nightmare-ish design and schizm that we have 
two different tracers and two different APIs trying to do the same thing 
really. And that's been going on for two and a half years and counting and i do 
not see much progress there so i'm getting worried about it ...

> [...] Maybe I'm mistaken but I don't have the belief that I can just jump on 
> faith into perf and abandon all the work of current ftrace. But I'm happy to 
> help unify the kernel infrastructure. That is the important part.

Well, nobody suggests any extreme of immediately 'throwing away everything', 
especially as there's no clear replacement, why would we want to do that?

But at least having a very specific *idea* how to bring the two tracing tools 
together quickly, and doing the first steps towards that, after a painful 
period of 2.5 years, looks pretty essential to me.

I'd like to see the tracing pond grow, not fragment. Shrinking it by 10% to 90% 
in the first step would still be much better if it can then have the focus and 
clarity to grow to 300% or more - opposed to splitting it into two 50% parts 
and see both halves rot in their own unique ways! :-)

> > Why not use the correctly designed tracing approach and enhance it, and 
> > merge all the remaining useful bits of ftrace into it?
> 
> The problem we have is that we disagree on what a correctly designed tracing 
> approach is. Tracing is one of those things that everyone has a different 
> idea of what is important. As you stated, you do not care about 4 bytes in an 
> event. If you have 4 million events that is 4 million bytes. A typical event 
> size could be 20 bytes, that 4 bytes is 1/5th of the event that is wasted 
> space.

Well, look at the context:

 - In the context of useful tools like PowerTop, which is driving *tons* of 
   useful new code upstream, 4 bytes is very little cost. It strongly filters 
   events to not be too intrusive to the system to begin with.

 - In the context of perf record/report, which easily receives millions of
   events, 4 bytes is still not measurable overhead.

 - In the context of tracing workflows where you generate hundreds of millions
   of events in a short timespan and store the stream as-is as gigabytes of
   data, 4 bytes is probably measurable overhead.

So yes, there are definitely contexts/niches where 4 bytes are probably 
measurable, but if weighted against the regression of *PowerTop* the cost is 
negligible and it's not even a question which way we want to lean.

Also note that regardless of how tracing will look like in two years time, the 
no regressions policy will always have *way* higher priority than any 
micro-cost concerns.

Note that we are in fact are happy that applications use us, we are *happy* 
that they do indeed *break* if we didnt continue to do the goodness that we are 
doing today.

Consider the alternative: if we did things that no app and no developer is 
interested in. It would just not matter to anyone. We could break it freely, 
nobody would give a damn.

So i really prefer the 'apps are using us' situation we are in today, and not 
breaking them is a *small* price to pay and it is a very small loss of the near 
infinite degrees of development freedom we still enjoy in the kernel.

Also note that IMO there is no long-term technical problem really: i agree with 
you that we can eventually get rid of the 4 bytes bkl field as well, if all 
affected apps migrate to libperf.so in an orderly fashion.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-11 21:51                     ` Ingo Molnar
@ 2011-05-11 22:36                       ` Steven Rostedt
  2011-05-17  7:15                       ` Michael Rubin
  1 sibling, 0 replies; 40+ messages in thread
From: Steven Rostedt @ 2011-05-11 22:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Sharp, Vaibhav Nagarnaik, Michael Rubin, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann

On Wed, 2011-05-11 at 23:51 +0200, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:

> Also note that IMO there is no long-term technical problem really: i agree with 
> you that we can eventually get rid of the 4 bytes bkl field as well, if all 
> affected apps migrate to libperf.so in an orderly fashion.

OK Ingo,

I'll make this my top priority. Then who knows? Maybe I'll start looking
at what I can make of your trace program. ;)

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-11 21:51                     ` Ingo Molnar
  2011-05-11 22:36                       ` Steven Rostedt
@ 2011-05-17  7:15                       ` Michael Rubin
  2011-05-17 11:19                         ` Steven Rostedt
  1 sibling, 1 reply; 40+ messages in thread
From: Michael Rubin @ 2011-05-17  7:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, David Sharp, Vaibhav Nagarnaik, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann

This thread is unsettling for at least one customer of kernel tracing.

Google spent a lot of time writing their own kernel tracing
infrastructure ktrace. It was working just fine for us but we
abandoned it in order to work with the community. We evaluated perf,
ftrace and LTTNG and opted for ftrace. We saw it as a efficient kernel
system that had been around long enough we could depend on it to
continue to be around. Also we could share our work this way. We
started sending patches and tried to be good Open Source citizens.

Switching from ktrace to ftrace was very painful for us. In order to
monitor the tens of thousands of computers Google maintains we wrote a
lot of tools on top of ftrace that are very specific to Google's user
space technology. What was not fun was to ask engineers to make
changes to their working systems to accommodate the switch from ktrace
to ftrace. We are not going to do this again in the near future.

On Wed, 2011-05-11 at 23:51 +0200, Ingo Molnar wrote:
> This is sadly how 'splitting a small pond into two' tends to
> work out in practice: both halves stink a little bit more than they would if
> they were kept together ;-)

I heavily agree with this statement. Having duplicate solutions
doesn't help anything.

On Wed, 2011-05-11 at 23:51 +0200, Ingo Molnar wrote:
> - If the perf UI/API/ABI design is better then ftrace can be migrated to it
>  and we can use the perf APIs to do more tooling goodness.
>  Everyone will be happy.

But I don't agree here. Everyone will _not_ be happy. Existing
customers will have to migrate to a new system, API or even worse new
semantics.

On Wed, 2011-05-11 at 23:51 +0200, Ingo Molnar wrote:
> So i really prefer the 'apps are using us' situation we are in today, and not
> breaking them is a *small* price to pay and it is a very small loss of the near
> infinite degrees of development freedom we still enjoy in the kernel.

I really prefer the 'apps are using us' situation too. Both as someone
who is working with ftrace development and also working with kernel
tracing customers.

What is the plan for customers going forward? Is it going to involve
removing ftrace in favor for perf? Removing perf in favor for ftrace?
We love perf and don't want to see it go away either.  We tend to use
the two systems differently. Do customers basically have to wait a few
years to see not only which system wins but which ones stays on top?

I apologize if this is obvious to others but I am confused.

mrubin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-17  7:15                       ` Michael Rubin
@ 2011-05-17 11:19                         ` Steven Rostedt
  2011-05-17 13:24                           ` David Ahern
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2011-05-17 11:19 UTC (permalink / raw)
  To: Michael Rubin
  Cc: Ingo Molnar, David Sharp, Vaibhav Nagarnaik, Linus Torvalds,
	Arjan van de Ven, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Arnd Bergmann

On Tue, 2011-05-17 at 00:15 -0700, Michael Rubin wrote:

> What is the plan for customers going forward? Is it going to involve
> removing ftrace in favor for perf? Removing perf in favor for ftrace?
> We love perf and don't want to see it go away either.  We tend to use
> the two systems differently. Do customers basically have to wait a few
> years to see not only which system wins but which ones stays on top?

My plan is:

1) get libperf.so out for user tools to use.

2) Start hacking on code again :)

But I'll make sure that this will not be a burden on Google. There's no
reason that Google should be punished for using something that is
mainline, and using the proper ABIs. The code in ftrace is very flexible
and tools that use ftrace should still work even if we make internal
changes.

I'll work closely with you to make sure that Google's tools will always
work with future kernels.

> 
> I apologize if this is obvious to others but I am confused.

No need to apologize, it's a very confusing situation.

-- Steve



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-17 11:19                         ` Steven Rostedt
@ 2011-05-17 13:24                           ` David Ahern
  2011-05-17 13:27                             ` Steven Rostedt
  0 siblings, 1 reply; 40+ messages in thread
From: David Ahern @ 2011-05-17 13:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Michael Rubin, Ingo Molnar, David Sharp, Vaibhav Nagarnaik,
	Linus Torvalds, Arjan van de Ven, linux-kernel,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Christoph Hellwig, Arnd Bergmann



On 05/17/11 05:19, Steven Rostedt wrote:
> On Tue, 2011-05-17 at 00:15 -0700, Michael Rubin wrote:
> 
>> What is the plan for customers going forward? Is it going to involve
>> removing ftrace in favor for perf? Removing perf in favor for ftrace?
>> We love perf and don't want to see it go away either.  We tend to use
>> the two systems differently. Do customers basically have to wait a few
>> years to see not only which system wins but which ones stays on top?
> 
> My plan is:
> 
> 1) get libperf.so out for user tools to use.

libparsevent or libperf? If you really meant libperf will it be a
superset of the event parsing -- like the inclusion of the plugins infra?

David

> 
> 2) Start hacking on code again :)
> 
> But I'll make sure that this will not be a burden on Google. There's no
> reason that Google should be punished for using something that is
> mainline, and using the proper ABIs. The code in ftrace is very flexible
> and tools that use ftrace should still work even if we make internal
> changes.
> 
> I'll work closely with you to make sure that Google's tools will always
> work with future kernels.
> 
>>
>> I apologize if this is obvious to others but I am confused.
> 
> No need to apologize, it's a very confusing situation.
> 
> -- Steve
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-17 13:24                           ` David Ahern
@ 2011-05-17 13:27                             ` Steven Rostedt
  2011-05-17 13:30                               ` Ingo Molnar
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2011-05-17 13:27 UTC (permalink / raw)
  To: David Ahern
  Cc: Michael Rubin, Ingo Molnar, David Sharp, Vaibhav Nagarnaik,
	Linus Torvalds, Arjan van de Ven, linux-kernel,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Christoph Hellwig, Arnd Bergmann

On Tue, 2011-05-17 at 07:24 -0600, David Ahern wrote:

> libparsevent or libperf? If you really meant libperf will it be a
> superset of the event parsing -- like the inclusion of the plugins infra?
> 

Yes, although not all the plugin infrastructure was in libparsevent.so.

I plan on taking libparsevent.so and adding perf wrappers to its
interface.

-- Steve



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Fix powerTOP regression with 2.6.39-rc5
  2011-05-17 13:27                             ` Steven Rostedt
@ 2011-05-17 13:30                               ` Ingo Molnar
  0 siblings, 0 replies; 40+ messages in thread
From: Ingo Molnar @ 2011-05-17 13:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Ahern, Michael Rubin, David Sharp, Vaibhav Nagarnaik,
	Linus Torvalds, Arjan van de Ven, linux-kernel,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Christoph Hellwig, Arnd Bergmann


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 2011-05-17 at 07:24 -0600, David Ahern wrote:
> 
> > libparsevent or libperf? If you really meant libperf will it be a
> > superset of the event parsing -- like the inclusion of the plugins infra?
> > 
> 
> Yes, although not all the plugin infrastructure was in libparsevent.so.

Please post the patches in a finegrained manner that makes it easy to review 
each aspect of it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2011-05-17 13:30 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-06 20:08 Fix powerTOP regression with 2.6.39-rc5 Arjan van de Ven
2011-05-06 20:20 ` Steven Rostedt
2011-05-06 20:51   ` Linus Torvalds
2011-05-06 21:10     ` Steven Rostedt
2011-05-06 21:24       ` Linus Torvalds
2011-05-06 21:14     ` Steven Rostedt
2011-05-06 21:28       ` Linus Torvalds
2011-05-06 21:29     ` Arjan van de Ven
2011-05-06 21:57       ` Steven Rostedt
2011-05-07  6:58     ` Ingo Molnar
2011-05-07 10:45       ` Steven Rostedt
2011-05-07 14:44         ` Ingo Molnar
2011-05-07 17:20           ` Steven Rostedt
2011-05-07 17:59             ` Arjan van de Ven
2011-05-08 21:08               ` Frederic Weisbecker
2011-05-08 21:56                 ` Arjan van de Ven
2011-05-07 19:00             ` Ingo Molnar
2011-05-10  3:07               ` Steven Rostedt
2011-05-10  4:44                 ` Dave Chinner
2011-05-10  5:39                   ` Steven Rostedt
2011-05-10  7:36                     ` Dave Chinner
2011-05-10  7:54                 ` Ingo Molnar
2011-05-10  8:09                 ` Ingo Molnar
2011-05-10  8:32                   ` Arjan van de Ven
2011-05-10  8:44                     ` Ingo Molnar
2011-05-10  9:14                       ` Pekka Enberg
2011-05-10  8:41                 ` Ingo Molnar
2011-05-10 13:06                   ` Steven Rostedt
2011-05-11 21:51                     ` Ingo Molnar
2011-05-11 22:36                       ` Steven Rostedt
2011-05-17  7:15                       ` Michael Rubin
2011-05-17 11:19                         ` Steven Rostedt
2011-05-17 13:24                           ` David Ahern
2011-05-17 13:27                             ` Steven Rostedt
2011-05-17 13:30                               ` Ingo Molnar
2011-05-10  8:47                 ` Ingo Molnar
2011-05-10 10:33                   ` Steven Rostedt
2011-05-10 19:13                     ` David Sharp
2011-05-09 23:37             ` David Sharp
2011-05-10  7:39               ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.