linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC/Requirements/Design] h/w error reporting
@ 2010-11-10  0:56 Luck, Tony
  2010-11-10 10:14 ` Ingo Molnar
  0 siblings, 1 reply; 50+ messages in thread
From: Luck, Tony @ 2010-11-10  0:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: ying.huang, mingo, bp, tglx, akpm, mchehab

At the Linux plumbers conference we had an interesting discussion on
the current state and future direction for hardware error reporting.
Thanks to Mauro for setting up the session, and to all those who
attended.  Cc: list created by looking at the most vocal on the last
thread on this subject - but everyone is invited to chime in.

Here are my notes on what was said - please add anything that I
missed/forgot ... or with your own thoughts on this topic.

The current situation
---------------------

We have several subsystems & methods for reporting hardware errors:

1) EDAC ("Error Detection and Correction").  In its original form
this consisted of a platform specific driver that read topology
information and error counts from chipset registers and reported
the results via a sysfs interface.  For example:

# ls -l /sys/devices/system/edac/mc/mc0
total 0
-r--r--r-- 1 root root 4096 Nov  8 14:48 ce_count
-r--r--r-- 1 root root 4096 Nov  8 14:48 ce_noinfo_count
drwxr-xr-x 2 root root    0 Nov  8 14:47 csrow0
drwxr-xr-x 2 root root    0 Nov  8 14:47 csrow1
lrwxrwxrwx 1 root root    0 Nov  8 14:48 device -> ../../../../pci0000:00/0000:00:10.0
-r--r--r-- 1 root root 4096 Nov  8 14:48 mc_name
--w------- 1 root root 4096 Nov  8 14:48 reset_counters
-rw-r--r-- 1 root root 4096 Nov  8 14:48 sdram_scrub_rate
-r--r--r-- 1 root root 4096 Nov  8 14:48 seconds_since_reset
-r--r--r-- 1 root root 4096 Nov  8 14:48 size_mb
-r--r--r-- 1 root root 4096 Nov  8 14:48 ue_count
-r--r--r-- 1 root root 4096 Nov  8 14:48 ue_noinfo_count

some chipset drivers also report some pci device error information
and others provide mechanisms to inject errors for testing.

2) mcelog - x86 specific decoding of machine check bank registers
reporting in binary form via /dev/mcelog. Recent additions make use
of the APEI extensions that were documented in version 4.0a of the
ACPI specification to acquire more information about errors without
having to rely reading chipset registers directly. A user level
programs decodes into somewhat human readable format.

3) drivers/edac/mce_amd.c  A recent addition - this driver hooks into
the mcelog path and decodes errors reported via machine check bank
registers in AMD processors to the console log using printk() [despite
being in the drivers/edac directory, this seems completely different
from classic EDAC to me].

Each of these mechanisms has a band of followers ... and none
of them appear to meet all the needs of all users. Some of the
issues are:
1) New EDAC drivers need to be written for each chipset. Documentation
is often opaque, so there is often a delay between the introduction
of a new platform and availability of EDAC drivers.
2) Some platforms do not allow the OS to read chipset error counters.
3) Some parts of mcelog use ACPI - which taints the whole subsystem
(somewhat unfairly - most of it depends on machine check bank registers).
4) Some large cluster users are not happy about parsing console logs
looking for patterns of warning messages that indicate possible
future problems.

Taking a cue from the tracing session from the previous day (where
the "perf" vs. "ftrace" vs. "lttng" war was ended by proposing a
new tracing methodology that would overcome the shortcomings of
both of the merged subsystems while also addressing the requirements
of the lttng users) we explored whether the solution would be to
define a new "system health" subsystem that could be used by any
part of the kernel to report hardware issues in a coherent way so
that end users would have a single place to look for all error
information.

Use cases:
----------
There are a number of things that people may want to do with h/w
error data:

*) Corrected errors -> Look for patterns, or for rates above a particular
   threshold and use this for "predictive failure analysis" (i.e. to
   schedule replacement of a component before it is the source of an
   uncorrectable error).
*) Uncorrected errors -> Identify failing component for replacement.


Requirements (woefully incomplete - please help):
-------------------------------------------------

*) Architecture independent (both power and arm are potentially interested)

*) Report errors against human readable labels (e.g. using motherboard
   labels to identify DIMM or PCI slots).  This is hard (will often need
   some platform-specific mapping table to provide, or override, detailed
   information).

*) General interface available for any kind of h/w error report (e.g.
   device driver might use it for board level problems, or IPMI might
   report fan speed problems or over-temperature events).

*) Useful to make it easy to adapt existing EDAC drivers, machine-check
   bank decoders and other existing error reporters to use this new
   mechanism.

*) Robust - should not lose error information.  If the platform provides
   some sort of persistent storage, should make use of it to preserve
   details for fatal errors across reboot.  But may need some threshold
   mechanism that copes with floods of errors from a failed object.

*) Flexible: Errors may be discovered by polling, or reported by some
   interrupt/exception

Open questions:
---------------

For error sources that require polling to collect information, who should
initiate polling?  Kernel (from a timer) or user?

There are lots of potential configuration options and tuneables - so how
to keep to the minimum necessary?

How should reports of h/w error events get from kernel to user (in
earlier instantiations of this discussion, Ingo suggested "perf").

What should each error report look like? Some sort of record structure
would seem to be needed - but needs to be flexible to cope with
different needs from different types of device.

If multiple user agents are interested in looking at errors, how to
ensure that every agent gets a chance to see every error.

Some errors may be found before userspace has been started. How/where
to hold these reliably until daemons are running.

Where should platform specific tables that map from hard to interpret
"device numbers" to actual numbered slots reside?  Should this be left
to user-mode to tidy up? Or should we somehow load mapping information
into the kernel?

-Tony

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10  0:56 [RFC/Requirements/Design] h/w error reporting Luck, Tony
@ 2010-11-10 10:14 ` Ingo Molnar
  2010-11-10 14:40   ` Steven Rostedt
  0 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2010-11-10 10:14 UTC (permalink / raw)
  To: Luck, Tony
  Cc: linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Arjan van de Ven


* Luck, Tony <tony.luck@intel.com> wrote:

> Taking a cue from the tracing session from the previous day (where the "perf" vs. 
> "ftrace" vs. "lttng" war was ended by proposing a new tracing methodology that 
> would overcome the shortcomings of both of the merged subsystems while also 
> addressing the requirements of the lttng users) [...]

Well, the direction is that we are unifying ftrace and perf events and we are 
actively phasing out individual ftrace plugins as matching events become available 
(we already removed a few).

Most new tools use the perf syscall and tool writers have expressed the very 
understandable desire that all events (and their reporting facility) be enumerated 
and accessible via a unified API/ABI.

While it often seems easier for subsystems to just do their own ad-hoc 
logging/reporting in the short run (every subsystem tends to think it has its own 
very specific requirements for logging - while users/tool-authors can only shake 
their head in disbelief when looking at the myriads of incompatible and inconsistent 
facilities). The tooling requirement for unification is strong here and can not be 
ignored.

> [...] we explored whether the solution would be to define a new "system health" 
> subsystem that could be used by any part of the kernel to report hardware issues 
> in a coherent way so that end users would have a single place to look for all 
> error information.

Note that Boris has been working on extending perf events into this area as well, 
see this recent submission of patches on lkml:

  [PATCH 20/20] ras: Add RAS daemon

One thing is clear: any 'health subsystem' should not do its own flavor of error 
reporting - instead we want to unify various forms of event logging into a common 
facility.

RAS/EDAC could do its own hardware-specific settings via a separate subsystem - 
although even many of those can be expressed via their respective events. (and we 
are open on the perf events side to give callbacks/facilities for such use)

The synergies of unified event reporting are very strong.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 10:14 ` Ingo Molnar
@ 2010-11-10 14:40   ` Steven Rostedt
  2010-11-10 14:43     ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 14:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Luck, Tony, linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 11:14 +0100, Ingo Molnar wrote:
> * Luck, Tony <tony.luck@intel.com> wrote:
> 
> > Taking a cue from the tracing session from the previous day (where the "perf" vs. 
> > "ftrace" vs. "lttng" war was ended by proposing a new tracing methodology that 
> > would overcome the shortcomings of both of the merged subsystems while also 
> > addressing the requirements of the lttng users) [...]
> 
> Well, the direction is that we are unifying ftrace and perf events and we are 
> actively phasing out individual ftrace plugins as matching events become available 
> (we already removed a few).
> 
> Most new tools use the perf syscall and tool writers have expressed the very 
> understandable desire that all events (and their reporting facility) be enumerated 
> and accessible via a unified API/ABI.
> 
> While it often seems easier for subsystems to just do their own ad-hoc 
> logging/reporting in the short run (every subsystem tends to think it has its own 
> very specific requirements for logging - while users/tool-authors can only shake 
> their head in disbelief when looking at the myriads of incompatible and inconsistent 
> facilities). The tooling requirement for unification is strong here and can not be 
> ignored.

Hi Ingo,

The thing that was brought up at KS was the problems between the way
perf and ftrace get their data. We have two buffer systems and two
interfaces for that. Forget the debugfs for now, the ftrace ring buffer
was designed for fast high speed tracing, with or without a reader. The
perf buffer was designed for analyzing a specific task (although it can
do more, but for a single task it shines).

The format of data that ftrace uses and the format perf uses is also
currently incompatible.

Linus said flat out that if he gets one complaint that a tool breaks
because a format change or an ABI disappears, he will revert the patch
that did that change immediately.

During the tracing summit at Linux Plumbers, Thomas stated that we have
two choices.

1) We can keep the status-quo and just have two separate interfaces
(whether both would be supported by the perf user tool was not
discussed)

2) We come up with a new syscall (or syscalls) that can be designed for
both the needs of perf and ftrace. This syscall would be kept out of
mainline until everyone was happy with it. After we are happy with it
and have tools that work well with it, we will push it to mainline. Then
the old interfaces would still be supported but nothing new added to
them. And all new development would be with the new syscall(s) and
eventually we deprecate the old interface. This would truly unify ftrace
and perf.

The second option was agreed upon by myself, Thomas, Frederic, and
Peter, and it was OK'd by Linus.

What do you think about it?

-- Steve

> 
> > [...] we explored whether the solution would be to define a new "system health" 
> > subsystem that could be used by any part of the kernel to report hardware issues 
> > in a coherent way so that end users would have a single place to look for all 
> > error information.
> 
> Note that Boris has been working on extending perf events into this area as well, 
> see this recent submission of patches on lkml:
> 
>   [PATCH 20/20] ras: Add RAS daemon
> 
> One thing is clear: any 'health subsystem' should not do its own flavor of error 
> reporting - instead we want to unify various forms of event logging into a common 
> facility.
> 
> RAS/EDAC could do its own hardware-specific settings via a separate subsystem - 
> although even many of those can be expressed via their respective events. (and we 
> are open on the perf events side to give callbacks/facilities for such use)
> 
> The synergies of unified event reporting are very strong.
> 
> Thanks,
> 
> 	Ingo



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 14:40   ` Steven Rostedt
@ 2010-11-10 14:43     ` Peter Zijlstra
  2010-11-10 15:09       ` Steven Rostedt
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2010-11-10 14:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 09:40 -0500, Steven Rostedt wrote:
> The second option was agreed upon by myself, Thomas, Frederic, and
> Peter, and it was OK'd by Linus.
> 
> What do you think about it? 

I would still like to see lots more detail before we commit to anything,
sure the second way is the only way out, but you still need to come up
with a trace data format and a control ABI.

Without those its pretty pointless to even talk about stuff.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 14:43     ` Peter Zijlstra
@ 2010-11-10 15:09       ` Steven Rostedt
  2010-11-10 15:28         ` Mathieu Desnoyers
  2010-11-10 15:30         ` Peter Zijlstra
  0 siblings, 2 replies; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 15:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 15:43 +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-10 at 09:40 -0500, Steven Rostedt wrote:
> > The second option was agreed upon by myself, Thomas, Frederic, and
> > Peter, and it was OK'd by Linus.
> > 
> > What do you think about it? 
> 
> I would still like to see lots more detail before we commit to anything,

Keeping it out of tree means we don't commit to anything :-)

> sure the second way is the only way out, but you still need to come up
> with a trace data format and a control ABI.

Great! Let's start with that then. Could you list some of the basic
needs of perf? And then I can start talking about what I need for
ftrace, and we also should keep in mind things we may want to do in the
future.

> 
> Without those its pretty pointless to even talk about stuff.

Agreed, but we really do want to find a way out, thus lets start the
conversation. Basically, start from square one. We now have both ftrace
and perf, and we know what we need. We can start working on something
with both in mind, and perhaps keeping track of other things.

I added Mathieu too. I know to you LTTng does not exist, but he can at
least give ideas about something we may not have thought about and may
want to do in the future.

-- Steve




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 15:09       ` Steven Rostedt
@ 2010-11-10 15:28         ` Mathieu Desnoyers
  2010-11-10 15:30         ` Peter Zijlstra
  1 sibling, 0 replies; 50+ messages in thread
From: Mathieu Desnoyers @ 2010-11-10 15:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Luck, Tony, linux-kernel,
	ying.huang, bp, tglx, akpm, mchehab,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Arjan van de Ven

* Steven Rostedt (rostedt@goodmis.org) wrote:
[...]
> I added Mathieu too. I know to you LTTng does not exist, but he can at
> least give ideas about something we may not have thought about and may
> want to do in the future.

If we come up with an ABI that fulfills my user needs, I'll switch to it and
deprecate LTTng. I've actually worked almost full time on common kernel tracer
infrastructures this year, leaving LTTng development almost to a standing halt.

How do you guys want to proceed ? Do you want to throw your requirements over
the wall or do you want me to propose a trace format we can discuss on ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 15:09       ` Steven Rostedt
  2010-11-10 15:28         ` Mathieu Desnoyers
@ 2010-11-10 15:30         ` Peter Zijlstra
  2010-11-10 15:53           ` Steven Rostedt
                             ` (2 more replies)
  1 sibling, 3 replies; 50+ messages in thread
From: Peter Zijlstra @ 2010-11-10 15:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 10:09 -0500, Steven Rostedt wrote:
> 
> > sure the second way is the only way out, but you still need to come up
> > with a trace data format and a control ABI.
> 
> Great! Let's start with that then. Could you list some of the basic
> needs of perf? 

Needs for what? I've already got a full control ABI and I can already
redirect output to other buffers ;-)

I don't have enumeration of what all is redirected to what, I pretend
that people know wth they're doing.. so if you want session lists of
what tracepoints are active on which buffers and the like you'll have to
come up with something for that.

As for the buffer, I prefer a u64 aligned data stream, but the very
least I need is frame encapsulation. What I don't want _ever_ is stupid
sub-buffers. And no they're not needed, see the discussion about sync
markers a while back.

I also don't want to support the stupid concurrent read/write from tail.

What I do want is both mmap() and splice(), this means buffer size needs
to be specified at buffer creation.

I currently support overwrite (flight-recorder) and non-overwrite modes
depending on PROT_WRITE, I guess that can easily be pushed into the
buffer create call.

As to the mmap() part, it needs a control page to expose the head/tail
pointers and some data.

And as you know I need to write > PAGE_SIZE entries.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 15:30         ` Peter Zijlstra
@ 2010-11-10 15:53           ` Steven Rostedt
  2010-11-10 16:52           ` Steven Rostedt
  2010-11-10 17:48           ` Ingo Molnar
  2 siblings, 0 replies; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 16:30 +0100, Peter Zijlstra wrote:


> I also don't want to support the stupid concurrent read/write from tail.

You are talking about what you said that ftrace ring buffer is totally
broken, that if the writer is writing to the tail, and the reader is
reading from the head, it is broken?

Let me get this straight.

We have a writer constantly writing to the tail of the buffer. On
another CPU we have a reader, that will start at that tail (where the
writer just wrote) and go backwards.

What happens if the writer continues writing? Do we stop the reader and
have it write what it just wrote? Or just consider that the reader goes
the opposite direction of the writer, and when it hits the writer, it
continues, since now it has the new data again.

Now the question comes, how do we show this data to the user? Does the
user need to sort the data? If the reader reads X amount of data, it
gets X from where the writer just wrote. Then the writer writes Y data.
The reader reads X amount again, but X > Y, do we read the Y where the
writer wrote, and then read the buffer part that is older than the
previous read? Thus the user now has the burden to sort the buffer?

I'm really confused to how to use a buffer like this?

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 15:30         ` Peter Zijlstra
  2010-11-10 15:53           ` Steven Rostedt
@ 2010-11-10 16:52           ` Steven Rostedt
  2010-11-10 17:05             ` Borislav Petkov
  2010-11-10 17:25             ` Frederic Weisbecker
  2010-11-10 17:48           ` Ingo Molnar
  2 siblings, 2 replies; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 16:30 +0100, Peter Zijlstra wrote:

> As for the buffer, I prefer a u64 aligned data stream, but the very
> least I need is frame encapsulation. What I don't want _ever_ is stupid
> sub-buffers. And no they're not needed, see the discussion about sync
> markers a while back.

BTW, the sub buffers is just an implementation detail. I suspect that
we'll have to end up with something that splits the buffer up. Whether
we have 'markers' or something else. They all break down the buffer into
a "sub-buffer".

> 
> I also don't want to support the stupid concurrent read/write from tail.

I was thinking about this more. I guess it can work if the reader always
goes the opposite direction of the writer. It's just any user that uses
this will need to cope with it. I would personally like both methods
implemented. One as the "broken design" (as you put it) which removes
the burden of sorting from the user. But the "fast design" which
requires the end result having to be able to sort the buffer.

> 
> What I do want is both mmap() and splice(), this means buffer size needs
> to be specified at buffer creation.

Sure.

> 
> I currently support overwrite (flight-recorder) and non-overwrite modes
> depending on PROT_WRITE, I guess that can easily be pushed into the
> buffer create call.

yep.

> 
> As to the mmap() part, it needs a control page to expose the head/tail
> pointers and some data.

But lets make this somehow generic, where it can be used by all sorts.


> And as you know I need to write > PAGE_SIZE entries.

Sure.

Also, lets not focus now on implementation. Let's try to concentrate on
what we want the tools to be able to do.

For example, I would like:

Very small entries, and pick and chose what I want in my entries.

A way to read it fast to a file or over the network (splice).

The read backwards seems like a cool idea, but I would not want to throw
away the read forwards part either.

How we implement this, we can work together on.

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 16:52           ` Steven Rostedt
@ 2010-11-10 17:05             ` Borislav Petkov
  2010-11-10 17:41               ` Ingo Molnar
  2010-11-10 17:25             ` Frederic Weisbecker
  1 sibling, 1 reply; 50+ messages in thread
From: Borislav Petkov @ 2010-11-10 17:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Luck, Tony, linux-kernel,
	ying.huang, tglx, akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, Nov 10, 2010 at 11:52:40AM -0500, Steven Rostedt wrote:
> Also, lets not focus now on implementation. Let's try to concentrate on
> what we want the tools to be able to do.
> 
> For example, I would like:
> 
> Very small entries, and pick and chose what I want in my entries.
> 
> A way to read it fast to a file or over the network (splice).
> 
> The read backwards seems like a cool idea, but I would not want to throw
> away the read forwards part either.

It would also be cool to be able to allocate those buffers as early as
possible, even if before MCA is enabled, so that I won't have to copy
MCE data which got logged before the tracing subsystem got enabled to
the buffers proper.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 16:52           ` Steven Rostedt
  2010-11-10 17:05             ` Borislav Petkov
@ 2010-11-10 17:25             ` Frederic Weisbecker
  1 sibling, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2010-11-10 17:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Luck, Tony, linux-kernel,
	ying.huang, bp, tglx, akpm, mchehab, Arnaldo Carvalho de Melo,
	Arjan van de Ven, Mathieu Desnoyers

On Wed, Nov 10, 2010 at 11:52:40AM -0500, Steven Rostedt wrote:
> On Wed, 2010-11-10 at 16:30 +0100, Peter Zijlstra wrote:
> 
> > As for the buffer, I prefer a u64 aligned data stream, but the very
> > least I need is frame encapsulation. What I don't want _ever_ is stupid
> > sub-buffers. And no they're not needed, see the discussion about sync
> > markers a while back.
> 
> BTW, the sub buffers is just an implementation detail. I suspect that
> we'll have to end up with something that splits the buffer up. Whether
> we have 'markers' or something else. They all break down the buffer into
> a "sub-buffer".


If the size of the sub-buffers are tunable (all the same size inside a
whole buffer, but that size is tunable), then someone who doesn't want
to use subbuffers can just use a single big subbuffer :)


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 17:05             ` Borislav Petkov
@ 2010-11-10 17:41               ` Ingo Molnar
  2010-11-10 17:50                 ` Luck, Tony
  2010-11-10 18:09                 ` Steven Rostedt
  0 siblings, 2 replies; 50+ messages in thread
From: Ingo Molnar @ 2010-11-10 17:41 UTC (permalink / raw)
  To: Borislav Petkov, Steven Rostedt, Peter Zijlstra, Luck, Tony,
	linux-kernel, ying.huang, tglx, akpm, mchehab,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Arjan van de Ven, Mathieu Desnoyers


* Borislav Petkov <bp@alien8.de> wrote:

> On Wed, Nov 10, 2010 at 11:52:40AM -0500, Steven Rostedt wrote:
> > Also, lets not focus now on implementation. Let's try to concentrate on
> > what we want the tools to be able to do.
> > 
> > For example, I would like:
> > 
> > Very small entries, and pick and chose what I want in my entries.
> > 
> > A way to read it fast to a file or over the network (splice).
> > 
> > The read backwards seems like a cool idea, but I would not want to throw
> > away the read forwards part either.
> 
> It would also be cool to be able to allocate those buffers as early as possible, 
> even if before MCA is enabled, so that I won't have to copy MCE data which got 
> logged before the tracing subsystem got enabled to the buffers proper.

We could even have some (small) statically enabled build-time buffer that could be 
enabled straight away before any allocators are enabled.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 15:30         ` Peter Zijlstra
  2010-11-10 15:53           ` Steven Rostedt
  2010-11-10 16:52           ` Steven Rostedt
@ 2010-11-10 17:48           ` Ingo Molnar
  2010-11-10 18:05             ` Steven Rostedt
  2 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2010-11-10 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Wed, 2010-11-10 at 10:09 -0500, Steven Rostedt wrote:
> > 
> > > sure the second way is the only way out, but you still need to come up with a 
> > > trace data format and a control ABI.
> > 
> > Great! Let's start with that then. Could you list some of the basic needs of 
> > perf?
> 
> Needs for what? I've already got a full control ABI and I can already redirect 
> output to other buffers ;-)

Yep. The obvious direction is to extend the event buffering ABI we already have, 
with whatever additions that are needed:

 - document that we already support flight recorder mode

 - a more compressed record format

 - NOP filler events up to page boundary, for better splice and for better flight 
   recorder

 - splice support

etc. That's how it evolved until now and it's all very extensible.

Steve, could you please list the additions you have in mind, in order of priority?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 17:41               ` Ingo Molnar
@ 2010-11-10 17:50                 ` Luck, Tony
  2010-11-10 18:09                 ` Steven Rostedt
  1 sibling, 0 replies; 50+ messages in thread
From: Luck, Tony @ 2010-11-10 17:50 UTC (permalink / raw)
  To: Ingo Molnar, Borislav Petkov, Steven Rostedt, Peter Zijlstra,
	linux-kernel, Huang, Ying, tglx, akpm, mchehab,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Arjan van de Ven, Mathieu Desnoyers

> We could even have some (small) statically enabled build-time
> buffer that could be enabled straight away before any allocators
> are enabled.

Agreed - in fact the error reporting paths will also want
some pre-allocated guaranteed space at all times.  Allocating
memory from within an NMI or Machine Check handler would
cause too many problems.

-Tony

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 17:48           ` Ingo Molnar
@ 2010-11-10 18:05             ` Steven Rostedt
  2010-11-10 18:23               ` Luck, Tony
                                 ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 18:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 18:48 +0100, Ingo Molnar wrote:

> Yep. The obvious direction is to extend the event buffering ABI we already have, 
> with whatever additions that are needed:
> 
>  - document that we already support flight recorder mode

I thought this was still broken?

> 
>  - a more compressed record format
> 
>  - NOP filler events up to page boundary, for better splice and for better flight 
>    recorder
> 
>  - splice support
> 
> etc. That's how it evolved until now and it's all very extensible.
> 
> Steve, could you please list the additions you have in mind, in order of priority?

A few of things that pop up quickly are:

1) lockless

2) as-fast-as possible

3) support all tasks / all CPUs and still have as-fast-as-possible

Peter said at LPC that the perf buffering system was not designed to
handle high speed tracing. But he also said he does not like the way the
ftrace buffering works.

I think if we take a step back, we can come up with a new buffering/ABI
system that can satisfy everyone. We will still support the current
method now, but I really don't think it is designed with everything we
had in mind. I do not envision that we can "envolve" to where we want to
be. We may have to bite the bullet, just like iptables did when they saw
the failures of ipchains, and redesign something new now that we
understand what the requirements are.

I do think we need to come up with something new but still support the
old methods. Thomas came up with this idea, and Peter, Frederic and
myself agreed.

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 17:41               ` Ingo Molnar
  2010-11-10 17:50                 ` Luck, Tony
@ 2010-11-10 18:09                 ` Steven Rostedt
  2010-11-10 18:52                   ` Ingo Molnar
  1 sibling, 1 reply; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 18:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Peter Zijlstra, Luck, Tony, linux-kernel,
	ying.huang, tglx, akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 18:41 +0100, Ingo Molnar wrote:

> > It would also be cool to be able to allocate those buffers as early as possible, 
> > even if before MCA is enabled, so that I won't have to copy MCE data which got 
> > logged before the tracing subsystem got enabled to the buffers proper.
> 
> We could even have some (small) statically enabled build-time buffer that could be 
> enabled straight away before any allocators are enabled.

I'm not sure it needs to be small. We can have a persistent buffer that
may be resized at any time. It can start off small, or with a kernel
command line, be as big as you want it.

Basically, what ftrace has now.

Also, with a way that root user can get a handle on this buffer, and
just trace global events with it.

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 18:05             ` Steven Rostedt
@ 2010-11-10 18:23               ` Luck, Tony
  2010-11-10 18:31                 ` Peter Zijlstra
  2010-11-10 18:24               ` Peter Zijlstra
  2010-11-10 18:27               ` Ingo Molnar
  2 siblings, 1 reply; 50+ messages in thread
From: Luck, Tony @ 2010-11-10 18:23 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar
  Cc: Peter Zijlstra, linux-kernel, Huang, Ying, bp, tglx, akpm,
	mchehab, Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Arjan van de Ven, Mathieu Desnoyers

> A few of things that pop up quickly are:
>
> 1) lockless

This is a clear requirement for use in h/w error
reporting too.  Taking locks in NMI or machine
check handler isn't an option.

-Tony

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 18:05             ` Steven Rostedt
  2010-11-10 18:23               ` Luck, Tony
@ 2010-11-10 18:24               ` Peter Zijlstra
  2010-11-10 18:41                 ` Ingo Molnar
                                   ` (2 more replies)
  2010-11-10 18:27               ` Ingo Molnar
  2 siblings, 3 replies; 50+ messages in thread
From: Peter Zijlstra @ 2010-11-10 18:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 13:05 -0500, Steven Rostedt wrote:
> 
> Peter said at LPC that the perf buffering system was not designed to
> handle high speed tracing. But he also said he does not like the way the
> ftrace buffering works. 

You're not very good at listening, I said the perf infrastructure and
event handling mechanism isn't geared towards full throughput but
instead on sampling.

There is lots of code between getting the event and landing it in the
buffer. The buffer itself is perfectly suited for high speed low
overhead stuffs, the perf data format possibly not because its not
bitfield happy.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 18:05             ` Steven Rostedt
  2010-11-10 18:23               ` Luck, Tony
  2010-11-10 18:24               ` Peter Zijlstra
@ 2010-11-10 18:27               ` Ingo Molnar
  2 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2010-11-10 18:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 2010-11-10 at 18:48 +0100, Ingo Molnar wrote:
> 
> > Yep. The obvious direction is to extend the event buffering ABI we already have, 
> > with whatever additions that are needed:
> > 
> >  - document that we already support flight recorder mode
> 
> I thought this was still broken?

Please document that as well.

> >  - a more compressed record format
> > 
> >  - NOP filler events up to page boundary, for better splice and for better flight 
> >    recorder
> > 
> >  - splice support
> > 
> > etc. That's how it evolved until now and it's all very extensible.
> > 
> > Steve, could you please list the additions you have in mind, in order of priority?
> 
> A few of things that pop up quickly are:
> 
> 1) lockless
> 
> 2) as-fast-as possible
> 
> 3) support all tasks / all CPUs and still have as-fast-as-possible

Yeah - that's a self-evident goal for just about any kernel code.

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 18:23               ` Luck, Tony
@ 2010-11-10 18:31                 ` Peter Zijlstra
  2010-11-10 18:49                   ` Ingo Molnar
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2010-11-10 18:31 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Steven Rostedt, Ingo Molnar, linux-kernel, Huang, Ying, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 10:23 -0800, Luck, Tony wrote:
> > A few of things that pop up quickly are:
> >
> > 1) lockless
> 
> This is a clear requirement for use in h/w error
> reporting too.  Taking locks in NMI or machine
> check handler isn't an option.

Don't worry, lots of PMIs are NMIs, perf needs to be fully NMI safe
otherwise things simply don't work.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 18:24               ` Peter Zijlstra
@ 2010-11-10 18:41                 ` Ingo Molnar
  2010-11-10 19:00                   ` Steven Rostedt
  2010-11-10 19:16                 ` [RFC/Requirements/Design] h/w error reporting Steven Rostedt
  2010-11-10 19:38                 ` Steven Rostedt
  2 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2010-11-10 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Wed, 2010-11-10 at 13:05 -0500, Steven Rostedt wrote:
> 
> > Peter said at LPC that the perf buffering system was not designed to handle high 
> > speed tracing. But he also said he does not like the way the ftrace buffering 
> > works.
> 
> You're not very good at listening, I said the perf infrastructure and event 
> handling mechanism isn't geared towards full throughput but instead on sampling.

Note that even that is an implementational detail that can be changed: even with a 
sampling model the sampling bits are in a flag word, so common combinations can be 
checked for quickly and open-coded into flat fall-through code - if the sample 
decoding ever shows up as overhead. (It doesnt even need any ABI changes.)

So it's a non-issue.

> There is lots of code between getting the event and landing it in the buffer. The 
> buffer itself is perfectly suited for high speed low overhead stuffs, the perf 
> data format possibly not because its not bitfield happy.

Even that can be tweaked via allowing more compressed records. I doubt it will help 
as much, but it's still an incremental change that can be validated carefully.

Fact is that we have an ABI, happy users, happy tools and happy developers, so going 
incrementally is important and allows us to validate and measure every step while 
still having a full tool-space in place - and it will help everyone, in addition to 
the ftrace/lttng usecases.

We'll need to embark on this incremental path instead of a rewrite-the-world thing. 
As a maintainer my task is to say 'no' to rewrite-the-world approaches - and we can 
and will do better here.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 18:31                 ` Peter Zijlstra
@ 2010-11-10 18:49                   ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2010-11-10 18:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Luck, Tony, Steven Rostedt, linux-kernel, Huang, Ying, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Wed, 2010-11-10 at 10:23 -0800, Luck, Tony wrote:
> > > A few of things that pop up quickly are:
> > >
> > > 1) lockless
> > 
> > This is a clear requirement for use in h/w error
> > reporting too.  Taking locks in NMI or machine
> > check handler isn't an option.
> 
> Don't worry, lots of PMIs are NMIs, perf needs to be fully NMI safe
> otherwise things simply don't work.

Yep, in fact perf was fully NMI safe earlier than the ftrace ring-buffer.

When perf code is NMI unsafe we notice it very quickly. I regularly record millions 
of events per second.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 18:09                 ` Steven Rostedt
@ 2010-11-10 18:52                   ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2010-11-10 18:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Borislav Petkov, Peter Zijlstra, Luck, Tony, linux-kernel,
	ying.huang, tglx, akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 2010-11-10 at 18:41 +0100, Ingo Molnar wrote:
> 
> > > It would also be cool to be able to allocate those buffers as early as possible, 
> > > even if before MCA is enabled, so that I won't have to copy MCE data which got 
> > > logged before the tracing subsystem got enabled to the buffers proper.
> > 
> > We could even have some (small) statically enabled build-time buffer that could be 
> > enabled straight away before any allocators are enabled.
> 
> I'm not sure it needs to be small. We can have a persistent buffer that may be 
> resized at any time. It can start off small, or with a kernel command line, be as 
> big as you want it.

Sure.

> Basically, what ftrace has now.
> 
> Also, with a way that root user can get a handle on this buffer, and just trace 
> global events with it.

Yeah.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 18:41                 ` Ingo Molnar
@ 2010-11-10 19:00                   ` Steven Rostedt
  2010-11-10 19:11                     ` Ingo Molnar
  2010-11-10 19:11                     ` Frederic Weisbecker
  0 siblings, 2 replies; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 19:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 19:41 +0100, Ingo Molnar wrote:

> We'll need to embark on this incremental path instead of a rewrite-the-world thing. 
> As a maintainer my task is to say 'no' to rewrite-the-world approaches - and we can 
> and will do better here.

Thus you are saying that we stick to the status quo, and also ignore the
fact that perf was a rewrite-the-world from ftrace to begin with.

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 19:00                   ` Steven Rostedt
@ 2010-11-10 19:11                     ` Ingo Molnar
  2010-11-10 19:11                     ` Frederic Weisbecker
  1 sibling, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2010-11-10 19:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 2010-11-10 at 19:41 +0100, Ingo Molnar wrote:
> 
> > We'll need to embark on this incremental path instead of a rewrite-the-world 
> > thing. As a maintainer my task is to say 'no' to rewrite-the-world approaches - 
> > and we can and will do better here.
> 
> Thus you are saying that we stick to the status quo, [...]

No, i'm saying we dont do new things just for the sake of it being new, without 
exhausting existing facilities.

None of the examples/arguments offered so far seemed to necessiate throwing away 
existing stuff.

> [...] and also ignore the fact that perf was a rewrite-the-world from ftrace to 
> begin with.

No, the thing is that there were no tools and no ABI - perf was mostly about the ABI 
and about the user-space tooling - ftrace didnt really have that (and oprofile had 
deep problems).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 19:00                   ` Steven Rostedt
  2010-11-10 19:11                     ` Ingo Molnar
@ 2010-11-10 19:11                     ` Frederic Weisbecker
  2010-11-10 19:30                       ` Ingo Molnar
  2010-11-10 20:23                       ` Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting) Mathieu Desnoyers
  1 sibling, 2 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2010-11-10 19:11 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Ingo Molnar, Peter Zijlstra, Luck, Tony, linux-kernel,
	ying.huang, bp, tglx, akpm, mchehab, Arnaldo Carvalho de Melo,
	Arjan van de Ven

On Wed, Nov 10, 2010 at 02:00:45PM -0500, Steven Rostedt wrote:
> On Wed, 2010-11-10 at 19:41 +0100, Ingo Molnar wrote:
> 
> > We'll need to embark on this incremental path instead of a rewrite-the-world thing. 
> > As a maintainer my task is to say 'no' to rewrite-the-world approaches - and we can 
> > and will do better here.
> 
> Thus you are saying that we stick to the status quo, and also ignore the
> fact that perf was a rewrite-the-world from ftrace to begin with.


Perhaps you and Mathieu can summarize your requirements here and then explain
why extending the current ABI wouldn't work. It's quite normal that people
try to find a solution fully backward compatible in the first place. If
it's not possible, fine, but then justify it.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 18:24               ` Peter Zijlstra
  2010-11-10 18:41                 ` Ingo Molnar
@ 2010-11-10 19:16                 ` Steven Rostedt
  2010-11-10 19:38                 ` Steven Rostedt
  2 siblings, 0 replies; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 19:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 19:24 +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-10 at 13:05 -0500, Steven Rostedt wrote:
> > 
> > Peter said at LPC that the perf buffering system was not designed to
> > handle high speed tracing. But he also said he does not like the way the
> > ftrace buffering works. 
> 
> You're not very good at listening,

Or I'm not good at rephrasing, but both are probably true ;-)


>  I said the perf infrastructure and
> event handling mechanism isn't geared towards full throughput but
> instead on sampling.

OK, not the buffering, but the infrastructure. That's not much of a
difference to this topic. Which is about the ABI which includes all from
the user to the buffer.

> 
> There is lots of code between getting the event and landing it in the
> buffer. The buffer itself is perfectly suited for high speed low
> overhead stuffs, the perf data format possibly not because its not
> bitfield happy.

Well, we need to separate out the buffer in perf regardless, since it is
very entwined in the code. Does it now support flight recorder mode?

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 19:11                     ` Frederic Weisbecker
@ 2010-11-10 19:30                       ` Ingo Molnar
  2010-11-10 19:48                         ` Steven Rostedt
  2010-11-10 20:23                       ` Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting) Mathieu Desnoyers
  1 sibling, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2010-11-10 19:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, Mathieu Desnoyers, Peter Zijlstra, Luck, Tony,
	linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> On Wed, Nov 10, 2010 at 02:00:45PM -0500, Steven Rostedt wrote:
> > On Wed, 2010-11-10 at 19:41 +0100, Ingo Molnar wrote:
> > 
> > > We'll need to embark on this incremental path instead of a rewrite-the-world 
> > > thing. As a maintainer my task is to say 'no' to rewrite-the-world approaches 
> > > - and we can and will do better here.
> > 
> > Thus you are saying that we stick to the status quo, and also ignore the fact 
> > that perf was a rewrite-the-world from ftrace to begin with.
> 
> Perhaps you and Mathieu can summarize your requirements here and then explain why 
> extending the current ABI wouldn't work. It's quite normal that people try to find 
> a solution fully backward compatible in the first place. If it's not possible, 
> fine, but then justify it.

Yeah, that's pretty much the only reasonable approach really. It also makes every 
single step testable and verifiable, and often optional as well:

 - How much do we win from more compressed records? Do we win? Do we want _larger_,
   less encoded records on some CPUs because their construction overhead and cache
   behavior is better there?

 - How much does splice() help?

 - How much do the sampling fast-path approaches help. How many apps will make use
   of them?

Those are all issues that are virtually undecidable individually if the approach is 
an all-or-nothing flag-day thing.

Fact is that the perf events based tool space is vibrant and alive, and new uses are 
popping up every week. We'd be utter fools to not embark on an iterative approach 
here. It does not even limit us in any way technically.

The days of full tracing subsystem rewrites are over Steve, i'm afraid it is time 
for us to grow up ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 18:24               ` Peter Zijlstra
  2010-11-10 18:41                 ` Ingo Molnar
  2010-11-10 19:16                 ` [RFC/Requirements/Design] h/w error reporting Steven Rostedt
@ 2010-11-10 19:38                 ` Steven Rostedt
  2 siblings, 0 replies; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 19:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Luck, Tony, linux-kernel, ying.huang, bp, tglx,
	akpm, mchehab, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Arjan van de Ven, Mathieu Desnoyers

On Wed, 2010-11-10 at 19:24 +0100, Peter Zijlstra wrote:
> There is lots of code between getting the event and landing it in the
> buffer. The buffer itself is perfectly suited for high speed low
> overhead stuffs, the perf data format possibly not because its not
> bitfield happy.

Question:

Can we make perf lower what it records, thus speeding up the amount it
records, without breaking the ABI?

Can we add flight recorder mode splice version, non mmap, without
breaking the ABI?

If we can make perf as fast as ftrace in its recording, and maybe even
faster if we have the ability to select what is recorded and compress
the events, I'm all for it.

-- Steve




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC/Requirements/Design] h/w error reporting
  2010-11-10 19:30                       ` Ingo Molnar
@ 2010-11-10 19:48                         ` Steven Rostedt
  0 siblings, 0 replies; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 19:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, Mathieu Desnoyers, Peter Zijlstra, Luck,
	Tony, linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

On Wed, 2010-11-10 at 20:30 +0100, Ingo Molnar wrote:

>  - How much does splice() help?

Probably more if we eventually fix the bug that causes a copy of the
page when it is not needed.

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 19:11                     ` Frederic Weisbecker
  2010-11-10 19:30                       ` Ingo Molnar
@ 2010-11-10 20:23                       ` Mathieu Desnoyers
  2010-11-10 20:54                         ` Luck, Tony
  2010-11-10 21:30                         ` Frederic Weisbecker
  1 sibling, 2 replies; 50+ messages in thread
From: Mathieu Desnoyers @ 2010-11-10 20:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Luck, Tony,
	linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

* Frederic Weisbecker (fweisbec@gmail.com) wrote:
> On Wed, Nov 10, 2010 at 02:00:45PM -0500, Steven Rostedt wrote:
> > On Wed, 2010-11-10 at 19:41 +0100, Ingo Molnar wrote:
> > 
> > > We'll need to embark on this incremental path instead of a rewrite-the-world thing. 
> > > As a maintainer my task is to say 'no' to rewrite-the-world approaches - and we can 
> > > and will do better here.
> > 
> > Thus you are saying that we stick to the status quo, and also ignore the
> > fact that perf was a rewrite-the-world from ftrace to begin with.
> 
> Perhaps you and Mathieu can summarize your requirements here and then explain
> why extending the current ABI wouldn't work. It's quite normal that people
> try to find a solution fully backward compatible in the first place. If
> it's not possible, fine, but then justify it.

Sure, here are the requirements my user-base have, followed by a listing of Perf
and Ftrace pain points, some of which are directly derived from their respective
ABIs, others partially caused by their implementation and partially caused by
their ABI.

- Low overhead is key
  - 150 ns per event (cache-hot)
  - Zero-copy (splice to disk/network, mmap for zero-copy in-place data
    analysis)
- Compactness of traces
  - e.g. 96 bits per event (including typical 64-bit payload), no PID saved per
    event.
- Scalability to multi-core and multi-processor
  - Per-CPU buffers, time-stamp reading both scalable to many cpus *and* accurate
- Production-grace tracer reliability
  - Trace clock accuracy within 100ns, ordering can be inferred based on
    lock/interrupt handler knowledge, ability to know when ordering might be
    wrong.
- Flight recorder mode
  - Support concurrent read while writer is overwriting buffer data
    (Thomas Gleixner named these "trace-shots")
- Support multiple trace sessions in parallel
  - Engineer + Operator + flight recorder for automated bug reports
- Availability of trace buffers for crash diagnosis
  - Save to disk, network, use kexec or persistent memory
- Heterogeneous environment support
  - Portability
  - Distinct host/target environment support
  - Management of multiple target kernel versions
  - No dependency on kernel image to analyze traces
    (traces contain complete information)
- Live view/analysis of trace streams via the network
  - Impact on buffer flushing, power saving, idle, ...
- Synchronized system-wide (hypervisor, kernel and user-space) traces
- Scalability of analysis tools to very large data sets (> 10GB)
- Standardization of trace format across analysis tools


* Ring Buffer issues with Perf:

- Perf does not support flight recorder tracing (concurrent read/write)
  - Sub-buffers are needed to support concurrent read/writes in flight recorder
    mode. Peter still has to convince me otherwise (if he cares).
  - Imply adding padding when an event does not fit in the current sub-buffer
    (ABI change). Note for Frederic: creating a single-subbuffer as large as the
    buffer does not solve this problem, because perf allows writing an event
    across the end of the buffer and its beginning. In a scheme where
    sub-buffers can be discarded, it makes it quite unreliable to try to figure
    out where partially overwritten events end.
  - Calling the kernel when finishing reading a sub-buffer is needed for flight
    recorder mode tracing. It is not possible with the mmap-head-tail-counter
    ABI Perf currently uses for reader-writer synchronization.
- Perf is 5 times slower than Ftrace/Generic Ring Buffer Library/LTTng.
  - Partially due to implementation.
  - Partially due to large event size.

* Trace Format issues with Perf:

- Perf event headers are too large
- Handling of dynamically added instrumentation while trace is recorded is
  inexistent.


* Ring Buffer issues with Ftrace:

- Ftrace needs an internal API cleanup.
  - "peek" is an unnecessary API duplication which complicates everything down
    to the buffer-level.
- Ftrace does not support cross-pages event writes
  - Limits event size to less than 4kB

* Trace Format issues with Ftrace:

- Ftrace timestamps are saved as delta from previous event
  - Only works for tracing where preemption can be disabled, unusable for
    user-space tracing.
  - Creates an artificial data dependency between events, leading to odd
    side-effects when dealing with nesting over tracer
    - 0 ns IRQ/SOFTIRQ handler duration side-effect
- Event size limited to one page
- Ftrace event headers are still too large
- Handling of dynamically added instrumentation while trace is recorded is
  inexistent.

So given that fixing these issues requires a large ABI rework of both Ftrace and
Perf, creating a new ABI rather than building on top of an ABI not initially
designed to meet these requirements seems to really make sense here.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 20:23                       ` Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting) Mathieu Desnoyers
@ 2010-11-10 20:54                         ` Luck, Tony
  2010-11-10 21:06                           ` Steven Rostedt
  2010-11-10 22:51                           ` Mathieu Desnoyers
  2010-11-10 21:30                         ` Frederic Weisbecker
  1 sibling, 2 replies; 50+ messages in thread
From: Luck, Tony @ 2010-11-10 20:54 UTC (permalink / raw)
  To: Mathieu Desnoyers, Frederic Weisbecker
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, linux-kernel, Huang,
	Ying, bp, tglx, akpm, mchehab, Arnaldo Carvalho de Melo,
	Arjan van de Ven

>- Perf does not support flight recorder tracing (concurrent read/write)
>  - Sub-buffers are needed to support concurrent read/writes

When I hear somebody say "flight recorder" - I think of "black boxes"
in airplanes that log data while the flight is running, and are only
looked at offline later.  So I'm confused by the "concurrent read/write"
requirement.

Perhaps you could explain the use cases of your "flight recorder",
because it seems that the name doesn't fit exactly, and this is
causing me (and maybe others) some confusion.

Thanks

-Tony

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 20:54                         ` Luck, Tony
@ 2010-11-10 21:06                           ` Steven Rostedt
  2010-11-10 21:34                             ` Steven Rostedt
  2010-11-10 22:51                           ` Mathieu Desnoyers
  1 sibling, 1 reply; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 21:06 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Ingo Molnar,
	Peter Zijlstra, linux-kernel, Huang, Ying, bp, tglx, akpm,
	mchehab, Arnaldo Carvalho de Melo, Arjan van de Ven

On Wed, 2010-11-10 at 12:54 -0800, Luck, Tony wrote:
> >- Perf does not support flight recorder tracing (concurrent read/write)
> >  - Sub-buffers are needed to support concurrent read/writes
> 
> When I hear somebody say "flight recorder" - I think of "black boxes"
> in airplanes that log data while the flight is running, and are only
> looked at offline later.  So I'm confused by the "concurrent read/write"
> requirement.
> 
> Perhaps you could explain the use cases of your "flight recorder",
> because it seems that the name doesn't fit exactly, and this is
> causing me (and maybe others) some confusion.

Hmm, I had this argument with Mathieu before, but I guess I mistakenly
let him win ;-)

I call "flight recorder" mode "overwrite" mode. Basically there's two
modes. They only have meaning when the ring buffer is full and a write
takes place.

1) produce/consumer mode - When the writer reaches the reader, all new
events are discarded. This means that you lose the latest events while
you keep older events around.

2) overwrite mode (flight recorder) - when the writer reaches the
reader, it pushes the reader forward, and writes the new events over the
old ones. This way, new events are always existent, where as old events
are lost.

1 is much easier to implement than 2, especially when doing it in a
lockless way.

I guess I should have fought harder to keep the terminology of
"overwrite" mode. This is the third time is the last week I had to
explain what "flight recorder" mode was. Where as, overwrite mode was a
bit more obvious.

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 20:23                       ` Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting) Mathieu Desnoyers
  2010-11-10 20:54                         ` Luck, Tony
@ 2010-11-10 21:30                         ` Frederic Weisbecker
  2010-11-10 21:54                           ` Steven Rostedt
  2010-11-11  0:11                           ` Mathieu Desnoyers
  1 sibling, 2 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2010-11-10 21:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Luck, Tony,
	linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

On Wed, Nov 10, 2010 at 03:23:16PM -0500, Mathieu Desnoyers wrote:
> * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > On Wed, Nov 10, 2010 at 02:00:45PM -0500, Steven Rostedt wrote:
> > > On Wed, 2010-11-10 at 19:41 +0100, Ingo Molnar wrote:
> > > 
> > > > We'll need to embark on this incremental path instead of a rewrite-the-world thing. 
> > > > As a maintainer my task is to say 'no' to rewrite-the-world approaches - and we can 
> > > > and will do better here.
> > > 
> > > Thus you are saying that we stick to the status quo, and also ignore the
> > > fact that perf was a rewrite-the-world from ftrace to begin with.
> > 
> > Perhaps you and Mathieu can summarize your requirements here and then explain
> > why extending the current ABI wouldn't work. It's quite normal that people
> > try to find a solution fully backward compatible in the first place. If
> > it's not possible, fine, but then justify it.
> 
> Sure, here are the requirements my user-base have, followed by a listing of Perf
> and Ftrace pain points, some of which are directly derived from their respective
> ABIs, others partially caused by their implementation and partially caused by
> their ABI.



Yeah, but the main point here is to explain why/how reaching those goals is not
efficiently possible through an extension of the current ABI, in practice.

I'm going to try for some of them. Note when I'll talk about ABI breakage,
it actually means: create a new ABI and support the old one, schedule its
deprecation in the long term.

Here we go:


> 
> - Low overhead is key
>   - 150 ns per event (cache-hot)
>   - Zero-copy (splice to disk/network, mmap for zero-copy in-place data
>     analysis)



We could do splice in perf through an extension of the current ABI.
The rest seems more about kernel internals.

=> Abi breakage doesn't seem to be needed.



> - Compactness of traces
>   - e.g. 96 bits per event (including typical 64-bit payload), no PID saved per
>     event.


In perf we save the pid from two places:

- perf headers, see PERF_SAMPLE_TID
- from the common fields of the trace events

Ftrace too for common fields.

It's useful to keep PERF_SAMPLE_TID for low overhead events (like
perf little sampling). Otherwise we can certainly deduce the pid
from context switch trace events.

But the pid in the trace event headers remains. We probably should
get rid of that.

There are also the other common fields:

	struct trace_entry {
		unsigned short		type;


Type is needed by perf. If we have one buffer per event, we could
retrieve which event we are dealing with. But if buffers are
multiplexed per cpu, we need this.


	unsigned char		flags;


Useful for ftrace, not for perf which will be able to save regs
soon.


	unsigned char		preempt_count;


Dunno. Should be optional.


	int			pid;


Kill!


	int			lock_depth;


Killed ;)


};



=> Abi breakage needed. Can be made through an ABI extension though, but
wouldn't scale in the long term.




> - Scalability to multi-core and multi-processor
>   - Per-CPU buffers, time-stamp reading both scalable to many cpus *and* accurate



=> Kernel internals




> - Production-grace tracer reliability
>   - Trace clock accuracy within 100ns, ordering can be inferred based on
>     lock/interrupt handler knowledge, ability to know when ordering might be
>     wrong.



=> Seems to be kernel internals only. I may be missing your point though.




> - Flight recorder mode
>   - Support concurrent read while writer is overwriting buffer data
>     (Thomas Gleixner named these "trace-shots")



=> Abi extension (overwriting mode)?




> - Support multiple trace sessions in parallel
>   - Engineer + Operator + flight recorder for automated bug reports



=> Doesn't seem to need ABI breakage.



> - Availability of trace buffers for crash diagnosis
>   - Save to disk, network, use kexec or persistent memory



Use splice for save to disk or network. But I don't understand the kexec
thing.

=> ABI extension (see splice)



> - Heterogeneous environment support
>   - Portability



What is missing?



>   - Distinct host/target environment support



ditto.
This works well for perf and ftrace currently. Have you
a specific problem in mind?



>   - Management of multiple target kernel versions


We all try to ensure backward compatibility. It only gets broken
because of unwanted regressions or scheduled deprecation in the
long term.


>   - No dependency on kernel image to analyze traces
>     (traces contain complete information)



Trace format.



> - Live view/analysis of trace streams via the network
>   - Impact on buffer flushing, power saving, idle, ...


kernel internals



> - Synchronized system-wide (hypervisor, kernel and user-space) traces



kernel internals?



> - Scalability of analysis tools to very large data sets (> 10GB)



=> Userspace internals



> - Standardization of trace format across analysis tools



Please detail.



> 
> * Ring Buffer issues with Perf:
> 
> - Perf does not support flight recorder tracing (concurrent read/write)



Abi extension.



>   - Sub-buffers are needed to support concurrent read/writes in flight recorder
>     mode. Peter still has to convince me otherwise (if he cares).



ABI breakage needed


>   - Imply adding padding when an event does not fit in the current sub-buffer
>     (ABI change). Note for Frederic: creating a single-subbuffer as large as the
>     buffer does not solve this problem, because perf allows writing an event
>     across the end of the buffer and its beginning. In a scheme where
>     sub-buffers can be discarded, it makes it quite unreliable to try to figure
>     out where partially overwritten events end.



Ok.




>   - Calling the kernel when finishing reading a sub-buffer is needed for flight
>     recorder mode tracing. It is not possible with the mmap-head-tail-counter
>     ABI Perf currently uses for reader-writer synchronization.



Why do you need to call the kernel for that?



> - Perf is 5 times slower than Ftrace/Generic Ring Buffer Library/LTTng.
>   - Partially due to implementation.


Kernel internals



>   - Partially due to large event size.



(See my previous comments about pid and so).



> 
> * Trace Format issues with Perf:
> 
> - Perf event headers are too large



You can select them independantly, except for trace events, for which
I made comments before.



> - Handling of dynamically added instrumentation while trace is recorded is
>   inexistent.



???



> 
> 
> * Ring Buffer issues with Ftrace:
> 
> - Ftrace needs an internal API cleanup.
>   - "peek" is an unnecessary API duplication which complicates everything down
>     to the buffer-level.



kernel internals



> - Ftrace does not support cross-pages event writes
>   - Limits event size to less than 4kB


kernel internals?




> * Trace Format issues with Ftrace:
> 
> - Ftrace timestamps are saved as delta from previous event
>   - Only works for tracing where preemption can be disabled, unusable for
>     user-space tracing.



What is this userspace tracing? Is this userspace tracing made in kernel
space?

(tag me confused)



>   - Creates an artificial data dependency between events, leading to odd
>     side-effects when dealing with nesting over tracer



I wouldn't comment that, I'm not very experienced with the ring buffer



>     - 0 ns IRQ/SOFTIRQ handler duration side-effect



ditto.

If we need/want to cure that, then we need an:

=> ABI breakage



> - Event size limited to one page



Perf too needs more (userspace stack dumps).



> - Ftrace event headers are still too large


(described in the beginning)



> - Handling of dynamically added instrumentation while trace is recorded is
>   inexistent.




I still don't understand this point


Now I'm too tired to sum up all the points that seem not to be
solved through an ABI extension :)


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 21:06                           ` Steven Rostedt
@ 2010-11-10 21:34                             ` Steven Rostedt
  0 siblings, 0 replies; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 21:34 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Ingo Molnar,
	Peter Zijlstra, linux-kernel, Huang, Ying, bp, tglx, akpm,
	mchehab, Arnaldo Carvalho de Melo, Arjan van de Ven

On Wed, 2010-11-10 at 16:06 -0500, Steven Rostedt wrote:

> > Perhaps you could explain the use cases of your "flight recorder",
> > because it seems that the name doesn't fit exactly, and this is
> > causing me (and maybe others) some confusion.

> 2) overwrite mode (flight recorder) - when the writer reaches the
> reader, it pushes the reader forward, and writes the new events over the
> old ones. This way, new events are always existent, where as old events
> are lost.

Ah, I forgot to mention use cases.

Even when recording a trace (lots of data, so it is saved to disk and
not all in kernel memory) I like to know that if something happens and
disables the trace, I have all the trace information up to the point of
failure.

With producer/consumer mode, you risk the reader being late and the
events in the trace that led up to failure (or any other anomaly) were
dropped.

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 21:30                         ` Frederic Weisbecker
@ 2010-11-10 21:54                           ` Steven Rostedt
  2010-11-10 22:19                             ` Frederic Weisbecker
  2010-11-10 22:49                             ` Frederic Weisbecker
  2010-11-11  0:11                           ` Mathieu Desnoyers
  1 sibling, 2 replies; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 21:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Luck, Tony,
	linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

On Wed, 2010-11-10 at 22:30 +0100, Frederic Weisbecker wrote:

> > * Trace Format issues with Ftrace:
> > 
> > - Ftrace timestamps are saved as delta from previous event
> >   - Only works for tracing where preemption can be disabled, unusable for
> >     user-space tracing.
> 
> 
> 
> What is this userspace tracing? Is this userspace tracing made in kernel
> space?
> 
> (tag me confused)

tag team!

> 
> 
> 
> >   - Creates an artificial data dependency between events, leading to odd
> >     side-effects when dealing with nesting over tracer
> 
> 
> 
> I wouldn't comment that, I'm not very experienced with the ring buffer

Yeah, I discussed this with Mathieu. There's a pretty trivial fix for
this, but may require ABI breakage.

> 
> 
> 
> >     - 0 ns IRQ/SOFTIRQ handler duration side-effect
> 
> 
> 
> ditto.

If an interrupt (or softirq) preempts the recorded trace, then events
that are recorded in that interrupt all get the same time as the event
it preempted. Giving us the assumption that all events happened at once.

Again, this is just a side effect and the fix is trivial. But may
require ABI breakage to do so.


> 
> If we need/want to cure that, then we need an:
> 
> => ABI breakage
> 
> 
> 
> > - Event size limited to one page
> 
> 
> 
> Perf too needs more (userspace stack dumps).

That was actually a decision made by Linus. But is trivial to change. As
there's nothing hard coded about the design that forces us to have page
size sub buffers. I don't even think that it would require an ABI
breakage, except I think my tools I wrote (incorrectly) assumed it.


> 
> 
> 
> > - Ftrace event headers are still too large
> 
> 
> (described in the beginning)

Yep, they are large, but can be trimmed. This would require no abi
breakage since the these headers are also described in the event
formats. Thus changing the current tools should cope with the headers
changing. In fact they were designed too since the lock-depth was known
to be deprecated soon.

> 
> 
> 
> > - Handling of dynamically added instrumentation while trace is recorded is
> >   inexistent.
> 
> 
> 
> 
> I still don't understand this point

He's talking about tracing the tracepoints in a loaded module. We
currently have no way to add them while a trace is happening. The trace
formats do not exist and may not exist (if module is unloaded) when the
trace ends.

But who really loads and then unloads a module during tracing. As pretty
much all kernel developers cringe at the fact that modules get
unloaded ;-)


> 
> 
> Now I'm too tired to sum up all the points that seem not to be
> solved through an ABI extension :)

Me too, lets go shopping!

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 21:54                           ` Steven Rostedt
@ 2010-11-10 22:19                             ` Frederic Weisbecker
  2010-11-10 22:49                             ` Frederic Weisbecker
  1 sibling, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2010-11-10 22:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Luck, Tony,
	linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

On Wed, Nov 10, 2010 at 04:54:19PM -0500, Steven Rostedt wrote:
> On Wed, 2010-11-10 at 22:30 +0100, Frederic Weisbecker wrote:
> > I wouldn't comment that, I'm not very experienced with the ring buffer
> 
> Yeah, I discussed this with Mathieu. There's a pretty trivial fix for
> this, but may require ABI breakage.



If you have this patch ready that fixes my ring buffer inexperience,
please send it right away. I might accomodate with the self-ABI breakage,
depending on what it is...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 21:54                           ` Steven Rostedt
  2010-11-10 22:19                             ` Frederic Weisbecker
@ 2010-11-10 22:49                             ` Frederic Weisbecker
  1 sibling, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2010-11-10 22:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Luck, Tony,
	linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

On Wed, Nov 10, 2010 at 04:54:19PM -0500, Steven Rostedt wrote:
> On Wed, 2010-11-10 at 22:30 +0100, Frederic Weisbecker wrote:
> > >     - 0 ns IRQ/SOFTIRQ handler duration side-effect
> > 
> > 
> > 
> > ditto.
> 
> If an interrupt (or softirq) preempts the recorded trace, then events
> that are recorded in that interrupt all get the same time as the event
> it preempted. Giving us the assumption that all events happened at once.
> 
> Again, this is just a side effect and the fix is trivial. But may
> require ABI breakage to do so.



Right, now I recall I discussed that with Mathieu lately.


 
> 
> > 
> > If we need/want to cure that, then we need an:
> > 
> > => ABI breakage
> > 
> > 
> > 
> > > - Event size limited to one page
> > 
> > 
> > 
> > Perf too needs more (userspace stack dumps).
> 
> That was actually a decision made by Linus. But is trivial to change. As
> there's nothing hard coded about the design that forces us to have page
> size sub buffers. I don't even think that it would require an ABI
> breakage, except I think my tools I wrote (incorrectly) assumed it.


Ok.



> 
> > 
> > 
> > 
> > > - Ftrace event headers are still too large
> > 
> > 
> > (described in the beginning)
> 
> Yep, they are large, but can be trimmed. This would require no abi
> breakage since the these headers are also described in the event
> formats. Thus changing the current tools should cope with the headers
> changing. In fact they were designed too since the lock-depth was known
> to be deprecated soon.


Hmm, in practice this is an ABI breakage as we have scripts that rely on
the common_pid field for example. We can fix this, but older tools won't
work with new kernels.



> > > - Handling of dynamically added instrumentation while trace is recorded is
> > >   inexistent.
> > 
> > 
> > 
> > 
> > I still don't understand this point
> 
> He's talking about tracing the tracepoints in a loaded module. We
> currently have no way to add them while a trace is happening. The trace
> formats do not exist and may not exist (if module is unloaded) when the
> trace ends.
> 
> But who really loads and then unloads a module during tracing. As pretty
> much all kernel developers cringe at the fact that modules get
> unloaded ;-)



Hehe :)

Anyway this can be expressed through an ABI extension,
using a kind of lazy tracepoint registration or so.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 20:54                         ` Luck, Tony
  2010-11-10 21:06                           ` Steven Rostedt
@ 2010-11-10 22:51                           ` Mathieu Desnoyers
  2010-11-10 23:12                             ` Thomas Gleixner
  1 sibling, 1 reply; 50+ messages in thread
From: Mathieu Desnoyers @ 2010-11-10 22:51 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	linux-kernel, Huang, Ying, bp, tglx, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

* Luck, Tony (tony.luck@intel.com) wrote:
> >- Perf does not support flight recorder tracing (concurrent read/write)
> >  - Sub-buffers are needed to support concurrent read/writes
> 
> When I hear somebody say "flight recorder" - I think of "black boxes"
> in airplanes that log data while the flight is running, and are only
> looked at offline later.  So I'm confused by the "concurrent read/write"
> requirement.
> 
> Perhaps you could explain the use cases of your "flight recorder",
> because it seems that the name doesn't fit exactly, and this is
> causing me (and maybe others) some confusion.

As Steven pointed out, the flight recorder buffers are set to overwrite the
oldest data when the buffer is filled. Therefore, the tracer can be used in
close-circuit mode (without extracting the data out of the memory buffers) to
keep a trace of the recent events. The trace can be extracted when an
interesting condition (trigger) occurs.

A typical use-case is to let it run on an end-user machine to enhance
application crash diagnosis with tracing information, albeit using a very small
fraction of the system resources to do so.

The reason why "concurrent read/write" is required is for server-class machines
which needs to continuously be able to gather trace data to report/find/locate
problematic scenarios happening. This means we're not only interested in one
single failure, but rather by a whole set of erroneous/warning conditions that
need to be reported. Stopping tracing every time data is gathered is
inappropriate, because it would hide errors/warnings that would be happening
during data collection.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 22:51                           ` Mathieu Desnoyers
@ 2010-11-10 23:12                             ` Thomas Gleixner
  2010-11-10 23:20                               ` Steven Rostedt
  2010-11-10 23:28                               ` Mathieu Desnoyers
  0 siblings, 2 replies; 50+ messages in thread
From: Thomas Gleixner @ 2010-11-10 23:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Luck, Tony, Frederic Weisbecker, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, linux-kernel, Huang, Ying, bp, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

On Wed, 10 Nov 2010, Mathieu Desnoyers wrote:

> * Luck, Tony (tony.luck@intel.com) wrote:
> > >- Perf does not support flight recorder tracing (concurrent read/write)
> > >  - Sub-buffers are needed to support concurrent read/writes
> > 
> > When I hear somebody say "flight recorder" - I think of "black boxes"
> > in airplanes that log data while the flight is running, and are only
> > looked at offline later.  So I'm confused by the "concurrent read/write"
> > requirement.
> > 
> > Perhaps you could explain the use cases of your "flight recorder",
> > because it seems that the name doesn't fit exactly, and this is
> > causing me (and maybe others) some confusion.
> 
> As Steven pointed out, the flight recorder buffers are set to overwrite the
> oldest data when the buffer is filled. Therefore, the tracer can be used in
> close-circuit mode (without extracting the data out of the memory buffers) to
> keep a trace of the recent events. The trace can be extracted when an
> interesting condition (trigger) occurs.
> 
> A typical use-case is to let it run on an end-user machine to enhance
> application crash diagnosis with tracing information, albeit using a very small
> fraction of the system resources to do so.
> 
> The reason why "concurrent read/write" is required is for server-class machines
> which needs to continuously be able to gather trace data to report/find/locate
> problematic scenarios happening. This means we're not only interested in one
> single failure, but rather by a whole set of erroneous/warning conditions that
> need to be reported. Stopping tracing every time data is gathered is
> inappropriate, because it would hide errors/warnings that would be happening
> during data collection.

Aargh! Just because it can be done all in one with an insane amount of
complexity does not mean that it's an absolute requirement and a good
solution.

So if you want to have both the flight recorder crash documentation
and the ongoing monitoring then use two separate sessions with
separate modes and be done with it.

Cramming both into the same session is just insane.

The first rule is "Keep It Simple!".  Period.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 23:12                             ` Thomas Gleixner
@ 2010-11-10 23:20                               ` Steven Rostedt
  2010-11-10 23:45                                 ` Thomas Gleixner
  2010-11-11 18:25                                 ` Ted Ts'o
  2010-11-10 23:28                               ` Mathieu Desnoyers
  1 sibling, 2 replies; 50+ messages in thread
From: Steven Rostedt @ 2010-11-10 23:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, Luck, Tony, Frederic Weisbecker, Ingo Molnar,
	Peter Zijlstra, linux-kernel, Huang, Ying, bp, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

On Thu, 2010-11-11 at 00:12 +0100, Thomas Gleixner wrote:

> Cramming both into the same session is just insane.

That just doubled the overhead of the tracer.

> 
> The first rule is "Keep It Simple!".  Period.

It's not that complex. Both Mathieu and I have implemented it. Really,
I've seen a lot more complex code. Just because it does not fit into the
CS101 course does not mean that it is totally complex.

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 23:12                             ` Thomas Gleixner
  2010-11-10 23:20                               ` Steven Rostedt
@ 2010-11-10 23:28                               ` Mathieu Desnoyers
  2010-11-10 23:58                                 ` Thomas Gleixner
  1 sibling, 1 reply; 50+ messages in thread
From: Mathieu Desnoyers @ 2010-11-10 23:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Luck, Tony, Frederic Weisbecker, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, linux-kernel, Huang, Ying, bp, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

* Thomas Gleixner (tglx@linutronix.de) wrote:
> On Wed, 10 Nov 2010, Mathieu Desnoyers wrote:
> 
> > * Luck, Tony (tony.luck@intel.com) wrote:
> > > >- Perf does not support flight recorder tracing (concurrent read/write)
> > > >  - Sub-buffers are needed to support concurrent read/writes
> > > 
> > > When I hear somebody say "flight recorder" - I think of "black boxes"
> > > in airplanes that log data while the flight is running, and are only
> > > looked at offline later.  So I'm confused by the "concurrent read/write"
> > > requirement.
> > > 
> > > Perhaps you could explain the use cases of your "flight recorder",
> > > because it seems that the name doesn't fit exactly, and this is
> > > causing me (and maybe others) some confusion.
> > 
> > As Steven pointed out, the flight recorder buffers are set to overwrite the
> > oldest data when the buffer is filled. Therefore, the tracer can be used in
> > close-circuit mode (without extracting the data out of the memory buffers) to
> > keep a trace of the recent events. The trace can be extracted when an
> > interesting condition (trigger) occurs.
> > 
> > A typical use-case is to let it run on an end-user machine to enhance
> > application crash diagnosis with tracing information, albeit using a very small
> > fraction of the system resources to do so.
> > 
> > The reason why "concurrent read/write" is required is for server-class machines
> > which needs to continuously be able to gather trace data to report/find/locate
> > problematic scenarios happening. This means we're not only interested in one
> > single failure, but rather by a whole set of erroneous/warning conditions that
> > need to be reported. Stopping tracing every time data is gathered is
> > inappropriate, because it would hide errors/warnings that would be happening
> > during data collection.
> 
> Aargh! Just because it can be done all in one with an insane amount of
> complexity does not mean that it's an absolute requirement and a good
> solution.
> 
> So if you want to have both the flight recorder crash documentation
> and the ongoing monitoring then use two separate sessions with
> separate modes and be done with it.
> 
> Cramming both into the same session is just insane.
> 

I'm afraid this is not what I proposed above. I'm open to use different tracing
sessions for different things. However, the server-class case needs to
continuously gather data so that "trace-shots" can be gathered when problems
occur. But if you hit two problems back to back, you don't want to lose the
trace leading to the second issue. Hence the motivation for supporting
concurrent reading while writing.

> The first rule is "Keep It Simple!".  Period.

I'd like to start with an implementation that skips some of these requirements
initially, but what I really think we need to figure out is how we organize our
ABIs to finally support these requirements.

Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 23:20                               ` Steven Rostedt
@ 2010-11-10 23:45                                 ` Thomas Gleixner
  2010-11-11 18:25                                 ` Ted Ts'o
  1 sibling, 0 replies; 50+ messages in thread
From: Thomas Gleixner @ 2010-11-10 23:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Luck, Tony, Frederic Weisbecker, Ingo Molnar,
	Peter Zijlstra, linux-kernel, Huang, Ying, bp, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

On Wed, 10 Nov 2010, Steven Rostedt wrote:

> On Thu, 2010-11-11 at 00:12 +0100, Thomas Gleixner wrote:
> 
> > Cramming both into the same session is just insane.
> 
> That just doubled the overhead of the tracer.
> 
> > 
> > The first rule is "Keep It Simple!".  Period.
> 
> It's not that complex. Both Mathieu and I have implemented it. Really,
> I've seen a lot more complex code. Just because it does not fit into the
> CS101 course does not mean that it is totally complex.

I know you have implemented it and it's not about CS101, it's about
the insanity of what we have in the ftrace ringbuffer code. It does
not fulfil in any way the "Keep it Simple" requirement. Period.

And as I said earlier on IRC, you are trying to create the

  ZeroImpactFilteringMultiSessionFlightRecorderOverwriteFifoSplicePerfMmapTracer

which is a nice wet dream, but completely unrealistic to achieve in
one go.

When I yelled at you folks in Boston last week and suggested to come
up with a syscall for buffers and the corresponding configuration
interfaces along with a unified record format, then I certainly did
not ask for the ZIFMSFROFSPMT thingy and a rewrite the world approach.

I told you back in 2008 that you need to think hard about the
interfaces and start with a reasonable simple implementation. Then
proceed from there.

The overall achievement so far is an ongoing ringbuffer pissing
contest, zero interfaces and lenghty explanations which kind of tracer
madness is preferred by whom.

I don't call this progress. If you did not get the message last week,
then you have it in writing now to digest as long as it takes:

 Get your gear together and come up with sensible gradual approaches
 which bring us to a better progress ratio than 0/year.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 23:28                               ` Mathieu Desnoyers
@ 2010-11-10 23:58                                 ` Thomas Gleixner
  2010-11-11  9:17                                   ` Ingo Molnar
  0 siblings, 1 reply; 50+ messages in thread
From: Thomas Gleixner @ 2010-11-10 23:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Luck, Tony, Frederic Weisbecker, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, linux-kernel, Huang, Ying, bp, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

On Wed, 10 Nov 2010, Mathieu Desnoyers wrote:
> * Thomas Gleixner (tglx@linutronix.de) wrote:
> > > The reason why "concurrent read/write" is required is for server-class machines
> > > which needs to continuously be able to gather trace data to report/find/locate
> > > problematic scenarios happening. This means we're not only interested in one
> > > single failure, but rather by a whole set of erroneous/warning conditions that
> > > need to be reported. Stopping tracing every time data is gathered is
> > > inappropriate, because it would hide errors/warnings that would be happening
> > > during data collection.
> > 
> > Aargh! Just because it can be done all in one with an insane amount of
> > complexity does not mean that it's an absolute requirement and a good
> > solution.
> > 
> > So if you want to have both the flight recorder crash documentation
> > and the ongoing monitoring then use two separate sessions with
> > separate modes and be done with it.
> > 
> > Cramming both into the same session is just insane.
> > 
> 
> I'm afraid this is not what I proposed above. I'm open to use different tracing
> sessions for different things. However, the server-class case needs to
> continuously gather data so that "trace-shots" can be gathered when problems
> occur. But if you hit two problems back to back, you don't want to lose the
> trace leading to the second issue. Hence the motivation for supporting
> concurrent reading while writing.

Realistically, you are interested in the first one, simply because in
99.9% of the cases the second problem is caused by the first one. Do we
really need to care about the 0.1% which fall into the other category?

Not at all. Simply because the likeliness of those back to back events
_AND_ giving us the 0.1% case is approaching zero.

Of course you can argue with your academic hat on that I'm ignoring
that we might catch this rare "easter and xmas fall on the same day"
event, but I couldn't care less.

> I'd like to start with an implementation that skips some of these requirements
> initially, but what I really think we need to figure out is how we organize our
> ABIs to finally support these requirements.

I did not say, that you should not think about this, but the progress
so far in more than TWO YEARS is exaclty ZERO. And that's what I'm
concerned about.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 21:30                         ` Frederic Weisbecker
  2010-11-10 21:54                           ` Steven Rostedt
@ 2010-11-11  0:11                           ` Mathieu Desnoyers
  2010-11-11 16:10                             ` Steven Rostedt
  1 sibling, 1 reply; 50+ messages in thread
From: Mathieu Desnoyers @ 2010-11-11  0:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Luck, Tony,
	linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

* Frederic Weisbecker (fweisbec@gmail.com) wrote:
> On Wed, Nov 10, 2010 at 03:23:16PM -0500, Mathieu Desnoyers wrote:
> > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > > On Wed, Nov 10, 2010 at 02:00:45PM -0500, Steven Rostedt wrote:
> > > > On Wed, 2010-11-10 at 19:41 +0100, Ingo Molnar wrote:
> > > > 
> > > > > We'll need to embark on this incremental path instead of a rewrite-the-world thing. 
> > > > > As a maintainer my task is to say 'no' to rewrite-the-world approaches - and we can 
> > > > > and will do better here.
> > > > 
> > > > Thus you are saying that we stick to the status quo, and also ignore the
> > > > fact that perf was a rewrite-the-world from ftrace to begin with.
> > > 
> > > Perhaps you and Mathieu can summarize your requirements here and then explain
> > > why extending the current ABI wouldn't work. It's quite normal that people
> > > try to find a solution fully backward compatible in the first place. If
> > > it's not possible, fine, but then justify it.
> > 
> > Sure, here are the requirements my user-base have, followed by a listing of Perf
> > and Ftrace pain points, some of which are directly derived from their respective
> > ABIs, others partially caused by their implementation and partially caused by
> > their ABI.
> 
> Yeah, but the main point here is to explain why/how reaching those goals is not
> efficiently possible through an extension of the current ABI, in practice.
> 
> I'm going to try for some of them. Note when I'll talk about ABI breakage,
> it actually means: create a new ABI and support the old one, schedule its
> deprecation in the long term.
> 
> Here we go:
> 
> 
> > 
> > - Low overhead is key
> >   - 150 ns per event (cache-hot)
> >   - Zero-copy (splice to disk/network, mmap for zero-copy in-place data
> >     analysis)
> 
> We could do splice in perf through an extension of the current ABI.
> The rest seems more about kernel internals.
> 
> => Abi breakage doesn't seem to be needed.
> 
> > - Compactness of traces
> >   - e.g. 96 bits per event (including typical 64-bit payload), no PID saved per
> >     event.
> 
> In perf we save the pid from two places:
> 
> - perf headers, see PERF_SAMPLE_TID
> - from the common fields of the trace events
> 
> Ftrace too for common fields.
> 
> It's useful to keep PERF_SAMPLE_TID for low overhead events (like
> perf little sampling). Otherwise we can certainly deduce the pid
> from context switch trace events.
> 
> But the pid in the trace event headers remains. We probably should
> get rid of that.
> 
> There are also the other common fields:
> 
> 	struct trace_entry {
> 		unsigned short		type;
> 
> 
> Type is needed by perf. If we have one buffer per event, we could
> retrieve which event we are dealing with. But if buffers are
> multiplexed per cpu, we need this.

Agreed, although 65536 types ID is probably overkill for the common case.
I prefer to go for approaches with a header that contains a smaller number of
bits, and use an extended header for those rare cases that need it.

> 	unsigned char		flags;
> 
> Useful for ftrace, not for perf which will be able to save regs
> soon.

Also useless for lttng.

> 	unsigned char		preempt_count;
> 
> 
> Dunno. Should be optional.

Ditto.

> 	int			pid;
> 
> 
> Kill!

Yep :)

> 
> 
> 	int			lock_depth;
> 
> 
> Killed ;)

Finally ;)

> };
> 
> 
> 
> => Abi breakage needed. Can be made through an ABI extension though, but
> wouldn't scale in the long term.

Yep, you'd have to support the two formats side-to-side for a while anyway. So
we can definitely call it a ABI breakage rather than extension.

> 
> > - Scalability to multi-core and multi-processor
> >   - Per-CPU buffers, time-stamp reading both scalable to many cpus *and* accurate
> 
> => Kernel internals

That's right. It's more in the trace-clock area. Let's keep this problem for
later, as we are focusing on the ABIs.

> > - Production-grace tracer reliability
> >   - Trace clock accuracy within 100ns, ordering can be inferred based on
> >     lock/interrupt handler knowledge, ability to know when ordering might be
> >     wrong.
> 
> => Seems to be kernel internals only. I may be missing your point though.

Yep, also trace-clock related. No effect on ABI.

> 
> 
> > - Flight recorder mode
> >   - Support concurrent read while writer is overwriting buffer data
> >     (Thomas Gleixner named these "trace-shots")
> 
> => Abi extension (overwriting mode)?

There were more details below on the impact of supporting flight recorder on the
trace format (using sub-buffers, etc). The ABI impact is more than just a flag,
although adding a flag is a good starting point. ;-)

> > - Support multiple trace sessions in parallel
> >   - Engineer + Operator + flight recorder for automated bug reports
> 
> => Doesn't seem to need ABI breakage.

This one could be done through ABI extension I guess.

> > - Availability of trace buffers for crash diagnosis
> >   - Save to disk, network, use kexec or persistent memory
> 
> Use splice for save to disk or network. But I don't understand the kexec
> thing.
> 
> => ABI extension (see splice)

This one is when the kernel is crashed. So there is not much still available,
certainly not splice(). :) The idea is to keep the trace buffers around in the
system after a OOPS (or hard lockup) so that they can be gathered later on.

> > - Heterogeneous environment support
> >   - Portability
> 
> What is missing?

Portable bitfields comes to my mind. And no, it's not enough to just reverse the
byte order across endianness.

> >   - Distinct host/target environment support
> 
> ditto.
> This works well for perf and ftrace currently. Have you
> a specific problem in mind?

The setup is that the traces are gathered on telecom switches, and brought to a
host machine for viewing. The user has to deal with traces gathered from various
kernel versions.

I did push Steven to support cross-endianness and self-describing types in
Ftrace in the past, and I have to admit that a large part of this requirement is
met, which is good.

> >   - Management of multiple target kernel versions
> 
> We all try to ensure backward compatibility. It only gets broken
> because of unwanted regressions or scheduled deprecation in the
> long term.
> 
> >   - No dependency on kernel image to analyze traces
> >     (traces contain complete information)
> 
> Trace format.

Yep, this one involves that the trace metadata (currently exported through
debugfs) should make its way along with the trace stream. One way to do it would
be to have a small separate buffer to transport the metadata.

> 
> > - Live view/analysis of trace streams via the network
> >   - Impact on buffer flushing, power saving, idle, ...
> 
> kernel internals

Being able to set the periodic timer flush impact the ABI (very slightly).

> 
> > - Synchronized system-wide (hypervisor, kernel and user-space) traces
> 
> kernel internals?

Yep. Mainly and largely has big impacts on trace clock implementation.

> > - Scalability of analysis tools to very large data sets (> 10GB)
> 
> => Userspace internals

There are ways to layout the trace data so that a userspace tool can dig through
it quickly. Therefore it impacts the trace format too.

> > - Standardization of trace format across analysis tools
> 
> Please detail.

I'm working for the Linux Foundation CELF group and Ericsson, with the
Multi-Core Association, to come up with a standardized trace format across
trace providers in the industry, so that we can use the same tools to analyze
traces taken from heterogeneous systems (hardware traces, OS traces, user-space
traces...).

Given the live analysis and low-overhead requirements, being able to generate
this trace format natively would be a great gain.

> > * Ring Buffer issues with Perf:
> > 
> > - Perf does not support flight recorder tracing (concurrent read/write)
> 
> Abi extension.

Nope, this one is an ABI breakage. The current mmap shared control head/tail
values used for synchronization between the kernel (writer) and user-space
(reader) does not allow concurrent read/write in flight recorder mode. We need,
at the very least, to call the kernel after we've finished reading a sub-buffer.

> >   - Sub-buffers are needed to support concurrent read/writes in flight recorder
> >     mode. Peter still has to convince me otherwise (if he cares).
> 
> ABI breakage needed

Yep.

> >   - Imply adding padding when an event does not fit in the current sub-buffer
> >     (ABI change). Note for Frederic: creating a single-subbuffer as large as the
> >     buffer does not solve this problem, because perf allows writing an event
> >     across the end of the buffer and its beginning. In a scheme where
> >     sub-buffers can be discarded, it makes it quite unreliable to try to figure
> >     out where partially overwritten events end.
> 
> Ok.
> 
> >   - Calling the kernel when finishing reading a sub-buffer is needed for flight
> >     recorder mode tracing. It is not possible with the mmap-head-tail-counter
> >     ABI Perf currently uses for reader-writer synchronization.
> 
> Why do you need to call the kernel for that?

Because we need to get exclusive access to the next sub-buffer (exchanging it
with the one we currently own). This operation is an atomic pointer CAS (or
exchange for ftrace), which should only be done by the kernel.

> > - Perf is 5 times slower than Ftrace/Generic Ring Buffer Library/LTTng.
> >   - Partially due to implementation.
> 
> Kernel internals
> 
> >   - Partially due to large event size.
> 
> (See my previous comments about pid and so).
> 
> > 
> > * Trace Format issues with Perf:
> > 
> > - Perf event headers are too large
> 
> You can select them independantly, except for trace events, for which
> I made comments before.
> 
> > - Handling of dynamically added instrumentation while trace is recorded is
> >   inexistent.
> 
> ???

This problem applies to both Ftrace and Perf. If you have the following
scenario:

1 - start tracing
2 - debugfs event descriptions are read
3 - load a module with tracepoints in it or add a dynamic kprobe
4 - hit the newly added events
5 - stop tracing

Then you end up being unable to parse the dynamically loaded information. That
is if the dynamically loaded instrumentation ends up being activated at all.

In a context where distributions load modules like KVM on demand, it does not
make sense to keep these events out of the trace just because they have been
dynamically loaded without the user knowledge. The problem is twofold here:

1 - we need to be able to specify which tracepoints are to be activated
independently of their location (kernel/modules) and of whether or not they
currently exist.

2 - we need to be able to append to the event list (metadata) while the trace is
being gathered.

> > 
> > * Ring Buffer issues with Ftrace:
> > 
> > - Ftrace needs an internal API cleanup.
> >   - "peek" is an unnecessary API duplication which complicates everything down
> >     to the buffer-level.
> 
> kernel internals

Yep.

> > - Ftrace does not support cross-pages event writes
> >   - Limits event size to less than 4kB
> 
> kernel internals?

Well, it all depends on how much the ftrace tools expect the sub-buffer size to
be 4kB.

> > * Trace Format issues with Ftrace:
> > 
> > - Ftrace timestamps are saved as delta from previous event
> >   - Only works for tracing where preemption can be disabled, unusable for
> >     user-space tracing.
> 
> What is this userspace tracing? Is this userspace tracing made in kernel
> space?
> 
> (tag me confused)

Nope, this is userspace tracing performed all in userspace. However, if we want
to share the same trace format, then we need to come up with a trace format that
is not inherently tied to a scheme where preemption can be disabled.

> >   - Creates an artificial data dependency between events, leading to odd
> >     side-effects when dealing with nesting over tracer
> 
> I wouldn't comment that, I'm not very experienced with the ring buffer
> 
> >     - 0 ns IRQ/SOFTIRQ handler duration side-effect
> 
> ditto.
> 
> If we need/want to cure that, then we need an:
> 
> => ABI breakage

Yep.

> > - Event size limited to one page
> 
> Perf too needs more (userspace stack dumps).

Yep.

> > - Ftrace event headers are still too large
> 
> 
> (described in the beginning)
> 
> 
> 
> > - Handling of dynamically added instrumentation while trace is recorded is
> >   inexistent.
> 
> I still don't understand this point
> 

Explained above.

> Now I'm too tired to sum up all the points that seem not to be
> solved through an ABI extension :)

Thanks for the feedback!

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 23:58                                 ` Thomas Gleixner
@ 2010-11-11  9:17                                   ` Ingo Molnar
  2010-11-11 13:37                                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2010-11-11  9:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, Luck, Tony, Frederic Weisbecker,
	Steven Rostedt, Peter Zijlstra, linux-kernel, Huang, Ying, bp,
	akpm, mchehab, Arnaldo Carvalho de Melo, Arjan van de Ven


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, 10 Nov 2010, Mathieu Desnoyers wrote:

> > I'd like to start with an implementation that skips some of these requirements 
> > initially, but what I really think we need to figure out is how we organize our 
> > ABIs to finally support these requirements.

Note, there is an existing ABI in place, please use that. (It's highly extensible so 
it can support just about any ABI experiment that can even be turned into a smooth 
ABI replacement.)

I think Frederic just started iterating it - but if Mathieu and Steve helps out it 
will all happen faster.

> I did not say, that you should not think about this, but the progress so far in 
> more than TWO YEARS is exaclty ZERO. And that's what I'm concerned about.

Yes, indeed that is the main problem i see as well.

Most of the problems listed in the various documents can be solved iteratively in 
the existing facilities. There is not a single requirement where Peter or me said 
'No, this cannot be done, go away!'. Each and every item was answered with: 'sure, 
we can do that' - or at worst with a 'do we really need it?'. Each and every item 
fits naturally into existing goals as well - so it's not like some different world 
view is being forced on anyone.

We only have one basic condition: please introduce these thing step by step in the 
existing ABI.

This is a must-have for tools, and there is another very important factor as well: a 
couple of items can have disadvantages beyond the claimed advantages, so we want to 
be able to evaluate the effects in isolation, test them and if needed, undo them. It 
will settle the 'do we really need this?' kind of sub-arguments for sub-features.

So being intelligent about it, being iterative is my only requirement to you guys: 
you are free to change anything, go wild, but please make it iterative and dont try 
to fork the tooling and developer community.

The time has come to not grow the list of requirements but to shrink it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-11  9:17                                   ` Ingo Molnar
@ 2010-11-11 13:37                                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 50+ messages in thread
From: Mathieu Desnoyers @ 2010-11-11 13:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Luck, Tony, Frederic Weisbecker, Steven Rostedt,
	Peter Zijlstra, linux-kernel, Huang, Ying, bp, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

Hi Ingo,

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Wed, 10 Nov 2010, Mathieu Desnoyers wrote:
> 
> > > I'd like to start with an implementation that skips some of these requirements 
> > > initially, but what I really think we need to figure out is how we organize our 
> > > ABIs to finally support these requirements.
> 
> Note, there is an existing ABI in place, please use that. (It's highly extensible so 
> it can support just about any ABI experiment that can even be turned into a smooth 
> ABI replacement.)

Is there any way we could proceed without piling up work-arounds over Perf's ABI ?
At this point, the only benefit of growing from the Perf ABI is comparable to
dragging a ball and chain all along. Yes, I agree that a smooth transition
should be the target, but I disagree on the means. I propose to come up with a
new ABI and eventually move the perf tools to this ABI, which is not a split in
the tracing developers community; rather more a unification.

Which do you prefer: a sequence of continuous ABI breakages or a single ABI
switch when the new ABI is ready ? In terms of contributor and user pain, I
think the second option is much better. We'll have to keep the old Perf ABI
around for a while anyway to keep users and tool developers happy. As Linus
pointed out at KS, this ABI is now cast in stone. If we need to break it, the
only thing we can do is create a new one. He said that he will personally revert
any ABI-breaking tracing patch if he ever receives a single complaint from a
user. This is not a context in which we want to start playing games with the
existing ABIs.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-11  0:11                           ` Mathieu Desnoyers
@ 2010-11-11 16:10                             ` Steven Rostedt
  2010-11-11 16:34                               ` Mathieu Desnoyers
  0 siblings, 1 reply; 50+ messages in thread
From: Steven Rostedt @ 2010-11-11 16:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, Luck, Tony,
	linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

On Wed, 2010-11-10 at 19:11 -0500, Mathieu Desnoyers wrote:

> > There are also the other common fields:
> > 
> > 	struct trace_entry {
> > 		unsigned short		type;
> > 
> > 
> > Type is needed by perf. If we have one buffer per event, we could
> > retrieve which event we are dealing with. But if buffers are
> > multiplexed per cpu, we need this.
> 
> Agreed, although 65536 types ID is probably overkill for the common case.
> I prefer to go for approaches with a header that contains a smaller number of
> bits, and use an extended header for those rare cases that need it.

Note, ftrace currently has over 600 event types. Unless we compact it
down into bits, using two bytes is fine.

-- Steve



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-11 16:10                             ` Steven Rostedt
@ 2010-11-11 16:34                               ` Mathieu Desnoyers
  0 siblings, 0 replies; 50+ messages in thread
From: Mathieu Desnoyers @ 2010-11-11 16:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, Luck, Tony,
	linux-kernel, ying.huang, bp, tglx, akpm, mchehab,
	Arnaldo Carvalho de Melo, Arjan van de Ven

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Wed, 2010-11-10 at 19:11 -0500, Mathieu Desnoyers wrote:
> 
> > > There are also the other common fields:
> > > 
> > > 	struct trace_entry {
> > > 		unsigned short		type;
> > > 
> > > 
> > > Type is needed by perf. If we have one buffer per event, we could
> > > retrieve which event we are dealing with. But if buffers are
> > > multiplexed per cpu, we need this.
> > 
> > Agreed, although 65536 types ID is probably overkill for the common case.
> > I prefer to go for approaches with a header that contains a smaller number of
> > bits, and use an extended header for those rare cases that need it.
> 
> Note, ftrace currently has over 600 event types. Unless we compact it
> down into bits, using two bytes is fine.

I understand that overall the numer of events overflows 256. However, this does
not mean they are typically all activated. Moreover, if we add multiple buffer
support, these events don't necessarily have to end up in the same buffer. This
numeric identifier is only there to distinguish between events ending in the
same buffer after all.

So the number of available events does not count as an argument for choosing the
typical number of bytes to represent the event IDs. The number of events
typically activated for a trace session does.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting)
  2010-11-10 23:20                               ` Steven Rostedt
  2010-11-10 23:45                                 ` Thomas Gleixner
@ 2010-11-11 18:25                                 ` Ted Ts'o
  1 sibling, 0 replies; 50+ messages in thread
From: Ted Ts'o @ 2010-11-11 18:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Mathieu Desnoyers, Luck, Tony,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel,
	Huang, Ying, bp, akpm, mchehab, Arnaldo Carvalho de Melo,
	Arjan van de Ven

On Wed, Nov 10, 2010 at 06:20:27PM -0500, Steven Rostedt wrote:
> On Thu, 2010-11-11 at 00:12 +0100, Thomas Gleixner wrote:
> 
> > Cramming both into the same session is just insane.
> 
> That just doubled the overhead of the tracer.

At least when I've used ftrace for the "flight recorder" use case, I'm
not tracing as well.  What I do is enable a bunch of trace points,
maybe I've sprinkled in some "trace_printk()'s" into various kernel
code paths, and then I run the workload which locks up the kernel.
When locks up, I've used sysrq-z to dump out the ftrace ring buffer,
and usually _exactly_ what I need to debug the lock up is waiting for
me in the ring buffer.

So, this use case, is incredibly useful, and I hope whatever folks do
with the new-fangled API, that somehow "overwrite mode" is supported.
Even if for speed reasons, what you do is wait until for the head to
overrun the tail, that the tail gets bumped up by 50% and we lose half
the log (so that whatever expensive locking is necessary only happens
once in a while), I at least would find that quite acceptable.

The other feature/requirements request I would make is that there
should be a way that common kernel abstractions, such as converting a
dev_t to either a MAJOR/MINOR number pair, or to a device name, be
made available.  For now I've changed the tracepoints to translate
MAJOR/MINOR and drop integers into the ring buffer, and a generic
workaround in the future is to always drop strings into the ring
buffer instead of allowing the translation to be done in TP_printk
(which doesn't work for perf; it causes the userspace perf client to
fall over and die, without even skipping the problematic tracepoint
record --- boo, hiss).  But that can be relatively inefficient,
because we're now having to drop potentially fairly large text strings
into ring buffer, because of limitations that perf has in its output
transformations step.

I know that because perf is doing its output transformation in
userspace, there are fundamental limitations about what it can do.
But it would be nice if it could be expanded at least _somewhat_, and
either way, there needs to be some clear documentation about what it
can and can not accept.  And if these limitations means that I should
just simply continue using ftrace, and not use perf, it would be nice
if the tracepoints I create that work with ftrace don't cause perf to
just die horribly when it tries to parse them.

						- Ted

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2010-11-11 18:26 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-10  0:56 [RFC/Requirements/Design] h/w error reporting Luck, Tony
2010-11-10 10:14 ` Ingo Molnar
2010-11-10 14:40   ` Steven Rostedt
2010-11-10 14:43     ` Peter Zijlstra
2010-11-10 15:09       ` Steven Rostedt
2010-11-10 15:28         ` Mathieu Desnoyers
2010-11-10 15:30         ` Peter Zijlstra
2010-11-10 15:53           ` Steven Rostedt
2010-11-10 16:52           ` Steven Rostedt
2010-11-10 17:05             ` Borislav Petkov
2010-11-10 17:41               ` Ingo Molnar
2010-11-10 17:50                 ` Luck, Tony
2010-11-10 18:09                 ` Steven Rostedt
2010-11-10 18:52                   ` Ingo Molnar
2010-11-10 17:25             ` Frederic Weisbecker
2010-11-10 17:48           ` Ingo Molnar
2010-11-10 18:05             ` Steven Rostedt
2010-11-10 18:23               ` Luck, Tony
2010-11-10 18:31                 ` Peter Zijlstra
2010-11-10 18:49                   ` Ingo Molnar
2010-11-10 18:24               ` Peter Zijlstra
2010-11-10 18:41                 ` Ingo Molnar
2010-11-10 19:00                   ` Steven Rostedt
2010-11-10 19:11                     ` Ingo Molnar
2010-11-10 19:11                     ` Frederic Weisbecker
2010-11-10 19:30                       ` Ingo Molnar
2010-11-10 19:48                         ` Steven Rostedt
2010-11-10 20:23                       ` Tracing Requirements (was: [RFC/Requirements/Design] h/w error reporting) Mathieu Desnoyers
2010-11-10 20:54                         ` Luck, Tony
2010-11-10 21:06                           ` Steven Rostedt
2010-11-10 21:34                             ` Steven Rostedt
2010-11-10 22:51                           ` Mathieu Desnoyers
2010-11-10 23:12                             ` Thomas Gleixner
2010-11-10 23:20                               ` Steven Rostedt
2010-11-10 23:45                                 ` Thomas Gleixner
2010-11-11 18:25                                 ` Ted Ts'o
2010-11-10 23:28                               ` Mathieu Desnoyers
2010-11-10 23:58                                 ` Thomas Gleixner
2010-11-11  9:17                                   ` Ingo Molnar
2010-11-11 13:37                                     ` Mathieu Desnoyers
2010-11-10 21:30                         ` Frederic Weisbecker
2010-11-10 21:54                           ` Steven Rostedt
2010-11-10 22:19                             ` Frederic Weisbecker
2010-11-10 22:49                             ` Frederic Weisbecker
2010-11-11  0:11                           ` Mathieu Desnoyers
2010-11-11 16:10                             ` Steven Rostedt
2010-11-11 16:34                               ` Mathieu Desnoyers
2010-11-10 19:16                 ` [RFC/Requirements/Design] h/w error reporting Steven Rostedt
2010-11-10 19:38                 ` Steven Rostedt
2010-11-10 18:27               ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).