All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC] Common Trace Format Requirements (v1.3)
       [not found]         ` <D86DB60C62D82A4792CAEC7F58CA04E70767A61E@NA1-MAIL.mgc.mentorg.com>
@ 2010-09-01 22:17           ` Mathieu Desnoyers
  0 siblings, 0 replies; 3+ messages in thread
From: Mathieu Desnoyers @ 2010-09-01 22:17 UTC (permalink / raw)
  To: Spear, Aaron
  Cc: Frank Ch. Eigler, linux-kernel, ltt-dev, Dominique Toupin,
	Martinez Ed-R12594, George Stephen-RVGG20, Bruce Ashfield,
	Burton, Felix, Maisonneuve, Philippe, McDermott, Andrew,
	Frederic Weisbecker, Jan Kiszka, Michael Rubin, Michel Dagenais,
	Steven Rostedt, Tim Bird, Joe Green, Tasneem Brutch - SISA,
	Peter Zijlstra, Ingo Molnar

(sorry, I had to remove a few CC to stop triggering LKML spam filter)

* Spear, Aaron (aaron_spear@mentor.com) wrote:
> > > > > If the metadata is not given in a standard form, then how do
> > > > > envision general trace analysis tools (those not hard-coded for
> > > > > some particular trace source) working?
> > >
> > > > We can have a metadata format selector at the beginning of the
> > > > metadata section, with reserved IDs for metadata formats. We can
> > > > think of a format generated natively by TRACE_EVENT(), a format
> > > > generated in some sort of XML. The trace analyzer would need the
> > > > metadata format parser in order to be able to read the trace.
> > >
> > > If one hopes that such tools should be able to consume data with
> > > drifting versions of TRACE_EVENT() or XML or whatnot, they
> themselves
> > > had better be fixed/standardized.  Otherwise, the data will not be
> > > self-describing, and all the tools will have to chase kernel/etc.
> > > versions.
> > 
> > The tool can bail out if it detects a type it does not know. So a tool
> > upgrade would be required.
> > 
> > Also the standard plans to have a major/minor trace version in there,
> > so we can explicitly break compatibility if it is required. It might
> > end up being much better than trying to support backward compatibility
> > forever. Then it would be up to the tool implementors to choose if
> they
> > want to provide backward compatibility for older trace versions or
> not.
> 
> What we have done in our own internal format (which I distributed for
> reference some time ago) is have a fixed set of fundamental types that
> can be inserted into the trace log (integers of various bit sizes,
> strings, Booleans, etc).  An event in the log is a collection of these
> fundamental types then,  (e.g. a semaphore post event might have a 16
> bit unsigned event id, a 32 bit unsigned semaphore handle, 32 bit
> unsigned current thread id, 64 bit timestamp, ...).  So at a base layer
> there can be a trace analyzer that simply knows how to display
> attributes of events but doesn't really know what they MEAN.

I agree that we have to describe types in the metadata, although I fear the
definitions of "basic types" differs widely depending on the execution context.
For instance, the Linux kernel just doesn't care about floating point values.
So I'd be tempted to treat all types in the same way, which means that all types
would be "optional". It's only if they are present in the trace that their
description becomes required.

>  On top of
> that you can layer plugin functionality in a trace analyzer that
> understands what a given type of event means and as such what sort of
> thing makes the most sense to do in a presentation.  Scheduling is a
> good example.  If there are scheduler events that indicate context
> switches, then if the trace analyzer knows this, it can use that
> information to help it draw a nice Gantt chart as opposed to a linear
> graph of simple event data.
> 
> I have been hoping that as a part of this standard that we will end up
> with common schemas for meta-data that are shared by various use cases,
> so that things that are common concepts to many OS'es for example could
> in fact be shared and thus you could have a trace analyzer that works
> for OS'es that have similar models and generate data in a compatible
> schema.  In practice this may be difficult, but it seems like a noble
> goal to aim for.  If we can get a fixed schema for the fundamental types
> that will be a step in the right direction. 

Those are very good ideas. Here is what I plan to add in the next RFC version.
Adding a per-event taxonomy to the metadata would allow automated creation of
a state machine in the trace analyzer that keeps track of state updates within
the taxonomy tree. So we can automate this tedious task rather than creating
custom plugins that each need deep knowledge about the events.

About the items below, I must remind everyone that this information is only
described _once per trace_ in the metadata section.

1.2) Extensions (optional capabilities)
  ...
  - Metadata
  ...

  - Optional per-event "current state tracking" information.
    Described in an file-system path-like taxonomy with additional []
    operator which indicates a lookup by value, e.g.:

    * For events in the trace stream updating the current state only based on
      information known from the context (either derived from the per-section or
      per-event context information):

    E.g., associated with a scheduling change event:

    "cpu[section/cpu]/thread = field/next_pid"
       Updates the current value of the current section's cpu "thread" attribute
       (e.g. currently running thread).

    E.g., associated with a system call:

    "thread[cpu[section/cpu]/thread]/syscall[field/syscall_id]/id
       = field/syscall_id"

       Updates the state value of the current thread "syscall" attribute.

    * For events in the trace stream targeting a path that depends on other
      fields into that same event (would be common for full system state dump at
      trace start):

    E.g., associated with a thread listing event:
    "thread[field/pid]/pid = field/pid"

    E.g., associated with a thread memory maps listing event:
    "thread[field/pid]/mmap[field/address]/address = field/address"
    "thread[field/pid]/mmap[field/address]/end = field/end"
    "thread[field/pid]/mmap[field/address]/flags = field/flags"
    "thread[field/pid]/mmap[field/address]/pgoff = field/pgoff"
    "thread[field/pid]/mmap[field/address]/inode = field/inode"

    All per-event context information (e.g. repeating the current PID and CPU
    for each event) can be represented with this taxonomy, e.g., in the
    section description:

    "section/pid = field/pid"
    "section/cpu = field/cpu"


Thanks !

Mathieu

> 
> Best regards,
> Aaron Spear
> Tools Architect, Mentor Graphics

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [ltt-dev] [RFC] Common Trace Format Requirements (v1.3)
       [not found] ` <AANLkTi=biqazdMTq2kvEvBhW3mZdrCF-vvxok6t31g+x@mail.gmail.com>
@ 2010-09-01 22:32   ` Mathieu Desnoyers
  0 siblings, 0 replies; 3+ messages in thread
From: Mathieu Desnoyers @ 2010-09-01 22:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: linux-kernel, ltt-dev, Dominique Toupin, Spear, Aaron,
	Martinez Ed-R12594, George Stephen-RVGG20, Bruce Ashfield,
	Burton, Felix, Maisonneuve, Philippe, McDermott, Andrew,
	Frederic Weisbecker, Jan Kiszka, Michael Rubin, Frank Ch. Eigler,
	Michel Dagenais, Steven Rostedt, Tim Bird, Joe Green,
	Tasneem Brutch - SISA, Peter Zijlstra, Ingo Molnar,
	Prerna Saxena

* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Tue, Aug 31, 2010 at 3:50 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> > Hi,
> >
> >  The goal of the present document is to propose a trace format that will suit
> > the needs of the embedded, telecom, high-performance and kernel communities. It
> > starts by doing an overview of the trace format, tracer and trace analyzer
> > requirements to consider for a Common Trace Format proposal.
> >
> > This is a follow-up on what I presented at LinuxCon 2010:
> > "Efficient Trace Format for System-Wide Tracing"
> >
> > http://www.efficios.com/linuxcon2010 (slides)
> >
> > Feedback is welcome!
> >
> > Thanks,
> >
> >     Mathieu
> >
> >
> > This document includes requirements from:
> >
> > Steven Rostedt <rostedt@goodmis.org>
> > Dominique Toupin <dominique.toupin@ericsson.com>
> > Aaron Spear <aaron_spear@mentor.com>
> > Philippe Maisonneuve <Philippe.Maisonneuve@windriver.com>
> > Felix Burton <Felix.Burton@windriver.com>
> > Andrew McDermott <Andrew.McDermott@windriver.com>
> > Multi-Core Association Tool Infrastructure Workgroup
> >   (http://www.multicore-association.org/workgroup/tiwg.php)
> >
> >
> > * Trace Format Requirements
> >
> >  These are requirements on the trace format itself. This section discusses the
> > layout of data in the trace, explaining the rationale behind the choices. The
> > rationale for the trace format choices may refer to the tracer and trace
> > analyzer requirements stated below. This section starts by presenting the common
> > trace model, and then specifies the requirements of an instance of this model
> > specifically tailored to efficient kernel- and user-space tracing requirements.
> >
> >
> > 1) Architecture
> >
> > This high-level model is meant to be an industry-wide, common model, fulfilling
> > the tracing requirements. It is meant to be application-, architecture-, and
> > language-agnostic.
> >
> > 1.1) Core model
> >
> > - Event
> >
> > An event is an information record contained within the trace.
> >
> >  - Events must be in physical order within a section
> 
> What does "physical order" mean?  Is this up to the tracer and will
> usually be chronological order?

Yep, the intent is that they appear in the section as a sequence, where the
notion of "order" (be it chronological, sequence-counter or simply
position -based) is strongly tied to the physical order of the event within the
section.

> 
> >  - Event type (numeric identifier: maps to metadata)
> >    - Unique ID assigned within a section.
> >  - Event payload
> >    - Variable event size
> >    - Size limitations: maximum event size should be configurable.
> >    - Size information available through metadata.
> >    - Support various data alignment for architectures, standards, and
> >      languages:
> >      - Natural alignment of data for architectures with slow non-aligned
> >        writes.
> >      - Packed layout of headers for architecture with efficient non-aligned
> >        writes.
> >
> > - Section
> >
> > A section within the trace can be thought of as the ELF sections in a ELF
> > binary. They contain a sequence of physically contiguous event records.
> >
> >  - Multi-level section identifier
> >    - e.g.: section name / CPU number
> >  - Contains a subset of event types
> 
> Is there support for interleaving sections?  When traces go to
> userspace or over the network it is possible to multiplex, but what
> about the layout of sections in the file format itself?
> 
> For example:
> 
> events/cpu0:
>  * sys_enter
> events/cpu1:
>  * sys_enter
>  * sys_exit
> events/cpu0: <--- cpu0 again, with cpu1 interleaved
>  * sys_exit
> 
> I think this should be possible so a tracer can write the file out
> directly and does not need to buffer sections until they are "done"
> before writing them out.
> 
> Perhaps the "trace stream" concept below in the Linux section covers
> this.  I'm not sure.

I used the parallel with ELF sections to conceptually demonstrate the idea of
trace section, but the similarity stops there. A trace is peculiar in that we
have to continuously append to each sections, and we need to have ideally no
interaction between sections (e.g. no locking on the same file).  In terms of
implementation, I would strongly recommend to keep one file per section rather
than putting all sections within a single file. On a heavily loaded system, this
would even allow saving different sections to different storage in parallel.

I'll update the next RFC taking your comments into account.

Thanks for the feedback!

Mathieu

> 
> > - Metadata
> >
> > Metadata is the description of the setting of the environment of the
> > application. Defines the basic types of the domains. Will define the mapping
> > between the event, and the type of the event fields. The metadata scope (what it
> > describes) is a whole trace, which consists of one or many sections.
> >
> > The metadata can be either contained in the trace (better usability for telecom
> > scenarios) or added alongside the trace data by a separate module (for DSP
> > scenarios). Metadata checksumming and/or versioning can be used to ensure
> > consistency between sections and metadata in the latter.
> >
> >  - Trace version
> >    - Major number (increment breaks compabilility)
> >    - Minor number (increment keeps compatibility)
> >  - Describe the invariant properties of the environment where the trace was
> >    generated.
> >    - Contain unique domain identifier (kernel, process ID and timestamp,
> >      hypervisor)
> >    - Describes the runtime environment.
> >    - Report target bitness
> >    - Report target byte order
> >    - Data types (see section 1.2 Extensions below)
> >  - Architecture-agnostic (text-based)
> >  - Ought to be parsed with a regular grammar
> >  - Mapping to event types, e.g. (section, event) tuples, with:
> >      ( section identifier, event numerical identifier )
> >  - Description of event context fields (per section)
> >  - Can be streamed along with the trace as a trace section
> >  - Support dynamic addition of events while trace is active (module loading)
> >  - Metadata section should be efficient and reliable. Additional information
> >    could be kept in separate sections, outside of metadata.
> >  - Metadata description language not imposed by standard
> >
> >
> > 1.2) Extensions (optional capabilities)
> >
> > - Event
> >  - Optional context (thread id, virtual cpu id, execution mode (irq/bh/thread),
> >                      CPU/board/node id, event ordering identifier, timestamp,
> >                      current hardware performance counter information, event
> >                      size)
> >    - Optional ordering capability across sections:
> >      - Ordering identifier required for trace containing many event streams
> >      - Either timestamp-based or based on unique sequence numbers
> >    - Optional time-flow capability: per-event timestamps
> >
> > - Section
> >  - Optional context applying to all events contained in that section
> >    (thread id, virtual cpu id, execution mode (irq/bh/thread), CPU/board/node
> >     id)
> >  - Support piece-wise compression
> >  - Support checksumming
> >
> > - Metadata
> >  - Execution environment information
> >    - Data types available: integer, strings, arrays, sequence, floats,
> >      structures, maps (aka enumerations), bitfields, ...
> >      - Describe type alignment.
> >      - Describe type size.
> >      - Describe type signedness.
> >      - Other type examples:
> >        - gcc "vector" type. (packed data)
> >          http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
> >        - gcc complex type (e.g. complex short, float, double...)
> >        - gcc _Fract and _Accum http://gcc.gnu.org/wiki/FixedPointArithmetic
> >          http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html
> >  - Describes trace capabilities, for instance:
> >    - Event ordering across sections
> >    - Time flow information
> >      - In event header
> >      - Or possibly payload of pre-specified sections and/or events
> >    - Ability to perform event ordering across traces
> >
> >
> > 2) Linux-specific Model
> >
> >   (Linux instance, specific to the reference implementation)
> >
> > Instance of the model specifically tailored to the Linux kernel and C
> > programs/libraries requirements. Allows for either packed events, or events
> > aligned following the ISO/C standard.
> >
> > - Event
> >  - Payload
> >    - Initially support ISO C naturally aligned and packed type layouts.
> >
> > - Each section represented as a trace stream (typically 1 trace stream per cpu
> >  per section) to allow the tracer to easily append to these sections.
> >  Identifier: section name / CPU ID
> >  Each section has a CPU ID identifier in its context information.
> >
> > - Trace stream
> >  - Should have no hard-coded limit on size of a file generated by saving the
> >    trace stream (64 bit file position is fine)
> >  - Event lost count should be localized. It should apply to a limited time
> >    interval and to a tracefile, hence to a specific section, so the trace
> >    analyzer can provide basic information about what kind of events were lost
> >    and where they were lost in the trace.
> >  - Should be optionally compressible piece-wise.
> >  - Optional checksum on the sub-buffer content (except sub-buffer header), with
> >    a selection of checksum algorithms.
> >  - Sub-buffer headers should contain a sequence number to help UDP streaming
> >    reassembly.
> >
> > - Compact representation
> >  - Minimize the overhead in terms of disk/network/serial port/memory bandwidth.
> >  - A compact representation can keep more information in smaller buffers,
> >    thus needs less memory to keep the same amount of information around.
> >    Also useful to improve cache locality in flight recorder mode.
> >
> > - Natural alignment of headers for architectures with slow non-aligned writes.
> >
> > - Packed layout of headers for architecture with efficient non-aligned writes.
> >
> > - Should have a 1 to 1 mapping between the memory buffers and the generated
> >  trace files: allows zero-copy with splice().
> >
> > - Use target endianness
> >
> > - Portable across different host target (tracer)/host (analyzer) architectures
> >
> > - It should be possible to generate metadata from descriptions written in header
> >  files (extraction with C preprocessor macros is one solution).
> >
> >
> > * Requirements on the Tracers
> >
> > Higher-level tracer requirements that seem appropriate to support some of the
> > trace format requirements stated above.
> >
> >
> > *Fast*
> > - Low-overhead
> > - Handle large trace throughput (multi-GB per minutes)
> > - Scalable to high number of cores
> >  - Per-cpu memory buffers
> >  - Scalability and performance-aware synchronization
> >
> > *Compact*
> > - Environments without filesystem
> >  - Need to buffer events in target RAM to send them in group a host for
> >    analysis
> > - Ability to tune the size of buffers and transmission medium to minimize the
> >  impact on the traced system.
> > - Streaming (live monitoring)
> >  - Through sockets (USB, network)
> >  - Through serial ports
> >  - There must be a related protocol for streaming this event data.
> >
> > - Availability of flight recorder (synonym: overwrite) mode
> >  - Exclusive ownership of reader data.
> >  - Buffer size should be per group of events.
> >
> > - Output trace to disk
> > - Trace buffers available in crash dump to allow post-mortem analysis
> > - Fine-grained timestamps
> >
> > - Lockless (lock-free, ideally wait-free; aka starvation-free)
> >
> > - Buffer introspection: event written, read and lost counts.
> >
> > - Ability to iteratively narrow the level of details and traced time window
> >  following an initial high level "state" overview provided by an initial trace
> >  collecting everything.
> >
> > - Support kernel module instrumentation
> >
> > - Standard way(s) for a host to upload/access trace log data from a
> >  target/JTAG device/simulator/etc.
> >
> > - Conditional tracing in kernel space.
> >
> > - Compatibility with power management subsystem (trace collection shall not be a
> >  reason for waking up a device)
> >
> > - Well defined and stable trace configuration and control API across kernel
> >  versions.
> >
> > - Create and run more than one trace session in parallel at the same time
> >  - monitoring from system administrators
> >  - field engineered to troubleshoot a specific problem
> >
> >
> > * Trace Analyzer Requirements
> >
> > - Ability to cope with huge traces (> 10 GB)
> > - Should be possible to do a binary search on the file to find events by time
> >  at least. (combined with smart indexing/ summary data perhaps)
> > - File format should be as dense as possible, but not at the expense of
> >  analysis performance (faster is more important than bigger since disks are
> >  getting cheaper)
> > - Must not be required to scan through all events in order to start
> >  analyzing (by time anyway)
> > - Support live viewing of trace streams
> > - Standard description of a trace event context.
> >  (PERI-XML calls it "Dimensions")
> > - Manage system-wide event scoping with the following hierarchy:
> >  (address space identifier, section name, event name)
> >
> > --
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> >
> > _______________________________________________
> > ltt-dev mailing list
> > ltt-dev@lists.casi.polymtl.ca
> > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> >

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [RFC] Common Trace Format Requirements (v1.3)
@ 2010-09-01 11:29 Mathieu Desnoyers
  0 siblings, 0 replies; 3+ messages in thread
From: Mathieu Desnoyers @ 2010-09-01 11:29 UTC (permalink / raw)
  To: linux-kernel, ltt-dev

(I am re-sending this trace format requirements document to LKML, because it
looks like the too long recipient list caught their spam filter.)

Hi,

  The goal of the present document is to propose a trace format that will suit
the needs of the embedded, telecom, high-performance and kernel communities. It
starts by doing an overview of the trace format, tracer and trace analyzer
requirements to consider for a Common Trace Format proposal.

This is a follow-up on what I presented at LinuxCon 2010:
"Efficient Trace Format for System-Wide Tracing"

http://www.efficios.com/linuxcon2010 (slides)

Feedback is welcome!

Thanks,

     Mathieu


This document includes requirements from:

Steven Rostedt <rostedt@goodmis.org>
Dominique Toupin <dominique.toupin@ericsson.com>
Aaron Spear <aaron_spear@mentor.com>
Philippe Maisonneuve <Philippe.Maisonneuve@windriver.com>
Felix Burton <Felix.Burton@windriver.com>
Andrew McDermott <Andrew.McDermott@windriver.com>
Multi-Core Association Tool Infrastructure Workgroup
   (http://www.multicore-association.org/workgroup/tiwg.php)


* Trace Format Requirements

  These are requirements on the trace format itself. This section discusses the
layout of data in the trace, explaining the rationale behind the choices. The
rationale for the trace format choices may refer to the tracer and trace
analyzer requirements stated below. This section starts by presenting the common
trace model, and then specifies the requirements of an instance of this model
specifically tailored to efficient kernel- and user-space tracing requirements.


1) Architecture

This high-level model is meant to be an industry-wide, common model, fulfilling
the tracing requirements. It is meant to be application-, architecture-, and
language-agnostic.

1.1) Core model

- Event

An event is an information record contained within the trace.

  - Events must be in physical order within a section
  - Event type (numeric identifier: maps to metadata)
    - Unique ID assigned within a section.
  - Event payload
    - Variable event size
    - Size limitations: maximum event size should be configurable.
    - Size information available through metadata.
    - Support various data alignment for architectures, standards, and
      languages:
      - Natural alignment of data for architectures with slow non-aligned
        writes.
      - Packed layout of headers for architecture with efficient non-aligned
        writes.

- Section

A section within the trace can be thought of as the ELF sections in a ELF
binary. They contain a sequence of physically contiguous event records.

  - Multi-level section identifier
    - e.g.: section name / CPU number
  - Contains a subset of event types

- Metadata

Metadata is the description of the setting of the environment of the
application. Defines the basic types of the domains. Will define the mapping
between the event, and the type of the event fields. The metadata scope (what it
describes) is a whole trace, which consists of one or many sections.

The metadata can be either contained in the trace (better usability for telecom
scenarios) or added alongside the trace data by a separate module (for DSP
scenarios). Metadata checksumming and/or versioning can be used to ensure
consistency between sections and metadata in the latter.

  - Trace version
    - Major number (increment breaks compabilility)
    - Minor number (increment keeps compatibility)
  - Describe the invariant properties of the environment where the trace was
    generated.
    - Contain unique domain identifier (kernel, process ID and timestamp,
      hypervisor)
    - Describes the runtime environment.
    - Report target bitness
    - Report target byte order
    - Data types (see section 1.2 Extensions below)
  - Architecture-agnostic (text-based)
  - Ought to be parsed with a regular grammar
  - Mapping to event types, e.g. (section, event) tuples, with:
      ( section identifier, event numerical identifier )
  - Description of event context fields (per section)
  - Can be streamed along with the trace as a trace section
  - Support dynamic addition of events while trace is active (module loading)
  - Metadata section should be efficient and reliable. Additional information
    could be kept in separate sections, outside of metadata.
  - Metadata description language not imposed by standard


1.2) Extensions (optional capabilities)

- Event
  - Optional context (thread id, virtual cpu id, execution mode (irq/bh/thread),
                      CPU/board/node id, event ordering identifier, timestamp,
                      current hardware performance counter information, event
                      size)
    - Optional ordering capability across sections:
      - Ordering identifier required for trace containing many event streams
      - Either timestamp-based or based on unique sequence numbers
    - Optional time-flow capability: per-event timestamps

- Section
  - Optional context applying to all events contained in that section
    (thread id, virtual cpu id, execution mode (irq/bh/thread), CPU/board/node
     id)
  - Support piece-wise compression
  - Support checksumming

- Metadata
  - Execution environment information
    - Data types available: integer, strings, arrays, sequence, floats,
      structures, maps (aka enumerations), bitfields, ...
      - Describe type alignment.
      - Describe type size.
      - Describe type signedness.
      - Other type examples:
        - gcc "vector" type. (packed data)
          http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
        - gcc complex type (e.g. complex short, float, double...)
        - gcc _Fract and _Accum http://gcc.gnu.org/wiki/FixedPointArithmetic
          http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html
  - Describes trace capabilities, for instance:
    - Event ordering across sections
    - Time flow information
      - In event header
      - Or possibly payload of pre-specified sections and/or events
    - Ability to perform event ordering across traces


2) Linux-specific Model

   (Linux instance, specific to the reference implementation)

Instance of the model specifically tailored to the Linux kernel and C
programs/libraries requirements. Allows for either packed events, or events
aligned following the ISO/C standard.

- Event
  - Payload
    - Initially support ISO C naturally aligned and packed type layouts.

- Each section represented as a trace stream (typically 1 trace stream per cpu
  per section) to allow the tracer to easily append to these sections.
  Identifier: section name / CPU ID
  Each section has a CPU ID identifier in its context information.

- Trace stream
  - Should have no hard-coded limit on size of a file generated by saving the
    trace stream (64 bit file position is fine)
  - Event lost count should be localized. It should apply to a limited time
    interval and to a tracefile, hence to a specific section, so the trace
    analyzer can provide basic information about what kind of events were lost
    and where they were lost in the trace.
  - Should be optionally compressible piece-wise.
  - Optional checksum on the sub-buffer content (except sub-buffer header), with
    a selection of checksum algorithms.
  - Sub-buffer headers should contain a sequence number to help UDP streaming
    reassembly.

- Compact representation
  - Minimize the overhead in terms of disk/network/serial port/memory bandwidth.
  - A compact representation can keep more information in smaller buffers,
    thus needs less memory to keep the same amount of information around.
    Also useful to improve cache locality in flight recorder mode.

- Natural alignment of headers for architectures with slow non-aligned writes.

- Packed layout of headers for architecture with efficient non-aligned writes.

- Should have a 1 to 1 mapping between the memory buffers and the generated
  trace files: allows zero-copy with splice().

- Use target endianness

- Portable across different host target (tracer)/host (analyzer) architectures

- It should be possible to generate metadata from descriptions written in header
  files (extraction with C preprocessor macros is one solution).


* Requirements on the Tracers

Higher-level tracer requirements that seem appropriate to support some of the
trace format requirements stated above.


*Fast*
- Low-overhead
- Handle large trace throughput (multi-GB per minutes)
- Scalable to high number of cores
  - Per-cpu memory buffers
  - Scalability and performance-aware synchronization

*Compact*
- Environments without filesystem
  - Need to buffer events in target RAM to send them in group a host for
    analysis
- Ability to tune the size of buffers and transmission medium to minimize the
  impact on the traced system.
- Streaming (live monitoring)
  - Through sockets (USB, network)
  - Through serial ports
  - There must be a related protocol for streaming this event data.

- Availability of flight recorder (synonym: overwrite) mode
  - Exclusive ownership of reader data.
  - Buffer size should be per group of events.

- Output trace to disk
- Trace buffers available in crash dump to allow post-mortem analysis
- Fine-grained timestamps

- Lockless (lock-free, ideally wait-free; aka starvation-free)

- Buffer introspection: event written, read and lost counts.

- Ability to iteratively narrow the level of details and traced time window
  following an initial high level "state" overview provided by an initial trace
  collecting everything.

- Support kernel module instrumentation

- Standard way(s) for a host to upload/access trace log data from a
  target/JTAG device/simulator/etc.

- Conditional tracing in kernel space.

- Compatibility with power management subsystem (trace collection shall not be a
  reason for waking up a device)

- Well defined and stable trace configuration and control API across kernel
  versions.

- Create and run more than one trace session in parallel at the same time
  - monitoring from system administrators
  - field engineered to troubleshoot a specific problem


* Trace Analyzer Requirements

- Ability to cope with huge traces (> 10 GB)
- Should be possible to do a binary search on the file to find events by time
  at least. (combined with smart indexing/ summary data perhaps)
- File format should be as dense as possible, but not at the expense of
  analysis performance (faster is more important than bigger since disks are
  getting cheaper)
- Must not be required to scan through all events in order to start
  analyzing (by time anyway)
- Support live viewing of trace streams
- Standard description of a trace event context.
  (PERI-XML calls it "Dimensions")
- Manage system-wide event scoping with the following hierarchy:
  (address space identifier, section name, event name)

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-09-01 22:32 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20100831145030.GA18176@Krystal>
     [not found] ` <20100831153650.GI3185@redhat.com>
     [not found]   ` <20100831163722.GA27279@Krystal>
     [not found]     ` <20100831170453.GA5762@redhat.com>
     [not found]       ` <20100831173019.GA1163@Krystal>
     [not found]         ` <D86DB60C62D82A4792CAEC7F58CA04E70767A61E@NA1-MAIL.mgc.mentorg.com>
2010-09-01 22:17           ` [RFC] Common Trace Format Requirements (v1.3) Mathieu Desnoyers
     [not found] ` <AANLkTi=biqazdMTq2kvEvBhW3mZdrCF-vvxok6t31g+x@mail.gmail.com>
2010-09-01 22:32   ` [ltt-dev] " Mathieu Desnoyers
2010-09-01 11:29 Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.