All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
@ 2011-04-22  8:47 Stephane Eranian
  2011-04-22  9:23 ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Stephane Eranian @ 2011-04-22  8:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma

On Fri, Apr 22, 2011 at 10:06 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
>> This needs to be a *lot* more user friendly. Users do not want to type in
>> stupid hexa magic numbers to get profiling. We have moved beyond the oprofile
>> era really.
>>
>> Unless there's proper generalized and human usable support i'm leaning
>> towards turning off the offcore user-space accessible raw bits for now, and
>> use them only kernel-internally, for the cache events.
>
Generic cache events are a myth. They are not usable. I keep getting questions
from users because nobody knows what they are actually counting, thus nobody
knows how to interpret the counts. You cannot really hide the micro-architecture
if you want to make any sensible measurements.

I agree with the poor usability of perf when you have to pass hex
values for events.
But that's why I have a user level library to map event strings to
event codes for perf.
Arun Sharma posted a patch a while ago to connect this library with perf, so far
it's been ignored, it seems:
    perf stat -e offcore_response_0:dmd_data_rd foo


> I'm about to push out the patch attached below - it lays out the arguments in
> detail. I don't think we have time to fix this properly for .39 - but memory
> profiling could be a nice feature for v2.6.40.
>
You will not be able to do any reasonable memory profiling using
offcore response
events. Dont' expect a profile to point to the missing loads. If
you're lucky it would
point to the use instruction.


> --------------------->
> From b52c55c6a25e4515b5e075a989ff346fc251ed09 Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <mingo@elte.hu>
> Date: Fri, 22 Apr 2011 08:44:38 +0200
> Subject: [PATCH] x86, perf event: Turn off unstructured raw event access to offcore registers
>
> Andi Kleen pointed out that the Intel offcore support patches were merged
> without user-space tool support to the functionality:
>
>  |
>  | The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the
>  | user space bits were not. This made it impossible to set the extra mask
>  | and actually do the OFFCORE profiling
>  |
>
> Andi submitted a preliminary patch for user-space support, as an
> extension to perf's raw event syntax:
>
>  |
>  | Some raw events -- like the Intel OFFCORE events -- support additional
>  | parameters. These can be appended after a ':'.
>  |
>  | For example on a multi socket Intel Nehalem:
>  |
>  |    perf stat -e r1b7:20ff -a sleep 1
>  |
>  | Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0
>  | that measures any access to DRAM on another socket.
>  |
>
> But this kind of usability is absolutely unacceptable - users should not
> be expected to type in magic, CPU and model specific incantations to get
> access to useful hardware functionality.
>
> The proper solution is to expose useful offcore functionality via
> generalized events - that way users do not have to care which specific
> CPU model they are using, they can use the conceptual event and not some
> model specific quirky hexa number.
>
> We already have such generalization in place for CPU cache events,
> and it's all very extensible.
>
> "Offcore" events measure general DRAM access patters along various
> parameters. They are particularly useful in NUMA systems.
>
> We want to support them via generalized DRAM events: either as the
> fourth level of cache (after the last-level cache), or as a separate
> generalization category.
>
> That way user-space support would be very obvious, memory access
> profiling could be done via self-explanatory commands like:
>
>  perf record -e dram ./myapp
>  perf record -e dram-remote ./myapp
>
> ... to measure DRAM accesses or more expensive cross-node NUMA DRAM
> accesses.
>
> These generalized events would work on all CPUs and architectures that
> have comparable PMU features.
>
> ( Note, these are just examples: actual implementation could have more
>  sophistication and more parameter - as long as they center around
>  similarly simple usecases. )
>
> Now we do not want to revert *all* of the current offcore bits, as they
> are still somewhat useful for generic last-level-cache events, implemented
> in this commit:
>
>  e994d7d23a0b: perf: Fix LLC-* events on Intel Nehalem/Westmere
>
> But we definitely do not yet want to expose the unstructured raw events
> to user-space, until better generalization and usability is implemented
> for these hardware event features.
>
> ( Note: after generalization has been implemented raw offcore events can be
>  supported as well: there can always be an odd event that is marginally
>  useful but not useful enough to generalize. DRAM profiling is definitely
>  *not* such a category so generalization must be done first. )
>
> Furthermore, PERF_TYPE_RAW access to these registers was not intended
> to go upstream without proper support - it was a side-effect of the above
> e994d7d23a0b commit, not mentioned in the changelog.
>
> As v2.6.39 is nearing release we go for the simplest approach: disable
> the PERF_TYPE_RAW offcore hack for now, before it escapes into a released
> kernel and becomes an ABI.
>
> Once proper structure is implemented for these hardware events and users
> are offered usable solutions we can revisit this issue.
>
> Reported-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Link: http://lkml.kernel.org/r/1302658203-4239-1-git-send-email-andi@firstfloor.org
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  arch/x86/kernel/cpu/perf_event.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index eed3673a..632e5dc 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -586,8 +586,12 @@ static int x86_setup_perfctr(struct perf_event *event)
>                        return -EOPNOTSUPP;
>        }
>
> +       /*
> +        * Do not allow config1 (extended registers) to propagate,
> +        * there's no sane user-space generalization yet:
> +        */
>        if (attr->type == PERF_TYPE_RAW)
> -               return x86_pmu_extra_regs(event->attr.config, event);
> +               return 0;
>
>        if (attr->type == PERF_TYPE_HW_CACHE)
>                return set_ext_hw_attr(hwc, event);
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22  8:47 [PATCH 1/1] perf tools: Add missing user space support for config1/config2 Stephane Eranian
@ 2011-04-22  9:23 ` Ingo Molnar
  2011-04-22  9:41   ` Stephane Eranian
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2011-04-22  9:23 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma


* Stephane Eranian <eranian@google.com> wrote:

> On Fri, Apr 22, 2011 at 10:06 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Ingo Molnar <mingo@elte.hu> wrote:
> >
> >> This needs to be a *lot* more user friendly. Users do not want to type in
> >> stupid hexa magic numbers to get profiling. We have moved beyond the oprofile
> >> era really.
> >>
> >> Unless there's proper generalized and human usable support i'm leaning
> >> towards turning off the offcore user-space accessible raw bits for now, and
> >> use them only kernel-internally, for the cache events.
>
> Generic cache events are a myth. They are not usable. I keep getting 
> questions from users because nobody knows what they are actually counting, 
> thus nobody knows how to interpret the counts. You cannot really hide the 
> micro-architecture if you want to make any sensible measurements.

Well:

 aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10
 Time: 0.125
 Time: 0.136
 Time: 0.180
 Time: 0.103
 Time: 0.097
 Time: 0.125
 Time: 0.104
 Time: 0.125
 Time: 0.114
 Time: 0.158

 Performance counter stats for './hackbench 10' (10 runs):

     2,102,556,398 instructions             #      0.000 IPC     ( +-   1.179% )
       843,957,634 L1-dcache-loads            ( +-   1.295% )
       130,007,361 L1-dcache-load-misses      ( +-   3.281% )
         6,328,938 LLC-misses                 ( +-   3.969% )

        0.146160287  seconds time elapsed   ( +-   5.851% )

It's certainly useful if you want to get ballpark figures about cache behavior 
of an app and want to do comparisons.

There are inconsistencies in our generic cache events - but that's not really a 
reason to obcure their usage behind nonsensical microarchitecture-specific 
details.

But i'm definitely in favor of making these generalized events more consistent 
across different CPU types. Can you list examples of inconsistencies that we 
should resolve? (and which you possibly consider impossible to resolve, right?)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22  9:23 ` Ingo Molnar
@ 2011-04-22  9:41   ` Stephane Eranian
  2011-04-22 10:52     ` [generalized cache events] " Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Stephane Eranian @ 2011-04-22  9:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma

On Fri, Apr 22, 2011 at 11:23 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Stephane Eranian <eranian@google.com> wrote:
>
>> On Fri, Apr 22, 2011 at 10:06 AM, Ingo Molnar <mingo@elte.hu> wrote:
>> >
>> > * Ingo Molnar <mingo@elte.hu> wrote:
>> >
>> >> This needs to be a *lot* more user friendly. Users do not want to type in
>> >> stupid hexa magic numbers to get profiling. We have moved beyond the oprofile
>> >> era really.
>> >>
>> >> Unless there's proper generalized and human usable support i'm leaning
>> >> towards turning off the offcore user-space accessible raw bits for now, and
>> >> use them only kernel-internally, for the cache events.
>>
>> Generic cache events are a myth. They are not usable. I keep getting
>> questions from users because nobody knows what they are actually counting,
>> thus nobody knows how to interpret the counts. You cannot really hide the
>> micro-architecture if you want to make any sensible measurements.
>
> Well:
>
>  aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10
>  Time: 0.125
>  Time: 0.136
>  Time: 0.180
>  Time: 0.103
>  Time: 0.097
>  Time: 0.125
>  Time: 0.104
>  Time: 0.125
>  Time: 0.114
>  Time: 0.158
>
>  Performance counter stats for './hackbench 10' (10 runs):
>
>     2,102,556,398 instructions             #      0.000 IPC     ( +-   1.179% )
>       843,957,634 L1-dcache-loads            ( +-   1.295% )
>       130,007,361 L1-dcache-load-misses      ( +-   3.281% )
>         6,328,938 LLC-misses                 ( +-   3.969% )
>
>        0.146160287  seconds time elapsed   ( +-   5.851% )
>
> It's certainly useful if you want to get ballpark figures about cache behavior
> of an app and want to do comparisons.
>
What can you conclude from the above counts?
Are they good or bad? If they are bad, how do you go about fixing the app?

> There are inconsistencies in our generic cache events - but that's not really a
> reason to obcure their usage behind nonsensical microarchitecture-specific
> details.
>
The actual events are a reflection of the micro-architecture. They indirectly
describe how it works. It is not clear to me that you can really improve your
app without some exposure to the micro-architecture.

So if you want to have generic events, I am fine with this, but you should not
block access to actual events pretending they are useless. Some people are
certainly interested in using them and learning about the micro-architecture
of their processor.


> But i'm definitely in favor of making these generalized events more consistent
> across different CPU types. Can you list examples of inconsistencies that we
> should resolve? (and which you possibly consider impossible to resolve, right?)
>
To make generic events more uniform across processors, one would have to have
precise definitions as to what they are supposed to count. Once you
have that, then
we may have a better chance at finding consistent mappings for each processor.
I have not yet seen such definitions.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22  9:41   ` Stephane Eranian
@ 2011-04-22 10:52     ` Ingo Molnar
  2011-04-22 12:04       ` Stephane Eranian
  2011-04-22 16:50       ` arun
  0 siblings, 2 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-22 10:52 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* Stephane Eranian <eranian@google.com> wrote:

> >> Generic cache events are a myth. They are not usable. [...]
> >
> > Well:
> >
> >  aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10
> >  Time: 0.125
> >  Time: 0.136
> >  Time: 0.180
> >  Time: 0.103
> >  Time: 0.097
> >  Time: 0.125
> >  Time: 0.104
> >  Time: 0.125
> >  Time: 0.114
> >  Time: 0.158
> >
> >  Performance counter stats for './hackbench 10' (10 runs):
> >
> >     2,102,556,398 instructions             #      0.000 IPC     ( +-   1.179% )
> >       843,957,634 L1-dcache-loads            ( +-   1.295% )
> >       130,007,361 L1-dcache-load-misses      ( +-   3.281% )
> >         6,328,938 LLC-misses                 ( +-   3.969% )
> >
> >        0.146160287  seconds time elapsed   ( +-   5.851% )
> >
> > It's certainly useful if you want to get ballpark figures about cache behavior
> > of an app and want to do comparisons.
> >
> What can you conclude from the above counts?
> Are they good or bad? If they are bad, how do you go about fixing the app?

So let me give you a simplified example.

Say i'm a developer and i have an app with such code:

#define THOUSAND 1000

static char array[THOUSAND][THOUSAND];

int init_array(void)
{
	int i, j;

	for (i = 0; i < THOUSAND; i++) {
		for (j = 0; j < THOUSAND; j++) {
			array[j][i]++;
		}
	}

	return 0;
}

Pretty common stuff, right?

Using the generalized cache events i can run:

 $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array

 Performance counter stats for './array' (10 runs):

         6,719,130 cycles:u                   ( +-   0.662% )
         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )

        0.003802098  seconds time elapsed   ( +-  13.395% )

I consider that this is 'bad', because for almost every dcache-load there's a 
dcache-miss - a 99% L1 cache miss rate!

Then i think a bit, notice something, apply this performance optimization:

diff --git a/array.c b/array.c
index 4758d9a..d3f7037 100644
--- a/array.c
+++ b/array.c
@@ -9,7 +9,7 @@ int init_array(void)
 
 	for (i = 0; i < THOUSAND; i++) {
 		for (j = 0; j < THOUSAND; j++) {
-			array[j][i]++;
+			array[i][j]++;
 		}
 	}
 
I re-run perf-stat:

 $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array

 Performance counter stats for './array' (10 runs):

         2,395,407 cycles:u                   ( +-   0.365% )
         5,084,788 instructions:u           #      2.123 IPC     ( +-   0.000% )
         1,035,731 l1-dcache-loads:u          ( +-   0.006% )
             3,955 l1-dcache-load-misses:u    ( +-   4.872% )

        0.001806438  seconds time elapsed   ( +-   3.831% )

And i'm happy that indeed the l1-dcache misses are now super-low and that the 
app got much faster as well - the cycle count is a third of what it was before 
the optimization!

Note that:

 - I got absolute numbers in the right ballpark figure: i got a million loads as 
   expected (the array has 1 million elements), and 1 million cache-misses in 
   the 'bad' case.

 - I did not care which specific Intel CPU model this was running on

 - I did not care about *any* microarchitectural details - i only knew it's a 
   reasonably modern CPU with caching

 - I did not care how i could get access to L1 load and miss events. The events 
   were named obviously and it just worked.

So no, kernel driven generalization and sane tooling is not at all a 'myth' 
today, really.

So this is the general direction in which we want to move on. If you know about 
problems with existing generalization definitions then lets *fix* them, not 
pretend that generalizations and sane workflows are impossible ...

Thanks,

	Ingo

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 10:52     ` [generalized cache events] " Ingo Molnar
@ 2011-04-22 12:04       ` Stephane Eranian
  2011-04-22 13:18         ` Ingo Molnar
  2011-04-22 16:51         ` Andi Kleen
  2011-04-22 16:50       ` arun
  1 sibling, 2 replies; 46+ messages in thread
From: Stephane Eranian @ 2011-04-22 12:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton

On Fri, Apr 22, 2011 at 12:52 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Stephane Eranian <eranian@google.com> wrote:
>
>> >> Generic cache events are a myth. They are not usable. [...]
>> >
>> > Well:
>> >
>> >  aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10
>> >  Time: 0.125
>> >  Time: 0.136
>> >  Time: 0.180
>> >  Time: 0.103
>> >  Time: 0.097
>> >  Time: 0.125
>> >  Time: 0.104
>> >  Time: 0.125
>> >  Time: 0.114
>> >  Time: 0.158
>> >
>> >  Performance counter stats for './hackbench 10' (10 runs):
>> >
>> >     2,102,556,398 instructions             #      0.000 IPC     ( +-   1.179% )
>> >       843,957,634 L1-dcache-loads            ( +-   1.295% )
>> >       130,007,361 L1-dcache-load-misses      ( +-   3.281% )
>> >         6,328,938 LLC-misses                 ( +-   3.969% )
>> >
>> >        0.146160287  seconds time elapsed   ( +-   5.851% )
>> >
>> > It's certainly useful if you want to get ballpark figures about cache behavior
>> > of an app and want to do comparisons.
>> >
>> What can you conclude from the above counts?
>> Are they good or bad? If they are bad, how do you go about fixing the app?
>
> So let me give you a simplified example.
>
> Say i'm a developer and i have an app with such code:
>
> #define THOUSAND 1000
>
> static char array[THOUSAND][THOUSAND];
>
> int init_array(void)
> {
>        int i, j;
>
>        for (i = 0; i < THOUSAND; i++) {
>                for (j = 0; j < THOUSAND; j++) {
>                        array[j][i]++;
>                }
>        }
>
>        return 0;
> }
>
> Pretty common stuff, right?
>
> Using the generalized cache events i can run:
>
>  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
>
>  Performance counter stats for './array' (10 runs):
>
>         6,719,130 cycles:u                   ( +-   0.662% )
>         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
>         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
>         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
>
>        0.003802098  seconds time elapsed   ( +-  13.395% )
>
> I consider that this is 'bad', because for almost every dcache-load there's a
> dcache-miss - a 99% L1 cache miss rate!
>
> Then i think a bit, notice something, apply this performance optimization:
>
I don't think this example is really representative of the kind of problems
people face, it is just too small and obvious. So I would not generalize on it.

If you are happy with generalized cache events then, as I said, I am fine with
it. But the API should ALWAYS allow users access to raw events when they
need finer grain analysis.

> diff --git a/array.c b/array.c
> index 4758d9a..d3f7037 100644
> --- a/array.c
> +++ b/array.c
> @@ -9,7 +9,7 @@ int init_array(void)
>
>        for (i = 0; i < THOUSAND; i++) {
>                for (j = 0; j < THOUSAND; j++) {
> -                       array[j][i]++;
> +                       array[i][j]++;
>                }
>        }
>
> I re-run perf-stat:
>
>  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
>
>  Performance counter stats for './array' (10 runs):
>
>         2,395,407 cycles:u                   ( +-   0.365% )
>         5,084,788 instructions:u           #      2.123 IPC     ( +-   0.000% )
>         1,035,731 l1-dcache-loads:u          ( +-   0.006% )
>             3,955 l1-dcache-load-misses:u    ( +-   4.872% )
>
>  - I got absolute numbers in the right ballpark figure: i got a million loads as
>   expected (the array has 1 million elements), and 1 million cache-misses in
>   the 'bad' case.
>
>  - I did not care which specific Intel CPU model this was running on
>
>  - I did not care about *any* microarchitectural details - i only knew it's a
>   reasonably modern CPU with caching
>
>  - I did not care how i could get access to L1 load and miss events. The events
>   were named obviously and it just worked.
>
> So no, kernel driven generalization and sane tooling is not at all a 'myth'
> today, really.
>
> So this is the general direction in which we want to move on. If you know about
> problems with existing generalization definitions then lets *fix* them, not
> pretend that generalizations and sane workflows are impossible ...
>
Again, to fix them, you need to give us definitions for what you expect those
events to count. Otherwise we cannot make forward progress.

Let me give just one simple example: cycles

What your definition for the generic cycle event?

There are various flavors:
  - count halted, unhalted cycles?
  - impacted by frequency scaling?

LLC-misses:
  - what considered the LLC?
  - does it include code, data or both?
  - does it include demand, hw prefetch?
  - it is to local or remote dram?

Once you have clear and precise definition, then we can look at the actual
events and figure out a mapping.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 12:04       ` Stephane Eranian
@ 2011-04-22 13:18         ` Ingo Molnar
  2011-04-22 20:31           ` Stephane Eranian
  2011-04-22 16:51         ` Andi Kleen
  1 sibling, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2011-04-22 13:18 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* Stephane Eranian <eranian@google.com> wrote:

> > Say i'm a developer and i have an app with such code:
> >
> > #define THOUSAND 1000
> >
> > static char array[THOUSAND][THOUSAND];
> >
> > int init_array(void)
> > {
> >        int i, j;
> >
> >        for (i = 0; i < THOUSAND; i++) {
> >                for (j = 0; j < THOUSAND; j++) {
> >                        array[j][i]++;
> >                }
> >        }
> >
> >        return 0;
> > }
> >
> > Pretty common stuff, right?
> >
> > Using the generalized cache events i can run:
> >
> >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> >
> >  Performance counter stats for './array' (10 runs):
> >
> >         6,719,130 cycles:u                   ( +-   0.662% )
> >         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
> >         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
> >         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
> >
> >        0.003802098  seconds time elapsed   ( +-  13.395% )
> >
> > I consider that this is 'bad', because for almost every dcache-load there's a
> > dcache-miss - a 99% L1 cache miss rate!
> >
> > Then i think a bit, notice something, apply this performance optimization:
>
> I don't think this example is really representative of the kind of problems 
> people face, it is just too small and obvious. [...]

Well, the overwhelming majority of performance problems are 'small and obvious' 
- once a tool roughly pinpoints their existence and location!

And you have not offered a counter example either so you have not really 
demonstrated what you consider a 'real' example and why you consider 
generalized cache events inadequate.

> [...] So I would not generalize on it.

To the contrary, it demonstrates the most fundamental concept of cache 
profiling: looking at the hits/misses ratios and identifying hotspots.

That concept can be applied pretty nicely to all sorts of applications.

Interestly, the exact hardware event doesnt even *matter* for most problems, as 
long as it *correlates* with the conceptual entity we want to measure.

So what we need are hardware events that correlate with:

 - loads done
 - stores done
 - load misses suffered
 - store misses suffered
 - branches done
 - branches missed
 - instructions executed

It is the *ratio* that matters in most cases: before-change versus 
after-change, hits versus misses, etc.

Yes, there will be imprecisions, CPU quirks, limitations and speculation 
effects - but as long as we keep our eyes on the ball, generalizations are 
useful for solving practical problems.

> If you are happy with generalized cache events then, as I said, I am fine 
> with it. But the API should ALWAYS allow users access to raw events when they 
> need finer grain analysis.

Well, that's a pretty far cry from calling it a 'myth' :-)

So my point is (outlined in detail in the common changelog) that we need sane 
generalized remote DRAM events *first* - before we think about exposing the 
'rest' of te offcore-PMU as raw events.

> > diff --git a/array.c b/array.c
> > index 4758d9a..d3f7037 100644
> > --- a/array.c
> > +++ b/array.c
> > @@ -9,7 +9,7 @@ int init_array(void)
> >
> >        for (i = 0; i < THOUSAND; i++) {
> >                for (j = 0; j < THOUSAND; j++) {
> > -                       array[j][i]++;
> > +                       array[i][j]++;
> >                }
> >        }
> >
> > I re-run perf-stat:
> >
> >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> >
> >  Performance counter stats for './array' (10 runs):
> >
> >         2,395,407 cycles:u                   ( +-   0.365% )
> >         5,084,788 instructions:u           #      2.123 IPC     ( +-   0.000% )
> >         1,035,731 l1-dcache-loads:u          ( +-   0.006% )
> >             3,955 l1-dcache-load-misses:u    ( +-   4.872% )
> >
> >  - I got absolute numbers in the right ballpark figure: i got a million loads as
> >   expected (the array has 1 million elements), and 1 million cache-misses in
> >   the 'bad' case.
> >
> >  - I did not care which specific Intel CPU model this was running on
> >
> >  - I did not care about *any* microarchitectural details - i only knew it's a
> >   reasonably modern CPU with caching
> >
> >  - I did not care how i could get access to L1 load and miss events. The events
> >   were named obviously and it just worked.
> >
> > So no, kernel driven generalization and sane tooling is not at all a 'myth'
> > today, really.
> >
> > So this is the general direction in which we want to move on. If you know about
> > problems with existing generalization definitions then lets *fix* them, not
> > pretend that generalizations and sane workflows are impossible ...
>
> Again, to fix them, you need to give us definitions for what you expect those 
> events to count. Otherwise we cannot make forward progress.

No, we do not 'need' to give exact definitions. This whole topic is more 
analogous to physics than to mathematics. See my description above about how 
ratios and high level structure matters more than absolute values and 
definitions.

Yes, if we can then 'loads' and 'stores' should correspond to the number of 
loads a program flow does, which you get if you look at the assembly code. 
'Instructions' should correspond to the number of instructions executed.

If the CPU cannot do it it's not a huge deal in practice - we will cope and 
hopefully it will all be fixed in future CPU versions.

That having said, most CPUs i have access to get the fundamentals right, so 
it's not like we have huge problems in practice. Key CPU statistics are 
available.

> Let me give just one simple example: cycles
>
> What your definition for the generic cycle event?
> 
> There are various flavors:
>   - count halted, unhalted cycles?

Again i think you are getting lost in too much detail.

For typical developers halted versus unhalted is mostly an uninteresting 
distinction, as people tend to just type 'perf record ./myapp', which is per 
workload profiling so it excludes idle time. So it would give the same result 
to them regardless of whether it's halted or unhalted cycles.

( This simple example already shows the idiocy of the hardware names, calling 
  cycles events "CPU_CLK_UNHALTED.REF". In most cases the developer does *not*
  care about those distinctions so the defaults should not be complicated with
  them. )

>   - impacted by frequency scaling?

The best default for developers is a frequency scaling invariant result - i.e. 
one that is not against a reference clock but against the real CPU clock.

( Even that one will not be completely invariant due to the frequency-scaling
  dependent cost of misses and bus ops, etc. )

But profiling against a reference frequency makes sense as well, especially for 
system-wide profiling - this is the hardware equivalent of the cpu-clock / 
elapsed time metric. We could implement the cpu-clock using reference cycles 
events for example.

> LLC-misses:
>   - what considered the LLC?

The last level cache is whichever cache sits before DRAM.

>   - does it include code, data or both?

Both if possible as they tend to be unified caches anyway.

>   - does it include demand, hw prefetch?

Do you mean for the LLC-prefetch events? What would be your suggestion, which 
is the most useful metric? Prefetches are not directly done by program logic so 
this is borderline. We wanted to include them for completeness - and the metric 
should probably include 'all activities that program flow has not caused 
directly and which may be sucking up system resources' - i.e. including hw 
prefetch.

>   - it is to local or remote dram?

The current definitions should include both.

Measuring remote DRAM accesss is of course useful - that is the original point 
of this thread. It should be done as an additional layer, basically local RAM 
is yet another cache level - but we can take other generalized approach as 
well, if they make more sense.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 10:52     ` [generalized cache events] " Ingo Molnar
  2011-04-22 12:04       ` Stephane Eranian
@ 2011-04-22 16:50       ` arun
  2011-04-22 17:00         ` Andi Kleen
  2011-04-22 20:30         ` Ingo Molnar
  1 sibling, 2 replies; 46+ messages in thread
From: arun @ 2011-04-22 16:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel,
	Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton

On Fri, Apr 22, 2011 at 12:52:11PM +0200, Ingo Molnar wrote:
> 
> Using the generalized cache events i can run:
> 
>  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> 
>  Performance counter stats for './array' (10 runs):
> 
>          6,719,130 cycles:u                   ( +-   0.662% )
>          5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
>          1,037,032 l1-dcache-loads:u          ( +-   0.009% )
>          1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
> 
>         0.003802098  seconds time elapsed   ( +-  13.395% )
> 
> I consider that this is 'bad', because for almost every dcache-load there's a 
> dcache-miss - a 99% L1 cache miss rate!

One could argue that all you need is cycles and instructions. If there is an
expensive load, you'll see that the load instruction takes many cycles and
you can infer that it's a cache miss.

Questions app developers typically ask me:

* If I fix all my top 5 L3 misses how much faster will my app go?
* Am I bottlenecked on memory bandwidth?
* I have 4 L3 misses every 1000 instructions and 15 branch mispredicts per
  1000 instructions. Which one should I focus on?

It's hard to answer some of these without access to all events.
While your approach of having generic events for commonly used counters
might be useful for some use cases, I don't see why exposing all vendor
defined events is harmful.

A clear statement on the last point would be helpful.

 -Arun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 12:04       ` Stephane Eranian
  2011-04-22 13:18         ` Ingo Molnar
@ 2011-04-22 16:51         ` Andi Kleen
  2011-04-22 19:57           ` Ingo Molnar
  2011-04-26  9:25           ` Peter Zijlstra
  1 sibling, 2 replies; 46+ messages in thread
From: Andi Kleen @ 2011-04-22 16:51 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton

> Once you have clear and precise definition, then we can look at the actual
> events and figure out a mapping.

It's unclear this can be even done. Linux runs on a wide variety of micro
architectures, with all kinds of cache architectures.

Micro architectures are so different. I suspect a "generic" definition would
need to be so vague as to be useless.

This in general seems to be the problem of the current cache events.

Overall for any interesting analysis you need to go CPU specific.
Abstracted performance analysis is a contradiction in terms.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 16:50       ` arun
@ 2011-04-22 17:00         ` Andi Kleen
  2011-04-22 20:30         ` Ingo Molnar
  1 sibling, 0 replies; 46+ messages in thread
From: Andi Kleen @ 2011-04-22 17:00 UTC (permalink / raw)
  To: arun
  Cc: Ingo Molnar, Stephane Eranian, Arnaldo Carvalho de Melo,
	linux-kernel, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton

> One could argue that all you need is cycles and instructions. If there is an
> expensive load, you'll see that the load instruction takes many cycles and
> you can infer that it's a cache miss.

That is when you can actually recognize which of the last load instructions
caused the problem. Often skid on Out of order CPUs makes this very hard to 
identify which actual load caused the the long stall (or if they all stalled)

There's a way around it of course using advanced events, but not with
cycles.

> I don't see why exposing all vendor defined events is harmful

Simple: without vendor events you cannot answer a lot of questions.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 16:51         ` Andi Kleen
@ 2011-04-22 19:57           ` Ingo Molnar
  2011-04-26  9:25           ` Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-22 19:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* Andi Kleen <ak@linux.intel.com> wrote:

> > Once you have clear and precise definition, then we can look at the actual
> > events and figure out a mapping.
> 
> It's unclear this can be even done. Linux runs on a wide variety of micro
> architectures, with all kinds of cache architectures.
> 
> Micro architectures are so different. I suspect a "generic" definition would
> need to be so vague as to be useless.

Not really. I gave a very specific example which solved a common and real 
problem, using L1-loads and L1-load-misses events.

> This in general seems to be the problem of the current cache events.
> 
> Overall for any interesting analysis you need to go CPU specific. Abstracted 
> performance analysis is a contradiction in terms.

Nothing of what i did in that example was CPU or microarchitecture specific.

Really, you are making this more complex than it really is. Just check the 
cache profiling example i gave, it works just fine today.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 16:50       ` arun
  2011-04-22 17:00         ` Andi Kleen
@ 2011-04-22 20:30         ` Ingo Molnar
  2011-04-22 20:32           ` Ingo Molnar
  2011-04-23 20:14           ` [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES Ingo Molnar
  1 sibling, 2 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-22 20:30 UTC (permalink / raw)
  To: arun
  Cc: Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel,
	Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* arun@sharma-home.net <arun@sharma-home.net> wrote:

> On Fri, Apr 22, 2011 at 12:52:11PM +0200, Ingo Molnar wrote:
> > 
> > Using the generalized cache events i can run:
> > 
> >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> > 
> >  Performance counter stats for './array' (10 runs):
> > 
> >          6,719,130 cycles:u                   ( +-   0.662% )
> >          5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
> >          1,037,032 l1-dcache-loads:u          ( +-   0.009% )
> >          1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
> > 
> >         0.003802098  seconds time elapsed   ( +-  13.395% )
> > 
> > I consider that this is 'bad', because for almost every dcache-load there's a 
> > dcache-miss - a 99% L1 cache miss rate!
> 
> One could argue that all you need is cycles and instructions. [...]

Yes, and note that with instructions events we even have skid-less PEBS 
profiling so seeing the precise .

> [...] If there is an expensive load, you'll see that the load instruction 
> takes many cycles and you can infer that it's a cache miss.
> 
> Questions app developers typically ask me:
> 
> * If I fix all my top 5 L3 misses how much faster will my app go?

This has come up: we could add a 'stalled/idle-cycles' generic event - i.e. 
cycles spent without performing useful work in the pipelines. (Resource-stall 
events on Intel CPUs.)

Then you would profile L3 misses (there's a generic event for that), plus 
stalls, and the answer to your question would be the percentage of hits you get 
in the stalled-cycles profile, multiplied by the stalled-cycles/cycles ratio.

> * Am I bottlenecked on memory bandwidth?

This would be a variant of the measurement above: say the top 90% of L3 misses 
profile-correlated with stalled-cycles, relative to total-cycles. If you get 
'90% of L3 misses cause a 1% wall-time slowdown' then you are not memory 
bottlenecked. If the answer is '35% slowdown' then you are memory bottlenecked.

> * I have 4 L3 misses every 1000 instructions and 15 branch mispredicts per
>   1000 instructions. Which one should I focus on?

AFAICS this would be another variant of stalled-cycles measurements: you create 
a stalled-cycles profile and check whether the top hits are branches or memory 
loads.

> It's hard to answer some of these without access to all events.

I'm curious, how would you measure these properties - do you have some 
different events in mind?

> While your approach of having generic events for commonly used counters might 
> be useful for some use cases, I don't see why exposing all vendor defined 
> events is harmful.
> 
> A clear statement on the last point would be helpful.

Well, the thing is, i think users are helped most if we add useful, highlevel 
PMU features added and not just an opaque raw event pass-through engine. The 
problem with lowlevel raw ABIs is that the tool space fragments into a zillion 
small hacks and there's no good concentration of know-how. I'd like the art of 
performance measurement to be generalized out, as well as it can be.

We had this discussion in the big perf-counters flamewars 2+ years ago, where 
one side wanted raw events, while we wanted intelligent kernel-side 
abstractions and generalizations. I think the abstraction and generalization 
angle worked out very well in practice - but we are having this discussion 
again and again :-)

As i stated it in my prior mails, i'm not against raw events as a rare 
exception channel - that increases utility. I'm against what was attempted 
here: an extension to raw events as the *primary* channel for DRAM measurement 
features. That is just sloppy and *reduces* utility.

I'm very simple-minded: when i see reduced utility i become sad :)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 13:18         ` Ingo Molnar
@ 2011-04-22 20:31           ` Stephane Eranian
  2011-04-22 20:47             ` Ingo Molnar
  2011-04-22 21:03             ` Ingo Molnar
  0 siblings, 2 replies; 46+ messages in thread
From: Stephane Eranian @ 2011-04-22 20:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton

On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Stephane Eranian <eranian@google.com> wrote:
>
> > > Say i'm a developer and i have an app with such code:
> > >
> > > #define THOUSAND 1000
> > >
> > > static char array[THOUSAND][THOUSAND];
> > >
> > > int init_array(void)
> > > {
> > >        int i, j;
> > >
> > >        for (i = 0; i < THOUSAND; i++) {
> > >                for (j = 0; j < THOUSAND; j++) {
> > >                        array[j][i]++;
> > >                }
> > >        }
> > >
> > >        return 0;
> > > }
> > >
> > > Pretty common stuff, right?
> > >
> > > Using the generalized cache events i can run:
> > >
> > >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> > >
> > >  Performance counter stats for './array' (10 runs):
> > >
> > >         6,719,130 cycles:u                   ( +-   0.662% )
> > >         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
> > >         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
> > >         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
> > >
> > >        0.003802098  seconds time elapsed   ( +-  13.395% )
> > >
> > > I consider that this is 'bad', because for almost every dcache-load there's a
> > > dcache-miss - a 99% L1 cache miss rate!
> > >
> > > Then i think a bit, notice something, apply this performance optimization:
> >
> > I don't think this example is really representative of the kind of problems
> > people face, it is just too small and obvious. [...]
>
> Well, the overwhelming majority of performance problems are 'small and obvious'

Problems are not simple. Most serious applications these days are huge, hundreds
of MB of text, if not GB.

In your artificial example, you knew the answer before you started the
measurement.
Most of the time, applications are assembled out of hundreds of libraries, so no
single developers knows all the code. Thus, the performance analyst is
faced with a
black box most of the time.

Let's go back to your example.
Performance counter stats for './array' (10 runs):

         6,719,130 cycles:u                   ( +-   0.662% )
         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )

Looking at this I don't think you can pinpoint which function has a problem and
whether or not there is a real problem. You need to evaluate the penalty. Once
you know that you can estimate any potential gain from fixing the code. Arun
pointed that out rightfully in his answer. How do you know the penalty if you
don't decompose some more?

It's not because you see cache misses that they are harmful. They could be
generated by the HW prefetchers and be harmless to your program. Does
L1d-cache-load misses include HW prefetchers? Imagine it does on Intel Nehalem.
Do you guarantee it does on AMD Shanghai as well? If not then how do I compare?


As I said, if you are only interested in generic events, then that's
fine with me
as long as you don't prevent others from accessing raw events.

As people have pointed, I don't see why raw events would be harmful to
users. If you
don't want to learn about them, just stick with the generic events.


> - once a tool roughly pinpoints their existence and location!
>
> And you have not offered a counter example either so you have not really
> demonstrated what you consider a 'real' example and why you consider
> generalized cache events inadequate.
>
> > [...] So I would not generalize on it.
>
> To the contrary, it demonstrates the most fundamental concept of cache
> profiling: looking at the hits/misses ratios and identifying hotspots.
>
> That concept can be applied pretty nicely to all sorts of applications.
>
> Interestly, the exact hardware event doesnt even *matter* for most problems, as
> long as it *correlates* with the conceptual entity we want to measure.
>
> So what we need are hardware events that correlate with:
>
>  - loads done
>  - stores done
>  - load misses suffered
>  - store misses suffered
>  - branches done
>  - branches missed
>  - instructions executed
>
> It is the *ratio* that matters in most cases: before-change versus
> after-change, hits versus misses, etc.
>
> Yes, there will be imprecisions, CPU quirks, limitations and speculation
> effects - but as long as we keep our eyes on the ball, generalizations are
> useful for solving practical problems.
>
> > If you are happy with generalized cache events then, as I said, I am fine
> > with it. But the API should ALWAYS allow users access to raw events when they
> > need finer grain analysis.
>
> Well, that's a pretty far cry from calling it a 'myth' :-)
>
> So my point is (outlined in detail in the common changelog) that we need sane
> generalized remote DRAM events *first* - before we think about exposing the
> 'rest' of te offcore-PMU as raw events.
>
> > > diff --git a/array.c b/array.c
> > > index 4758d9a..d3f7037 100644
> > > --- a/array.c
> > > +++ b/array.c
> > > @@ -9,7 +9,7 @@ int init_array(void)
> > >
> > >        for (i = 0; i < THOUSAND; i++) {
> > >                for (j = 0; j < THOUSAND; j++) {
> > > -                       array[j][i]++;
> > > +                       array[i][j]++;
> > >                }
> > >        }
> > >
> > > I re-run perf-stat:
> > >
> > >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> > >
> > >  Performance counter stats for './array' (10 runs):
> > >
> > >         2,395,407 cycles:u                   ( +-   0.365% )
> > >         5,084,788 instructions:u           #      2.123 IPC     ( +-   0.000% )
> > >         1,035,731 l1-dcache-loads:u          ( +-   0.006% )
> > >             3,955 l1-dcache-load-misses:u    ( +-   4.872% )
> > >
> > >  - I got absolute numbers in the right ballpark figure: i got a million loads as
> > >   expected (the array has 1 million elements), and 1 million cache-misses in
> > >   the 'bad' case.
> > >
> > >  - I did not care which specific Intel CPU model this was running on
> > >
> > >  - I did not care about *any* microarchitectural details - i only knew it's a
> > >   reasonably modern CPU with caching
> > >
> > >  - I did not care how i could get access to L1 load and miss events. The events
> > >   were named obviously and it just worked.
> > >
> > > So no, kernel driven generalization and sane tooling is not at all a 'myth'
> > > today, really.
> > >
> > > So this is the general direction in which we want to move on. If you know about
> > > problems with existing generalization definitions then lets *fix* them, not
> > > pretend that generalizations and sane workflows are impossible ...
> >
> > Again, to fix them, you need to give us definitions for what you expect those
> > events to count. Otherwise we cannot make forward progress.
>
> No, we do not 'need' to give exact definitions. This whole topic is more
> analogous to physics than to mathematics. See my description above about how
> ratios and high level structure matters more than absolute values and
> definitions.
>
> Yes, if we can then 'loads' and 'stores' should correspond to the number of
> loads a program flow does, which you get if you look at the assembly code.
> 'Instructions' should correspond to the number of instructions executed.
>
> If the CPU cannot do it it's not a huge deal in practice - we will cope and
> hopefully it will all be fixed in future CPU versions.
>
> That having said, most CPUs i have access to get the fundamentals right, so
> it's not like we have huge problems in practice. Key CPU statistics are
> available.
>
> > Let me give just one simple example: cycles
> >
> > What your definition for the generic cycle event?
> >
> > There are various flavors:
> >   - count halted, unhalted cycles?
>
> Again i think you are getting lost in too much detail.
>
> For typical developers halted versus unhalted is mostly an uninteresting
> distinction, as people tend to just type 'perf record ./myapp', which is per
> workload profiling so it excludes idle time. So it would give the same result
> to them regardless of whether it's halted or unhalted cycles.
>
> ( This simple example already shows the idiocy of the hardware names, calling
>  cycles events "CPU_CLK_UNHALTED.REF". In most cases the developer does *not*
>  care about those distinctions so the defaults should not be complicated with
>  them. )
>
> >   - impacted by frequency scaling?
>
> The best default for developers is a frequency scaling invariant result - i.e.
> one that is not against a reference clock but against the real CPU clock.
>
> ( Even that one will not be completely invariant due to the frequency-scaling
>  dependent cost of misses and bus ops, etc. )
>
> But profiling against a reference frequency makes sense as well, especially for
> system-wide profiling - this is the hardware equivalent of the cpu-clock /
> elapsed time metric. We could implement the cpu-clock using reference cycles
> events for example.
>
> > LLC-misses:
> >   - what considered the LLC?
>
> The last level cache is whichever cache sits before DRAM.
>
> >   - does it include code, data or both?
>
> Both if possible as they tend to be unified caches anyway.
>
> >   - does it include demand, hw prefetch?
>
> Do you mean for the LLC-prefetch events? What would be your suggestion, which
> is the most useful metric? Prefetches are not directly done by program logic so
> this is borderline. We wanted to include them for completeness - and the metric
> should probably include 'all activities that program flow has not caused
> directly and which may be sucking up system resources' - i.e. including hw
> prefetch.
>
> >   - it is to local or remote dram?
>
> The current definitions should include both.
>
> Measuring remote DRAM accesss is of course useful - that is the original point
> of this thread. It should be done as an additional layer, basically local RAM
> is yet another cache level - but we can take other generalized approach as
> well, if they make more sense.
>
> Thanks,
>
>        Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 20:30         ` Ingo Molnar
@ 2011-04-22 20:32           ` Ingo Molnar
  2011-04-23  0:03             ` Andi Kleen
  2011-04-23 20:14           ` [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES Ingo Molnar
  1 sibling, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2011-04-22 20:32 UTC (permalink / raw)
  To: arun
  Cc: Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel,
	Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> * arun@sharma-home.net <arun@sharma-home.net> wrote:
> 
> > On Fri, Apr 22, 2011 at 12:52:11PM +0200, Ingo Molnar wrote:
> > > 
> > > Using the generalized cache events i can run:
> > > 
> > >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> > > 
> > >  Performance counter stats for './array' (10 runs):
> > > 
> > >          6,719,130 cycles:u                   ( +-   0.662% )
> > >          5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
> > >          1,037,032 l1-dcache-loads:u          ( +-   0.009% )
> > >          1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
> > > 
> > >         0.003802098  seconds time elapsed   ( +-  13.395% )
> > > 
> > > I consider that this is 'bad', because for almost every dcache-load there's a 
> > > dcache-miss - a 99% L1 cache miss rate!
> > 
> > One could argue that all you need is cycles and instructions. [...]
> 
> Yes, and note that with instructions events we even have skid-less PEBS 
> profiling so seeing the precise .
                                  - location of instructions is possible.

[ An email gremlin ate that part of the sentence. ]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 20:31           ` Stephane Eranian
@ 2011-04-22 20:47             ` Ingo Molnar
  2011-04-23 12:13               ` Stephane Eranian
  2011-04-22 21:03             ` Ingo Molnar
  1 sibling, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2011-04-22 20:47 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* Stephane Eranian <eranian@google.com> wrote:

> On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Stephane Eranian <eranian@google.com> wrote:
> >
> > > > Say i'm a developer and i have an app with such code:
> > > >
> > > > #define THOUSAND 1000
> > > >
> > > > static char array[THOUSAND][THOUSAND];
> > > >
> > > > int init_array(void)
> > > > {
> > > >        int i, j;
> > > >
> > > >        for (i = 0; i < THOUSAND; i++) {
> > > >                for (j = 0; j < THOUSAND; j++) {
> > > >                        array[j][i]++;
> > > >                }
> > > >        }
> > > >
> > > >        return 0;
> > > > }
> > > >
> > > > Pretty common stuff, right?
> > > >
> > > > Using the generalized cache events i can run:
> > > >
> > > >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> > > >
> > > >  Performance counter stats for './array' (10 runs):
> > > >
> > > >         6,719,130 cycles:u                   ( +-   0.662% )
> > > >         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
> > > >         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
> > > >         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
> > > >
> > > >        0.003802098  seconds time elapsed   ( +-  13.395% )
> > > >
> > > > I consider that this is 'bad', because for almost every dcache-load there's a
> > > > dcache-miss - a 99% L1 cache miss rate!
> > > >
> > > > Then i think a bit, notice something, apply this performance optimization:
> > >
> > > I don't think this example is really representative of the kind of problems
> > > people face, it is just too small and obvious. [...]
> >
> > Well, the overwhelming majority of performance problems are 'small and obvious'
> 
> Problems are not simple. Most serious applications these days are huge, 
> hundreds of MB of text, if not GB.
> 
> In your artificial example, you knew the answer before you started the 
> measurement.
>
> Most of the time, applications are assembled out of hundreds of libraries, so 
> no single developers knows all the code. Thus, the performance analyst is 
> faced with a black box most of the time.

I isolated out an example and assumed that you'd agree that identifying hot 
spots is trivial with generic cache events.

My assumption was wrong so let me show you how trivial it really is.

Here's an example with *two* problematic functions (but it could have hundreds, 
it does not matter):

-------------------------------->
#define THOUSAND 1000

static char array1[THOUSAND][THOUSAND];

static char array2[THOUSAND][THOUSAND];

void func1(void)
{
	int i, j;

	for (i = 0; i < THOUSAND; i++)
		for (j = 0; j < THOUSAND; j++)
			array1[i][j]++;
}

void func2(void)
{
	int i, j;

	for (i = 0; i < THOUSAND; i++)
		for (j = 0; j < THOUSAND; j++)
			array2[j][i]++;
}

int main(void)
{
	for (;;) {
		func1();
		func2();
	}

	return 0;
}
<--------------------------------

We do not know which one has the cache-misses problem, func1() or func2(), it's 
all a black box, right?

Using generic cache events you simply type this:

 $ perf top -e l1-dcache-load-misses -e l1-dcache-loads

And you get such output:

   PerfTop:    1923 irqs/sec  kernel: 0.0%  exact:  0.0% [l1-dcache-load-misses:u/l1-dcache-loads:u],  (all, 16 CPUs)
-------------------------------------------------------------------------------------------------------

   weight    samples  pcnt funct DSO
   ______    _______ _____ _____ ______________________

      1.9       6184 98.8% func2 /home/mingo/opt/array2
      0.0         69  1.1% func1 /home/mingo/opt/array2

It has pinpointed the problem in func2 *very* precisely.

Obviously this can be used to analyze larger apps as well, with thousands of 
functions, to pinpoint cachemiss problems in specific functions.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 20:31           ` Stephane Eranian
  2011-04-22 20:47             ` Ingo Molnar
@ 2011-04-22 21:03             ` Ingo Molnar
  2011-04-23 12:27               ` Stephane Eranian
  1 sibling, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2011-04-22 21:03 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* Stephane Eranian <eranian@google.com> wrote:

> Let's go back to your example.
> Performance counter stats for './array' (10 runs):
> 
>          6,719,130 cycles:u                   ( +-   0.662% )
>          5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
>          1,037,032 l1-dcache-loads:u          ( +-   0.009% )
>          1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
> 
> Looking at this I don't think you can pinpoint which function has a problem 
> [...]

In my previous mail i showed how to pinpoint specific functions. You bring up 
an interesting question, cost/benefit analysis:

> [...] and whether or not there is a real problem. You need to evaluate the 
> penalty. Once you know that you can estimate any potential gain from fixing 
> the code. Arun pointed that out rightfully in his answer. How do you know the 
> penalty if you don't decompose some more?

We can measure that even with today's tooling - which doesnt do cost/benefit 
analysis out of box. In my previous example i showed the cachemisses profile:

   weight    samples  pcnt funct DSO
   ______    _______ _____ _____ ______________________

      1.9       6184 98.8% func2 /home/mingo/opt/array2
      0.0         69  1.1% func1 /home/mingo/opt/array2

and here's the cycles profile:

             samples  pcnt funct DSO
             _______ _____ _____ ______________________

             2555.00 67.4% func2 /home/mingo/opt/array2
             1220.00 32.2% func1 /home/mingo/opt/array2

So, given that there was no other big miss sources:

 $ perf stat -a -e branch-misses:u -e l1-dcache-load-misses:u -e l1-dcache-store-misses:u -e l1-icache-load-misses:u sleep 1

 Performance counter stats for 'sleep 1':

            70,674 branch-misses:u         
       347,992,027 l1-dcache-load-misses:u 
             1,779 l1-dcache-store-misses:u
             8,007 l1-icache-load-misses:u 

        1.000982021  seconds time elapsed

I can tell you that by fixing the cache-misses in that function, the code will 
be roughly 33% faster.

So i fixed the bug, and before it 100 iterations of func1+func2 took 300 msecs:

 $ perf stat -e cpu-clock --repeat 10 ./array2

 Performance counter stats for './array2' (10 runs):

        298.405074 cpu-clock                  ( +-   1.823% )

After the fix it took 190 msecs:

 $ perf stat -e cpu-clock --repeat 10 ./array2

 Performance counter stats for './array2' (10 runs):

        189.409569 cpu-clock                  ( +-   0.019% )

        0.190007596  seconds time elapsed   ( +-   0.025% )

Which is 63% of the original speed - 37% faster. And no, i first did the 
calculation, then did the measurement of the optimized code.

Now it would be nice to automate such analysis some more within perf - but i 
think i have established the principle well enough that we can use generic 
cache events for such measurements.

Also, we could certainly add more generic events - a stalled-cycles event would 
certainly be useful for example, to collect all (or at least most) 'harmful 
delays' the execution flow can experience. Want to take a stab at that patch?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 20:32           ` Ingo Molnar
@ 2011-04-23  0:03             ` Andi Kleen
  2011-04-23  7:50               ` Peter Zijlstra
  2011-04-23  8:02               ` Ingo Molnar
  0 siblings, 2 replies; 46+ messages in thread
From: Andi Kleen @ 2011-04-23  0:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton

> > Yes, and note that with instructions events we even have skid-less PEBS 
> > profiling so seeing the precise .
>                                   - location of instructions is possible.

It was better when it was eaten. PEBS does not actually eliminated
skid unfortunately. The interrupt still occurs later, so the
instruction location is off.

PEBS merely gives you more information.

-Andi

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-23  0:03             ` Andi Kleen
@ 2011-04-23  7:50               ` Peter Zijlstra
  2011-04-23 12:06                 ` Stephane Eranian
  2011-04-24  2:19                 ` Andi Kleen
  2011-04-23  8:02               ` Ingo Molnar
  1 sibling, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2011-04-23  7:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, arun, Stephane Eranian, Arnaldo Carvalho de Melo,
	linux-kernel, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds,
	Andrew Morton

On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote:
> > > Yes, and note that with instructions events we even have skid-less PEBS 
> > > profiling so seeing the precise .
> >                                   - location of instructions is possible.
> 
> It was better when it was eaten. PEBS does not actually eliminated
> skid unfortunately. The interrupt still occurs later, so the
> instruction location is off.
> 
> PEBS merely gives you more information.

You're so skilled at not actually saying anything useful. Are you
perchance referring to the fact that the IP reported in the PEBS data is
exactly _one_ instruction off? Something that is demonstrated to be
fixable?

Or are you defining skid differently and not telling us your definition?

What are you saying?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-23  0:03             ` Andi Kleen
  2011-04-23  7:50               ` Peter Zijlstra
@ 2011-04-23  8:02               ` Ingo Molnar
  1 sibling, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-23  8:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* Andi Kleen <ak@linux.intel.com> wrote:

> > > Yes, and note that with instructions events we even have skid-less PEBS 
> > > profiling so seeing the precise .
> >                                   - location of instructions is possible.
> 
> It was better when it was eaten. PEBS does not actually eliminated
> skid unfortunately. The interrupt still occurs later, so the
> instruction location is off.
> 
> PEBS merely gives you more information.

Have you actually tried perf's PEBS support feature? Try:

  perf record -e instructions:pp ./myapp

(the ':pp' postfix stands for 'precise' and activates PEBS+LBR tricks.)

Look at the perf report --tui annotated asssembly output (or check 'perf 
annotate' directly) and see how precise and skid-less the hits are. Works 
pretty well on Nehalem.

Here's a cache-bound loop with skid (profiled with '-e instructions'):

         :	0000000000400390 <main>:
    0.00 :	  400390:       31 c0                   xor    %eax,%eax
    0.00 :	  400392:       eb 22                   jmp    4003b6 <main+0x26>
   12.08 :	  400394:       fe 84 10 50 08 60 00    incb   0x600850(%rax,%rdx,1)
   87.92 :	  40039b:       48 81 c2 10 27 00 00    add    $0x2710,%rdx
    0.00 :	  4003a2:       48 81 fa 00 e1 f5 05    cmp    $0x5f5e100,%rdx
    0.00 :	  4003a9:       75 e9                   jne    400394 <main+0x4>
    0.00 :	  4003ab:       48 ff c0                inc    %rax
    0.00 :	  4003ae:       48 3d 10 27 00 00       cmp    $0x2710,%rax
    0.00 :	  4003b4:       74 04                   je     4003ba <main+0x2a>
    0.00 :	  4003b6:       31 d2                   xor    %edx,%edx
    0.00 :	  4003b8:       eb da                   jmp    400394 <main+0x4>
    0.00 :	  4003ba:       31 c0                   xor    %eax,%eax

Those 'ADD' instruction hits are bogus: 99% of the cost in this function is in 
the INCB, but the PMU NMI often skids to the next (few) instructions.

Profiled with "-e instructions:pp" we get:

         :	0000000000400390 <main>:
    0.00 :	  400390:       31 c0                   xor    %eax,%eax
    0.00 :	  400392:       eb 22                   jmp    4003b6 <main+0x26>
   85.33 :	  400394:       fe 84 10 50 08 60 00    incb   0x600850(%rax,%rdx,1)
    0.00 :	  40039b:       48 81 c2 10 27 00 00    add    $0x2710,%rdx
   14.67 :	  4003a2:       48 81 fa 00 e1 f5 05    cmp    $0x5f5e100,%rdx
    0.00 :	  4003a9:       75 e9                   jne    400394 <main+0x4>
    0.00 :	  4003ab:       48 ff c0                inc    %rax
    0.00 :	  4003ae:       48 3d 10 27 00 00       cmp    $0x2710,%rax
    0.00 :	  4003b4:       74 04                   je     4003ba <main+0x2a>
    0.00 :	  4003b6:       31 d2                   xor    %edx,%edx
    0.00 :	  4003b8:       eb da                   jmp    400394 <main+0x4>
    0.00 :	  4003ba:       31 c0                   xor    %eax,%eax

The INCB has the most hits as expected - but we also learn that there's 
something about the CMP.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-23  7:50               ` Peter Zijlstra
@ 2011-04-23 12:06                 ` Stephane Eranian
  2011-04-23 12:36                   ` Ingo Molnar
                                     ` (2 more replies)
  2011-04-24  2:19                 ` Andi Kleen
  1 sibling, 3 replies; 46+ messages in thread
From: Stephane Eranian @ 2011-04-23 12:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Ingo Molnar, arun, Arnaldo Carvalho de Melo,
	linux-kernel, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds,
	Andrew Morton

On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote:
>> > > Yes, and note that with instructions events we even have skid-less PEBS
>> > > profiling so seeing the precise .
>> >                                   - location of instructions is possible.
>>
>> It was better when it was eaten. PEBS does not actually eliminated
>> skid unfortunately. The interrupt still occurs later, so the
>> instruction location is off.
>>
>> PEBS merely gives you more information.
>
> You're so skilled at not actually saying anything useful. Are you
> perchance referring to the fact that the IP reported in the PEBS data is
> exactly _one_ instruction off? Something that is demonstrated to be
> fixable?
>
> Or are you defining skid differently and not telling us your definition?
>

PEBS is guaranteed to return an IP that is just after AN instruction that
caused the event. However, that instruction is NOT the one at the end
of your period. Let's take an example with INST_RETIRED, period=100000.
Then, the IP you get is NOT after the 100,000th retired instruction. It's an
instruction that is N cycles after that one. There is internal skid due to the
way PEBS is implemented.

That is what Andi is referring to. The issue causes bias and thus impacts
the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST
event. PREC_DIST=precise distribution. It tries to correct for the skid
on this event on INST_RETIRED with PEBS (look at Vol3b).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 20:47             ` Ingo Molnar
@ 2011-04-23 12:13               ` Stephane Eranian
  2011-04-23 12:49                 ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Stephane Eranian @ 2011-04-23 12:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton

On Fri, Apr 22, 2011 at 10:47 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Stephane Eranian <eranian@google.com> wrote:
>
>> On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> >
>> > * Stephane Eranian <eranian@google.com> wrote:
>> >
>> > > > Say i'm a developer and i have an app with such code:
>> > > >
>> > > > #define THOUSAND 1000
>> > > >
>> > > > static char array[THOUSAND][THOUSAND];
>> > > >
>> > > > int init_array(void)
>> > > > {
>> > > >        int i, j;
>> > > >
>> > > >        for (i = 0; i < THOUSAND; i++) {
>> > > >                for (j = 0; j < THOUSAND; j++) {
>> > > >                        array[j][i]++;
>> > > >                }
>> > > >        }
>> > > >
>> > > >        return 0;
>> > > > }
>> > > >
>> > > > Pretty common stuff, right?
>> > > >
>> > > > Using the generalized cache events i can run:
>> > > >
>> > > >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
>> > > >
>> > > >  Performance counter stats for './array' (10 runs):
>> > > >
>> > > >         6,719,130 cycles:u                   ( +-   0.662% )
>> > > >         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
>> > > >         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
>> > > >         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
>> > > >
>> > > >        0.003802098  seconds time elapsed   ( +-  13.395% )
>> > > >
>> > > > I consider that this is 'bad', because for almost every dcache-load there's a
>> > > > dcache-miss - a 99% L1 cache miss rate!
>> > > >
>> > > > Then i think a bit, notice something, apply this performance optimization:
>> > >
>> > > I don't think this example is really representative of the kind of problems
>> > > people face, it is just too small and obvious. [...]
>> >
>> > Well, the overwhelming majority of performance problems are 'small and obvious'
>>
>> Problems are not simple. Most serious applications these days are huge,
>> hundreds of MB of text, if not GB.
>>
>> In your artificial example, you knew the answer before you started the
>> measurement.
>>
>> Most of the time, applications are assembled out of hundreds of libraries, so
>> no single developers knows all the code. Thus, the performance analyst is
>> faced with a black box most of the time.
>
> I isolated out an example and assumed that you'd agree that identifying hot
> spots is trivial with generic cache events.
>
> My assumption was wrong so let me show you how trivial it really is.
>
> Here's an example with *two* problematic functions (but it could have hundreds,
> it does not matter):
>
> -------------------------------->
> #define THOUSAND 1000
>
> static char array1[THOUSAND][THOUSAND];
>
> static char array2[THOUSAND][THOUSAND];
>
> void func1(void)
> {
>        int i, j;
>
>        for (i = 0; i < THOUSAND; i++)
>                for (j = 0; j < THOUSAND; j++)
>                        array1[i][j]++;
> }
>
> void func2(void)
> {
>        int i, j;
>
>        for (i = 0; i < THOUSAND; i++)
>                for (j = 0; j < THOUSAND; j++)
>                        array2[j][i]++;
> }
>
> int main(void)
> {
>        for (;;) {
>                func1();
>                func2();
>        }
>
>        return 0;
> }
> <--------------------------------
>
> We do not know which one has the cache-misses problem, func1() or func2(), it's
> all a black box, right?
>
> Using generic cache events you simply type this:
>
>  $ perf top -e l1-dcache-load-misses -e l1-dcache-loads
>
> And you get such output:
>
>   PerfTop:    1923 irqs/sec  kernel: 0.0%  exact:  0.0% [l1-dcache-load-misses:u/l1-dcache-loads:u],  (all, 16 CPUs)
> -------------------------------------------------------------------------------------------------------
>
>   weight    samples  pcnt funct DSO
>   ______    _______ _____ _____ ______________________
>
>      1.9       6184 98.8% func2 /home/mingo/opt/array2
>      0.0         69  1.1% func1 /home/mingo/opt/array2
>
> It has pinpointed the problem in func2 *very* precisely.
>
> Obviously this can be used to analyze larger apps as well, with thousands of
> functions, to pinpoint cachemiss problems in specific functions.
>
No, it does not.

As I said before, your example is just to trivial to be representative. You
keep thinking that what you see in the profile pinpoints exactly the instruction
or even the function where the problem always occurs. This is not always
the case. There is skid, and it can be very big, the IP you get may not even
be in the same function where the load was issued.

You cannot generalized based on this example.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 21:03             ` Ingo Molnar
@ 2011-04-23 12:27               ` Stephane Eranian
  0 siblings, 0 replies; 46+ messages in thread
From: Stephane Eranian @ 2011-04-23 12:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton

On Fri, Apr 22, 2011 at 11:03 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Stephane Eranian <eranian@google.com> wrote:
>
>> Let's go back to your example.
>> Performance counter stats for './array' (10 runs):
>>
>>          6,719,130 cycles:u                   ( +-   0.662% )
>>          5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
>>          1,037,032 l1-dcache-loads:u          ( +-   0.009% )
>>          1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
>>
>> Looking at this I don't think you can pinpoint which function has a problem
>> [...]
>
> In my previous mail i showed how to pinpoint specific functions. You bring up
> an interesting question, cost/benefit analysis:
>
>> [...] and whether or not there is a real problem. You need to evaluate the
>> penalty. Once you know that you can estimate any potential gain from fixing
>> the code. Arun pointed that out rightfully in his answer. How do you know the
>> penalty if you don't decompose some more?
>
> We can measure that even with today's tooling - which doesnt do cost/benefit
> analysis out of box. In my previous example i showed the cachemisses profile:
>
>   weight    samples  pcnt funct DSO
>   ______    _______ _____ _____ ______________________
>
>      1.9       6184 98.8% func2 /home/mingo/opt/array2
>      0.0         69  1.1% func1 /home/mingo/opt/array2
>
> and here's the cycles profile:
>
>             samples  pcnt funct DSO
>             _______ _____ _____ ______________________
>
>             2555.00 67.4% func2 /home/mingo/opt/array2
>             1220.00 32.2% func1 /home/mingo/opt/array2
>
> So, given that there was no other big miss sources:
>
>  $ perf stat -a -e branch-misses:u -e l1-dcache-load-misses:u -e l1-dcache-store-misses:u -e l1-icache-load-misses:u sleep 1
>
>  Performance counter stats for 'sleep 1':
>
>            70,674 branch-misses:u
>       347,992,027 l1-dcache-load-misses:u
>             1,779 l1-dcache-store-misses:u
>             8,007 l1-icache-load-misses:u
>
>        1.000982021  seconds time elapsed
>
> I can tell you that by fixing the cache-misses in that function, the code will
> be roughly 33% faster.
>
33% based on what? l1d-load-misses? The fact that in the same program you have a
problematic function, func1(), and its fixed counter-part func2() +
you know both do
the same thing? How often do you think this happens in real life?

Now, imagine you don't have func2(). Tell me how much of an impact (cycles)
you think func1() is having on the overall execution of a program especially
if is far more complex that your toy example above?

Your arguments would carry more weight if you were to derive them from real
life applications.


> So i fixed the bug, and before it 100 iterations of func1+func2 took 300 msecs:
>
>  $ perf stat -e cpu-clock --repeat 10 ./array2
>
>  Performance counter stats for './array2' (10 runs):
>
>        298.405074 cpu-clock                  ( +-   1.823% )
>
> After the fix it took 190 msecs:
>
>  $ perf stat -e cpu-clock --repeat 10 ./array2
>
>  Performance counter stats for './array2' (10 runs):
>
>        189.409569 cpu-clock                  ( +-   0.019% )
>
>        0.190007596  seconds time elapsed   ( +-   0.025% )
>
> Which is 63% of the original speed - 37% faster. And no, i first did the
> calculation, then did the measurement of the optimized code.
>
> Now it would be nice to automate such analysis some more within perf - but i
> think i have established the principle well enough that we can use generic
> cache events for such measurements.
>
> Also, we could certainly add more generic events - a stalled-cycles event would
> certainly be useful for example, to collect all (or at least most) 'harmful
> delays' the execution flow can experience. Want to take a stab at that patch?
>
> Thanks,
>
>        Ingo
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-23 12:06                 ` Stephane Eranian
@ 2011-04-23 12:36                   ` Ingo Molnar
  2011-04-23 13:16                   ` Peter Zijlstra
  2011-04-24  2:15                   ` Andi Kleen
  2 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-23 12:36 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, Andi Kleen, arun, Arnaldo Carvalho de Melo,
	linux-kernel, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds,
	Andrew Morton


* Stephane Eranian <eranian@google.com> wrote:

> On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote:
> >> > > Yes, and note that with instructions events we even have skid-less PEBS
> >> > > profiling so seeing the precise .
> >> >                                   - location of instructions is possible.
> >>
> >> It was better when it was eaten. PEBS does not actually eliminated
> >> skid unfortunately. The interrupt still occurs later, so the
> >> instruction location is off.
> >>
> >> PEBS merely gives you more information.
> >
> > You're so skilled at not actually saying anything useful. Are you
> > perchance referring to the fact that the IP reported in the PEBS data is
> > exactly _one_ instruction off? Something that is demonstrated to be
> > fixable?
> >
> > Or are you defining skid differently and not telling us your definition?
> >
> 
> PEBS is guaranteed to return an IP that is just after AN instruction that 
> caused the event. However, that instruction is NOT the one at the end of your 
> period. Let's take an example with INST_RETIRED, period=100000. Then, the IP 
> you get is NOT after the 100,000th retired instruction. It's an instruction 
> that is N cycles after that one. There is internal skid due to the way PEBS 
> is implemented.

You are really misapplying the common-sense definition of 'skid'.

Skid refers to the instruction causing a profiler hit being mis-identified. 
Google 'x86 pmu skid' and read the third entry: your own prior posting ;-)

What you are referring to here is not really classic skid but a small, mostly 
constant skew in the period length with some very small amount of variability. 
It's thus mostly immaterial - at most a second or third order effect with 
typical frequencies of sampling.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-23 12:13               ` Stephane Eranian
@ 2011-04-23 12:49                 ` Ingo Molnar
  0 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-23 12:49 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen,
	Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* Stephane Eranian <eranian@google.com> wrote:

> On Fri, Apr 22, 2011 at 10:47 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Stephane Eranian <eranian@google.com> wrote:
> >
> >> On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >> >
> >> > * Stephane Eranian <eranian@google.com> wrote:
> >> >
> >> > > > Say i'm a developer and i have an app with such code:
> >> > > >
> >> > > > #define THOUSAND 1000
> >> > > >
> >> > > > static char array[THOUSAND][THOUSAND];
> >> > > >
> >> > > > int init_array(void)
> >> > > > {
> >> > > >        int i, j;
> >> > > >
> >> > > >        for (i = 0; i < THOUSAND; i++) {
> >> > > >                for (j = 0; j < THOUSAND; j++) {
> >> > > >                        array[j][i]++;
> >> > > >                }
> >> > > >        }
> >> > > >
> >> > > >        return 0;
> >> > > > }
> >> > > >
> >> > > > Pretty common stuff, right?
> >> > > >
> >> > > > Using the generalized cache events i can run:
> >> > > >
> >> > > >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> >> > > >
> >> > > >  Performance counter stats for './array' (10 runs):
> >> > > >
> >> > > >         6,719,130 cycles:u                   ( +-   0.662% )
> >> > > >         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
> >> > > >         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
> >> > > >         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
> >> > > >
> >> > > >        0.003802098  seconds time elapsed   ( +-  13.395% )
> >> > > >
> >> > > > I consider that this is 'bad', because for almost every dcache-load there's a
> >> > > > dcache-miss - a 99% L1 cache miss rate!
> >> > > >
> >> > > > Then i think a bit, notice something, apply this performance optimization:
> >> > >
> >> > > I don't think this example is really representative of the kind of problems
> >> > > people face, it is just too small and obvious. [...]
> >> >
> >> > Well, the overwhelming majority of performance problems are 'small and obvious'
> >>
> >> Problems are not simple. Most serious applications these days are huge,
> >> hundreds of MB of text, if not GB.
> >>
> >> In your artificial example, you knew the answer before you started the
> >> measurement.
> >>
> >> Most of the time, applications are assembled out of hundreds of libraries, so
> >> no single developers knows all the code. Thus, the performance analyst is
> >> faced with a black box most of the time.
> >
> > I isolated out an example and assumed that you'd agree that identifying hot
> > spots is trivial with generic cache events.
> >
> > My assumption was wrong so let me show you how trivial it really is.
> >
> > Here's an example with *two* problematic functions (but it could have hundreds,
> > it does not matter):
> >
> > -------------------------------->
> > #define THOUSAND 1000
> >
> > static char array1[THOUSAND][THOUSAND];
> >
> > static char array2[THOUSAND][THOUSAND];
> >
> > void func1(void)
> > {
> >        int i, j;
> >
> >        for (i = 0; i < THOUSAND; i++)
> >                for (j = 0; j < THOUSAND; j++)
> >                        array1[i][j]++;
> > }
> >
> > void func2(void)
> > {
> >        int i, j;
> >
> >        for (i = 0; i < THOUSAND; i++)
> >                for (j = 0; j < THOUSAND; j++)
> >                        array2[j][i]++;
> > }
> >
> > int main(void)
> > {
> >        for (;;) {
> >                func1();
> >                func2();
> >        }
> >
> >        return 0;
> > }
> > <--------------------------------
> >
> > We do not know which one has the cache-misses problem, func1() or func2(), it's
> > all a black box, right?
> >
> > Using generic cache events you simply type this:
> >
> >  $ perf top -e l1-dcache-load-misses -e l1-dcache-loads
> >
> > And you get such output:
> >
> >   PerfTop:    1923 irqs/sec  kernel: 0.0%  exact:  0.0% [l1-dcache-load-misses:u/l1-dcache-loads:u],  (all, 16 CPUs)
> > -------------------------------------------------------------------------------------------------------
> >
> >   weight    samples  pcnt funct DSO
> >   ______    _______ _____ _____ ______________________
> >
> >      1.9       6184 98.8% func2 /home/mingo/opt/array2
> >      0.0         69  1.1% func1 /home/mingo/opt/array2
> >
> > It has pinpointed the problem in func2 *very* precisely.
> >
> > Obviously this can be used to analyze larger apps as well, with thousands 
> > of functions, to pinpoint cachemiss problems in specific functions.
>
> No, it does not.

The thing is, you will need to come up with more convincing and concrete 
arguments than a blanket, unsupported "No, it does not" claim.

I *just showed* you an example which you claimed just two mails ago is 
impossible to analyze. I showed an example two functions and claimed that the 
same thing works with 3 or more functions as well: perf top will happily 
display the ones with the highest cachemiss ratio, regardless of how many there 
are.

> As I said before, your example is just to trivial to be representative. You 
> keep thinking that what you see in the profile pinpoints exactly the 
> instruction or even the function where the problem always occurs. This is not 
> always the case. There is skid, and it can be very big, the IP you get may 
> not even be in the same function where the load was issued.

So now you claim a narrow special case (most of the hot-spot overhead skidding 
out of a function) as a counter-proof?

Sometimes skid causes problems - in practice it rarely does, and i do a lot of 
profiling.

Also, i'd expect PEBS to be extended in the future to more and more events - 
including cachemiss events. That will solve this kind of skidding in a pretty 
natural way.

Also, lets analyze your narrow special case: if a function is indeed 
"invisible" to profiling because most overhead skids out of it then there's 
little you can do with raw events to begin with ...

You really need to specifically demonstrate how raw events help your example.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-23 12:06                 ` Stephane Eranian
  2011-04-23 12:36                   ` Ingo Molnar
@ 2011-04-23 13:16                   ` Peter Zijlstra
  2011-04-25 18:48                     ` Stephane Eranian
  2011-04-25 19:40                     ` Andi Kleen
  2011-04-24  2:15                   ` Andi Kleen
  2 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2011-04-23 13:16 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, Ingo Molnar, arun, Arnaldo Carvalho de Melo,
	linux-kernel, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds,
	Andrew Morton

On Sat, 2011-04-23 at 14:06 +0200, Stephane Eranian wrote:
> On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote:
> >> > > Yes, and note that with instructions events we even have skid-less PEBS
> >> > > profiling so seeing the precise .
> >> >                                   - location of instructions is possible.
> >>
> >> It was better when it was eaten. PEBS does not actually eliminated
> >> skid unfortunately. The interrupt still occurs later, so the
> >> instruction location is off.
> >>
> >> PEBS merely gives you more information.
> >
> > You're so skilled at not actually saying anything useful. Are you
> > perchance referring to the fact that the IP reported in the PEBS data is
> > exactly _one_ instruction off? Something that is demonstrated to be
> > fixable?
> >
> > Or are you defining skid differently and not telling us your definition?
> >
> 
> PEBS is guaranteed to return an IP that is just after AN instruction that
> caused the event. However, that instruction is NOT the one at the end
> of your period. Let's take an example with INST_RETIRED, period=100000.
> Then, the IP you get is NOT after the 100,000th retired instruction. It's an
> instruction that is N cycles after that one. There is internal skid due to the
> way PEBS is implemented.
> 
> That is what Andi is referring to. The issue causes bias and thus impacts
> the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST
> event. PREC_DIST=precise distribution. It tries to correct for the skid
> on this event on INST_RETIRED with PEBS (look at Vol3b).

Sure, but who cares? So your period isn't exactly what you specified,
but the effective period will have an average and a fairly small stdev
(assuming the initial period is much larger than the relatively few
cycles it takes to arm the PEBS assist), therefore you still get a
fairly uniform spread.

I don't much get the obsession with precision here, its all a statistics
game anyway.

And while you keep saying the examples are too trivial and Andi keep
sprouting vague non-statements, neither of you actually provide anything
sensible to the discussion.

So stop f*cking whining and start talking sense or stop talking all
together.

I mean, you were in the room where Intel presented their research on
event correlations based on pathological micro-benches. That clearly
shows that exact event definitions simply don't matter.

Similarly all this precision wanking isn't _that_ important, the big
fish clearly stand out, its only when you start shaving off the last few
cycles that all that really comes in handy, before that its mostly: ooh
thinking is hard, lets go shopping.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
  2011-04-22 20:30         ` Ingo Molnar
  2011-04-22 20:32           ` Ingo Molnar
@ 2011-04-23 20:14           ` Ingo Molnar
  2011-04-24  6:16             ` Arun Sharma
  1 sibling, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2011-04-23 20:14 UTC (permalink / raw)
  To: arun
  Cc: Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel,
	Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> > [...] If there is an expensive load, you'll see that the load instruction 
> > takes many cycles and you can infer that it's a cache miss.
> > 
> > Questions app developers typically ask me:
> > 
> > * If I fix all my top 5 L3 misses how much faster will my app go?
> 
> This has come up: we could add a 'stalled/idle-cycles' generic event - i.e. 
> cycles spent without performing useful work in the pipelines. (Resource-stall 
> events on Intel CPUs.)

How about something like the patch below?

	Ingo
---
Subject: perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
From: Ingo Molnar <mingo@elte.hu>

The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
cycles the CPU does nothing useful, because it is stalled on a
cache-miss or some other condition.

Note: this is still a incomplete and will work on Intel Nehalem
      CPUs for now, the intel_perfmon_event_map[] needs to be
      properly split between the major models.

Also update 'perf stat' to print:

           611,527 cycles
           400,553 instructions             # ( 0.7 instructions per cycle )
            77,809 stalled-cycles           # ( 12.7% of all cycles )

        0.000610987  seconds time elapsed

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_event_intel.c |    2 ++
 include/linux/perf_event.h             |    1 +
 tools/perf/builtin-stat.c              |   11 +++++++++--
 tools/perf/util/parse-events.c         |    1 +
 tools/perf/util/python.c               |    1 +
 5 files changed, 14 insertions(+), 2 deletions(-)

Index: linux/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux/arch/x86/kernel/cpu/perf_event_intel.c
@@ -34,6 +34,8 @@ static const u64 intel_perfmon_event_map
   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS]	= 0x00c4,
   [PERF_COUNT_HW_BRANCH_MISSES]		= 0x00c5,
   [PERF_COUNT_HW_BUS_CYCLES]		= 0x013c,
+  [PERF_COUNT_HW_STALLED_CYCLES]	= 0xffa2, /* 0xff: All reasons, 0xa2: Resource stalls */
+
 };
 
 static struct event_constraint intel_core_event_constraints[] =
Index: linux/include/linux/perf_event.h
===================================================================
--- linux.orig/include/linux/perf_event.h
+++ linux/include/linux/perf_event.h
@@ -52,6 +52,7 @@ enum perf_hw_id {
 	PERF_COUNT_HW_BRANCH_INSTRUCTIONS	= 4,
 	PERF_COUNT_HW_BRANCH_MISSES		= 5,
 	PERF_COUNT_HW_BUS_CYCLES		= 6,
+	PERF_COUNT_HW_STALLED_CYCLES		= 7,
 
 	PERF_COUNT_HW_MAX,			/* non-ABI */
 };
Index: linux/tools/perf/builtin-stat.c
===================================================================
--- linux.orig/tools/perf/builtin-stat.c
+++ linux/tools/perf/builtin-stat.c
@@ -442,7 +442,7 @@ static void abs_printout(int cpu, struct
 		if (total)
 			ratio = avg / total;
 
-		fprintf(stderr, " # %10.3f IPC  ", ratio);
+		fprintf(stderr, " # ( %3.1f instructions per cycle )", ratio);
 	} else if (perf_evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES) &&
 			runtime_branches_stats[cpu].n != 0) {
 		total = avg_stats(&runtime_branches_stats[cpu]);
@@ -450,7 +450,7 @@ static void abs_printout(int cpu, struct
 		if (total)
 			ratio = avg * 100 / total;
 
-		fprintf(stderr, " # %10.3f %%    ", ratio);
+		fprintf(stderr, " # %10.3f %%", ratio);
 
 	} else if (runtime_nsecs_stats[cpu].n != 0) {
 		total = avg_stats(&runtime_nsecs_stats[cpu]);
@@ -459,6 +459,13 @@ static void abs_printout(int cpu, struct
 			ratio = 1000.0 * avg / total;
 
 		fprintf(stderr, " # %10.3f M/sec", ratio);
+	} else if (perf_evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES)) {
+		total = avg_stats(&runtime_cycles_stats[cpu]);
+
+		if (total)
+			ratio = avg / total * 100.0;
+
+		fprintf(stderr, " # (%5.1f%% of all cycles )", ratio);
 	}
 }
 
Index: linux/tools/perf/util/parse-events.c
===================================================================
--- linux.orig/tools/perf/util/parse-events.c
+++ linux/tools/perf/util/parse-events.c
@@ -38,6 +38,7 @@ static struct event_symbol event_symbols
   { CHW(BRANCH_INSTRUCTIONS),	"branch-instructions",	"branches"	},
   { CHW(BRANCH_MISSES),		"branch-misses",	""		},
   { CHW(BUS_CYCLES),		"bus-cycles",		""		},
+  { CHW(STALLED_CYCLES),	"stalled-cycles",	""		},
 
   { CSW(CPU_CLOCK),		"cpu-clock",		""		},
   { CSW(TASK_CLOCK),		"task-clock",		""		},
Index: linux/tools/perf/util/python.c
===================================================================
--- linux.orig/tools/perf/util/python.c
+++ linux/tools/perf/util/python.c
@@ -798,6 +798,7 @@ static struct {
 	{ "COUNT_HW_BRANCH_INSTRUCTIONS", PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
 	{ "COUNT_HW_BRANCH_MISSES",	  PERF_COUNT_HW_BRANCH_MISSES },
 	{ "COUNT_HW_BUS_CYCLES",	  PERF_COUNT_HW_BUS_CYCLES },
+	{ "COUNT_HW_STALLED_CYCLES",	  PERF_COUNT_HW_STALLED_CYCLES },
 	{ "COUNT_HW_CACHE_L1D",		  PERF_COUNT_HW_CACHE_L1D },
 	{ "COUNT_HW_CACHE_L1I",		  PERF_COUNT_HW_CACHE_L1I },
 	{ "COUNT_HW_CACHE_LL",	  	  PERF_COUNT_HW_CACHE_LL },

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-23 12:06                 ` Stephane Eranian
  2011-04-23 12:36                   ` Ingo Molnar
  2011-04-23 13:16                   ` Peter Zijlstra
@ 2011-04-24  2:15                   ` Andi Kleen
  2 siblings, 0 replies; 46+ messages in thread
From: Andi Kleen @ 2011-04-24  2:15 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, Ingo Molnar, arun, Arnaldo Carvalho de Melo,
	linux-kernel, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds,
	Andrew Morton

> 
> That is what Andi is referring to. The issue causes bias and thus impacts
> the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST
> event. PREC_DIST=precise distribution. It tries to correct for the skid
> on this event on INST_RETIRED with PEBS (look at Vol3b).

Unfortunately even PDIST doesn't completely fix the problem, but
it only makes it somewhat better. Also it's only statistical,
so you won't get a guaranteed answer for every sample.

-Andi

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-23  7:50               ` Peter Zijlstra
  2011-04-23 12:06                 ` Stephane Eranian
@ 2011-04-24  2:19                 ` Andi Kleen
  2011-04-25 17:41                   ` Ingo Molnar
  1 sibling, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2011-04-24  2:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, arun, Stephane Eranian, Arnaldo Carvalho de Melo,
	linux-kernel, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds,
	Andrew Morton

> You're so skilled at not actually saying anything useful. Are you
> perchance referring to the fact that the IP reported in the PEBS data is
> exactly _one_ instruction off? Something that is demonstrated to be
> fixable?

It's one instruction off the instruction that was retired when the PEBS
interrupt was ready, but not one instruction off the instruction
that caused the event. There's still skid in triggering the interrupt.

The main good thing about PEBS is that you can get some information
about the state of the instruction, just not the EIP.
For example with the memory latency event you can actually get
the address and memory cache state (as Lin Ming's patchkit implements) 

-Andi

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
  2011-04-23 20:14           ` [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES Ingo Molnar
@ 2011-04-24  6:16             ` Arun Sharma
  2011-04-25 17:37               ` Ingo Molnar
                                 ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Arun Sharma @ 2011-04-24  6:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel,
	Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton

On Sat, Apr 23, 2011 at 10:14:09PM +0200, Ingo Molnar wrote:
> 
> The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
> cycles the CPU does nothing useful, because it is stalled on a
> cache-miss or some other condition.

Conceptually looks fine. I'd prefer a more precise name such as:
PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or
retirement stalls).

In the example below:

==> foo.c <==

foo()
{
}


bar()
{
}

==> test.c <==
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define FNV_PRIME_32 16777619
#define FNV_OFFSET_32 2166136261U
uint32_t hash1(const char *s)
{
    uint32_t hash = FNV_OFFSET_32, i;
    for(i = 0; i < 4; i++)
    {
        hash = hash ^ (s[i]); // xor next byte into the bottom of the hash
        hash = hash * FNV_PRIME_32; // Multiply by prime number found to work well
    }
    return hash;
} 

#define FNV_PRIME_WEAK_32 100
#define FNV_OFFSET_WEAK_32 200

uint32_t hash2(const char *s)
{
    uint32_t hash = FNV_OFFSET_WEAK_32, i;
    for(i = 0; i < 4; i++)
    {
        hash = hash ^ (s[i]); // xor next byte into the bottom of the hash
        hash = hash * FNV_PRIME_WEAK_32; // Multiply by prime number found to work well
    }
    return hash;
}

int main()
{
	int r = random();

	while(1) {
		r++;
#ifdef HARD
		if (hash1((const char *) &r) & 0x500)
#else
		if (hash2((const char *) &r) & 0x500)
#endif
			foo();
		else
			bar();
	}
}
==> Makefile <==

all:
	gcc -O2 test.c foo.c -UHARD -o test.easy
	gcc -O2 test.c foo.c -DHARD -o test.hard


# perf stat -e cycles,instructions ./test.hard
^C
 Performance counter stats for './test.hard':

     3,742,855,848 cycles                  
     4,179,896,309 instructions             #      1.117 IPC  

        1.754804730  seconds time elapsed


# perf stat -e cycles,instructions ./test.easy
^C
 Performance counter stats for './test.easy':

     3,932,887,528 cycles                  
     8,994,416,316 instructions             #      2.287 IPC  

        1.843832827  seconds time elapsed

i.e. fixing the branch mispredicts could result in
a nearly 2x speedup for the program.

Looking at:

# perf stat -e  cycles,instructions,branch-misses,cache-misses,RESOURCE_STALLS:ANY ./test.hard
^C
 Performance counter stats for './test.hard':

     3,413,713,048 cycles                    (scaled from 69.93%)
     3,772,611,782 instructions             #      1.105 IPC    (scaled from 80.01%)
        51,113,340 branch-misses             (scaled from 80.01%)
            12,370 cache-misses              (scaled from 80.02%)
        26,656,983 RESOURCE_STALLS:ANY       (scaled from 69.99%)

        1.626595305  seconds time elapsed

it's hard to spot the opporunity. On the other hand:

# ./analyze.py
Percent idle: 27%
        Retirement Stalls: 82%
        Backend Stalls: 0%
        Frontend Stalls: 62%
        Instruction Starvation: 62%
        icache stalls: 0%

does give me a signal about where to look. The script below is
a quick and dirty hack. I haven't really validated it with 
many workloads. I'm posting it here anyway hoping that it'd
result in better kernel support for these types of analyses.

Even if we cover this with various generic PERF_COUNT_*STALL events,
we'll still have a need for other events:

* Things that give info about instruction mixes.

  Ratio of {loads, stores, floating point, branches, conditional branches}
  to total instructions.

* Activity related to micro architecture specific caches

  People using -funroll-loops may have a significant performance opportunity.
  But it's hard to spot bottlenecks in the instruction decoder.

* Monitoring traffic on Hypertransport/QPI links

Like you observe, most people will not look at these events, so
focusing on getting the common events right makes sense. But I
still like access to all events (either via a mapping file or
a library such as libpfm4). Hiding them in "perf list" sounds
like a reasonable way of keeping complexity out.

 -Arun

PS: branch-misses:pp was spot on for the example above.

#!/usr/bin/env python

from optparse import OptionParser
from itertools import izip, chain, repeat
from subprocess import Popen, PIPE
import re, struct

def grouper(n, iterable, padvalue=None):
    "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
    return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)

counter_re = re.compile('\s+(?P<count>\d+)\s+(?P<event>\S+)')
def sample(events):
    cmd = 'perf stat --no-big-num -a'
    ncounters = 4
    groups = grouper(ncounters, events)
    for g in groups:
        # filter padding
	g = [ e for e in g if e ]
        cmd += ' -e ' + ','.join(g)
    cmd += ' -- sleep ' + str(options.time)
    process = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE)
    out, err = process.communicate()
    ret = process.poll()
    if ret: raise "Perf failed: " + err
    ret = {}
    for line in err.split('\n'):
        m = counter_re.match(line)
	if not m: continue
	ret[m.group('event')] = long(m.group('count'))
    return ret

def measure_cycles():
    # disable C-states
    f = open("/dev/cpu_dma_latency", "wb")
    f.write(struct.pack("i", 0))
    f.flush()
    saved = options.time
    options.time = 1 # one sec is sufficient to measure clock
    cycles = sample(["cycles"])['cycles']
    cycles /= options.time
    f.close()
    options.time = saved
    return cycles

if __name__ == '__main__':
    parser = OptionParser()
    parser.add_option("-t", "--time", dest="time", default=1,
                      help="How long to sample events")
    parser.add_option("-q", "--quiet",
                      action="store_false", dest="verbose", default=True,
                      help="don't print status messages to stdout")
    
    (options, args) = parser.parse_args()
    cycles_per_sec = measure_cycles()
    c = sample(["cycles", "instructions", "UOPS_ISSUED:ANY:c=1", "UOPS_ISSUED:ANY:c=1:t=1",
                "RESOURCE_STALLS:ANY", "UOPS_RETIRED:ANY:c=1:t=1",
		"UOPS_EXECUTED:PORT015:t=1", "UOPS_EXECUTED:PORT234_CORE",
		"UOPS_ISSUED:ANY:t=1", "UOPS_ISSUED:FUSED:t=1", "UOPS_RETIRED:ANY:t=1",
	        "L1I:CYCLES_STALLED"])
    cycles = c["cycles"] * 1.0
    cycles_no_uops_issued = cycles - c["UOPS_ISSUED:ANY:c=1:t=1"]
    cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
    backend_stall_cycles = c["RESOURCE_STALLS:ANY"]
    icache_stall_cycles = c["L1I:CYCLES_STALLED"]

    # Cycle stall accounting
    print "Percent idle: %d%%" % ((1 - cycles/(int(options.time) * cycles_per_sec)) * 100)
    print "\tRetirement Stalls: %d%%" % ((cycles_no_uops_retired / cycles) * 100)
    print "\tBackend Stalls: %d%%" % ((backend_stall_cycles / cycles) * 100)
    print "\tFrontend Stalls: %d%%" % ((cycles_no_uops_issued / cycles) * 100)
    print "\tInstruction Starvation: %d%%" % (((cycles_no_uops_issued - backend_stall_cycles)/cycles) * 100)
    print "\ticache stalls: %d%%" % ((icache_stall_cycles/cycles) * 100)

    # Wasted work
    uops_executed = c["UOPS_EXECUTED:PORT015:t=1"] + c["UOPS_EXECUTED:PORT234_CORE"]
    uops_retired = c["UOPS_RETIRED:ANY:t=1"]
    uops_issued = c["UOPS_ISSUED:ANY:t=1"] + c["UOPS_ISSUED:FUSED:t=1"]

    print "\tPercentage useless uops: %d%%" % ((uops_executed - uops_retired) * 100.0/uops_retired)
    print "\tPercentage useless uops issued: %d%%" % ((uops_issued - uops_retired) * 100.0/uops_retired)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
  2011-04-24  6:16             ` Arun Sharma
@ 2011-04-25 17:37               ` Ingo Molnar
  2011-04-26  9:25               ` Peter Zijlstra
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-25 17:37 UTC (permalink / raw)
  To: Arun Sharma
  Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel,
	Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Linus Torvalds,
	Andrew Morton


* Arun Sharma <asharma@fb.com> wrote:

> On Sat, Apr 23, 2011 at 10:14:09PM +0200, Ingo Molnar wrote:
> > 
> > The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
> > cycles the CPU does nothing useful, because it is stalled on a
> > cache-miss or some other condition.
> 
> Conceptually looks fine. I'd prefer a more precise name such as: 
> PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or 
> retirement stalls).

Ok.

Your script:

> # ./analyze.py
> Percent idle: 27%
>         Retirement Stalls: 82%
>         Backend Stalls: 0%
>         Frontend Stalls: 62%
>         Instruction Starvation: 62%
>         icache stalls: 0%
> 
> does give me a signal about where to look. The script below is
> a quick and dirty hack. I haven't really validated it with 
> many workloads. I'm posting it here anyway hoping that it'd
> result in better kernel support for these types of analyses.

Is pretty useful IMO.

The frontend/backend characterisation is pretty generic - most modern CPUs 
share that and have similar events.

So we could try to generalize these and get most of the statistics your script 
outputs.

> Even if we cover this with various generic PERF_COUNT_*STALL events,
> we'll still have a need for other events:
> 
> * Things that give info about instruction mixes.
> 
>   Ratio of {loads, stores, floating point, branches, conditional branches}
>   to total instructions.

We have this at least partially covered, but yeah, we stopped short of covering 
all instruction types so complete ratios cannot be built yet.

> * Activity related to micro architecture specific caches
> 
>   People using -funroll-loops may have a significant performance opportunity.
>   But it's hard to spot bottlenecks in the instruction decoder.
> 
> * Monitoring traffic on Hypertransport/QPI links

Cross-node accesses ought to be covered by Peter's RFC patch. In terms of 
isolating cross-CPU cache accesses i suspect we could do that too if it really 
matters to analysis in practice.

Basically the way to go about it are the testcases you wrote - they demonstrate 
the utility of a given type of event - and that justifies generalization as 
well.

> Like you observe, most people will not look at these events, so
> focusing on getting the common events right makes sense. But I
> still like access to all events (either via a mapping file or
> a library such as libpfm4). Hiding them in "perf list" sounds
> like a reasonable way of keeping complexity out.

Yes. We have access to raw events for relatively obscure (or too CPU dependent) 
events - but what we do not want to do is to extend that space without adding 
*any* generic event in essence. If something like offcore or uncore PMU support 
is useful enough to be in the kernel, then it should also be useful enough to 
gain generic events.

> PS: branch-misses:pp was spot on for the example above.

heh :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-24  2:19                 ` Andi Kleen
@ 2011-04-25 17:41                   ` Ingo Molnar
  2011-04-25 18:00                     ` Dehao Chen
       [not found]                     ` <BANLkTiks31-pMJe4zCKrppsrA1d6KanJFA@mail.gmail.com>
  0 siblings, 2 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-25 17:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, arun, Stephane Eranian, Arnaldo Carvalho de Melo,
	linux-kernel, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds,
	Andrew Morton


* Andi Kleen <ak@linux.intel.com> wrote:

> > You're so skilled at not actually saying anything useful. Are you
> > perchance referring to the fact that the IP reported in the PEBS data is
> > exactly _one_ instruction off? Something that is demonstrated to be
> > fixable?
> 
> It's one instruction off the instruction that was retired when the PEBS 
> interrupt was ready, but not one instruction off the instruction that caused 
> the event. There's still skid in triggering the interrupt.

Peter answered this in the other mail:

 |
 | Sure, but who cares? So your period isn't exactly what you specified, but 
 | the effective period will have an average and a fairly small stdev (assuming 
 | the initial period is much larger than the relatively few cycles it takes to 
 | arm the PEBS assist), therefore you still get a fairly uniform spread.
 |

... and the resulting low level of noise in the average period length is what 
matters. The instruction itself will still be one of the hotspot instructions, 
statistically.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-25 17:41                   ` Ingo Molnar
@ 2011-04-25 18:00                     ` Dehao Chen
       [not found]                     ` <BANLkTiks31-pMJe4zCKrppsrA1d6KanJFA@mail.gmail.com>
  1 sibling, 0 replies; 46+ messages in thread
From: Dehao Chen @ 2011-04-25 18:00 UTC (permalink / raw)
  To: linux-kernel

On Tue, Apr 26, 2011 at 1:41 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andi Kleen <ak@linux.intel.com> wrote:
>
> > > You're so skilled at not actually saying anything useful. Are you
> > > perchance referring to the fact that the IP reported in the PEBS data is
> > > exactly _one_ instruction off? Something that is demonstrated to be
> > > fixable?
> >
> > It's one instruction off the instruction that was retired when the PEBS
> > interrupt was ready, but not one instruction off the instruction that caused
> > the event. There's still skid in triggering the interrupt.
>
> Peter answered this in the other mail:
>
>  |
>  | Sure, but who cares? So your period isn't exactly what you specified, but
>  | the effective period will have an average and a fairly small stdev (assuming
>  | the initial period is much larger than the relatively few cycles it takes to
>  | arm the PEBS assist), therefore you still get a fairly uniform spread.
>  |
>
> ... and the resulting low level of noise in the average period length is what
> matters. The instruction itself will still be one of the hotspot instructions,
> statistically.

Not true. This skid will lead to some aggregation and shadow effects
on some certain instructions. To make things worse, these effects are
deterministic and cannot be removed by either sampling for multiple
times or by averaging among instructions within a basic block. As a
result, some actual "hot spot" are not sampled at all. You can simply
try to collect a basic block level CPI, and you'll get a very
misleading profile.

Dehao

>
> Thanks,
>
>        Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
       [not found]                     ` <BANLkTiks31-pMJe4zCKrppsrA1d6KanJFA@mail.gmail.com>
@ 2011-04-25 18:05                       ` Ingo Molnar
  2011-04-25 18:39                         ` Stephane Eranian
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2011-04-25 18:05 UTC (permalink / raw)
  To: Dehao Chen
  Cc: Andi Kleen, Peter Zijlstra, arun, Stephane Eranian,
	Arnaldo Carvalho de Melo, linux-kernel, Lin Ming,
	Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* Dehao Chen <danielcdh@gmail.com> wrote:

> > ... and the resulting low level of noise in the average period length is 
> > what matters. The instruction itself will still be one of the hotspot 
> > instructions, statistically.
> 
> Not true. This skid will lead to some aggregation and shadow effects on some 
> certain instructions. To make things worse, these effects are deterministic 
> and cannot be removed by either sampling for multiple times or by averaging 
> among instructions within a basic block. As a result, some actual "hot spot" 
> are not sampled at all. You can simply try to collect a basic block level 
> CPI, and you'll get a very misleading profile.

This certainly does not match the results i'm seeing on real applications, 
using "-e instructions:pp" PEBS+LBR profiling. How do you explain that? Also, 
can you demonstrate your claim with a real example?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-25 18:05                       ` Ingo Molnar
@ 2011-04-25 18:39                         ` Stephane Eranian
  2011-04-25 19:45                           ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Stephane Eranian @ 2011-04-25 18:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dehao Chen, Andi Kleen, Peter Zijlstra, arun,
	Arnaldo Carvalho de Melo, linux-kernel, Lin Ming,
	Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton

On Mon, Apr 25, 2011 at 8:05 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Dehao Chen <danielcdh@gmail.com> wrote:
>
>> > ... and the resulting low level of noise in the average period length is
>> > what matters. The instruction itself will still be one of the hotspot
>> > instructions, statistically.
>>
>> Not true. This skid will lead to some aggregation and shadow effects on some
>> certain instructions. To make things worse, these effects are deterministic
>> and cannot be removed by either sampling for multiple times or by averaging
>> among instructions within a basic block. As a result, some actual "hot spot"
>> are not sampled at all. You can simply try to collect a basic block level
>> CPI, and you'll get a very misleading profile.
>
> This certainly does not match the results i'm seeing on real applications,
> using "-e instructions:pp" PEBS+LBR profiling. How do you explain that? Also,
> can you demonstrate your claim with a real example?
>

LBR removes the off-by-1 IP problem, it does not remove the shadow effect, i.e.,
that blind spot of N cycles caused by the PEBS arming mechanism.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-23 13:16                   ` Peter Zijlstra
@ 2011-04-25 18:48                     ` Stephane Eranian
  2011-04-25 19:40                     ` Andi Kleen
  1 sibling, 0 replies; 46+ messages in thread
From: Stephane Eranian @ 2011-04-25 18:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Ingo Molnar, arun, Arnaldo Carvalho de Melo,
	linux-kernel, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds,
	Andrew Morton

On Sat, Apr 23, 2011 at 3:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, 2011-04-23 at 14:06 +0200, Stephane Eranian wrote:
>> On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote:
>> >> > > Yes, and note that with instructions events we even have skid-less PEBS
>> >> > > profiling so seeing the precise .
>> >> >                                   - location of instructions is possible.
>> >>
>> >> It was better when it was eaten. PEBS does not actually eliminated
>> >> skid unfortunately. The interrupt still occurs later, so the
>> >> instruction location is off.
>> >>
>> >> PEBS merely gives you more information.
>> >
>> > You're so skilled at not actually saying anything useful. Are you
>> > perchance referring to the fact that the IP reported in the PEBS data is
>> > exactly _one_ instruction off? Something that is demonstrated to be
>> > fixable?
>> >
>> > Or are you defining skid differently and not telling us your definition?
>> >
>>
>> PEBS is guaranteed to return an IP that is just after AN instruction that
>> caused the event. However, that instruction is NOT the one at the end
>> of your period. Let's take an example with INST_RETIRED, period=100000.
>> Then, the IP you get is NOT after the 100,000th retired instruction. It's an
>> instruction that is N cycles after that one. There is internal skid due to the
>> way PEBS is implemented.
>>
>> That is what Andi is referring to. The issue causes bias and thus impacts
>> the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST
>> event. PREC_DIST=precise distribution. It tries to correct for the skid
>> on this event on INST_RETIRED with PEBS (look at Vol3b).
>
> Sure, but who cares? So your period isn't exactly what you specified,
> but the effective period will have an average and a fairly small stdev
> (assuming the initial period is much larger than the relatively few
> cycles it takes to arm the PEBS assist), therefore you still get a
> fairly uniform spread.
>
> I don't much get the obsession with precision here, its all a statistics
> game anyway.
>

The particular example I am thinking about came from compiler people
I work with who would like to use PEBS to do statistical basic block profiling.
They do care about correctness of the profile. Otherwise, it may cause wrong
attribution of "hotness" of basic blocks and mislead the compiler when it tries
to reorder blocks on the critical path. Compiler people can validate a
statistical
profile because they have a reference profile obtained via instrumentation of
each basic block.

> And while you keep saying the examples are too trivial and Andi keep
> sprouting vague non-statements, neither of you actually provide anything
> sensible to the discussion.
>
> So stop f*cking whining and start talking sense or stop talking all
> together.
>
> I mean, you were in the room where Intel presented their research on
> event correlations based on pathological micro-benches. That clearly
> shows that exact event definitions simply don't matter.
>
Yes, and I don't get the same reading of the presentation. He never mentioned
generic events. He never even used them, I mean the Intel generic events.
Instead he used very focused Atom-specific events.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-23 13:16                   ` Peter Zijlstra
  2011-04-25 18:48                     ` Stephane Eranian
@ 2011-04-25 19:40                     ` Andi Kleen
  2011-04-25 19:55                       ` Ingo Molnar
  1 sibling, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2011-04-25 19:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Ingo Molnar, arun, Arnaldo Carvalho de Melo,
	linux-kernel, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds,
	Andrew Morton

> Sure, but who cares? So your period isn't exactly what you specified,
> but the effective period will have an average and a fairly small stdev
> (assuming the initial period is much larger than the relatively few
> cycles it takes to arm the PEBS assist), therefore you still get a
> fairly uniform spread.

The skid is not uniform and not necessarily random unfortunately, 
and difficult to correct in a standard way.

> I don't much get the obsession with precision here, its all a statistics
> game anyway.

If you want to make your code faster it's often important to figure
out what exactly is slow.

One example of this we had recently in the kernel: 

function accesses three global objects. Scalability tanks when the test is 
run with more CPUs.  Now the hit is near the three memory accesses. Which one
is the one that is actually bouncing cache lines?

The CPU executes them all in parallel so it's hard to tell. It's
all in the out of order reordering window.

PEBS (e.g. the memory latency event) can give you some information about
which memory access is to blame with the right events, but it's not 
using the RIP.

The generic events won't help with that, because they're still RIP
based, which is not accurate.

> Similarly all this precision wanking isn't _that_ important, the big
> fish clearly stand out, its only when you start shaving off the last few
> cycles that all that really comes in handy, before that its mostly: ooh
> thinking is hard, lets go shopping.

I wish it was that easy.

In the example above it's about scaling or not scaling, which is
definitely not the last cycle, but more a life-and-death 
"is the workload feasible on this machine or not" question.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-25 18:39                         ` Stephane Eranian
@ 2011-04-25 19:45                           ` Ingo Molnar
  0 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-25 19:45 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Dehao Chen, Andi Kleen, Peter Zijlstra, arun,
	Arnaldo Carvalho de Melo, linux-kernel, Lin Ming,
	Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma,
	Linus Torvalds, Andrew Morton


* Stephane Eranian <eranian@google.com> wrote:

> > This certainly does not match the results i'm seeing on real applications, 
> > using "-e instructions:pp" PEBS+LBR profiling. How do you explain that? 
> > Also, can you demonstrate your claim with a real example?
> 
> LBR removes the off-by-1 IP problem, it does not remove the shadow effect, 
> i.e., that blind spot of N cycles caused by the PEBS arming mechanism.

I really think you are grasping at straws here - unless you are able to 
demonstrate clear problems - which you have failed to do so far. The pure act
of profiling probably disturbs a typical workload statistically more than a
few cycles skew of the period.

I could imagine artifacts with realy short periods and artificially short and 
dominant hotpaths - but in those cases the skew does not matter much in 
practice: a short and dominant hotpath is pinpointed very easily ...

So i really think it's a non-issue in practice - but you can certainly prove me 
wrong by demonstrating whatever problems you suspect.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-25 19:40                     ` Andi Kleen
@ 2011-04-25 19:55                       ` Ingo Molnar
  0 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-25 19:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Stephane Eranian, arun, Arnaldo Carvalho de Melo,
	linux-kernel, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds,
	Andrew Morton


* Andi Kleen <ak@linux.intel.com> wrote:

> One example of this we had recently in the kernel: 
> 
> function accesses three global objects. Scalability tanks when the test is 
> run with more CPUs.  Now the hit is near the three memory accesses. Which one
> is the one that is actually bouncing cache lines?

that's not an example - you are still only giving vague, untestable, 
unverifiable references. You need to give us something specific and 
reproducible - preferably a testcase.

Peter and me are doing lots of scalability work in the core kernel and for most 
problems i've met it was enough if we knew the function name - the scalability 
problem is typically very obvious from that point on - and an annotated profile 
makes it even more obvious.

I've never met a situation what you describe, that it was not possible to 
disambiguate a real SMP bounce - and i've been fixing SMP bounces in the kernel 
for over ten years.

So you really will have to back up your point with an accurate, reproducible 
testcase - vague statements like the ones you are making i do not accept at 
face value, sorry.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
  2011-04-24  6:16             ` Arun Sharma
  2011-04-25 17:37               ` Ingo Molnar
@ 2011-04-26  9:25               ` Peter Zijlstra
  2011-04-26 14:00               ` Ingo Molnar
  2011-04-27 11:11               ` Ingo Molnar
  3 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2011-04-26  9:25 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Ingo Molnar, arun, Stephane Eranian, Arnaldo Carvalho de Melo,
	linux-kernel, Andi Kleen, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Linus Torvalds, Andrew Morton

On Sat, 2011-04-23 at 23:16 -0700, Arun Sharma wrote:
> Conceptually looks fine. I'd prefer a more precise name such as:
> PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or
> retirement stalls). 

Very nice example!! This is the stuff we want people to do, but instead
of focusing on the raw event aspect, put in a little more effort and see
what it takes to make it work across the board.

None of the things you mention are very specific to Intel, afaik those
concepts you listed: Retirement, Frontend (instruction
decode/uop-issue), Backend (uop execution), I-cache (instruction fetch)
map to pretty much all hardware I know (PMU coverage of these aspects
aside).

So in fact you propose these concepts, and that is the kind of feedback
perf wants and needs.

The thing that set off this whole discussion is that most people don't
seem to believe in concepts and stick to their very narrow HPC every
last cycle matters therefore we need absolute events mentality.

That too is a form of vendor lock, once you're so dependent on a
particular platform the cost of switching increases dramatically.
Furthermore very few people are actually interested in it.

That is not to say we should not enable those people, but the current
state of affairs seems to be that some people are only interested in
enabling that and simply don't care (and don't want to care) about cross
platform performance analysis and useful abstractions.

We'd very much like to make the cost of entry -- the cost of supporting
lowlevel capabilities -- the addition of high level concepts, means for
the greater public to use them.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2
  2011-04-22 16:51         ` Andi Kleen
  2011-04-22 19:57           ` Ingo Molnar
@ 2011-04-26  9:25           ` Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2011-04-26  9:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stephane Eranian, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds,
	Andrew Morton

On Fri, 2011-04-22 at 09:51 -0700, Andi Kleen wrote:
> 
> Micro architectures are so different. I suspect a "generic" definition would
> need to be so vague as to be useless.
> 
> This in general seems to be the problem of the current cache events.
> 
> Overall for any interesting analysis you need to go CPU specific.
> Abstracted performance analysis is a contradiction in terms.

It might help if you'd talk to your own research department before
making statements like that, they make you look silly.

Intel research has shown that you don't actually need exact definitions
as a side effect of applying machine learning principles in order to
provide machine aided optimizing (ie. clippy style guides for vtune).

They create simple micro-kernels (not our kind of kernels, but more like
the excellent example Arun provided) that trigger a pathological case
and a perfect counter-case and run it over _all_ possible events and do
correlation analysis.

The explicit example given was branch misses on an atom, and they found
(to nobody's great surprise) BR_INST_RETIRED.MISPRED to be the best
correlating event. But that's not the important part.

The important part is that all it needs is a strong correlation, and it
could even be a combination of events, it would just make the analysis a
bit more complex.

Anyway, given a sufficient large set of these pathological cases, you
can train a neural net for your target hardware and then reverse the
situation, run it over an unknown program and have it create suggestions
-> yay clippy!

So given a set of pathological cases and hardware with decent PMU
coverage you can train this thing and get useful results. Exact event
definitions be damned -- it doesn't care.

http://sites.google.com/site/fhpm2010/program/baugh_fhpm2010.pptx?attredirects=0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
  2011-04-24  6:16             ` Arun Sharma
  2011-04-25 17:37               ` Ingo Molnar
  2011-04-26  9:25               ` Peter Zijlstra
@ 2011-04-26 14:00               ` Ingo Molnar
  2011-04-27 11:11               ` Ingo Molnar
  3 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-26 14:00 UTC (permalink / raw)
  To: Arun Sharma
  Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel,
	Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Linus Torvalds,
	Andrew Morton


* Arun Sharma <asharma@fb.com> wrote:

> On Sat, Apr 23, 2011 at 10:14:09PM +0200, Ingo Molnar wrote:
> > 
> > The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
> > cycles the CPU does nothing useful, because it is stalled on a
> > cache-miss or some other condition.
> 
> Conceptually looks fine. I'd prefer a more precise name such as:
> PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or
> retirement stalls).

How about this naming convention:

  PERF_COUNT_HW_STALLED_CYCLES                              # execution
  PERF_COUNT_HW_STALLED_CYCLES_FRONTEND                     # frontend
  PERF_COUNT_HW_STALLED_CYCLES_ICACHE_MISS                  # icache

So STALLED_CYCLES would be the most general metric, the one that shows the real 
impact to the application. The other events would then help disambiguate this 
metric some more.

Below is the updated patch - this version makes the backend stalls event 
properly per model. (with the Nehalem table filled in.)

What do you think?

Thanks,

	Ingo

--------------------->
Subject: perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
From: Ingo Molnar <mingo@elte.hu>
Date: Sun Apr 24 08:18:31 CEST 2011

The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
cycles the CPU does nothing useful, because it is stalled on a
cache-miss or some other condition.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_event_intel.c |   42 +++++++++++++++++++++++++--------
 include/linux/perf_event.h             |    1 
 tools/perf/util/parse-events.c         |    1 
 tools/perf/util/python.c               |    1 
 4 files changed, 36 insertions(+), 9 deletions(-)

Index: linux/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux/arch/x86/kernel/cpu/perf_event_intel.c
@@ -36,6 +36,23 @@ static const u64 intel_perfmon_event_map
   [PERF_COUNT_HW_BUS_CYCLES]		= 0x013c,
 };
 
+/*
+ * Other generic events, Nehalem:
+ */
+static const u64 intel_nhm_event_map[] =
+{
+  /* Arch-perfmon events: */
+  [PERF_COUNT_HW_CPU_CYCLES]		= 0x003c,
+  [PERF_COUNT_HW_INSTRUCTIONS]		= 0x00c0,
+  [PERF_COUNT_HW_CACHE_REFERENCES]	= 0x4f2e,
+  [PERF_COUNT_HW_CACHE_MISSES]		= 0x412e,
+  [PERF_COUNT_HW_BRANCH_INSTRUCTIONS]	= 0x00c4,
+  [PERF_COUNT_HW_BRANCH_MISSES]		= 0x00c5,
+  [PERF_COUNT_HW_BUS_CYCLES]		= 0x013c,
+
+  [PERF_COUNT_HW_STALLED_CYCLES]	= 0xffa2, /* 0xff: All reasons, 0xa2: Resource stalls */
+};
+
 static struct event_constraint intel_core_event_constraints[] =
 {
 	INTEL_EVENT_CONSTRAINT(0x11, 0x2), /* FP_ASSIST */
@@ -150,6 +167,12 @@ static u64 intel_pmu_event_map(int hw_ev
 	return intel_perfmon_event_map[hw_event];
 }
 
+static u64 intel_pmu_nhm_event_map(int hw_event)
+{
+	return intel_nhm_event_map[hw_event];
+}
+
+
 static __initconst const u64 snb_hw_cache_event_ids
 				[PERF_COUNT_HW_CACHE_MAX]
 				[PERF_COUNT_HW_CACHE_OP_MAX]
@@ -1400,18 +1423,19 @@ static __init int intel_pmu_init(void)
 	case 26: /* 45 nm nehalem, "Bloomfield" */
 	case 30: /* 45 nm nehalem, "Lynnfield" */
 	case 46: /* 45 nm nehalem-ex, "Beckton" */
-		memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
-		       sizeof(hw_cache_event_ids));
-		memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
-		       sizeof(hw_cache_extra_regs));
+		memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids, sizeof(hw_cache_event_ids));
+		memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
 
 		intel_pmu_lbr_init_nhm();
 
-		x86_pmu.event_constraints = intel_nehalem_event_constraints;
-		x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints;
-		x86_pmu.percore_constraints = intel_nehalem_percore_constraints;
-		x86_pmu.enable_all = intel_pmu_nhm_enable_all;
-		x86_pmu.extra_regs = intel_nehalem_extra_regs;
+		x86_pmu.event_constraints	= intel_nehalem_event_constraints;
+		x86_pmu.pebs_constraints	= intel_nehalem_pebs_event_constraints;
+		x86_pmu.percore_constraints	= intel_nehalem_percore_constraints;
+		x86_pmu.enable_all		= intel_pmu_nhm_enable_all;
+		x86_pmu.extra_regs		= intel_nehalem_extra_regs;
+		x86_pmu.event_map		= intel_pmu_nhm_event_map;
+		x86_pmu.max_events		= ARRAY_SIZE(intel_perfmon_event_map),
+
 		pr_cont("Nehalem events, ");
 		break;
 
Index: linux/include/linux/perf_event.h
===================================================================
--- linux.orig/include/linux/perf_event.h
+++ linux/include/linux/perf_event.h
@@ -52,6 +52,7 @@ enum perf_hw_id {
 	PERF_COUNT_HW_BRANCH_INSTRUCTIONS	= 4,
 	PERF_COUNT_HW_BRANCH_MISSES		= 5,
 	PERF_COUNT_HW_BUS_CYCLES		= 6,
+	PERF_COUNT_HW_STALLED_CYCLES		= 7,
 
 	PERF_COUNT_HW_MAX,			/* non-ABI */
 };
Index: linux/tools/perf/util/parse-events.c
===================================================================
--- linux.orig/tools/perf/util/parse-events.c
+++ linux/tools/perf/util/parse-events.c
@@ -38,6 +38,7 @@ static struct event_symbol event_symbols
   { CHW(BRANCH_INSTRUCTIONS),	"branch-instructions",	"branches"	},
   { CHW(BRANCH_MISSES),		"branch-misses",	""		},
   { CHW(BUS_CYCLES),		"bus-cycles",		""		},
+  { CHW(STALLED_CYCLES),	"stalled-cycles",	""		},
 
   { CSW(CPU_CLOCK),		"cpu-clock",		""		},
   { CSW(TASK_CLOCK),		"task-clock",		""		},
Index: linux/tools/perf/util/python.c
===================================================================
--- linux.orig/tools/perf/util/python.c
+++ linux/tools/perf/util/python.c
@@ -798,6 +798,7 @@ static struct {
 	{ "COUNT_HW_BRANCH_INSTRUCTIONS", PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
 	{ "COUNT_HW_BRANCH_MISSES",	  PERF_COUNT_HW_BRANCH_MISSES },
 	{ "COUNT_HW_BUS_CYCLES",	  PERF_COUNT_HW_BUS_CYCLES },
+	{ "COUNT_HW_STALLED_CYCLES",	  PERF_COUNT_HW_STALLED_CYCLES },
 	{ "COUNT_HW_CACHE_L1D",		  PERF_COUNT_HW_CACHE_L1D },
 	{ "COUNT_HW_CACHE_L1I",		  PERF_COUNT_HW_CACHE_L1I },
 	{ "COUNT_HW_CACHE_LL",	  	  PERF_COUNT_HW_CACHE_LL },

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
  2011-04-24  6:16             ` Arun Sharma
                                 ` (2 preceding siblings ...)
  2011-04-26 14:00               ` Ingo Molnar
@ 2011-04-27 11:11               ` Ingo Molnar
  2011-04-27 14:47                 ` Arun Sharma
  3 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2011-04-27 11:11 UTC (permalink / raw)
  To: Arun Sharma
  Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel,
	Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Peter Zijlstra, eranian, Linus Torvalds,
	Andrew Morton


* Arun Sharma <asharma@fb.com> wrote:

>     cycles = c["cycles"] * 1.0
>     cycles_no_uops_issued = cycles - c["UOPS_ISSUED:ANY:c=1:t=1"]
>     cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
>     backend_stall_cycles = c["RESOURCE_STALLS:ANY"]
>     icache_stall_cycles = c["L1I:CYCLES_STALLED"]
> 
>     # Cycle stall accounting
>     print "Percent idle: %d%%" % ((1 - cycles/(int(options.time) * cycles_per_sec)) * 100)
>     print "\tRetirement Stalls: %d%%" % ((cycles_no_uops_retired / cycles) * 100)
>     print "\tBackend Stalls: %d%%" % ((backend_stall_cycles / cycles) * 100)
>     print "\tFrontend Stalls: %d%%" % ((cycles_no_uops_issued / cycles) * 100)
>     print "\tInstruction Starvation: %d%%" % (((cycles_no_uops_issued - backend_stall_cycles)/cycles) * 100)
>     print "\ticache stalls: %d%%" % ((icache_stall_cycles/cycles) * 100)
> 
>     # Wasted work
>     uops_executed = c["UOPS_EXECUTED:PORT015:t=1"] + c["UOPS_EXECUTED:PORT234_CORE"]
>     uops_retired = c["UOPS_RETIRED:ANY:t=1"]
>     uops_issued = c["UOPS_ISSUED:ANY:t=1"] + c["UOPS_ISSUED:FUSED:t=1"]
> 
>     print "\tPercentage useless uops: %d%%" % ((uops_executed - uops_retired) * 100.0/uops_retired)
>     print "\tPercentage useless uops issued: %d%%" % ((uops_issued - uops_retired) * 100.0/uops_retired)

Just an update: i started working on generalizing these events.

As a first step i'd like to introduce stall statistics in default 'perf stat' 
output, then as a second step offer more detailed modi of analysis (like your 
script).

As for the first, 'overview' step, i'd like to use one or two numbers only, to 
give people a general ballpark figure about how good the CPU is performing for 
a given workload.

Wouldnt UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 be in general a pretty good, 
primary "stall" indicator? This is similar to the "cycles-uops_executed" value 
in your script (UOPS_EXECUTED:PORT015:t=1 and UOPS_EXECUTED:PORT234_CORE 
based): it counts cycles when there's no execution at all - not even 
speculative one.

This would cover a wide variety of 'stall' reasons: external latency, or 
stalling on lack of paralellism in the incoming instruction stream and most 
other stall reasons. So it would measure everything that moves the CPU away 
from 100% utilization.

Secondly, the 'speculative waste' proportion is probably pretty well captured 
via branch-misprediction counts - those are the primary source of filling the 
pipeline with useless work.

So in the most high-level view we could already print useful information via 
the introduction of a single new generic event:

   PERF_COUNT_HW_CPU_CYCLES_BUSY

and 'idle cycles' are "cycles-busy_cycles".

I have implemented preliminary support for this, and this is how the new 
'overview' output looks like currently. Firstly here's the output from a "bad" 
testcase (lots of branch-misses):

 $ perf stat ./branches 20 1

 Performance counter stats for './branches 20 1' (10 runs):

          9.829903 task-clock               #    0.972 CPUs utilized            ( +-  0.07% )
                 0 context-switches         #    0.000 M/sec                    ( +-  0.00% )
                 0 CPU-migrations           #    0.000 M/sec                    ( +-  0.00% )
               111 page-faults              #    0.011 M/sec                    ( +-  0.09% )
        31,470,886 cycles                   #    3.202 GHz                      ( +-  0.06% )
         9,825,068 stalled-cycles           #   31.22% of all cycles are idle   ( +- 13.89% )
        27,868,090 instructions             #    0.89  insns per cycle        
                                            #    0.35  stalled cycles per insn  ( +-  0.02% )
         4,313,661 branches                 #  438.830 M/sec                    ( +-  0.02% )
         1,068,668 branch-misses            #   24.77% of all branches          ( +-  0.01% )

        0.010117520  seconds time elapsed  ( +-  0.14% )

The two important values are the "20.00% of all cycles are idle" and the 
"24.77% of all branches" missed values - both are high and indicative of 
trouble.

The fixed testcase shows:

 Performance counter stats for './branches 20 0' (100 runs):

          4.417987 task-clock               #    0.948 CPUs utilized            ( +-  0.10% )
                 0 context-switches         #    0.000 M/sec                    ( +-  0.00% )
                 0 CPU-migrations           #    0.000 M/sec                    ( +-  0.00% )
               111 page-faults              #    0.025 M/sec                    ( +-  0.02% )
        14,135,368 cycles                   #    3.200 GHz                      ( +-  0.10% )
         1,939,275 stalled-cycles           #   13.72% of all cycles are idle   ( +-  4.99% )
        27,846,610 instructions             #    1.97  insns per cycle        
                                            #    0.07  stalled cycles per insn  ( +-  0.00% )
         4,309,228 branches                 #  975.383 M/sec                    ( +-  0.00% )
             3,992 branch-misses            #    0.09% of all branches          ( +-  0.26% )

        0.004660164  seconds time elapsed  ( +-  0.15% )

Both stall values are much lower and the instructions per cycle value doubled.

Here's another testcase, one that fills the pipeline near-perfectly:

 $ perf stat ./fill_1b

 Performance counter stats for './fill_1b':

       1874.601174 task-clock               #    0.998 CPUs utilized          
                 1 context-switches         #    0.000 M/sec                  
                 0 CPU-migrations           #    0.000 M/sec                  
               107 page-faults              #    0.000 M/sec                  
     6,009,321,149 cycles                   #    3.206 GHz                    
       212,795,827 stalled-cycles           #    3.54% of all cycles are idle 
    18,007,646,574 instructions             #    3.00  insns per cycle        
                                            #    0.01  stalled cycles per insn
     1,001,527,311 branches                 #  534.262 M/sec                  
            16,988 branch-misses            #    0.00% of all branches        

        1.878558311  seconds time elapsed

Here too both counts are very low.

The next step is to provide the tools to further analyze why the CPU is not 
utilized perfectly. I have implemented some preliminary code for that too, 
using generic cache events:

 $ perf stat --repeat 10 --detailed ./array-bad 

 Performance counter stats for './array-bad' (10 runs):

         50.552646 task-clock               #    0.992 CPUs utilized            ( +-  0.04% )
                 0 context-switches         #    0.000 M/sec                    ( +-  0.00% )
                 0 CPU-migrations           #    0.000 M/sec                    ( +-  0.00% )
             1,877 page-faults              #    0.037 M/sec                    ( +-  0.01% )
       142,802,193 cycles                   #    2.825 GHz                      ( +-  0.18% )  (22.55%)
        88,622,411 stalled-cycles           #   62.06% of all cycles are idle   ( +-  0.22% )  (34.97%)
        45,381,755 instructions             #    0.32  insns per cycle        
                                            #    1.95  stalled cycles per insn  ( +-  0.11% )  (46.94%)
         7,725,207 branches                 #  152.815 M/sec                    ( +-  0.05% )  (58.44%)
            29,788 branch-misses            #    0.39% of all branches          ( +-  1.06% )  (69.46%)
         8,421,969 L1-dcache-loads          #  166.598 M/sec                    ( +-  0.37% )  (70.06%)
         7,868,389 L1-dcache-load-misses    #   93.43% of all L1-dcache hits    ( +-  0.13% )  (58.28%)
         4,553,490 LLC-loads                #   90.074 M/sec                    ( +-  0.31% )  (44.49%)
         1,764,277 LLC-load-misses          #   34.900 M/sec                    ( +-  0.21% )  (9.98%)

        0.050973462  seconds time elapsed  ( +-  0.05% )

The --detailed flag is what activates wider counting. The "93.43% of all 
L1-dcache hits" is a giveaway indicator that this particular workload is 
primarily data-access limited and that much of it escapes into RAM as well.

Is this the direction you'd like to see perf stat to move into? Any comments, 
suggestions?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
  2011-04-27 11:11               ` Ingo Molnar
@ 2011-04-27 14:47                 ` Arun Sharma
  2011-04-27 15:48                   ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Arun Sharma @ 2011-04-27 14:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arun Sharma, Stephane Eranian, Arnaldo Carvalho de Melo,
	linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra,
	eranian, Linus Torvalds, Andrew Morton

On Wed, Apr 27, 2011 at 4:11 AM, Ingo Molnar <mingo@elte.hu> wrote:
> As for the first, 'overview' step, i'd like to use one or two numbers only, to
> give people a general ballpark figure about how good the CPU is performing for
> a given workload.
>
> Wouldnt UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 be in general a pretty good,
> primary "stall" indicator? This is similar to the "cycles-uops_executed" value
> in your script (UOPS_EXECUTED:PORT015:t=1 and UOPS_EXECUTED:PORT234_CORE
> based): it counts cycles when there's no execution at all - not even
> speculative one.

If we're going to pick one stall indicator, why not pick cycles where
no uops are retiring?

cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]

In the presence of C-states and some halted cycles, I found that I
couldn't measure it via UOPS_RETIRED:ANY:c=1:i=1 because it counts
halted cycles too and could be greater than (unhalted) cycles.

The other issue I had to deal with was UOPS_RETIRED > UOPS_EXECUTED
condition. I believe this is caused by what AMD calls sideband stack
optimizer and Intel calls dedicated stack manager (i.e. UOPS executed
outside the main pipeline). A recursive fibonacci(30) is a good test
case for reproducing this.

>
> Is this the direction you'd like to see perf stat to move into? Any comments,
> suggestions?
>

Looks like a step in the right direction. Thanks.

 -Arun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
  2011-04-27 14:47                 ` Arun Sharma
@ 2011-04-27 15:48                   ` Ingo Molnar
  2011-04-27 16:27                     ` Ingo Molnar
  2011-04-27 19:03                     ` Arun Sharma
  0 siblings, 2 replies; 46+ messages in thread
From: Ingo Molnar @ 2011-04-27 15:48 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Arun Sharma, Stephane Eranian, Arnaldo Carvalho de Melo,
	linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra,
	eranian, Linus Torvalds, Andrew Morton


* Arun Sharma <arun@sharma-home.net> wrote:

> On Wed, Apr 27, 2011 at 4:11 AM, Ingo Molnar <mingo@elte.hu> wrote:
> > As for the first, 'overview' step, i'd like to use one or two numbers only, to
> > give people a general ballpark figure about how good the CPU is performing for
> > a given workload.
> >
> > Wouldnt UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 be in general a pretty good,
> > primary "stall" indicator? This is similar to the "cycles-uops_executed" value
> > in your script (UOPS_EXECUTED:PORT015:t=1 and UOPS_EXECUTED:PORT234_CORE
> > based): it counts cycles when there's no execution at all - not even
> > speculative one.
> 
> If we're going to pick one stall indicator, [...]

Well, one stall indicator for the 'general overview' stage, plus branch misses.

Other stages can also have all sorts of details, including various subsets of 
stall reasons. (and stalls of different units of the CPU)

We'll see how far it can be pushed.

> [...] why not pick cycles where no uops are retiring?
> 
> cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
>
> In the presence of C-states and some halted cycles, I found that I couldn't 
> measure it via UOPS_RETIRED:ANY:c=1:i=1 because it counts halted cycles too 
> and could be greater than (unhalted) cycles.

Agreed, good point.

You are right that it is more robust to pick 'the CPU was busy on our behalf' 
metric instead of a 'CPU is idle' metric, because that way 'HLT' as a special 
type of idling around does not have to be identified.

HLT is not an issue for the default 'perf stat' behavior (because it only 
measures task execution, never the idle thread or other tasks not involved with 
the workload), but for per CPU and system-wide (--all) it matters.

I'll flip it around.

> The other issue I had to deal with was UOPS_RETIRED > UOPS_EXECUTED 
> condition. I believe this is caused by what AMD calls sideband stack 
> optimizer and Intel calls dedicated stack manager (i.e. UOPS executed outside 
> the main pipeline). A recursive fibonacci(30) is a good test case for 
> reproducing this.

So the PORT015+234 sum is not precise? The definition seems to be rather firm:

  Counts number of Uops executed that where issued on port 2, 3, or 4.
  Counts number of Uops executed that where issued on port 0, 1, or 5.

Wouldnt that include all uops?

> > Is this the direction you'd like to see perf stat to move into? Any 
> > comments, suggestions?
> 
> Looks like a step in the right direction. Thanks.

Ok, great - will keep you updated. I doubt the defaults can ever beat truly 
expert use of PMU events: there will always be fine details that a generic 
approach will miss. But i'd be happy if we got 70% the way ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
  2011-04-27 15:48                   ` Ingo Molnar
@ 2011-04-27 16:27                     ` Ingo Molnar
  2011-04-27 19:05                       ` Arun Sharma
  2011-04-27 19:03                     ` Arun Sharma
  1 sibling, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2011-04-27 16:27 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Arun Sharma, Stephane Eranian, Arnaldo Carvalho de Melo,
	linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra,
	eranian, Linus Torvalds, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> > [...] why not pick cycles where no uops are retiring?
> > 
> > cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
> >
> > In the presence of C-states and some halted cycles, I found that I couldn't 
> > measure it via UOPS_RETIRED:ANY:c=1:i=1 because it counts halted cycles too 
> > and could be greater than (unhalted) cycles.
> 
> Agreed, good point.
> 
> You are right that it is more robust to pick 'the CPU was busy on our behalf' 
> metric instead of a 'CPU is idle' metric, because that way 'HLT' as a special 
> type of idling around does not have to be identified.

Sidenote, there's one advantage of the idle event: it's more meaningful to 
profile idle cycles - and it's easy to ignore the HLT loop in the profile 
output (we already do).

That way we get a 'hidden overhead' profile: a profile of frequently executed 
code which executes in the CPU in a suboptimal way.

So we should probably offer both events.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
  2011-04-27 15:48                   ` Ingo Molnar
  2011-04-27 16:27                     ` Ingo Molnar
@ 2011-04-27 19:03                     ` Arun Sharma
  1 sibling, 0 replies; 46+ messages in thread
From: Arun Sharma @ 2011-04-27 19:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arun Sharma, Stephane Eranian, Arnaldo Carvalho de Melo,
	linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra,
	eranian, Linus Torvalds, Andrew Morton

On Wed, Apr 27, 2011 at 8:48 AM, Ingo Molnar <mingo@elte.hu> wrote:

>
>> The other issue I had to deal with was UOPS_RETIRED > UOPS_EXECUTED
>> condition. I believe this is caused by what AMD calls sideband stack
>> optimizer and Intel calls dedicated stack manager (i.e. UOPS executed outside
>> the main pipeline). A recursive fibonacci(30) is a good test case for
>> reproducing this.
>
> So the PORT015+234 sum is not precise? The definition seems to be rather firm:
>
>  Counts number of Uops executed that where issued on port 2, 3, or 4.
>  Counts number of Uops executed that where issued on port 0, 1, or 5.
>

There is some work done outside of the main out of order engine for
power optimization reasons:

Described as dedicated stack engine here:
http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/vol7iss2_art03.pdf

However, I can't seem to be able to reproduce this behavior using a
micro benchmark right now:

# cat foo.s
.text
        .global main
main:
1:
        push %rax
        push %rbx
        push %rcx
        push %rdx
        pop  %rax
        pop  %rbx
        pop  %rcx
        pop  %rdx
        jmp  1b

 Performance counter stats for './foo':

     7,755,881,073 UOPS_ISSUED:ANY:t=1       (scaled from 79.98%)
    10,569,957,988 UOPS_RETIRED:ANY:t=1      (scaled from 79.96%)
     9,155,400,383 UOPS_EXECUTED:PORT234_CORE  (scaled from 80.02%)
     2,594,206,312 UOPS_EXECUTED:PORT015:t=1  (scaled from 80.02%)

Perhaps I was thinking of UOPS_ISSUED < UOPS_RETIRED.

In general, UOPS_RETIRED (or instruction retirement in general) is the
"source of truth" in an otherwise crazy world and might be more
interesting as a generalized event that works on multiple
architectures.

 -Arun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
  2011-04-27 16:27                     ` Ingo Molnar
@ 2011-04-27 19:05                       ` Arun Sharma
  0 siblings, 0 replies; 46+ messages in thread
From: Arun Sharma @ 2011-04-27 19:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arun Sharma, Stephane Eranian, Arnaldo Carvalho de Melo,
	linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra,
	eranian, Linus Torvalds, Andrew Morton

On Wed, Apr 27, 2011 at 9:27 AM, Ingo Molnar <mingo@elte.hu> wrote:

>
> That way we get a 'hidden overhead' profile: a profile of frequently executed
> code which executes in the CPU in a suboptimal way.
>
> So we should probably offer both events.
>

Yes - certainly.

 -Arun

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2011-04-27 19:05 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-22  8:47 [PATCH 1/1] perf tools: Add missing user space support for config1/config2 Stephane Eranian
2011-04-22  9:23 ` Ingo Molnar
2011-04-22  9:41   ` Stephane Eranian
2011-04-22 10:52     ` [generalized cache events] " Ingo Molnar
2011-04-22 12:04       ` Stephane Eranian
2011-04-22 13:18         ` Ingo Molnar
2011-04-22 20:31           ` Stephane Eranian
2011-04-22 20:47             ` Ingo Molnar
2011-04-23 12:13               ` Stephane Eranian
2011-04-23 12:49                 ` Ingo Molnar
2011-04-22 21:03             ` Ingo Molnar
2011-04-23 12:27               ` Stephane Eranian
2011-04-22 16:51         ` Andi Kleen
2011-04-22 19:57           ` Ingo Molnar
2011-04-26  9:25           ` Peter Zijlstra
2011-04-22 16:50       ` arun
2011-04-22 17:00         ` Andi Kleen
2011-04-22 20:30         ` Ingo Molnar
2011-04-22 20:32           ` Ingo Molnar
2011-04-23  0:03             ` Andi Kleen
2011-04-23  7:50               ` Peter Zijlstra
2011-04-23 12:06                 ` Stephane Eranian
2011-04-23 12:36                   ` Ingo Molnar
2011-04-23 13:16                   ` Peter Zijlstra
2011-04-25 18:48                     ` Stephane Eranian
2011-04-25 19:40                     ` Andi Kleen
2011-04-25 19:55                       ` Ingo Molnar
2011-04-24  2:15                   ` Andi Kleen
2011-04-24  2:19                 ` Andi Kleen
2011-04-25 17:41                   ` Ingo Molnar
2011-04-25 18:00                     ` Dehao Chen
     [not found]                     ` <BANLkTiks31-pMJe4zCKrppsrA1d6KanJFA@mail.gmail.com>
2011-04-25 18:05                       ` Ingo Molnar
2011-04-25 18:39                         ` Stephane Eranian
2011-04-25 19:45                           ` Ingo Molnar
2011-04-23  8:02               ` Ingo Molnar
2011-04-23 20:14           ` [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES Ingo Molnar
2011-04-24  6:16             ` Arun Sharma
2011-04-25 17:37               ` Ingo Molnar
2011-04-26  9:25               ` Peter Zijlstra
2011-04-26 14:00               ` Ingo Molnar
2011-04-27 11:11               ` Ingo Molnar
2011-04-27 14:47                 ` Arun Sharma
2011-04-27 15:48                   ` Ingo Molnar
2011-04-27 16:27                     ` Ingo Molnar
2011-04-27 19:05                       ` Arun Sharma
2011-04-27 19:03                     ` Arun Sharma

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.