* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 @ 2011-04-22 8:47 Stephane Eranian 2011-04-22 9:23 ` Ingo Molnar 0 siblings, 1 reply; 80+ messages in thread From: Stephane Eranian @ 2011-04-22 8:47 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma On Fri, Apr 22, 2011 at 10:06 AM, Ingo Molnar <mingo@elte.hu> wrote: > > * Ingo Molnar <mingo@elte.hu> wrote: > >> This needs to be a *lot* more user friendly. Users do not want to type in >> stupid hexa magic numbers to get profiling. We have moved beyond the oprofile >> era really. >> >> Unless there's proper generalized and human usable support i'm leaning >> towards turning off the offcore user-space accessible raw bits for now, and >> use them only kernel-internally, for the cache events. > Generic cache events are a myth. They are not usable. I keep getting questions from users because nobody knows what they are actually counting, thus nobody knows how to interpret the counts. You cannot really hide the micro-architecture if you want to make any sensible measurements. I agree with the poor usability of perf when you have to pass hex values for events. But that's why I have a user level library to map event strings to event codes for perf. Arun Sharma posted a patch a while ago to connect this library with perf, so far it's been ignored, it seems: perf stat -e offcore_response_0:dmd_data_rd foo > I'm about to push out the patch attached below - it lays out the arguments in > detail. I don't think we have time to fix this properly for .39 - but memory > profiling could be a nice feature for v2.6.40. > You will not be able to do any reasonable memory profiling using offcore response events. Dont' expect a profile to point to the missing loads. If you're lucky it would point to the use instruction. > ---------------------> > From b52c55c6a25e4515b5e075a989ff346fc251ed09 Mon Sep 17 00:00:00 2001 > From: Ingo Molnar <mingo@elte.hu> > Date: Fri, 22 Apr 2011 08:44:38 +0200 > Subject: [PATCH] x86, perf event: Turn off unstructured raw event access to offcore registers > > Andi Kleen pointed out that the Intel offcore support patches were merged > without user-space tool support to the functionality: > > | > | The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the > | user space bits were not. This made it impossible to set the extra mask > | and actually do the OFFCORE profiling > | > > Andi submitted a preliminary patch for user-space support, as an > extension to perf's raw event syntax: > > | > | Some raw events -- like the Intel OFFCORE events -- support additional > | parameters. These can be appended after a ':'. > | > | For example on a multi socket Intel Nehalem: > | > | perf stat -e r1b7:20ff -a sleep 1 > | > | Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0 > | that measures any access to DRAM on another socket. > | > > But this kind of usability is absolutely unacceptable - users should not > be expected to type in magic, CPU and model specific incantations to get > access to useful hardware functionality. > > The proper solution is to expose useful offcore functionality via > generalized events - that way users do not have to care which specific > CPU model they are using, they can use the conceptual event and not some > model specific quirky hexa number. > > We already have such generalization in place for CPU cache events, > and it's all very extensible. > > "Offcore" events measure general DRAM access patters along various > parameters. They are particularly useful in NUMA systems. > > We want to support them via generalized DRAM events: either as the > fourth level of cache (after the last-level cache), or as a separate > generalization category. > > That way user-space support would be very obvious, memory access > profiling could be done via self-explanatory commands like: > > perf record -e dram ./myapp > perf record -e dram-remote ./myapp > > ... to measure DRAM accesses or more expensive cross-node NUMA DRAM > accesses. > > These generalized events would work on all CPUs and architectures that > have comparable PMU features. > > ( Note, these are just examples: actual implementation could have more > sophistication and more parameter - as long as they center around > similarly simple usecases. ) > > Now we do not want to revert *all* of the current offcore bits, as they > are still somewhat useful for generic last-level-cache events, implemented > in this commit: > > e994d7d23a0b: perf: Fix LLC-* events on Intel Nehalem/Westmere > > But we definitely do not yet want to expose the unstructured raw events > to user-space, until better generalization and usability is implemented > for these hardware event features. > > ( Note: after generalization has been implemented raw offcore events can be > supported as well: there can always be an odd event that is marginally > useful but not useful enough to generalize. DRAM profiling is definitely > *not* such a category so generalization must be done first. ) > > Furthermore, PERF_TYPE_RAW access to these registers was not intended > to go upstream without proper support - it was a side-effect of the above > e994d7d23a0b commit, not mentioned in the changelog. > > As v2.6.39 is nearing release we go for the simplest approach: disable > the PERF_TYPE_RAW offcore hack for now, before it escapes into a released > kernel and becomes an ABI. > > Once proper structure is implemented for these hardware events and users > are offered usable solutions we can revisit this issue. > > Reported-by: Andi Kleen <ak@linux.intel.com> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > Cc: Arnaldo Carvalho de Melo <acme@redhat.com> > Cc: Frederic Weisbecker <fweisbec@gmail.com> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Link: http://lkml.kernel.org/r/1302658203-4239-1-git-send-email-andi@firstfloor.org > Signed-off-by: Ingo Molnar <mingo@elte.hu> > --- > arch/x86/kernel/cpu/perf_event.c | 6 +++++- > 1 files changed, 5 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c > index eed3673a..632e5dc 100644 > --- a/arch/x86/kernel/cpu/perf_event.c > +++ b/arch/x86/kernel/cpu/perf_event.c > @@ -586,8 +586,12 @@ static int x86_setup_perfctr(struct perf_event *event) > return -EOPNOTSUPP; > } > > + /* > + * Do not allow config1 (extended registers) to propagate, > + * there's no sane user-space generalization yet: > + */ > if (attr->type == PERF_TYPE_RAW) > - return x86_pmu_extra_regs(event->attr.config, event); > + return 0; > > if (attr->type == PERF_TYPE_HW_CACHE) > return set_ext_hw_attr(hwc, event); > ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 8:47 [PATCH 1/1] perf tools: Add missing user space support for config1/config2 Stephane Eranian @ 2011-04-22 9:23 ` Ingo Molnar 2011-04-22 9:41 ` Stephane Eranian 0 siblings, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-22 9:23 UTC (permalink / raw) To: Stephane Eranian Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma * Stephane Eranian <eranian@google.com> wrote: > On Fri, Apr 22, 2011 at 10:06 AM, Ingo Molnar <mingo@elte.hu> wrote: > > > > * Ingo Molnar <mingo@elte.hu> wrote: > > > >> This needs to be a *lot* more user friendly. Users do not want to type in > >> stupid hexa magic numbers to get profiling. We have moved beyond the oprofile > >> era really. > >> > >> Unless there's proper generalized and human usable support i'm leaning > >> towards turning off the offcore user-space accessible raw bits for now, and > >> use them only kernel-internally, for the cache events. > > Generic cache events are a myth. They are not usable. I keep getting > questions from users because nobody knows what they are actually counting, > thus nobody knows how to interpret the counts. You cannot really hide the > micro-architecture if you want to make any sensible measurements. Well: aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10 Time: 0.125 Time: 0.136 Time: 0.180 Time: 0.103 Time: 0.097 Time: 0.125 Time: 0.104 Time: 0.125 Time: 0.114 Time: 0.158 Performance counter stats for './hackbench 10' (10 runs): 2,102,556,398 instructions # 0.000 IPC ( +- 1.179% ) 843,957,634 L1-dcache-loads ( +- 1.295% ) 130,007,361 L1-dcache-load-misses ( +- 3.281% ) 6,328,938 LLC-misses ( +- 3.969% ) 0.146160287 seconds time elapsed ( +- 5.851% ) It's certainly useful if you want to get ballpark figures about cache behavior of an app and want to do comparisons. There are inconsistencies in our generic cache events - but that's not really a reason to obcure their usage behind nonsensical microarchitecture-specific details. But i'm definitely in favor of making these generalized events more consistent across different CPU types. Can you list examples of inconsistencies that we should resolve? (and which you possibly consider impossible to resolve, right?) Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 9:23 ` Ingo Molnar @ 2011-04-22 9:41 ` Stephane Eranian 2011-04-22 10:52 ` [generalized cache events] " Ingo Molnar 0 siblings, 1 reply; 80+ messages in thread From: Stephane Eranian @ 2011-04-22 9:41 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma On Fri, Apr 22, 2011 at 11:23 AM, Ingo Molnar <mingo@elte.hu> wrote: > > * Stephane Eranian <eranian@google.com> wrote: > >> On Fri, Apr 22, 2011 at 10:06 AM, Ingo Molnar <mingo@elte.hu> wrote: >> > >> > * Ingo Molnar <mingo@elte.hu> wrote: >> > >> >> This needs to be a *lot* more user friendly. Users do not want to type in >> >> stupid hexa magic numbers to get profiling. We have moved beyond the oprofile >> >> era really. >> >> >> >> Unless there's proper generalized and human usable support i'm leaning >> >> towards turning off the offcore user-space accessible raw bits for now, and >> >> use them only kernel-internally, for the cache events. >> >> Generic cache events are a myth. They are not usable. I keep getting >> questions from users because nobody knows what they are actually counting, >> thus nobody knows how to interpret the counts. You cannot really hide the >> micro-architecture if you want to make any sensible measurements. > > Well: > > aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10 > Time: 0.125 > Time: 0.136 > Time: 0.180 > Time: 0.103 > Time: 0.097 > Time: 0.125 > Time: 0.104 > Time: 0.125 > Time: 0.114 > Time: 0.158 > > Performance counter stats for './hackbench 10' (10 runs): > > 2,102,556,398 instructions # 0.000 IPC ( +- 1.179% ) > 843,957,634 L1-dcache-loads ( +- 1.295% ) > 130,007,361 L1-dcache-load-misses ( +- 3.281% ) > 6,328,938 LLC-misses ( +- 3.969% ) > > 0.146160287 seconds time elapsed ( +- 5.851% ) > > It's certainly useful if you want to get ballpark figures about cache behavior > of an app and want to do comparisons. > What can you conclude from the above counts? Are they good or bad? If they are bad, how do you go about fixing the app? > There are inconsistencies in our generic cache events - but that's not really a > reason to obcure their usage behind nonsensical microarchitecture-specific > details. > The actual events are a reflection of the micro-architecture. They indirectly describe how it works. It is not clear to me that you can really improve your app without some exposure to the micro-architecture. So if you want to have generic events, I am fine with this, but you should not block access to actual events pretending they are useless. Some people are certainly interested in using them and learning about the micro-architecture of their processor. > But i'm definitely in favor of making these generalized events more consistent > across different CPU types. Can you list examples of inconsistencies that we > should resolve? (and which you possibly consider impossible to resolve, right?) > To make generic events more uniform across processors, one would have to have precise definitions as to what they are supposed to count. Once you have that, then we may have a better chance at finding consistent mappings for each processor. I have not yet seen such definitions. ^ permalink raw reply [flat|nested] 80+ messages in thread
* [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 9:41 ` Stephane Eranian @ 2011-04-22 10:52 ` Ingo Molnar 2011-04-22 12:04 ` Stephane Eranian 2011-04-22 16:50 ` arun 0 siblings, 2 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-22 10:52 UTC (permalink / raw) To: Stephane Eranian Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Stephane Eranian <eranian@google.com> wrote: > >> Generic cache events are a myth. They are not usable. [...] > > > > Well: > > > > aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10 > > Time: 0.125 > > Time: 0.136 > > Time: 0.180 > > Time: 0.103 > > Time: 0.097 > > Time: 0.125 > > Time: 0.104 > > Time: 0.125 > > Time: 0.114 > > Time: 0.158 > > > > Performance counter stats for './hackbench 10' (10 runs): > > > > 2,102,556,398 instructions # 0.000 IPC ( +- 1.179% ) > > 843,957,634 L1-dcache-loads ( +- 1.295% ) > > 130,007,361 L1-dcache-load-misses ( +- 3.281% ) > > 6,328,938 LLC-misses ( +- 3.969% ) > > > > 0.146160287 seconds time elapsed ( +- 5.851% ) > > > > It's certainly useful if you want to get ballpark figures about cache behavior > > of an app and want to do comparisons. > > > What can you conclude from the above counts? > Are they good or bad? If they are bad, how do you go about fixing the app? So let me give you a simplified example. Say i'm a developer and i have an app with such code: #define THOUSAND 1000 static char array[THOUSAND][THOUSAND]; int init_array(void) { int i, j; for (i = 0; i < THOUSAND; i++) { for (j = 0; j < THOUSAND; j++) { array[j][i]++; } } return 0; } Pretty common stuff, right? Using the generalized cache events i can run: $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array Performance counter stats for './array' (10 runs): 6,719,130 cycles:u ( +- 0.662% ) 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) 1,037,032 l1-dcache-loads:u ( +- 0.009% ) 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) 0.003802098 seconds time elapsed ( +- 13.395% ) I consider that this is 'bad', because for almost every dcache-load there's a dcache-miss - a 99% L1 cache miss rate! Then i think a bit, notice something, apply this performance optimization: diff --git a/array.c b/array.c index 4758d9a..d3f7037 100644 --- a/array.c +++ b/array.c @@ -9,7 +9,7 @@ int init_array(void) for (i = 0; i < THOUSAND; i++) { for (j = 0; j < THOUSAND; j++) { - array[j][i]++; + array[i][j]++; } } I re-run perf-stat: $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array Performance counter stats for './array' (10 runs): 2,395,407 cycles:u ( +- 0.365% ) 5,084,788 instructions:u # 2.123 IPC ( +- 0.000% ) 1,035,731 l1-dcache-loads:u ( +- 0.006% ) 3,955 l1-dcache-load-misses:u ( +- 4.872% ) 0.001806438 seconds time elapsed ( +- 3.831% ) And i'm happy that indeed the l1-dcache misses are now super-low and that the app got much faster as well - the cycle count is a third of what it was before the optimization! Note that: - I got absolute numbers in the right ballpark figure: i got a million loads as expected (the array has 1 million elements), and 1 million cache-misses in the 'bad' case. - I did not care which specific Intel CPU model this was running on - I did not care about *any* microarchitectural details - i only knew it's a reasonably modern CPU with caching - I did not care how i could get access to L1 load and miss events. The events were named obviously and it just worked. So no, kernel driven generalization and sane tooling is not at all a 'myth' today, really. So this is the general direction in which we want to move on. If you know about problems with existing generalization definitions then lets *fix* them, not pretend that generalizations and sane workflows are impossible ... Thanks, Ingo ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 10:52 ` [generalized cache events] " Ingo Molnar @ 2011-04-22 12:04 ` Stephane Eranian 2011-04-22 13:18 ` Ingo Molnar 2011-04-22 16:51 ` Andi Kleen 2011-04-22 16:50 ` arun 1 sibling, 2 replies; 80+ messages in thread From: Stephane Eranian @ 2011-04-22 12:04 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Fri, Apr 22, 2011 at 12:52 PM, Ingo Molnar <mingo@elte.hu> wrote: > > * Stephane Eranian <eranian@google.com> wrote: > >> >> Generic cache events are a myth. They are not usable. [...] >> > >> > Well: >> > >> > aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10 >> > Time: 0.125 >> > Time: 0.136 >> > Time: 0.180 >> > Time: 0.103 >> > Time: 0.097 >> > Time: 0.125 >> > Time: 0.104 >> > Time: 0.125 >> > Time: 0.114 >> > Time: 0.158 >> > >> > Performance counter stats for './hackbench 10' (10 runs): >> > >> > 2,102,556,398 instructions # 0.000 IPC ( +- 1.179% ) >> > 843,957,634 L1-dcache-loads ( +- 1.295% ) >> > 130,007,361 L1-dcache-load-misses ( +- 3.281% ) >> > 6,328,938 LLC-misses ( +- 3.969% ) >> > >> > 0.146160287 seconds time elapsed ( +- 5.851% ) >> > >> > It's certainly useful if you want to get ballpark figures about cache behavior >> > of an app and want to do comparisons. >> > >> What can you conclude from the above counts? >> Are they good or bad? If they are bad, how do you go about fixing the app? > > So let me give you a simplified example. > > Say i'm a developer and i have an app with such code: > > #define THOUSAND 1000 > > static char array[THOUSAND][THOUSAND]; > > int init_array(void) > { > int i, j; > > for (i = 0; i < THOUSAND; i++) { > for (j = 0; j < THOUSAND; j++) { > array[j][i]++; > } > } > > return 0; > } > > Pretty common stuff, right? > > Using the generalized cache events i can run: > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > > Performance counter stats for './array' (10 runs): > > 6,719,130 cycles:u ( +- 0.662% ) > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) > 1,037,032 l1-dcache-loads:u ( +- 0.009% ) > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) > > 0.003802098 seconds time elapsed ( +- 13.395% ) > > I consider that this is 'bad', because for almost every dcache-load there's a > dcache-miss - a 99% L1 cache miss rate! > > Then i think a bit, notice something, apply this performance optimization: > I don't think this example is really representative of the kind of problems people face, it is just too small and obvious. So I would not generalize on it. If you are happy with generalized cache events then, as I said, I am fine with it. But the API should ALWAYS allow users access to raw events when they need finer grain analysis. > diff --git a/array.c b/array.c > index 4758d9a..d3f7037 100644 > --- a/array.c > +++ b/array.c > @@ -9,7 +9,7 @@ int init_array(void) > > for (i = 0; i < THOUSAND; i++) { > for (j = 0; j < THOUSAND; j++) { > - array[j][i]++; > + array[i][j]++; > } > } > > I re-run perf-stat: > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > > Performance counter stats for './array' (10 runs): > > 2,395,407 cycles:u ( +- 0.365% ) > 5,084,788 instructions:u # 2.123 IPC ( +- 0.000% ) > 1,035,731 l1-dcache-loads:u ( +- 0.006% ) > 3,955 l1-dcache-load-misses:u ( +- 4.872% ) > > - I got absolute numbers in the right ballpark figure: i got a million loads as > expected (the array has 1 million elements), and 1 million cache-misses in > the 'bad' case. > > - I did not care which specific Intel CPU model this was running on > > - I did not care about *any* microarchitectural details - i only knew it's a > reasonably modern CPU with caching > > - I did not care how i could get access to L1 load and miss events. The events > were named obviously and it just worked. > > So no, kernel driven generalization and sane tooling is not at all a 'myth' > today, really. > > So this is the general direction in which we want to move on. If you know about > problems with existing generalization definitions then lets *fix* them, not > pretend that generalizations and sane workflows are impossible ... > Again, to fix them, you need to give us definitions for what you expect those events to count. Otherwise we cannot make forward progress. Let me give just one simple example: cycles What your definition for the generic cycle event? There are various flavors: - count halted, unhalted cycles? - impacted by frequency scaling? LLC-misses: - what considered the LLC? - does it include code, data or both? - does it include demand, hw prefetch? - it is to local or remote dram? Once you have clear and precise definition, then we can look at the actual events and figure out a mapping. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 12:04 ` Stephane Eranian @ 2011-04-22 13:18 ` Ingo Molnar 2011-04-22 20:31 ` Stephane Eranian 2011-04-22 16:51 ` Andi Kleen 1 sibling, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-22 13:18 UTC (permalink / raw) To: Stephane Eranian Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Stephane Eranian <eranian@google.com> wrote: > > Say i'm a developer and i have an app with such code: > > > > #define THOUSAND 1000 > > > > static char array[THOUSAND][THOUSAND]; > > > > int init_array(void) > > { > > int i, j; > > > > for (i = 0; i < THOUSAND; i++) { > > for (j = 0; j < THOUSAND; j++) { > > array[j][i]++; > > } > > } > > > > return 0; > > } > > > > Pretty common stuff, right? > > > > Using the generalized cache events i can run: > > > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > > > > Performance counter stats for './array' (10 runs): > > > > 6,719,130 cycles:u ( +- 0.662% ) > > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) > > 1,037,032 l1-dcache-loads:u ( +- 0.009% ) > > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) > > > > 0.003802098 seconds time elapsed ( +- 13.395% ) > > > > I consider that this is 'bad', because for almost every dcache-load there's a > > dcache-miss - a 99% L1 cache miss rate! > > > > Then i think a bit, notice something, apply this performance optimization: > > I don't think this example is really representative of the kind of problems > people face, it is just too small and obvious. [...] Well, the overwhelming majority of performance problems are 'small and obvious' - once a tool roughly pinpoints their existence and location! And you have not offered a counter example either so you have not really demonstrated what you consider a 'real' example and why you consider generalized cache events inadequate. > [...] So I would not generalize on it. To the contrary, it demonstrates the most fundamental concept of cache profiling: looking at the hits/misses ratios and identifying hotspots. That concept can be applied pretty nicely to all sorts of applications. Interestly, the exact hardware event doesnt even *matter* for most problems, as long as it *correlates* with the conceptual entity we want to measure. So what we need are hardware events that correlate with: - loads done - stores done - load misses suffered - store misses suffered - branches done - branches missed - instructions executed It is the *ratio* that matters in most cases: before-change versus after-change, hits versus misses, etc. Yes, there will be imprecisions, CPU quirks, limitations and speculation effects - but as long as we keep our eyes on the ball, generalizations are useful for solving practical problems. > If you are happy with generalized cache events then, as I said, I am fine > with it. But the API should ALWAYS allow users access to raw events when they > need finer grain analysis. Well, that's a pretty far cry from calling it a 'myth' :-) So my point is (outlined in detail in the common changelog) that we need sane generalized remote DRAM events *first* - before we think about exposing the 'rest' of te offcore-PMU as raw events. > > diff --git a/array.c b/array.c > > index 4758d9a..d3f7037 100644 > > --- a/array.c > > +++ b/array.c > > @@ -9,7 +9,7 @@ int init_array(void) > > > > for (i = 0; i < THOUSAND; i++) { > > for (j = 0; j < THOUSAND; j++) { > > - array[j][i]++; > > + array[i][j]++; > > } > > } > > > > I re-run perf-stat: > > > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > > > > Performance counter stats for './array' (10 runs): > > > > 2,395,407 cycles:u ( +- 0.365% ) > > 5,084,788 instructions:u # 2.123 IPC ( +- 0.000% ) > > 1,035,731 l1-dcache-loads:u ( +- 0.006% ) > > 3,955 l1-dcache-load-misses:u ( +- 4.872% ) > > > > - I got absolute numbers in the right ballpark figure: i got a million loads as > > expected (the array has 1 million elements), and 1 million cache-misses in > > the 'bad' case. > > > > - I did not care which specific Intel CPU model this was running on > > > > - I did not care about *any* microarchitectural details - i only knew it's a > > reasonably modern CPU with caching > > > > - I did not care how i could get access to L1 load and miss events. The events > > were named obviously and it just worked. > > > > So no, kernel driven generalization and sane tooling is not at all a 'myth' > > today, really. > > > > So this is the general direction in which we want to move on. If you know about > > problems with existing generalization definitions then lets *fix* them, not > > pretend that generalizations and sane workflows are impossible ... > > Again, to fix them, you need to give us definitions for what you expect those > events to count. Otherwise we cannot make forward progress. No, we do not 'need' to give exact definitions. This whole topic is more analogous to physics than to mathematics. See my description above about how ratios and high level structure matters more than absolute values and definitions. Yes, if we can then 'loads' and 'stores' should correspond to the number of loads a program flow does, which you get if you look at the assembly code. 'Instructions' should correspond to the number of instructions executed. If the CPU cannot do it it's not a huge deal in practice - we will cope and hopefully it will all be fixed in future CPU versions. That having said, most CPUs i have access to get the fundamentals right, so it's not like we have huge problems in practice. Key CPU statistics are available. > Let me give just one simple example: cycles > > What your definition for the generic cycle event? > > There are various flavors: > - count halted, unhalted cycles? Again i think you are getting lost in too much detail. For typical developers halted versus unhalted is mostly an uninteresting distinction, as people tend to just type 'perf record ./myapp', which is per workload profiling so it excludes idle time. So it would give the same result to them regardless of whether it's halted or unhalted cycles. ( This simple example already shows the idiocy of the hardware names, calling cycles events "CPU_CLK_UNHALTED.REF". In most cases the developer does *not* care about those distinctions so the defaults should not be complicated with them. ) > - impacted by frequency scaling? The best default for developers is a frequency scaling invariant result - i.e. one that is not against a reference clock but against the real CPU clock. ( Even that one will not be completely invariant due to the frequency-scaling dependent cost of misses and bus ops, etc. ) But profiling against a reference frequency makes sense as well, especially for system-wide profiling - this is the hardware equivalent of the cpu-clock / elapsed time metric. We could implement the cpu-clock using reference cycles events for example. > LLC-misses: > - what considered the LLC? The last level cache is whichever cache sits before DRAM. > - does it include code, data or both? Both if possible as they tend to be unified caches anyway. > - does it include demand, hw prefetch? Do you mean for the LLC-prefetch events? What would be your suggestion, which is the most useful metric? Prefetches are not directly done by program logic so this is borderline. We wanted to include them for completeness - and the metric should probably include 'all activities that program flow has not caused directly and which may be sucking up system resources' - i.e. including hw prefetch. > - it is to local or remote dram? The current definitions should include both. Measuring remote DRAM accesss is of course useful - that is the original point of this thread. It should be done as an additional layer, basically local RAM is yet another cache level - but we can take other generalized approach as well, if they make more sense. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 13:18 ` Ingo Molnar @ 2011-04-22 20:31 ` Stephane Eranian 2011-04-22 20:47 ` Ingo Molnar 2011-04-22 21:03 ` Ingo Molnar 0 siblings, 2 replies; 80+ messages in thread From: Stephane Eranian @ 2011-04-22 20:31 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <mingo@elte.hu> wrote: > > * Stephane Eranian <eranian@google.com> wrote: > > > > Say i'm a developer and i have an app with such code: > > > > > > #define THOUSAND 1000 > > > > > > static char array[THOUSAND][THOUSAND]; > > > > > > int init_array(void) > > > { > > > int i, j; > > > > > > for (i = 0; i < THOUSAND; i++) { > > > for (j = 0; j < THOUSAND; j++) { > > > array[j][i]++; > > > } > > > } > > > > > > return 0; > > > } > > > > > > Pretty common stuff, right? > > > > > > Using the generalized cache events i can run: > > > > > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > > > > > > Performance counter stats for './array' (10 runs): > > > > > > 6,719,130 cycles:u ( +- 0.662% ) > > > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) > > > 1,037,032 l1-dcache-loads:u ( +- 0.009% ) > > > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) > > > > > > 0.003802098 seconds time elapsed ( +- 13.395% ) > > > > > > I consider that this is 'bad', because for almost every dcache-load there's a > > > dcache-miss - a 99% L1 cache miss rate! > > > > > > Then i think a bit, notice something, apply this performance optimization: > > > > I don't think this example is really representative of the kind of problems > > people face, it is just too small and obvious. [...] > > Well, the overwhelming majority of performance problems are 'small and obvious' Problems are not simple. Most serious applications these days are huge, hundreds of MB of text, if not GB. In your artificial example, you knew the answer before you started the measurement. Most of the time, applications are assembled out of hundreds of libraries, so no single developers knows all the code. Thus, the performance analyst is faced with a black box most of the time. Let's go back to your example. Performance counter stats for './array' (10 runs): 6,719,130 cycles:u ( +- 0.662% ) 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) 1,037,032 l1-dcache-loads:u ( +- 0.009% ) 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) Looking at this I don't think you can pinpoint which function has a problem and whether or not there is a real problem. You need to evaluate the penalty. Once you know that you can estimate any potential gain from fixing the code. Arun pointed that out rightfully in his answer. How do you know the penalty if you don't decompose some more? It's not because you see cache misses that they are harmful. They could be generated by the HW prefetchers and be harmless to your program. Does L1d-cache-load misses include HW prefetchers? Imagine it does on Intel Nehalem. Do you guarantee it does on AMD Shanghai as well? If not then how do I compare? As I said, if you are only interested in generic events, then that's fine with me as long as you don't prevent others from accessing raw events. As people have pointed, I don't see why raw events would be harmful to users. If you don't want to learn about them, just stick with the generic events. > - once a tool roughly pinpoints their existence and location! > > And you have not offered a counter example either so you have not really > demonstrated what you consider a 'real' example and why you consider > generalized cache events inadequate. > > > [...] So I would not generalize on it. > > To the contrary, it demonstrates the most fundamental concept of cache > profiling: looking at the hits/misses ratios and identifying hotspots. > > That concept can be applied pretty nicely to all sorts of applications. > > Interestly, the exact hardware event doesnt even *matter* for most problems, as > long as it *correlates* with the conceptual entity we want to measure. > > So what we need are hardware events that correlate with: > > - loads done > - stores done > - load misses suffered > - store misses suffered > - branches done > - branches missed > - instructions executed > > It is the *ratio* that matters in most cases: before-change versus > after-change, hits versus misses, etc. > > Yes, there will be imprecisions, CPU quirks, limitations and speculation > effects - but as long as we keep our eyes on the ball, generalizations are > useful for solving practical problems. > > > If you are happy with generalized cache events then, as I said, I am fine > > with it. But the API should ALWAYS allow users access to raw events when they > > need finer grain analysis. > > Well, that's a pretty far cry from calling it a 'myth' :-) > > So my point is (outlined in detail in the common changelog) that we need sane > generalized remote DRAM events *first* - before we think about exposing the > 'rest' of te offcore-PMU as raw events. > > > > diff --git a/array.c b/array.c > > > index 4758d9a..d3f7037 100644 > > > --- a/array.c > > > +++ b/array.c > > > @@ -9,7 +9,7 @@ int init_array(void) > > > > > > for (i = 0; i < THOUSAND; i++) { > > > for (j = 0; j < THOUSAND; j++) { > > > - array[j][i]++; > > > + array[i][j]++; > > > } > > > } > > > > > > I re-run perf-stat: > > > > > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > > > > > > Performance counter stats for './array' (10 runs): > > > > > > 2,395,407 cycles:u ( +- 0.365% ) > > > 5,084,788 instructions:u # 2.123 IPC ( +- 0.000% ) > > > 1,035,731 l1-dcache-loads:u ( +- 0.006% ) > > > 3,955 l1-dcache-load-misses:u ( +- 4.872% ) > > > > > > - I got absolute numbers in the right ballpark figure: i got a million loads as > > > expected (the array has 1 million elements), and 1 million cache-misses in > > > the 'bad' case. > > > > > > - I did not care which specific Intel CPU model this was running on > > > > > > - I did not care about *any* microarchitectural details - i only knew it's a > > > reasonably modern CPU with caching > > > > > > - I did not care how i could get access to L1 load and miss events. The events > > > were named obviously and it just worked. > > > > > > So no, kernel driven generalization and sane tooling is not at all a 'myth' > > > today, really. > > > > > > So this is the general direction in which we want to move on. If you know about > > > problems with existing generalization definitions then lets *fix* them, not > > > pretend that generalizations and sane workflows are impossible ... > > > > Again, to fix them, you need to give us definitions for what you expect those > > events to count. Otherwise we cannot make forward progress. > > No, we do not 'need' to give exact definitions. This whole topic is more > analogous to physics than to mathematics. See my description above about how > ratios and high level structure matters more than absolute values and > definitions. > > Yes, if we can then 'loads' and 'stores' should correspond to the number of > loads a program flow does, which you get if you look at the assembly code. > 'Instructions' should correspond to the number of instructions executed. > > If the CPU cannot do it it's not a huge deal in practice - we will cope and > hopefully it will all be fixed in future CPU versions. > > That having said, most CPUs i have access to get the fundamentals right, so > it's not like we have huge problems in practice. Key CPU statistics are > available. > > > Let me give just one simple example: cycles > > > > What your definition for the generic cycle event? > > > > There are various flavors: > > - count halted, unhalted cycles? > > Again i think you are getting lost in too much detail. > > For typical developers halted versus unhalted is mostly an uninteresting > distinction, as people tend to just type 'perf record ./myapp', which is per > workload profiling so it excludes idle time. So it would give the same result > to them regardless of whether it's halted or unhalted cycles. > > ( This simple example already shows the idiocy of the hardware names, calling > cycles events "CPU_CLK_UNHALTED.REF". In most cases the developer does *not* > care about those distinctions so the defaults should not be complicated with > them. ) > > > - impacted by frequency scaling? > > The best default for developers is a frequency scaling invariant result - i.e. > one that is not against a reference clock but against the real CPU clock. > > ( Even that one will not be completely invariant due to the frequency-scaling > dependent cost of misses and bus ops, etc. ) > > But profiling against a reference frequency makes sense as well, especially for > system-wide profiling - this is the hardware equivalent of the cpu-clock / > elapsed time metric. We could implement the cpu-clock using reference cycles > events for example. > > > LLC-misses: > > - what considered the LLC? > > The last level cache is whichever cache sits before DRAM. > > > - does it include code, data or both? > > Both if possible as they tend to be unified caches anyway. > > > - does it include demand, hw prefetch? > > Do you mean for the LLC-prefetch events? What would be your suggestion, which > is the most useful metric? Prefetches are not directly done by program logic so > this is borderline. We wanted to include them for completeness - and the metric > should probably include 'all activities that program flow has not caused > directly and which may be sucking up system resources' - i.e. including hw > prefetch. > > > - it is to local or remote dram? > > The current definitions should include both. > > Measuring remote DRAM accesss is of course useful - that is the original point > of this thread. It should be done as an additional layer, basically local RAM > is yet another cache level - but we can take other generalized approach as > well, if they make more sense. > > Thanks, > > Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 20:31 ` Stephane Eranian @ 2011-04-22 20:47 ` Ingo Molnar 2011-04-23 12:13 ` Stephane Eranian 2011-04-22 21:03 ` Ingo Molnar 1 sibling, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-22 20:47 UTC (permalink / raw) To: Stephane Eranian Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Stephane Eranian <eranian@google.com> wrote: > On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <mingo@elte.hu> wrote: > > > > * Stephane Eranian <eranian@google.com> wrote: > > > > > > Say i'm a developer and i have an app with such code: > > > > > > > > #define THOUSAND 1000 > > > > > > > > static char array[THOUSAND][THOUSAND]; > > > > > > > > int init_array(void) > > > > { > > > > int i, j; > > > > > > > > for (i = 0; i < THOUSAND; i++) { > > > > for (j = 0; j < THOUSAND; j++) { > > > > array[j][i]++; > > > > } > > > > } > > > > > > > > return 0; > > > > } > > > > > > > > Pretty common stuff, right? > > > > > > > > Using the generalized cache events i can run: > > > > > > > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > > > > > > > > Performance counter stats for './array' (10 runs): > > > > > > > > 6,719,130 cycles:u ( +- 0.662% ) > > > > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) > > > > 1,037,032 l1-dcache-loads:u ( +- 0.009% ) > > > > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) > > > > > > > > 0.003802098 seconds time elapsed ( +- 13.395% ) > > > > > > > > I consider that this is 'bad', because for almost every dcache-load there's a > > > > dcache-miss - a 99% L1 cache miss rate! > > > > > > > > Then i think a bit, notice something, apply this performance optimization: > > > > > > I don't think this example is really representative of the kind of problems > > > people face, it is just too small and obvious. [...] > > > > Well, the overwhelming majority of performance problems are 'small and obvious' > > Problems are not simple. Most serious applications these days are huge, > hundreds of MB of text, if not GB. > > In your artificial example, you knew the answer before you started the > measurement. > > Most of the time, applications are assembled out of hundreds of libraries, so > no single developers knows all the code. Thus, the performance analyst is > faced with a black box most of the time. I isolated out an example and assumed that you'd agree that identifying hot spots is trivial with generic cache events. My assumption was wrong so let me show you how trivial it really is. Here's an example with *two* problematic functions (but it could have hundreds, it does not matter): --------------------------------> #define THOUSAND 1000 static char array1[THOUSAND][THOUSAND]; static char array2[THOUSAND][THOUSAND]; void func1(void) { int i, j; for (i = 0; i < THOUSAND; i++) for (j = 0; j < THOUSAND; j++) array1[i][j]++; } void func2(void) { int i, j; for (i = 0; i < THOUSAND; i++) for (j = 0; j < THOUSAND; j++) array2[j][i]++; } int main(void) { for (;;) { func1(); func2(); } return 0; } <-------------------------------- We do not know which one has the cache-misses problem, func1() or func2(), it's all a black box, right? Using generic cache events you simply type this: $ perf top -e l1-dcache-load-misses -e l1-dcache-loads And you get such output: PerfTop: 1923 irqs/sec kernel: 0.0% exact: 0.0% [l1-dcache-load-misses:u/l1-dcache-loads:u], (all, 16 CPUs) ------------------------------------------------------------------------------------------------------- weight samples pcnt funct DSO ______ _______ _____ _____ ______________________ 1.9 6184 98.8% func2 /home/mingo/opt/array2 0.0 69 1.1% func1 /home/mingo/opt/array2 It has pinpointed the problem in func2 *very* precisely. Obviously this can be used to analyze larger apps as well, with thousands of functions, to pinpoint cachemiss problems in specific functions. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 20:47 ` Ingo Molnar @ 2011-04-23 12:13 ` Stephane Eranian 2011-04-23 12:49 ` Ingo Molnar 0 siblings, 1 reply; 80+ messages in thread From: Stephane Eranian @ 2011-04-23 12:13 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Fri, Apr 22, 2011 at 10:47 PM, Ingo Molnar <mingo@elte.hu> wrote: > > * Stephane Eranian <eranian@google.com> wrote: > >> On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <mingo@elte.hu> wrote: >> > >> > * Stephane Eranian <eranian@google.com> wrote: >> > >> > > > Say i'm a developer and i have an app with such code: >> > > > >> > > > #define THOUSAND 1000 >> > > > >> > > > static char array[THOUSAND][THOUSAND]; >> > > > >> > > > int init_array(void) >> > > > { >> > > > int i, j; >> > > > >> > > > for (i = 0; i < THOUSAND; i++) { >> > > > for (j = 0; j < THOUSAND; j++) { >> > > > array[j][i]++; >> > > > } >> > > > } >> > > > >> > > > return 0; >> > > > } >> > > > >> > > > Pretty common stuff, right? >> > > > >> > > > Using the generalized cache events i can run: >> > > > >> > > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array >> > > > >> > > > Performance counter stats for './array' (10 runs): >> > > > >> > > > 6,719,130 cycles:u ( +- 0.662% ) >> > > > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) >> > > > 1,037,032 l1-dcache-loads:u ( +- 0.009% ) >> > > > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) >> > > > >> > > > 0.003802098 seconds time elapsed ( +- 13.395% ) >> > > > >> > > > I consider that this is 'bad', because for almost every dcache-load there's a >> > > > dcache-miss - a 99% L1 cache miss rate! >> > > > >> > > > Then i think a bit, notice something, apply this performance optimization: >> > > >> > > I don't think this example is really representative of the kind of problems >> > > people face, it is just too small and obvious. [...] >> > >> > Well, the overwhelming majority of performance problems are 'small and obvious' >> >> Problems are not simple. Most serious applications these days are huge, >> hundreds of MB of text, if not GB. >> >> In your artificial example, you knew the answer before you started the >> measurement. >> >> Most of the time, applications are assembled out of hundreds of libraries, so >> no single developers knows all the code. Thus, the performance analyst is >> faced with a black box most of the time. > > I isolated out an example and assumed that you'd agree that identifying hot > spots is trivial with generic cache events. > > My assumption was wrong so let me show you how trivial it really is. > > Here's an example with *two* problematic functions (but it could have hundreds, > it does not matter): > > --------------------------------> > #define THOUSAND 1000 > > static char array1[THOUSAND][THOUSAND]; > > static char array2[THOUSAND][THOUSAND]; > > void func1(void) > { > int i, j; > > for (i = 0; i < THOUSAND; i++) > for (j = 0; j < THOUSAND; j++) > array1[i][j]++; > } > > void func2(void) > { > int i, j; > > for (i = 0; i < THOUSAND; i++) > for (j = 0; j < THOUSAND; j++) > array2[j][i]++; > } > > int main(void) > { > for (;;) { > func1(); > func2(); > } > > return 0; > } > <-------------------------------- > > We do not know which one has the cache-misses problem, func1() or func2(), it's > all a black box, right? > > Using generic cache events you simply type this: > > $ perf top -e l1-dcache-load-misses -e l1-dcache-loads > > And you get such output: > > PerfTop: 1923 irqs/sec kernel: 0.0% exact: 0.0% [l1-dcache-load-misses:u/l1-dcache-loads:u], (all, 16 CPUs) > ------------------------------------------------------------------------------------------------------- > > weight samples pcnt funct DSO > ______ _______ _____ _____ ______________________ > > 1.9 6184 98.8% func2 /home/mingo/opt/array2 > 0.0 69 1.1% func1 /home/mingo/opt/array2 > > It has pinpointed the problem in func2 *very* precisely. > > Obviously this can be used to analyze larger apps as well, with thousands of > functions, to pinpoint cachemiss problems in specific functions. > No, it does not. As I said before, your example is just to trivial to be representative. You keep thinking that what you see in the profile pinpoints exactly the instruction or even the function where the problem always occurs. This is not always the case. There is skid, and it can be very big, the IP you get may not even be in the same function where the load was issued. You cannot generalized based on this example. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-23 12:13 ` Stephane Eranian @ 2011-04-23 12:49 ` Ingo Molnar 0 siblings, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-23 12:49 UTC (permalink / raw) To: Stephane Eranian Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Stephane Eranian <eranian@google.com> wrote: > On Fri, Apr 22, 2011 at 10:47 PM, Ingo Molnar <mingo@elte.hu> wrote: > > > > * Stephane Eranian <eranian@google.com> wrote: > > > >> On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <mingo@elte.hu> wrote: > >> > > >> > * Stephane Eranian <eranian@google.com> wrote: > >> > > >> > > > Say i'm a developer and i have an app with such code: > >> > > > > >> > > > #define THOUSAND 1000 > >> > > > > >> > > > static char array[THOUSAND][THOUSAND]; > >> > > > > >> > > > int init_array(void) > >> > > > { > >> > > > int i, j; > >> > > > > >> > > > for (i = 0; i < THOUSAND; i++) { > >> > > > for (j = 0; j < THOUSAND; j++) { > >> > > > array[j][i]++; > >> > > > } > >> > > > } > >> > > > > >> > > > return 0; > >> > > > } > >> > > > > >> > > > Pretty common stuff, right? > >> > > > > >> > > > Using the generalized cache events i can run: > >> > > > > >> > > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > >> > > > > >> > > > Performance counter stats for './array' (10 runs): > >> > > > > >> > > > 6,719,130 cycles:u ( +- 0.662% ) > >> > > > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) > >> > > > 1,037,032 l1-dcache-loads:u ( +- 0.009% ) > >> > > > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) > >> > > > > >> > > > 0.003802098 seconds time elapsed ( +- 13.395% ) > >> > > > > >> > > > I consider that this is 'bad', because for almost every dcache-load there's a > >> > > > dcache-miss - a 99% L1 cache miss rate! > >> > > > > >> > > > Then i think a bit, notice something, apply this performance optimization: > >> > > > >> > > I don't think this example is really representative of the kind of problems > >> > > people face, it is just too small and obvious. [...] > >> > > >> > Well, the overwhelming majority of performance problems are 'small and obvious' > >> > >> Problems are not simple. Most serious applications these days are huge, > >> hundreds of MB of text, if not GB. > >> > >> In your artificial example, you knew the answer before you started the > >> measurement. > >> > >> Most of the time, applications are assembled out of hundreds of libraries, so > >> no single developers knows all the code. Thus, the performance analyst is > >> faced with a black box most of the time. > > > > I isolated out an example and assumed that you'd agree that identifying hot > > spots is trivial with generic cache events. > > > > My assumption was wrong so let me show you how trivial it really is. > > > > Here's an example with *two* problematic functions (but it could have hundreds, > > it does not matter): > > > > --------------------------------> > > #define THOUSAND 1000 > > > > static char array1[THOUSAND][THOUSAND]; > > > > static char array2[THOUSAND][THOUSAND]; > > > > void func1(void) > > { > > int i, j; > > > > for (i = 0; i < THOUSAND; i++) > > for (j = 0; j < THOUSAND; j++) > > array1[i][j]++; > > } > > > > void func2(void) > > { > > int i, j; > > > > for (i = 0; i < THOUSAND; i++) > > for (j = 0; j < THOUSAND; j++) > > array2[j][i]++; > > } > > > > int main(void) > > { > > for (;;) { > > func1(); > > func2(); > > } > > > > return 0; > > } > > <-------------------------------- > > > > We do not know which one has the cache-misses problem, func1() or func2(), it's > > all a black box, right? > > > > Using generic cache events you simply type this: > > > > $ perf top -e l1-dcache-load-misses -e l1-dcache-loads > > > > And you get such output: > > > > PerfTop: 1923 irqs/sec kernel: 0.0% exact: 0.0% [l1-dcache-load-misses:u/l1-dcache-loads:u], (all, 16 CPUs) > > ------------------------------------------------------------------------------------------------------- > > > > weight samples pcnt funct DSO > > ______ _______ _____ _____ ______________________ > > > > 1.9 6184 98.8% func2 /home/mingo/opt/array2 > > 0.0 69 1.1% func1 /home/mingo/opt/array2 > > > > It has pinpointed the problem in func2 *very* precisely. > > > > Obviously this can be used to analyze larger apps as well, with thousands > > of functions, to pinpoint cachemiss problems in specific functions. > > No, it does not. The thing is, you will need to come up with more convincing and concrete arguments than a blanket, unsupported "No, it does not" claim. I *just showed* you an example which you claimed just two mails ago is impossible to analyze. I showed an example two functions and claimed that the same thing works with 3 or more functions as well: perf top will happily display the ones with the highest cachemiss ratio, regardless of how many there are. > As I said before, your example is just to trivial to be representative. You > keep thinking that what you see in the profile pinpoints exactly the > instruction or even the function where the problem always occurs. This is not > always the case. There is skid, and it can be very big, the IP you get may > not even be in the same function where the load was issued. So now you claim a narrow special case (most of the hot-spot overhead skidding out of a function) as a counter-proof? Sometimes skid causes problems - in practice it rarely does, and i do a lot of profiling. Also, i'd expect PEBS to be extended in the future to more and more events - including cachemiss events. That will solve this kind of skidding in a pretty natural way. Also, lets analyze your narrow special case: if a function is indeed "invisible" to profiling because most overhead skids out of it then there's little you can do with raw events to begin with ... You really need to specifically demonstrate how raw events help your example. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 20:31 ` Stephane Eranian 2011-04-22 20:47 ` Ingo Molnar @ 2011-04-22 21:03 ` Ingo Molnar 2011-04-23 12:27 ` Stephane Eranian 1 sibling, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-22 21:03 UTC (permalink / raw) To: Stephane Eranian Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Stephane Eranian <eranian@google.com> wrote: > Let's go back to your example. > Performance counter stats for './array' (10 runs): > > 6,719,130 cycles:u ( +- 0.662% ) > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) > 1,037,032 l1-dcache-loads:u ( +- 0.009% ) > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) > > Looking at this I don't think you can pinpoint which function has a problem > [...] In my previous mail i showed how to pinpoint specific functions. You bring up an interesting question, cost/benefit analysis: > [...] and whether or not there is a real problem. You need to evaluate the > penalty. Once you know that you can estimate any potential gain from fixing > the code. Arun pointed that out rightfully in his answer. How do you know the > penalty if you don't decompose some more? We can measure that even with today's tooling - which doesnt do cost/benefit analysis out of box. In my previous example i showed the cachemisses profile: weight samples pcnt funct DSO ______ _______ _____ _____ ______________________ 1.9 6184 98.8% func2 /home/mingo/opt/array2 0.0 69 1.1% func1 /home/mingo/opt/array2 and here's the cycles profile: samples pcnt funct DSO _______ _____ _____ ______________________ 2555.00 67.4% func2 /home/mingo/opt/array2 1220.00 32.2% func1 /home/mingo/opt/array2 So, given that there was no other big miss sources: $ perf stat -a -e branch-misses:u -e l1-dcache-load-misses:u -e l1-dcache-store-misses:u -e l1-icache-load-misses:u sleep 1 Performance counter stats for 'sleep 1': 70,674 branch-misses:u 347,992,027 l1-dcache-load-misses:u 1,779 l1-dcache-store-misses:u 8,007 l1-icache-load-misses:u 1.000982021 seconds time elapsed I can tell you that by fixing the cache-misses in that function, the code will be roughly 33% faster. So i fixed the bug, and before it 100 iterations of func1+func2 took 300 msecs: $ perf stat -e cpu-clock --repeat 10 ./array2 Performance counter stats for './array2' (10 runs): 298.405074 cpu-clock ( +- 1.823% ) After the fix it took 190 msecs: $ perf stat -e cpu-clock --repeat 10 ./array2 Performance counter stats for './array2' (10 runs): 189.409569 cpu-clock ( +- 0.019% ) 0.190007596 seconds time elapsed ( +- 0.025% ) Which is 63% of the original speed - 37% faster. And no, i first did the calculation, then did the measurement of the optimized code. Now it would be nice to automate such analysis some more within perf - but i think i have established the principle well enough that we can use generic cache events for such measurements. Also, we could certainly add more generic events - a stalled-cycles event would certainly be useful for example, to collect all (or at least most) 'harmful delays' the execution flow can experience. Want to take a stab at that patch? Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 21:03 ` Ingo Molnar @ 2011-04-23 12:27 ` Stephane Eranian 0 siblings, 0 replies; 80+ messages in thread From: Stephane Eranian @ 2011-04-23 12:27 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Fri, Apr 22, 2011 at 11:03 PM, Ingo Molnar <mingo@elte.hu> wrote: > > * Stephane Eranian <eranian@google.com> wrote: > >> Let's go back to your example. >> Performance counter stats for './array' (10 runs): >> >> 6,719,130 cycles:u ( +- 0.662% ) >> 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) >> 1,037,032 l1-dcache-loads:u ( +- 0.009% ) >> 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) >> >> Looking at this I don't think you can pinpoint which function has a problem >> [...] > > In my previous mail i showed how to pinpoint specific functions. You bring up > an interesting question, cost/benefit analysis: > >> [...] and whether or not there is a real problem. You need to evaluate the >> penalty. Once you know that you can estimate any potential gain from fixing >> the code. Arun pointed that out rightfully in his answer. How do you know the >> penalty if you don't decompose some more? > > We can measure that even with today's tooling - which doesnt do cost/benefit > analysis out of box. In my previous example i showed the cachemisses profile: > > weight samples pcnt funct DSO > ______ _______ _____ _____ ______________________ > > 1.9 6184 98.8% func2 /home/mingo/opt/array2 > 0.0 69 1.1% func1 /home/mingo/opt/array2 > > and here's the cycles profile: > > samples pcnt funct DSO > _______ _____ _____ ______________________ > > 2555.00 67.4% func2 /home/mingo/opt/array2 > 1220.00 32.2% func1 /home/mingo/opt/array2 > > So, given that there was no other big miss sources: > > $ perf stat -a -e branch-misses:u -e l1-dcache-load-misses:u -e l1-dcache-store-misses:u -e l1-icache-load-misses:u sleep 1 > > Performance counter stats for 'sleep 1': > > 70,674 branch-misses:u > 347,992,027 l1-dcache-load-misses:u > 1,779 l1-dcache-store-misses:u > 8,007 l1-icache-load-misses:u > > 1.000982021 seconds time elapsed > > I can tell you that by fixing the cache-misses in that function, the code will > be roughly 33% faster. > 33% based on what? l1d-load-misses? The fact that in the same program you have a problematic function, func1(), and its fixed counter-part func2() + you know both do the same thing? How often do you think this happens in real life? Now, imagine you don't have func2(). Tell me how much of an impact (cycles) you think func1() is having on the overall execution of a program especially if is far more complex that your toy example above? Your arguments would carry more weight if you were to derive them from real life applications. > So i fixed the bug, and before it 100 iterations of func1+func2 took 300 msecs: > > $ perf stat -e cpu-clock --repeat 10 ./array2 > > Performance counter stats for './array2' (10 runs): > > 298.405074 cpu-clock ( +- 1.823% ) > > After the fix it took 190 msecs: > > $ perf stat -e cpu-clock --repeat 10 ./array2 > > Performance counter stats for './array2' (10 runs): > > 189.409569 cpu-clock ( +- 0.019% ) > > 0.190007596 seconds time elapsed ( +- 0.025% ) > > Which is 63% of the original speed - 37% faster. And no, i first did the > calculation, then did the measurement of the optimized code. > > Now it would be nice to automate such analysis some more within perf - but i > think i have established the principle well enough that we can use generic > cache events for such measurements. > > Also, we could certainly add more generic events - a stalled-cycles event would > certainly be useful for example, to collect all (or at least most) 'harmful > delays' the execution flow can experience. Want to take a stab at that patch? > > Thanks, > > Ingo > ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 12:04 ` Stephane Eranian 2011-04-22 13:18 ` Ingo Molnar @ 2011-04-22 16:51 ` Andi Kleen 2011-04-22 19:57 ` Ingo Molnar 2011-04-26 9:25 ` Peter Zijlstra 1 sibling, 2 replies; 80+ messages in thread From: Andi Kleen @ 2011-04-22 16:51 UTC (permalink / raw) To: Stephane Eranian Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton > Once you have clear and precise definition, then we can look at the actual > events and figure out a mapping. It's unclear this can be even done. Linux runs on a wide variety of micro architectures, with all kinds of cache architectures. Micro architectures are so different. I suspect a "generic" definition would need to be so vague as to be useless. This in general seems to be the problem of the current cache events. Overall for any interesting analysis you need to go CPU specific. Abstracted performance analysis is a contradiction in terms. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 16:51 ` Andi Kleen @ 2011-04-22 19:57 ` Ingo Molnar 2011-04-26 9:25 ` Peter Zijlstra 1 sibling, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-22 19:57 UTC (permalink / raw) To: Andi Kleen Cc: Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Andi Kleen <ak@linux.intel.com> wrote: > > Once you have clear and precise definition, then we can look at the actual > > events and figure out a mapping. > > It's unclear this can be even done. Linux runs on a wide variety of micro > architectures, with all kinds of cache architectures. > > Micro architectures are so different. I suspect a "generic" definition would > need to be so vague as to be useless. Not really. I gave a very specific example which solved a common and real problem, using L1-loads and L1-load-misses events. > This in general seems to be the problem of the current cache events. > > Overall for any interesting analysis you need to go CPU specific. Abstracted > performance analysis is a contradiction in terms. Nothing of what i did in that example was CPU or microarchitecture specific. Really, you are making this more complex than it really is. Just check the cache profiling example i gave, it works just fine today. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 16:51 ` Andi Kleen 2011-04-22 19:57 ` Ingo Molnar @ 2011-04-26 9:25 ` Peter Zijlstra 1 sibling, 0 replies; 80+ messages in thread From: Peter Zijlstra @ 2011-04-26 9:25 UTC (permalink / raw) To: Andi Kleen Cc: Stephane Eranian, Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Fri, 2011-04-22 at 09:51 -0700, Andi Kleen wrote: > > Micro architectures are so different. I suspect a "generic" definition would > need to be so vague as to be useless. > > This in general seems to be the problem of the current cache events. > > Overall for any interesting analysis you need to go CPU specific. > Abstracted performance analysis is a contradiction in terms. It might help if you'd talk to your own research department before making statements like that, they make you look silly. Intel research has shown that you don't actually need exact definitions as a side effect of applying machine learning principles in order to provide machine aided optimizing (ie. clippy style guides for vtune). They create simple micro-kernels (not our kind of kernels, but more like the excellent example Arun provided) that trigger a pathological case and a perfect counter-case and run it over _all_ possible events and do correlation analysis. The explicit example given was branch misses on an atom, and they found (to nobody's great surprise) BR_INST_RETIRED.MISPRED to be the best correlating event. But that's not the important part. The important part is that all it needs is a strong correlation, and it could even be a combination of events, it would just make the analysis a bit more complex. Anyway, given a sufficient large set of these pathological cases, you can train a neural net for your target hardware and then reverse the situation, run it over an unknown program and have it create suggestions -> yay clippy! So given a set of pathological cases and hardware with decent PMU coverage you can train this thing and get useful results. Exact event definitions be damned -- it doesn't care. http://sites.google.com/site/fhpm2010/program/baugh_fhpm2010.pptx?attredirects=0 ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 10:52 ` [generalized cache events] " Ingo Molnar 2011-04-22 12:04 ` Stephane Eranian @ 2011-04-22 16:50 ` arun 2011-04-22 17:00 ` Andi Kleen 2011-04-22 20:30 ` Ingo Molnar 1 sibling, 2 replies; 80+ messages in thread From: arun @ 2011-04-22 16:50 UTC (permalink / raw) To: Ingo Molnar Cc: Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Fri, Apr 22, 2011 at 12:52:11PM +0200, Ingo Molnar wrote: > > Using the generalized cache events i can run: > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > > Performance counter stats for './array' (10 runs): > > 6,719,130 cycles:u ( +- 0.662% ) > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) > 1,037,032 l1-dcache-loads:u ( +- 0.009% ) > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) > > 0.003802098 seconds time elapsed ( +- 13.395% ) > > I consider that this is 'bad', because for almost every dcache-load there's a > dcache-miss - a 99% L1 cache miss rate! One could argue that all you need is cycles and instructions. If there is an expensive load, you'll see that the load instruction takes many cycles and you can infer that it's a cache miss. Questions app developers typically ask me: * If I fix all my top 5 L3 misses how much faster will my app go? * Am I bottlenecked on memory bandwidth? * I have 4 L3 misses every 1000 instructions and 15 branch mispredicts per 1000 instructions. Which one should I focus on? It's hard to answer some of these without access to all events. While your approach of having generic events for commonly used counters might be useful for some use cases, I don't see why exposing all vendor defined events is harmful. A clear statement on the last point would be helpful. -Arun ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 16:50 ` arun @ 2011-04-22 17:00 ` Andi Kleen 2011-04-22 20:30 ` Ingo Molnar 1 sibling, 0 replies; 80+ messages in thread From: Andi Kleen @ 2011-04-22 17:00 UTC (permalink / raw) To: arun Cc: Ingo Molnar, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton > One could argue that all you need is cycles and instructions. If there is an > expensive load, you'll see that the load instruction takes many cycles and > you can infer that it's a cache miss. That is when you can actually recognize which of the last load instructions caused the problem. Often skid on Out of order CPUs makes this very hard to identify which actual load caused the the long stall (or if they all stalled) There's a way around it of course using advanced events, but not with cycles. > I don't see why exposing all vendor defined events is harmful Simple: without vendor events you cannot answer a lot of questions. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 16:50 ` arun 2011-04-22 17:00 ` Andi Kleen @ 2011-04-22 20:30 ` Ingo Molnar 2011-04-22 20:32 ` Ingo Molnar 2011-04-23 20:14 ` [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES Ingo Molnar 1 sibling, 2 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-22 20:30 UTC (permalink / raw) To: arun Cc: Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * arun@sharma-home.net <arun@sharma-home.net> wrote: > On Fri, Apr 22, 2011 at 12:52:11PM +0200, Ingo Molnar wrote: > > > > Using the generalized cache events i can run: > > > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > > > > Performance counter stats for './array' (10 runs): > > > > 6,719,130 cycles:u ( +- 0.662% ) > > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) > > 1,037,032 l1-dcache-loads:u ( +- 0.009% ) > > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) > > > > 0.003802098 seconds time elapsed ( +- 13.395% ) > > > > I consider that this is 'bad', because for almost every dcache-load there's a > > dcache-miss - a 99% L1 cache miss rate! > > One could argue that all you need is cycles and instructions. [...] Yes, and note that with instructions events we even have skid-less PEBS profiling so seeing the precise . > [...] If there is an expensive load, you'll see that the load instruction > takes many cycles and you can infer that it's a cache miss. > > Questions app developers typically ask me: > > * If I fix all my top 5 L3 misses how much faster will my app go? This has come up: we could add a 'stalled/idle-cycles' generic event - i.e. cycles spent without performing useful work in the pipelines. (Resource-stall events on Intel CPUs.) Then you would profile L3 misses (there's a generic event for that), plus stalls, and the answer to your question would be the percentage of hits you get in the stalled-cycles profile, multiplied by the stalled-cycles/cycles ratio. > * Am I bottlenecked on memory bandwidth? This would be a variant of the measurement above: say the top 90% of L3 misses profile-correlated with stalled-cycles, relative to total-cycles. If you get '90% of L3 misses cause a 1% wall-time slowdown' then you are not memory bottlenecked. If the answer is '35% slowdown' then you are memory bottlenecked. > * I have 4 L3 misses every 1000 instructions and 15 branch mispredicts per > 1000 instructions. Which one should I focus on? AFAICS this would be another variant of stalled-cycles measurements: you create a stalled-cycles profile and check whether the top hits are branches or memory loads. > It's hard to answer some of these without access to all events. I'm curious, how would you measure these properties - do you have some different events in mind? > While your approach of having generic events for commonly used counters might > be useful for some use cases, I don't see why exposing all vendor defined > events is harmful. > > A clear statement on the last point would be helpful. Well, the thing is, i think users are helped most if we add useful, highlevel PMU features added and not just an opaque raw event pass-through engine. The problem with lowlevel raw ABIs is that the tool space fragments into a zillion small hacks and there's no good concentration of know-how. I'd like the art of performance measurement to be generalized out, as well as it can be. We had this discussion in the big perf-counters flamewars 2+ years ago, where one side wanted raw events, while we wanted intelligent kernel-side abstractions and generalizations. I think the abstraction and generalization angle worked out very well in practice - but we are having this discussion again and again :-) As i stated it in my prior mails, i'm not against raw events as a rare exception channel - that increases utility. I'm against what was attempted here: an extension to raw events as the *primary* channel for DRAM measurement features. That is just sloppy and *reduces* utility. I'm very simple-minded: when i see reduced utility i become sad :) Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 20:30 ` Ingo Molnar @ 2011-04-22 20:32 ` Ingo Molnar 2011-04-23 0:03 ` Andi Kleen 2011-04-23 20:14 ` [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES Ingo Molnar 1 sibling, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-22 20:32 UTC (permalink / raw) To: arun Cc: Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Ingo Molnar <mingo@elte.hu> wrote: > * arun@sharma-home.net <arun@sharma-home.net> wrote: > > > On Fri, Apr 22, 2011 at 12:52:11PM +0200, Ingo Molnar wrote: > > > > > > Using the generalized cache events i can run: > > > > > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > > > > > > Performance counter stats for './array' (10 runs): > > > > > > 6,719,130 cycles:u ( +- 0.662% ) > > > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% ) > > > 1,037,032 l1-dcache-loads:u ( +- 0.009% ) > > > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% ) > > > > > > 0.003802098 seconds time elapsed ( +- 13.395% ) > > > > > > I consider that this is 'bad', because for almost every dcache-load there's a > > > dcache-miss - a 99% L1 cache miss rate! > > > > One could argue that all you need is cycles and instructions. [...] > > Yes, and note that with instructions events we even have skid-less PEBS > profiling so seeing the precise . - location of instructions is possible. [ An email gremlin ate that part of the sentence. ] Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 20:32 ` Ingo Molnar @ 2011-04-23 0:03 ` Andi Kleen 2011-04-23 7:50 ` Peter Zijlstra 2011-04-23 8:02 ` Ingo Molnar 0 siblings, 2 replies; 80+ messages in thread From: Andi Kleen @ 2011-04-23 0:03 UTC (permalink / raw) To: Ingo Molnar Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton > > Yes, and note that with instructions events we even have skid-less PEBS > > profiling so seeing the precise . > - location of instructions is possible. It was better when it was eaten. PEBS does not actually eliminated skid unfortunately. The interrupt still occurs later, so the instruction location is off. PEBS merely gives you more information. -Andi ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-23 0:03 ` Andi Kleen @ 2011-04-23 7:50 ` Peter Zijlstra 2011-04-23 12:06 ` Stephane Eranian 2011-04-24 2:19 ` Andi Kleen 2011-04-23 8:02 ` Ingo Molnar 1 sibling, 2 replies; 80+ messages in thread From: Peter Zijlstra @ 2011-04-23 7:50 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote: > > > Yes, and note that with instructions events we even have skid-less PEBS > > > profiling so seeing the precise . > > - location of instructions is possible. > > It was better when it was eaten. PEBS does not actually eliminated > skid unfortunately. The interrupt still occurs later, so the > instruction location is off. > > PEBS merely gives you more information. You're so skilled at not actually saying anything useful. Are you perchance referring to the fact that the IP reported in the PEBS data is exactly _one_ instruction off? Something that is demonstrated to be fixable? Or are you defining skid differently and not telling us your definition? What are you saying? ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-23 7:50 ` Peter Zijlstra @ 2011-04-23 12:06 ` Stephane Eranian 2011-04-23 12:36 ` Ingo Molnar ` (2 more replies) 2011-04-24 2:19 ` Andi Kleen 1 sibling, 3 replies; 80+ messages in thread From: Stephane Eranian @ 2011-04-23 12:06 UTC (permalink / raw) To: Peter Zijlstra Cc: Andi Kleen, Ingo Molnar, arun, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote: >> > > Yes, and note that with instructions events we even have skid-less PEBS >> > > profiling so seeing the precise . >> > - location of instructions is possible. >> >> It was better when it was eaten. PEBS does not actually eliminated >> skid unfortunately. The interrupt still occurs later, so the >> instruction location is off. >> >> PEBS merely gives you more information. > > You're so skilled at not actually saying anything useful. Are you > perchance referring to the fact that the IP reported in the PEBS data is > exactly _one_ instruction off? Something that is demonstrated to be > fixable? > > Or are you defining skid differently and not telling us your definition? > PEBS is guaranteed to return an IP that is just after AN instruction that caused the event. However, that instruction is NOT the one at the end of your period. Let's take an example with INST_RETIRED, period=100000. Then, the IP you get is NOT after the 100,000th retired instruction. It's an instruction that is N cycles after that one. There is internal skid due to the way PEBS is implemented. That is what Andi is referring to. The issue causes bias and thus impacts the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST event. PREC_DIST=precise distribution. It tries to correct for the skid on this event on INST_RETIRED with PEBS (look at Vol3b). ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-23 12:06 ` Stephane Eranian @ 2011-04-23 12:36 ` Ingo Molnar 2011-04-23 13:16 ` Peter Zijlstra 2011-04-24 2:15 ` Andi Kleen 2 siblings, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-23 12:36 UTC (permalink / raw) To: Stephane Eranian Cc: Peter Zijlstra, Andi Kleen, arun, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Stephane Eranian <eranian@google.com> wrote: > On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote: > >> > > Yes, and note that with instructions events we even have skid-less PEBS > >> > > profiling so seeing the precise . > >> > - location of instructions is possible. > >> > >> It was better when it was eaten. PEBS does not actually eliminated > >> skid unfortunately. The interrupt still occurs later, so the > >> instruction location is off. > >> > >> PEBS merely gives you more information. > > > > You're so skilled at not actually saying anything useful. Are you > > perchance referring to the fact that the IP reported in the PEBS data is > > exactly _one_ instruction off? Something that is demonstrated to be > > fixable? > > > > Or are you defining skid differently and not telling us your definition? > > > > PEBS is guaranteed to return an IP that is just after AN instruction that > caused the event. However, that instruction is NOT the one at the end of your > period. Let's take an example with INST_RETIRED, period=100000. Then, the IP > you get is NOT after the 100,000th retired instruction. It's an instruction > that is N cycles after that one. There is internal skid due to the way PEBS > is implemented. You are really misapplying the common-sense definition of 'skid'. Skid refers to the instruction causing a profiler hit being mis-identified. Google 'x86 pmu skid' and read the third entry: your own prior posting ;-) What you are referring to here is not really classic skid but a small, mostly constant skew in the period length with some very small amount of variability. It's thus mostly immaterial - at most a second or third order effect with typical frequencies of sampling. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-23 12:06 ` Stephane Eranian 2011-04-23 12:36 ` Ingo Molnar @ 2011-04-23 13:16 ` Peter Zijlstra 2011-04-25 18:48 ` Stephane Eranian 2011-04-25 19:40 ` Andi Kleen 2011-04-24 2:15 ` Andi Kleen 2 siblings, 2 replies; 80+ messages in thread From: Peter Zijlstra @ 2011-04-23 13:16 UTC (permalink / raw) To: Stephane Eranian Cc: Andi Kleen, Ingo Molnar, arun, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Sat, 2011-04-23 at 14:06 +0200, Stephane Eranian wrote: > On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote: > >> > > Yes, and note that with instructions events we even have skid-less PEBS > >> > > profiling so seeing the precise . > >> > - location of instructions is possible. > >> > >> It was better when it was eaten. PEBS does not actually eliminated > >> skid unfortunately. The interrupt still occurs later, so the > >> instruction location is off. > >> > >> PEBS merely gives you more information. > > > > You're so skilled at not actually saying anything useful. Are you > > perchance referring to the fact that the IP reported in the PEBS data is > > exactly _one_ instruction off? Something that is demonstrated to be > > fixable? > > > > Or are you defining skid differently and not telling us your definition? > > > > PEBS is guaranteed to return an IP that is just after AN instruction that > caused the event. However, that instruction is NOT the one at the end > of your period. Let's take an example with INST_RETIRED, period=100000. > Then, the IP you get is NOT after the 100,000th retired instruction. It's an > instruction that is N cycles after that one. There is internal skid due to the > way PEBS is implemented. > > That is what Andi is referring to. The issue causes bias and thus impacts > the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST > event. PREC_DIST=precise distribution. It tries to correct for the skid > on this event on INST_RETIRED with PEBS (look at Vol3b). Sure, but who cares? So your period isn't exactly what you specified, but the effective period will have an average and a fairly small stdev (assuming the initial period is much larger than the relatively few cycles it takes to arm the PEBS assist), therefore you still get a fairly uniform spread. I don't much get the obsession with precision here, its all a statistics game anyway. And while you keep saying the examples are too trivial and Andi keep sprouting vague non-statements, neither of you actually provide anything sensible to the discussion. So stop f*cking whining and start talking sense or stop talking all together. I mean, you were in the room where Intel presented their research on event correlations based on pathological micro-benches. That clearly shows that exact event definitions simply don't matter. Similarly all this precision wanking isn't _that_ important, the big fish clearly stand out, its only when you start shaving off the last few cycles that all that really comes in handy, before that its mostly: ooh thinking is hard, lets go shopping. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-23 13:16 ` Peter Zijlstra @ 2011-04-25 18:48 ` Stephane Eranian 2011-04-25 19:40 ` Andi Kleen 1 sibling, 0 replies; 80+ messages in thread From: Stephane Eranian @ 2011-04-25 18:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Andi Kleen, Ingo Molnar, arun, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Sat, Apr 23, 2011 at 3:16 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, 2011-04-23 at 14:06 +0200, Stephane Eranian wrote: >> On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> > On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote: >> >> > > Yes, and note that with instructions events we even have skid-less PEBS >> >> > > profiling so seeing the precise . >> >> > - location of instructions is possible. >> >> >> >> It was better when it was eaten. PEBS does not actually eliminated >> >> skid unfortunately. The interrupt still occurs later, so the >> >> instruction location is off. >> >> >> >> PEBS merely gives you more information. >> > >> > You're so skilled at not actually saying anything useful. Are you >> > perchance referring to the fact that the IP reported in the PEBS data is >> > exactly _one_ instruction off? Something that is demonstrated to be >> > fixable? >> > >> > Or are you defining skid differently and not telling us your definition? >> > >> >> PEBS is guaranteed to return an IP that is just after AN instruction that >> caused the event. However, that instruction is NOT the one at the end >> of your period. Let's take an example with INST_RETIRED, period=100000. >> Then, the IP you get is NOT after the 100,000th retired instruction. It's an >> instruction that is N cycles after that one. There is internal skid due to the >> way PEBS is implemented. >> >> That is what Andi is referring to. The issue causes bias and thus impacts >> the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST >> event. PREC_DIST=precise distribution. It tries to correct for the skid >> on this event on INST_RETIRED with PEBS (look at Vol3b). > > Sure, but who cares? So your period isn't exactly what you specified, > but the effective period will have an average and a fairly small stdev > (assuming the initial period is much larger than the relatively few > cycles it takes to arm the PEBS assist), therefore you still get a > fairly uniform spread. > > I don't much get the obsession with precision here, its all a statistics > game anyway. > The particular example I am thinking about came from compiler people I work with who would like to use PEBS to do statistical basic block profiling. They do care about correctness of the profile. Otherwise, it may cause wrong attribution of "hotness" of basic blocks and mislead the compiler when it tries to reorder blocks on the critical path. Compiler people can validate a statistical profile because they have a reference profile obtained via instrumentation of each basic block. > And while you keep saying the examples are too trivial and Andi keep > sprouting vague non-statements, neither of you actually provide anything > sensible to the discussion. > > So stop f*cking whining and start talking sense or stop talking all > together. > > I mean, you were in the room where Intel presented their research on > event correlations based on pathological micro-benches. That clearly > shows that exact event definitions simply don't matter. > Yes, and I don't get the same reading of the presentation. He never mentioned generic events. He never even used them, I mean the Intel generic events. Instead he used very focused Atom-specific events. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-23 13:16 ` Peter Zijlstra 2011-04-25 18:48 ` Stephane Eranian @ 2011-04-25 19:40 ` Andi Kleen 2011-04-25 19:55 ` Ingo Molnar 1 sibling, 1 reply; 80+ messages in thread From: Andi Kleen @ 2011-04-25 19:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Stephane Eranian, Ingo Molnar, arun, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton > Sure, but who cares? So your period isn't exactly what you specified, > but the effective period will have an average and a fairly small stdev > (assuming the initial period is much larger than the relatively few > cycles it takes to arm the PEBS assist), therefore you still get a > fairly uniform spread. The skid is not uniform and not necessarily random unfortunately, and difficult to correct in a standard way. > I don't much get the obsession with precision here, its all a statistics > game anyway. If you want to make your code faster it's often important to figure out what exactly is slow. One example of this we had recently in the kernel: function accesses three global objects. Scalability tanks when the test is run with more CPUs. Now the hit is near the three memory accesses. Which one is the one that is actually bouncing cache lines? The CPU executes them all in parallel so it's hard to tell. It's all in the out of order reordering window. PEBS (e.g. the memory latency event) can give you some information about which memory access is to blame with the right events, but it's not using the RIP. The generic events won't help with that, because they're still RIP based, which is not accurate. > Similarly all this precision wanking isn't _that_ important, the big > fish clearly stand out, its only when you start shaving off the last few > cycles that all that really comes in handy, before that its mostly: ooh > thinking is hard, lets go shopping. I wish it was that easy. In the example above it's about scaling or not scaling, which is definitely not the last cycle, but more a life-and-death "is the workload feasible on this machine or not" question. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-25 19:40 ` Andi Kleen @ 2011-04-25 19:55 ` Ingo Molnar 0 siblings, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-25 19:55 UTC (permalink / raw) To: Andi Kleen Cc: Peter Zijlstra, Stephane Eranian, arun, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Andi Kleen <ak@linux.intel.com> wrote: > One example of this we had recently in the kernel: > > function accesses three global objects. Scalability tanks when the test is > run with more CPUs. Now the hit is near the three memory accesses. Which one > is the one that is actually bouncing cache lines? that's not an example - you are still only giving vague, untestable, unverifiable references. You need to give us something specific and reproducible - preferably a testcase. Peter and me are doing lots of scalability work in the core kernel and for most problems i've met it was enough if we knew the function name - the scalability problem is typically very obvious from that point on - and an annotated profile makes it even more obvious. I've never met a situation what you describe, that it was not possible to disambiguate a real SMP bounce - and i've been fixing SMP bounces in the kernel for over ten years. So you really will have to back up your point with an accurate, reproducible testcase - vague statements like the ones you are making i do not accept at face value, sorry. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-23 12:06 ` Stephane Eranian 2011-04-23 12:36 ` Ingo Molnar 2011-04-23 13:16 ` Peter Zijlstra @ 2011-04-24 2:15 ` Andi Kleen 2 siblings, 0 replies; 80+ messages in thread From: Andi Kleen @ 2011-04-24 2:15 UTC (permalink / raw) To: Stephane Eranian Cc: Peter Zijlstra, Ingo Molnar, arun, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton > > That is what Andi is referring to. The issue causes bias and thus impacts > the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST > event. PREC_DIST=precise distribution. It tries to correct for the skid > on this event on INST_RETIRED with PEBS (look at Vol3b). Unfortunately even PDIST doesn't completely fix the problem, but it only makes it somewhat better. Also it's only statistical, so you won't get a guaranteed answer for every sample. -Andi ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-23 7:50 ` Peter Zijlstra 2011-04-23 12:06 ` Stephane Eranian @ 2011-04-24 2:19 ` Andi Kleen 2011-04-25 17:41 ` Ingo Molnar 1 sibling, 1 reply; 80+ messages in thread From: Andi Kleen @ 2011-04-24 2:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton > You're so skilled at not actually saying anything useful. Are you > perchance referring to the fact that the IP reported in the PEBS data is > exactly _one_ instruction off? Something that is demonstrated to be > fixable? It's one instruction off the instruction that was retired when the PEBS interrupt was ready, but not one instruction off the instruction that caused the event. There's still skid in triggering the interrupt. The main good thing about PEBS is that you can get some information about the state of the instruction, just not the EIP. For example with the memory latency event you can actually get the address and memory cache state (as Lin Ming's patchkit implements) -Andi ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-24 2:19 ` Andi Kleen @ 2011-04-25 17:41 ` Ingo Molnar 2011-04-25 18:00 ` Dehao Chen [not found] ` <BANLkTiks31-pMJe4zCKrppsrA1d6KanJFA@mail.gmail.com> 0 siblings, 2 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-25 17:41 UTC (permalink / raw) To: Andi Kleen Cc: Peter Zijlstra, arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Andi Kleen <ak@linux.intel.com> wrote: > > You're so skilled at not actually saying anything useful. Are you > > perchance referring to the fact that the IP reported in the PEBS data is > > exactly _one_ instruction off? Something that is demonstrated to be > > fixable? > > It's one instruction off the instruction that was retired when the PEBS > interrupt was ready, but not one instruction off the instruction that caused > the event. There's still skid in triggering the interrupt. Peter answered this in the other mail: | | Sure, but who cares? So your period isn't exactly what you specified, but | the effective period will have an average and a fairly small stdev (assuming | the initial period is much larger than the relatively few cycles it takes to | arm the PEBS assist), therefore you still get a fairly uniform spread. | ... and the resulting low level of noise in the average period length is what matters. The instruction itself will still be one of the hotspot instructions, statistically. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-25 17:41 ` Ingo Molnar @ 2011-04-25 18:00 ` Dehao Chen [not found] ` <BANLkTiks31-pMJe4zCKrppsrA1d6KanJFA@mail.gmail.com> 1 sibling, 0 replies; 80+ messages in thread From: Dehao Chen @ 2011-04-25 18:00 UTC (permalink / raw) To: linux-kernel On Tue, Apr 26, 2011 at 1:41 AM, Ingo Molnar <mingo@elte.hu> wrote: > > * Andi Kleen <ak@linux.intel.com> wrote: > > > > You're so skilled at not actually saying anything useful. Are you > > > perchance referring to the fact that the IP reported in the PEBS data is > > > exactly _one_ instruction off? Something that is demonstrated to be > > > fixable? > > > > It's one instruction off the instruction that was retired when the PEBS > > interrupt was ready, but not one instruction off the instruction that caused > > the event. There's still skid in triggering the interrupt. > > Peter answered this in the other mail: > > | > | Sure, but who cares? So your period isn't exactly what you specified, but > | the effective period will have an average and a fairly small stdev (assuming > | the initial period is much larger than the relatively few cycles it takes to > | arm the PEBS assist), therefore you still get a fairly uniform spread. > | > > ... and the resulting low level of noise in the average period length is what > matters. The instruction itself will still be one of the hotspot instructions, > statistically. Not true. This skid will lead to some aggregation and shadow effects on some certain instructions. To make things worse, these effects are deterministic and cannot be removed by either sampling for multiple times or by averaging among instructions within a basic block. As a result, some actual "hot spot" are not sampled at all. You can simply try to collect a basic block level CPI, and you'll get a very misleading profile. Dehao > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <BANLkTiks31-pMJe4zCKrppsrA1d6KanJFA@mail.gmail.com>]
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 [not found] ` <BANLkTiks31-pMJe4zCKrppsrA1d6KanJFA@mail.gmail.com> @ 2011-04-25 18:05 ` Ingo Molnar 2011-04-25 18:39 ` Stephane Eranian 0 siblings, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-25 18:05 UTC (permalink / raw) To: Dehao Chen Cc: Andi Kleen, Peter Zijlstra, arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Dehao Chen <danielcdh@gmail.com> wrote: > > ... and the resulting low level of noise in the average period length is > > what matters. The instruction itself will still be one of the hotspot > > instructions, statistically. > > Not true. This skid will lead to some aggregation and shadow effects on some > certain instructions. To make things worse, these effects are deterministic > and cannot be removed by either sampling for multiple times or by averaging > among instructions within a basic block. As a result, some actual "hot spot" > are not sampled at all. You can simply try to collect a basic block level > CPI, and you'll get a very misleading profile. This certainly does not match the results i'm seeing on real applications, using "-e instructions:pp" PEBS+LBR profiling. How do you explain that? Also, can you demonstrate your claim with a real example? Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-25 18:05 ` Ingo Molnar @ 2011-04-25 18:39 ` Stephane Eranian 2011-04-25 19:45 ` Ingo Molnar 0 siblings, 1 reply; 80+ messages in thread From: Stephane Eranian @ 2011-04-25 18:39 UTC (permalink / raw) To: Ingo Molnar Cc: Dehao Chen, Andi Kleen, Peter Zijlstra, arun, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Mon, Apr 25, 2011 at 8:05 PM, Ingo Molnar <mingo@elte.hu> wrote: > > * Dehao Chen <danielcdh@gmail.com> wrote: > >> > ... and the resulting low level of noise in the average period length is >> > what matters. The instruction itself will still be one of the hotspot >> > instructions, statistically. >> >> Not true. This skid will lead to some aggregation and shadow effects on some >> certain instructions. To make things worse, these effects are deterministic >> and cannot be removed by either sampling for multiple times or by averaging >> among instructions within a basic block. As a result, some actual "hot spot" >> are not sampled at all. You can simply try to collect a basic block level >> CPI, and you'll get a very misleading profile. > > This certainly does not match the results i'm seeing on real applications, > using "-e instructions:pp" PEBS+LBR profiling. How do you explain that? Also, > can you demonstrate your claim with a real example? > LBR removes the off-by-1 IP problem, it does not remove the shadow effect, i.e., that blind spot of N cycles caused by the PEBS arming mechanism. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-25 18:39 ` Stephane Eranian @ 2011-04-25 19:45 ` Ingo Molnar 0 siblings, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-25 19:45 UTC (permalink / raw) To: Stephane Eranian Cc: Dehao Chen, Andi Kleen, Peter Zijlstra, arun, Arnaldo Carvalho de Melo, linux-kernel, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Stephane Eranian <eranian@google.com> wrote: > > This certainly does not match the results i'm seeing on real applications, > > using "-e instructions:pp" PEBS+LBR profiling. How do you explain that? > > Also, can you demonstrate your claim with a real example? > > LBR removes the off-by-1 IP problem, it does not remove the shadow effect, > i.e., that blind spot of N cycles caused by the PEBS arming mechanism. I really think you are grasping at straws here - unless you are able to demonstrate clear problems - which you have failed to do so far. The pure act of profiling probably disturbs a typical workload statistically more than a few cycles skew of the period. I could imagine artifacts with realy short periods and artificially short and dominant hotpaths - but in those cases the skew does not matter much in practice: a short and dominant hotpath is pinpointed very easily ... So i really think it's a non-issue in practice - but you can certainly prove me wrong by demonstrating whatever problems you suspect. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-23 0:03 ` Andi Kleen 2011-04-23 7:50 ` Peter Zijlstra @ 2011-04-23 8:02 ` Ingo Molnar 1 sibling, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-23 8:02 UTC (permalink / raw) To: Andi Kleen Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Andi Kleen <ak@linux.intel.com> wrote: > > > Yes, and note that with instructions events we even have skid-less PEBS > > > profiling so seeing the precise . > > - location of instructions is possible. > > It was better when it was eaten. PEBS does not actually eliminated > skid unfortunately. The interrupt still occurs later, so the > instruction location is off. > > PEBS merely gives you more information. Have you actually tried perf's PEBS support feature? Try: perf record -e instructions:pp ./myapp (the ':pp' postfix stands for 'precise' and activates PEBS+LBR tricks.) Look at the perf report --tui annotated asssembly output (or check 'perf annotate' directly) and see how precise and skid-less the hits are. Works pretty well on Nehalem. Here's a cache-bound loop with skid (profiled with '-e instructions'): : 0000000000400390 <main>: 0.00 : 400390: 31 c0 xor %eax,%eax 0.00 : 400392: eb 22 jmp 4003b6 <main+0x26> 12.08 : 400394: fe 84 10 50 08 60 00 incb 0x600850(%rax,%rdx,1) 87.92 : 40039b: 48 81 c2 10 27 00 00 add $0x2710,%rdx 0.00 : 4003a2: 48 81 fa 00 e1 f5 05 cmp $0x5f5e100,%rdx 0.00 : 4003a9: 75 e9 jne 400394 <main+0x4> 0.00 : 4003ab: 48 ff c0 inc %rax 0.00 : 4003ae: 48 3d 10 27 00 00 cmp $0x2710,%rax 0.00 : 4003b4: 74 04 je 4003ba <main+0x2a> 0.00 : 4003b6: 31 d2 xor %edx,%edx 0.00 : 4003b8: eb da jmp 400394 <main+0x4> 0.00 : 4003ba: 31 c0 xor %eax,%eax Those 'ADD' instruction hits are bogus: 99% of the cost in this function is in the INCB, but the PMU NMI often skids to the next (few) instructions. Profiled with "-e instructions:pp" we get: : 0000000000400390 <main>: 0.00 : 400390: 31 c0 xor %eax,%eax 0.00 : 400392: eb 22 jmp 4003b6 <main+0x26> 85.33 : 400394: fe 84 10 50 08 60 00 incb 0x600850(%rax,%rdx,1) 0.00 : 40039b: 48 81 c2 10 27 00 00 add $0x2710,%rdx 14.67 : 4003a2: 48 81 fa 00 e1 f5 05 cmp $0x5f5e100,%rdx 0.00 : 4003a9: 75 e9 jne 400394 <main+0x4> 0.00 : 4003ab: 48 ff c0 inc %rax 0.00 : 4003ae: 48 3d 10 27 00 00 cmp $0x2710,%rax 0.00 : 4003b4: 74 04 je 4003ba <main+0x2a> 0.00 : 4003b6: 31 d2 xor %edx,%edx 0.00 : 4003b8: eb da jmp 400394 <main+0x4> 0.00 : 4003ba: 31 c0 xor %eax,%eax The INCB has the most hits as expected - but we also learn that there's something about the CMP. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES 2011-04-22 20:30 ` Ingo Molnar 2011-04-22 20:32 ` Ingo Molnar @ 2011-04-23 20:14 ` Ingo Molnar 2011-04-24 6:16 ` Arun Sharma 1 sibling, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-23 20:14 UTC (permalink / raw) To: arun Cc: Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton * Ingo Molnar <mingo@elte.hu> wrote: > > [...] If there is an expensive load, you'll see that the load instruction > > takes many cycles and you can infer that it's a cache miss. > > > > Questions app developers typically ask me: > > > > * If I fix all my top 5 L3 misses how much faster will my app go? > > This has come up: we could add a 'stalled/idle-cycles' generic event - i.e. > cycles spent without performing useful work in the pipelines. (Resource-stall > events on Intel CPUs.) How about something like the patch below? Ingo --- Subject: perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES From: Ingo Molnar <mingo@elte.hu> The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate cycles the CPU does nothing useful, because it is stalled on a cache-miss or some other condition. Note: this is still a incomplete and will work on Intel Nehalem CPUs for now, the intel_perfmon_event_map[] needs to be properly split between the major models. Also update 'perf stat' to print: 611,527 cycles 400,553 instructions # ( 0.7 instructions per cycle ) 77,809 stalled-cycles # ( 12.7% of all cycles ) 0.000610987 seconds time elapsed Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/cpu/perf_event_intel.c | 2 ++ include/linux/perf_event.h | 1 + tools/perf/builtin-stat.c | 11 +++++++++-- tools/perf/util/parse-events.c | 1 + tools/perf/util/python.c | 1 + 5 files changed, 14 insertions(+), 2 deletions(-) Index: linux/arch/x86/kernel/cpu/perf_event_intel.c =================================================================== --- linux.orig/arch/x86/kernel/cpu/perf_event_intel.c +++ linux/arch/x86/kernel/cpu/perf_event_intel.c @@ -34,6 +34,8 @@ static const u64 intel_perfmon_event_map [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c4, [PERF_COUNT_HW_BRANCH_MISSES] = 0x00c5, [PERF_COUNT_HW_BUS_CYCLES] = 0x013c, + [PERF_COUNT_HW_STALLED_CYCLES] = 0xffa2, /* 0xff: All reasons, 0xa2: Resource stalls */ + }; static struct event_constraint intel_core_event_constraints[] = Index: linux/include/linux/perf_event.h =================================================================== --- linux.orig/include/linux/perf_event.h +++ linux/include/linux/perf_event.h @@ -52,6 +52,7 @@ enum perf_hw_id { PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4, PERF_COUNT_HW_BRANCH_MISSES = 5, PERF_COUNT_HW_BUS_CYCLES = 6, + PERF_COUNT_HW_STALLED_CYCLES = 7, PERF_COUNT_HW_MAX, /* non-ABI */ }; Index: linux/tools/perf/builtin-stat.c =================================================================== --- linux.orig/tools/perf/builtin-stat.c +++ linux/tools/perf/builtin-stat.c @@ -442,7 +442,7 @@ static void abs_printout(int cpu, struct if (total) ratio = avg / total; - fprintf(stderr, " # %10.3f IPC ", ratio); + fprintf(stderr, " # ( %3.1f instructions per cycle )", ratio); } else if (perf_evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES) && runtime_branches_stats[cpu].n != 0) { total = avg_stats(&runtime_branches_stats[cpu]); @@ -450,7 +450,7 @@ static void abs_printout(int cpu, struct if (total) ratio = avg * 100 / total; - fprintf(stderr, " # %10.3f %% ", ratio); + fprintf(stderr, " # %10.3f %%", ratio); } else if (runtime_nsecs_stats[cpu].n != 0) { total = avg_stats(&runtime_nsecs_stats[cpu]); @@ -459,6 +459,13 @@ static void abs_printout(int cpu, struct ratio = 1000.0 * avg / total; fprintf(stderr, " # %10.3f M/sec", ratio); + } else if (perf_evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES)) { + total = avg_stats(&runtime_cycles_stats[cpu]); + + if (total) + ratio = avg / total * 100.0; + + fprintf(stderr, " # (%5.1f%% of all cycles )", ratio); } } Index: linux/tools/perf/util/parse-events.c =================================================================== --- linux.orig/tools/perf/util/parse-events.c +++ linux/tools/perf/util/parse-events.c @@ -38,6 +38,7 @@ static struct event_symbol event_symbols { CHW(BRANCH_INSTRUCTIONS), "branch-instructions", "branches" }, { CHW(BRANCH_MISSES), "branch-misses", "" }, { CHW(BUS_CYCLES), "bus-cycles", "" }, + { CHW(STALLED_CYCLES), "stalled-cycles", "" }, { CSW(CPU_CLOCK), "cpu-clock", "" }, { CSW(TASK_CLOCK), "task-clock", "" }, Index: linux/tools/perf/util/python.c =================================================================== --- linux.orig/tools/perf/util/python.c +++ linux/tools/perf/util/python.c @@ -798,6 +798,7 @@ static struct { { "COUNT_HW_BRANCH_INSTRUCTIONS", PERF_COUNT_HW_BRANCH_INSTRUCTIONS }, { "COUNT_HW_BRANCH_MISSES", PERF_COUNT_HW_BRANCH_MISSES }, { "COUNT_HW_BUS_CYCLES", PERF_COUNT_HW_BUS_CYCLES }, + { "COUNT_HW_STALLED_CYCLES", PERF_COUNT_HW_STALLED_CYCLES }, { "COUNT_HW_CACHE_L1D", PERF_COUNT_HW_CACHE_L1D }, { "COUNT_HW_CACHE_L1I", PERF_COUNT_HW_CACHE_L1I }, { "COUNT_HW_CACHE_LL", PERF_COUNT_HW_CACHE_LL }, ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES 2011-04-23 20:14 ` [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES Ingo Molnar @ 2011-04-24 6:16 ` Arun Sharma 2011-04-25 17:37 ` Ingo Molnar ` (3 more replies) 0 siblings, 4 replies; 80+ messages in thread From: Arun Sharma @ 2011-04-24 6:16 UTC (permalink / raw) To: Ingo Molnar Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Arun Sharma, Linus Torvalds, Andrew Morton On Sat, Apr 23, 2011 at 10:14:09PM +0200, Ingo Molnar wrote: > > The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate > cycles the CPU does nothing useful, because it is stalled on a > cache-miss or some other condition. Conceptually looks fine. I'd prefer a more precise name such as: PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or retirement stalls). In the example below: ==> foo.c <== foo() { } bar() { } ==> test.c <== #include <stdio.h> #include <stdlib.h> #include <stdint.h> #define FNV_PRIME_32 16777619 #define FNV_OFFSET_32 2166136261U uint32_t hash1(const char *s) { uint32_t hash = FNV_OFFSET_32, i; for(i = 0; i < 4; i++) { hash = hash ^ (s[i]); // xor next byte into the bottom of the hash hash = hash * FNV_PRIME_32; // Multiply by prime number found to work well } return hash; } #define FNV_PRIME_WEAK_32 100 #define FNV_OFFSET_WEAK_32 200 uint32_t hash2(const char *s) { uint32_t hash = FNV_OFFSET_WEAK_32, i; for(i = 0; i < 4; i++) { hash = hash ^ (s[i]); // xor next byte into the bottom of the hash hash = hash * FNV_PRIME_WEAK_32; // Multiply by prime number found to work well } return hash; } int main() { int r = random(); while(1) { r++; #ifdef HARD if (hash1((const char *) &r) & 0x500) #else if (hash2((const char *) &r) & 0x500) #endif foo(); else bar(); } } ==> Makefile <== all: gcc -O2 test.c foo.c -UHARD -o test.easy gcc -O2 test.c foo.c -DHARD -o test.hard # perf stat -e cycles,instructions ./test.hard ^C Performance counter stats for './test.hard': 3,742,855,848 cycles 4,179,896,309 instructions # 1.117 IPC 1.754804730 seconds time elapsed # perf stat -e cycles,instructions ./test.easy ^C Performance counter stats for './test.easy': 3,932,887,528 cycles 8,994,416,316 instructions # 2.287 IPC 1.843832827 seconds time elapsed i.e. fixing the branch mispredicts could result in a nearly 2x speedup for the program. Looking at: # perf stat -e cycles,instructions,branch-misses,cache-misses,RESOURCE_STALLS:ANY ./test.hard ^C Performance counter stats for './test.hard': 3,413,713,048 cycles (scaled from 69.93%) 3,772,611,782 instructions # 1.105 IPC (scaled from 80.01%) 51,113,340 branch-misses (scaled from 80.01%) 12,370 cache-misses (scaled from 80.02%) 26,656,983 RESOURCE_STALLS:ANY (scaled from 69.99%) 1.626595305 seconds time elapsed it's hard to spot the opporunity. On the other hand: # ./analyze.py Percent idle: 27% Retirement Stalls: 82% Backend Stalls: 0% Frontend Stalls: 62% Instruction Starvation: 62% icache stalls: 0% does give me a signal about where to look. The script below is a quick and dirty hack. I haven't really validated it with many workloads. I'm posting it here anyway hoping that it'd result in better kernel support for these types of analyses. Even if we cover this with various generic PERF_COUNT_*STALL events, we'll still have a need for other events: * Things that give info about instruction mixes. Ratio of {loads, stores, floating point, branches, conditional branches} to total instructions. * Activity related to micro architecture specific caches People using -funroll-loops may have a significant performance opportunity. But it's hard to spot bottlenecks in the instruction decoder. * Monitoring traffic on Hypertransport/QPI links Like you observe, most people will not look at these events, so focusing on getting the common events right makes sense. But I still like access to all events (either via a mapping file or a library such as libpfm4). Hiding them in "perf list" sounds like a reasonable way of keeping complexity out. -Arun PS: branch-misses:pp was spot on for the example above. #!/usr/bin/env python from optparse import OptionParser from itertools import izip, chain, repeat from subprocess import Popen, PIPE import re, struct def grouper(n, iterable, padvalue=None): "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')" return izip(*[chain(iterable, repeat(padvalue, n-1))]*n) counter_re = re.compile('\s+(?P<count>\d+)\s+(?P<event>\S+)') def sample(events): cmd = 'perf stat --no-big-num -a' ncounters = 4 groups = grouper(ncounters, events) for g in groups: # filter padding g = [ e for e in g if e ] cmd += ' -e ' + ','.join(g) cmd += ' -- sleep ' + str(options.time) process = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE) out, err = process.communicate() ret = process.poll() if ret: raise "Perf failed: " + err ret = {} for line in err.split('\n'): m = counter_re.match(line) if not m: continue ret[m.group('event')] = long(m.group('count')) return ret def measure_cycles(): # disable C-states f = open("/dev/cpu_dma_latency", "wb") f.write(struct.pack("i", 0)) f.flush() saved = options.time options.time = 1 # one sec is sufficient to measure clock cycles = sample(["cycles"])['cycles'] cycles /= options.time f.close() options.time = saved return cycles if __name__ == '__main__': parser = OptionParser() parser.add_option("-t", "--time", dest="time", default=1, help="How long to sample events") parser.add_option("-q", "--quiet", action="store_false", dest="verbose", default=True, help="don't print status messages to stdout") (options, args) = parser.parse_args() cycles_per_sec = measure_cycles() c = sample(["cycles", "instructions", "UOPS_ISSUED:ANY:c=1", "UOPS_ISSUED:ANY:c=1:t=1", "RESOURCE_STALLS:ANY", "UOPS_RETIRED:ANY:c=1:t=1", "UOPS_EXECUTED:PORT015:t=1", "UOPS_EXECUTED:PORT234_CORE", "UOPS_ISSUED:ANY:t=1", "UOPS_ISSUED:FUSED:t=1", "UOPS_RETIRED:ANY:t=1", "L1I:CYCLES_STALLED"]) cycles = c["cycles"] * 1.0 cycles_no_uops_issued = cycles - c["UOPS_ISSUED:ANY:c=1:t=1"] cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"] backend_stall_cycles = c["RESOURCE_STALLS:ANY"] icache_stall_cycles = c["L1I:CYCLES_STALLED"] # Cycle stall accounting print "Percent idle: %d%%" % ((1 - cycles/(int(options.time) * cycles_per_sec)) * 100) print "\tRetirement Stalls: %d%%" % ((cycles_no_uops_retired / cycles) * 100) print "\tBackend Stalls: %d%%" % ((backend_stall_cycles / cycles) * 100) print "\tFrontend Stalls: %d%%" % ((cycles_no_uops_issued / cycles) * 100) print "\tInstruction Starvation: %d%%" % (((cycles_no_uops_issued - backend_stall_cycles)/cycles) * 100) print "\ticache stalls: %d%%" % ((icache_stall_cycles/cycles) * 100) # Wasted work uops_executed = c["UOPS_EXECUTED:PORT015:t=1"] + c["UOPS_EXECUTED:PORT234_CORE"] uops_retired = c["UOPS_RETIRED:ANY:t=1"] uops_issued = c["UOPS_ISSUED:ANY:t=1"] + c["UOPS_ISSUED:FUSED:t=1"] print "\tPercentage useless uops: %d%%" % ((uops_executed - uops_retired) * 100.0/uops_retired) print "\tPercentage useless uops issued: %d%%" % ((uops_issued - uops_retired) * 100.0/uops_retired) ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES 2011-04-24 6:16 ` Arun Sharma @ 2011-04-25 17:37 ` Ingo Molnar 2011-04-26 9:25 ` Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-25 17:37 UTC (permalink / raw) To: Arun Sharma Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Linus Torvalds, Andrew Morton * Arun Sharma <asharma@fb.com> wrote: > On Sat, Apr 23, 2011 at 10:14:09PM +0200, Ingo Molnar wrote: > > > > The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate > > cycles the CPU does nothing useful, because it is stalled on a > > cache-miss or some other condition. > > Conceptually looks fine. I'd prefer a more precise name such as: > PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or > retirement stalls). Ok. Your script: > # ./analyze.py > Percent idle: 27% > Retirement Stalls: 82% > Backend Stalls: 0% > Frontend Stalls: 62% > Instruction Starvation: 62% > icache stalls: 0% > > does give me a signal about where to look. The script below is > a quick and dirty hack. I haven't really validated it with > many workloads. I'm posting it here anyway hoping that it'd > result in better kernel support for these types of analyses. Is pretty useful IMO. The frontend/backend characterisation is pretty generic - most modern CPUs share that and have similar events. So we could try to generalize these and get most of the statistics your script outputs. > Even if we cover this with various generic PERF_COUNT_*STALL events, > we'll still have a need for other events: > > * Things that give info about instruction mixes. > > Ratio of {loads, stores, floating point, branches, conditional branches} > to total instructions. We have this at least partially covered, but yeah, we stopped short of covering all instruction types so complete ratios cannot be built yet. > * Activity related to micro architecture specific caches > > People using -funroll-loops may have a significant performance opportunity. > But it's hard to spot bottlenecks in the instruction decoder. > > * Monitoring traffic on Hypertransport/QPI links Cross-node accesses ought to be covered by Peter's RFC patch. In terms of isolating cross-CPU cache accesses i suspect we could do that too if it really matters to analysis in practice. Basically the way to go about it are the testcases you wrote - they demonstrate the utility of a given type of event - and that justifies generalization as well. > Like you observe, most people will not look at these events, so > focusing on getting the common events right makes sense. But I > still like access to all events (either via a mapping file or > a library such as libpfm4). Hiding them in "perf list" sounds > like a reasonable way of keeping complexity out. Yes. We have access to raw events for relatively obscure (or too CPU dependent) events - but what we do not want to do is to extend that space without adding *any* generic event in essence. If something like offcore or uncore PMU support is useful enough to be in the kernel, then it should also be useful enough to gain generic events. > PS: branch-misses:pp was spot on for the example above. heh :-) Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES 2011-04-24 6:16 ` Arun Sharma 2011-04-25 17:37 ` Ingo Molnar @ 2011-04-26 9:25 ` Peter Zijlstra 2011-04-26 14:00 ` Ingo Molnar 2011-04-27 11:11 ` Ingo Molnar 3 siblings, 0 replies; 80+ messages in thread From: Peter Zijlstra @ 2011-04-26 9:25 UTC (permalink / raw) To: Arun Sharma Cc: Ingo Molnar, arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, eranian, Linus Torvalds, Andrew Morton On Sat, 2011-04-23 at 23:16 -0700, Arun Sharma wrote: > Conceptually looks fine. I'd prefer a more precise name such as: > PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or > retirement stalls). Very nice example!! This is the stuff we want people to do, but instead of focusing on the raw event aspect, put in a little more effort and see what it takes to make it work across the board. None of the things you mention are very specific to Intel, afaik those concepts you listed: Retirement, Frontend (instruction decode/uop-issue), Backend (uop execution), I-cache (instruction fetch) map to pretty much all hardware I know (PMU coverage of these aspects aside). So in fact you propose these concepts, and that is the kind of feedback perf wants and needs. The thing that set off this whole discussion is that most people don't seem to believe in concepts and stick to their very narrow HPC every last cycle matters therefore we need absolute events mentality. That too is a form of vendor lock, once you're so dependent on a particular platform the cost of switching increases dramatically. Furthermore very few people are actually interested in it. That is not to say we should not enable those people, but the current state of affairs seems to be that some people are only interested in enabling that and simply don't care (and don't want to care) about cross platform performance analysis and useful abstractions. We'd very much like to make the cost of entry -- the cost of supporting lowlevel capabilities -- the addition of high level concepts, means for the greater public to use them. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES 2011-04-24 6:16 ` Arun Sharma 2011-04-25 17:37 ` Ingo Molnar 2011-04-26 9:25 ` Peter Zijlstra @ 2011-04-26 14:00 ` Ingo Molnar 2011-04-27 11:11 ` Ingo Molnar 3 siblings, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-26 14:00 UTC (permalink / raw) To: Arun Sharma Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Linus Torvalds, Andrew Morton * Arun Sharma <asharma@fb.com> wrote: > On Sat, Apr 23, 2011 at 10:14:09PM +0200, Ingo Molnar wrote: > > > > The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate > > cycles the CPU does nothing useful, because it is stalled on a > > cache-miss or some other condition. > > Conceptually looks fine. I'd prefer a more precise name such as: > PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or > retirement stalls). How about this naming convention: PERF_COUNT_HW_STALLED_CYCLES # execution PERF_COUNT_HW_STALLED_CYCLES_FRONTEND # frontend PERF_COUNT_HW_STALLED_CYCLES_ICACHE_MISS # icache So STALLED_CYCLES would be the most general metric, the one that shows the real impact to the application. The other events would then help disambiguate this metric some more. Below is the updated patch - this version makes the backend stalls event properly per model. (with the Nehalem table filled in.) What do you think? Thanks, Ingo ---------------------> Subject: perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES From: Ingo Molnar <mingo@elte.hu> Date: Sun Apr 24 08:18:31 CEST 2011 The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate cycles the CPU does nothing useful, because it is stalled on a cache-miss or some other condition. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/cpu/perf_event_intel.c | 42 +++++++++++++++++++++++++-------- include/linux/perf_event.h | 1 tools/perf/util/parse-events.c | 1 tools/perf/util/python.c | 1 4 files changed, 36 insertions(+), 9 deletions(-) Index: linux/arch/x86/kernel/cpu/perf_event_intel.c =================================================================== --- linux.orig/arch/x86/kernel/cpu/perf_event_intel.c +++ linux/arch/x86/kernel/cpu/perf_event_intel.c @@ -36,6 +36,23 @@ static const u64 intel_perfmon_event_map [PERF_COUNT_HW_BUS_CYCLES] = 0x013c, }; +/* + * Other generic events, Nehalem: + */ +static const u64 intel_nhm_event_map[] = +{ + /* Arch-perfmon events: */ + [PERF_COUNT_HW_CPU_CYCLES] = 0x003c, + [PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0, + [PERF_COUNT_HW_CACHE_REFERENCES] = 0x4f2e, + [PERF_COUNT_HW_CACHE_MISSES] = 0x412e, + [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c4, + [PERF_COUNT_HW_BRANCH_MISSES] = 0x00c5, + [PERF_COUNT_HW_BUS_CYCLES] = 0x013c, + + [PERF_COUNT_HW_STALLED_CYCLES] = 0xffa2, /* 0xff: All reasons, 0xa2: Resource stalls */ +}; + static struct event_constraint intel_core_event_constraints[] = { INTEL_EVENT_CONSTRAINT(0x11, 0x2), /* FP_ASSIST */ @@ -150,6 +167,12 @@ static u64 intel_pmu_event_map(int hw_ev return intel_perfmon_event_map[hw_event]; } +static u64 intel_pmu_nhm_event_map(int hw_event) +{ + return intel_nhm_event_map[hw_event]; +} + + static __initconst const u64 snb_hw_cache_event_ids [PERF_COUNT_HW_CACHE_MAX] [PERF_COUNT_HW_CACHE_OP_MAX] @@ -1400,18 +1423,19 @@ static __init int intel_pmu_init(void) case 26: /* 45 nm nehalem, "Bloomfield" */ case 30: /* 45 nm nehalem, "Lynnfield" */ case 46: /* 45 nm nehalem-ex, "Beckton" */ - memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids, - sizeof(hw_cache_event_ids)); - memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs, - sizeof(hw_cache_extra_regs)); + memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids, sizeof(hw_cache_event_ids)); + memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs, sizeof(hw_cache_extra_regs)); intel_pmu_lbr_init_nhm(); - x86_pmu.event_constraints = intel_nehalem_event_constraints; - x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints; - x86_pmu.percore_constraints = intel_nehalem_percore_constraints; - x86_pmu.enable_all = intel_pmu_nhm_enable_all; - x86_pmu.extra_regs = intel_nehalem_extra_regs; + x86_pmu.event_constraints = intel_nehalem_event_constraints; + x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints; + x86_pmu.percore_constraints = intel_nehalem_percore_constraints; + x86_pmu.enable_all = intel_pmu_nhm_enable_all; + x86_pmu.extra_regs = intel_nehalem_extra_regs; + x86_pmu.event_map = intel_pmu_nhm_event_map; + x86_pmu.max_events = ARRAY_SIZE(intel_perfmon_event_map), + pr_cont("Nehalem events, "); break; Index: linux/include/linux/perf_event.h =================================================================== --- linux.orig/include/linux/perf_event.h +++ linux/include/linux/perf_event.h @@ -52,6 +52,7 @@ enum perf_hw_id { PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4, PERF_COUNT_HW_BRANCH_MISSES = 5, PERF_COUNT_HW_BUS_CYCLES = 6, + PERF_COUNT_HW_STALLED_CYCLES = 7, PERF_COUNT_HW_MAX, /* non-ABI */ }; Index: linux/tools/perf/util/parse-events.c =================================================================== --- linux.orig/tools/perf/util/parse-events.c +++ linux/tools/perf/util/parse-events.c @@ -38,6 +38,7 @@ static struct event_symbol event_symbols { CHW(BRANCH_INSTRUCTIONS), "branch-instructions", "branches" }, { CHW(BRANCH_MISSES), "branch-misses", "" }, { CHW(BUS_CYCLES), "bus-cycles", "" }, + { CHW(STALLED_CYCLES), "stalled-cycles", "" }, { CSW(CPU_CLOCK), "cpu-clock", "" }, { CSW(TASK_CLOCK), "task-clock", "" }, Index: linux/tools/perf/util/python.c =================================================================== --- linux.orig/tools/perf/util/python.c +++ linux/tools/perf/util/python.c @@ -798,6 +798,7 @@ static struct { { "COUNT_HW_BRANCH_INSTRUCTIONS", PERF_COUNT_HW_BRANCH_INSTRUCTIONS }, { "COUNT_HW_BRANCH_MISSES", PERF_COUNT_HW_BRANCH_MISSES }, { "COUNT_HW_BUS_CYCLES", PERF_COUNT_HW_BUS_CYCLES }, + { "COUNT_HW_STALLED_CYCLES", PERF_COUNT_HW_STALLED_CYCLES }, { "COUNT_HW_CACHE_L1D", PERF_COUNT_HW_CACHE_L1D }, { "COUNT_HW_CACHE_L1I", PERF_COUNT_HW_CACHE_L1I }, { "COUNT_HW_CACHE_LL", PERF_COUNT_HW_CACHE_LL }, ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES 2011-04-24 6:16 ` Arun Sharma ` (2 preceding siblings ...) 2011-04-26 14:00 ` Ingo Molnar @ 2011-04-27 11:11 ` Ingo Molnar 2011-04-27 14:47 ` Arun Sharma 3 siblings, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-27 11:11 UTC (permalink / raw) To: Arun Sharma Cc: arun, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Linus Torvalds, Andrew Morton * Arun Sharma <asharma@fb.com> wrote: > cycles = c["cycles"] * 1.0 > cycles_no_uops_issued = cycles - c["UOPS_ISSUED:ANY:c=1:t=1"] > cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"] > backend_stall_cycles = c["RESOURCE_STALLS:ANY"] > icache_stall_cycles = c["L1I:CYCLES_STALLED"] > > # Cycle stall accounting > print "Percent idle: %d%%" % ((1 - cycles/(int(options.time) * cycles_per_sec)) * 100) > print "\tRetirement Stalls: %d%%" % ((cycles_no_uops_retired / cycles) * 100) > print "\tBackend Stalls: %d%%" % ((backend_stall_cycles / cycles) * 100) > print "\tFrontend Stalls: %d%%" % ((cycles_no_uops_issued / cycles) * 100) > print "\tInstruction Starvation: %d%%" % (((cycles_no_uops_issued - backend_stall_cycles)/cycles) * 100) > print "\ticache stalls: %d%%" % ((icache_stall_cycles/cycles) * 100) > > # Wasted work > uops_executed = c["UOPS_EXECUTED:PORT015:t=1"] + c["UOPS_EXECUTED:PORT234_CORE"] > uops_retired = c["UOPS_RETIRED:ANY:t=1"] > uops_issued = c["UOPS_ISSUED:ANY:t=1"] + c["UOPS_ISSUED:FUSED:t=1"] > > print "\tPercentage useless uops: %d%%" % ((uops_executed - uops_retired) * 100.0/uops_retired) > print "\tPercentage useless uops issued: %d%%" % ((uops_issued - uops_retired) * 100.0/uops_retired) Just an update: i started working on generalizing these events. As a first step i'd like to introduce stall statistics in default 'perf stat' output, then as a second step offer more detailed modi of analysis (like your script). As for the first, 'overview' step, i'd like to use one or two numbers only, to give people a general ballpark figure about how good the CPU is performing for a given workload. Wouldnt UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 be in general a pretty good, primary "stall" indicator? This is similar to the "cycles-uops_executed" value in your script (UOPS_EXECUTED:PORT015:t=1 and UOPS_EXECUTED:PORT234_CORE based): it counts cycles when there's no execution at all - not even speculative one. This would cover a wide variety of 'stall' reasons: external latency, or stalling on lack of paralellism in the incoming instruction stream and most other stall reasons. So it would measure everything that moves the CPU away from 100% utilization. Secondly, the 'speculative waste' proportion is probably pretty well captured via branch-misprediction counts - those are the primary source of filling the pipeline with useless work. So in the most high-level view we could already print useful information via the introduction of a single new generic event: PERF_COUNT_HW_CPU_CYCLES_BUSY and 'idle cycles' are "cycles-busy_cycles". I have implemented preliminary support for this, and this is how the new 'overview' output looks like currently. Firstly here's the output from a "bad" testcase (lots of branch-misses): $ perf stat ./branches 20 1 Performance counter stats for './branches 20 1' (10 runs): 9.829903 task-clock # 0.972 CPUs utilized ( +- 0.07% ) 0 context-switches # 0.000 M/sec ( +- 0.00% ) 0 CPU-migrations # 0.000 M/sec ( +- 0.00% ) 111 page-faults # 0.011 M/sec ( +- 0.09% ) 31,470,886 cycles # 3.202 GHz ( +- 0.06% ) 9,825,068 stalled-cycles # 31.22% of all cycles are idle ( +- 13.89% ) 27,868,090 instructions # 0.89 insns per cycle # 0.35 stalled cycles per insn ( +- 0.02% ) 4,313,661 branches # 438.830 M/sec ( +- 0.02% ) 1,068,668 branch-misses # 24.77% of all branches ( +- 0.01% ) 0.010117520 seconds time elapsed ( +- 0.14% ) The two important values are the "20.00% of all cycles are idle" and the "24.77% of all branches" missed values - both are high and indicative of trouble. The fixed testcase shows: Performance counter stats for './branches 20 0' (100 runs): 4.417987 task-clock # 0.948 CPUs utilized ( +- 0.10% ) 0 context-switches # 0.000 M/sec ( +- 0.00% ) 0 CPU-migrations # 0.000 M/sec ( +- 0.00% ) 111 page-faults # 0.025 M/sec ( +- 0.02% ) 14,135,368 cycles # 3.200 GHz ( +- 0.10% ) 1,939,275 stalled-cycles # 13.72% of all cycles are idle ( +- 4.99% ) 27,846,610 instructions # 1.97 insns per cycle # 0.07 stalled cycles per insn ( +- 0.00% ) 4,309,228 branches # 975.383 M/sec ( +- 0.00% ) 3,992 branch-misses # 0.09% of all branches ( +- 0.26% ) 0.004660164 seconds time elapsed ( +- 0.15% ) Both stall values are much lower and the instructions per cycle value doubled. Here's another testcase, one that fills the pipeline near-perfectly: $ perf stat ./fill_1b Performance counter stats for './fill_1b': 1874.601174 task-clock # 0.998 CPUs utilized 1 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 107 page-faults # 0.000 M/sec 6,009,321,149 cycles # 3.206 GHz 212,795,827 stalled-cycles # 3.54% of all cycles are idle 18,007,646,574 instructions # 3.00 insns per cycle # 0.01 stalled cycles per insn 1,001,527,311 branches # 534.262 M/sec 16,988 branch-misses # 0.00% of all branches 1.878558311 seconds time elapsed Here too both counts are very low. The next step is to provide the tools to further analyze why the CPU is not utilized perfectly. I have implemented some preliminary code for that too, using generic cache events: $ perf stat --repeat 10 --detailed ./array-bad Performance counter stats for './array-bad' (10 runs): 50.552646 task-clock # 0.992 CPUs utilized ( +- 0.04% ) 0 context-switches # 0.000 M/sec ( +- 0.00% ) 0 CPU-migrations # 0.000 M/sec ( +- 0.00% ) 1,877 page-faults # 0.037 M/sec ( +- 0.01% ) 142,802,193 cycles # 2.825 GHz ( +- 0.18% ) (22.55%) 88,622,411 stalled-cycles # 62.06% of all cycles are idle ( +- 0.22% ) (34.97%) 45,381,755 instructions # 0.32 insns per cycle # 1.95 stalled cycles per insn ( +- 0.11% ) (46.94%) 7,725,207 branches # 152.815 M/sec ( +- 0.05% ) (58.44%) 29,788 branch-misses # 0.39% of all branches ( +- 1.06% ) (69.46%) 8,421,969 L1-dcache-loads # 166.598 M/sec ( +- 0.37% ) (70.06%) 7,868,389 L1-dcache-load-misses # 93.43% of all L1-dcache hits ( +- 0.13% ) (58.28%) 4,553,490 LLC-loads # 90.074 M/sec ( +- 0.31% ) (44.49%) 1,764,277 LLC-load-misses # 34.900 M/sec ( +- 0.21% ) (9.98%) 0.050973462 seconds time elapsed ( +- 0.05% ) The --detailed flag is what activates wider counting. The "93.43% of all L1-dcache hits" is a giveaway indicator that this particular workload is primarily data-access limited and that much of it escapes into RAM as well. Is this the direction you'd like to see perf stat to move into? Any comments, suggestions? Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES 2011-04-27 11:11 ` Ingo Molnar @ 2011-04-27 14:47 ` Arun Sharma 2011-04-27 15:48 ` Ingo Molnar 0 siblings, 1 reply; 80+ messages in thread From: Arun Sharma @ 2011-04-27 14:47 UTC (permalink / raw) To: Ingo Molnar Cc: Arun Sharma, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Linus Torvalds, Andrew Morton On Wed, Apr 27, 2011 at 4:11 AM, Ingo Molnar <mingo@elte.hu> wrote: > As for the first, 'overview' step, i'd like to use one or two numbers only, to > give people a general ballpark figure about how good the CPU is performing for > a given workload. > > Wouldnt UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 be in general a pretty good, > primary "stall" indicator? This is similar to the "cycles-uops_executed" value > in your script (UOPS_EXECUTED:PORT015:t=1 and UOPS_EXECUTED:PORT234_CORE > based): it counts cycles when there's no execution at all - not even > speculative one. If we're going to pick one stall indicator, why not pick cycles where no uops are retiring? cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"] In the presence of C-states and some halted cycles, I found that I couldn't measure it via UOPS_RETIRED:ANY:c=1:i=1 because it counts halted cycles too and could be greater than (unhalted) cycles. The other issue I had to deal with was UOPS_RETIRED > UOPS_EXECUTED condition. I believe this is caused by what AMD calls sideband stack optimizer and Intel calls dedicated stack manager (i.e. UOPS executed outside the main pipeline). A recursive fibonacci(30) is a good test case for reproducing this. > > Is this the direction you'd like to see perf stat to move into? Any comments, > suggestions? > Looks like a step in the right direction. Thanks. -Arun ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES 2011-04-27 14:47 ` Arun Sharma @ 2011-04-27 15:48 ` Ingo Molnar 2011-04-27 16:27 ` Ingo Molnar 2011-04-27 19:03 ` Arun Sharma 0 siblings, 2 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-27 15:48 UTC (permalink / raw) To: Arun Sharma Cc: Arun Sharma, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Linus Torvalds, Andrew Morton * Arun Sharma <arun@sharma-home.net> wrote: > On Wed, Apr 27, 2011 at 4:11 AM, Ingo Molnar <mingo@elte.hu> wrote: > > As for the first, 'overview' step, i'd like to use one or two numbers only, to > > give people a general ballpark figure about how good the CPU is performing for > > a given workload. > > > > Wouldnt UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 be in general a pretty good, > > primary "stall" indicator? This is similar to the "cycles-uops_executed" value > > in your script (UOPS_EXECUTED:PORT015:t=1 and UOPS_EXECUTED:PORT234_CORE > > based): it counts cycles when there's no execution at all - not even > > speculative one. > > If we're going to pick one stall indicator, [...] Well, one stall indicator for the 'general overview' stage, plus branch misses. Other stages can also have all sorts of details, including various subsets of stall reasons. (and stalls of different units of the CPU) We'll see how far it can be pushed. > [...] why not pick cycles where no uops are retiring? > > cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"] > > In the presence of C-states and some halted cycles, I found that I couldn't > measure it via UOPS_RETIRED:ANY:c=1:i=1 because it counts halted cycles too > and could be greater than (unhalted) cycles. Agreed, good point. You are right that it is more robust to pick 'the CPU was busy on our behalf' metric instead of a 'CPU is idle' metric, because that way 'HLT' as a special type of idling around does not have to be identified. HLT is not an issue for the default 'perf stat' behavior (because it only measures task execution, never the idle thread or other tasks not involved with the workload), but for per CPU and system-wide (--all) it matters. I'll flip it around. > The other issue I had to deal with was UOPS_RETIRED > UOPS_EXECUTED > condition. I believe this is caused by what AMD calls sideband stack > optimizer and Intel calls dedicated stack manager (i.e. UOPS executed outside > the main pipeline). A recursive fibonacci(30) is a good test case for > reproducing this. So the PORT015+234 sum is not precise? The definition seems to be rather firm: Counts number of Uops executed that where issued on port 2, 3, or 4. Counts number of Uops executed that where issued on port 0, 1, or 5. Wouldnt that include all uops? > > Is this the direction you'd like to see perf stat to move into? Any > > comments, suggestions? > > Looks like a step in the right direction. Thanks. Ok, great - will keep you updated. I doubt the defaults can ever beat truly expert use of PMU events: there will always be fine details that a generic approach will miss. But i'd be happy if we got 70% the way ... Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES 2011-04-27 15:48 ` Ingo Molnar @ 2011-04-27 16:27 ` Ingo Molnar 2011-04-27 19:05 ` Arun Sharma 2011-04-27 19:03 ` Arun Sharma 1 sibling, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-27 16:27 UTC (permalink / raw) To: Arun Sharma Cc: Arun Sharma, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Linus Torvalds, Andrew Morton * Ingo Molnar <mingo@elte.hu> wrote: > > [...] why not pick cycles where no uops are retiring? > > > > cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"] > > > > In the presence of C-states and some halted cycles, I found that I couldn't > > measure it via UOPS_RETIRED:ANY:c=1:i=1 because it counts halted cycles too > > and could be greater than (unhalted) cycles. > > Agreed, good point. > > You are right that it is more robust to pick 'the CPU was busy on our behalf' > metric instead of a 'CPU is idle' metric, because that way 'HLT' as a special > type of idling around does not have to be identified. Sidenote, there's one advantage of the idle event: it's more meaningful to profile idle cycles - and it's easy to ignore the HLT loop in the profile output (we already do). That way we get a 'hidden overhead' profile: a profile of frequently executed code which executes in the CPU in a suboptimal way. So we should probably offer both events. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES 2011-04-27 16:27 ` Ingo Molnar @ 2011-04-27 19:05 ` Arun Sharma 0 siblings, 0 replies; 80+ messages in thread From: Arun Sharma @ 2011-04-27 19:05 UTC (permalink / raw) To: Ingo Molnar Cc: Arun Sharma, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Linus Torvalds, Andrew Morton On Wed, Apr 27, 2011 at 9:27 AM, Ingo Molnar <mingo@elte.hu> wrote: > > That way we get a 'hidden overhead' profile: a profile of frequently executed > code which executes in the CPU in a suboptimal way. > > So we should probably offer both events. > Yes - certainly. -Arun ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES 2011-04-27 15:48 ` Ingo Molnar 2011-04-27 16:27 ` Ingo Molnar @ 2011-04-27 19:03 ` Arun Sharma 1 sibling, 0 replies; 80+ messages in thread From: Arun Sharma @ 2011-04-27 19:03 UTC (permalink / raw) To: Ingo Molnar Cc: Arun Sharma, Stephane Eranian, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, eranian, Linus Torvalds, Andrew Morton On Wed, Apr 27, 2011 at 8:48 AM, Ingo Molnar <mingo@elte.hu> wrote: > >> The other issue I had to deal with was UOPS_RETIRED > UOPS_EXECUTED >> condition. I believe this is caused by what AMD calls sideband stack >> optimizer and Intel calls dedicated stack manager (i.e. UOPS executed outside >> the main pipeline). A recursive fibonacci(30) is a good test case for >> reproducing this. > > So the PORT015+234 sum is not precise? The definition seems to be rather firm: > > Counts number of Uops executed that where issued on port 2, 3, or 4. > Counts number of Uops executed that where issued on port 0, 1, or 5. > There is some work done outside of the main out of order engine for power optimization reasons: Described as dedicated stack engine here: http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/vol7iss2_art03.pdf However, I can't seem to be able to reproduce this behavior using a micro benchmark right now: # cat foo.s .text .global main main: 1: push %rax push %rbx push %rcx push %rdx pop %rax pop %rbx pop %rcx pop %rdx jmp 1b Performance counter stats for './foo': 7,755,881,073 UOPS_ISSUED:ANY:t=1 (scaled from 79.98%) 10,569,957,988 UOPS_RETIRED:ANY:t=1 (scaled from 79.96%) 9,155,400,383 UOPS_EXECUTED:PORT234_CORE (scaled from 80.02%) 2,594,206,312 UOPS_EXECUTED:PORT015:t=1 (scaled from 80.02%) Perhaps I was thinking of UOPS_ISSUED < UOPS_RETIRED. In general, UOPS_RETIRED (or instruction retirement in general) is the "source of truth" in an otherwise crazy world and might be more interesting as a generalized event that works on multiple architectures. -Arun ^ permalink raw reply [flat|nested] 80+ messages in thread
* [GIT PULL 0/1] perf/urgent Fix missing support for config1/config2 @ 2011-04-21 17:41 Arnaldo Carvalho de Melo 2011-04-21 17:41 ` [PATCH 1/1] perf tools: Add missing user space " Arnaldo Carvalho de Melo 0 siblings, 1 reply; 80+ messages in thread From: Arnaldo Carvalho de Melo @ 2011-04-21 17:41 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Arnaldo Carvalho de Melo, Andi Kleen, Lin Ming, Peter Zijlstra, Stephane Eranian, Arnaldo Carvalho de Melo Hi Ingo, Please consider pulling from: git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 perf/urgent Regards, - Arnaldo Andi Kleen (1): perf tools: Add missing user space support for config1/config2 tools/perf/Documentation/perf-list.txt | 11 +++++++++++ tools/perf/util/parse-events.c | 18 +++++++++++++++++- 2 files changed, 28 insertions(+), 1 deletions(-) ^ permalink raw reply [flat|nested] 80+ messages in thread
* [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-21 17:41 [GIT PULL 0/1] perf/urgent Fix missing support for config1/config2 Arnaldo Carvalho de Melo @ 2011-04-21 17:41 ` Arnaldo Carvalho de Melo 2011-04-22 6:34 ` Ingo Molnar 0 siblings, 1 reply; 80+ messages in thread From: Arnaldo Carvalho de Melo @ 2011-04-21 17:41 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Andi Kleen, Ingo Molnar, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo From: Andi Kleen <ak@linux.intel.com> The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the user space bits were not. This made it impossible to set the extra mask and actually do the OFFCORE profiling This patch fixes this. It adds a new syntax ':' to raw events to specify additional event masks. I also added support for setting config2, even though that is not needed currently. [Note: the original version back in time used , -- but that actually conflicted with event lists, so now it's :] Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Link: http://lkml.kernel.org/r/1302658203-4239-1-git-send-email-andi@firstfloor.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> --- tools/perf/Documentation/perf-list.txt | 11 +++++++++++ tools/perf/util/parse-events.c | 18 +++++++++++++++++- 2 files changed, 28 insertions(+), 1 deletions(-) diff --git a/tools/perf/Documentation/perf-list.txt b/tools/perf/Documentation/perf-list.txt index 7a527f7..f19f1e5 100644 --- a/tools/perf/Documentation/perf-list.txt +++ b/tools/perf/Documentation/perf-list.txt @@ -61,6 +61,17 @@ raw encoding of 0x1A8 can be used: You should refer to the processor specific documentation for getting these details. Some of them are referenced in the SEE ALSO section below. +Some raw events -- like the Intel OFFCORE events -- support additional +parameters. These can be appended after a ':'. + +For example on a multi socket Intel Nehalem: + + perf stat -e r1b7:20ff -a sleep 1 + +Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0 +that measures any access to DRAM on another socket. Upto two parameters can +be specified with additional ':' + OPTIONS ------- diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c index 952b4ae..fe9d079 100644 --- a/tools/perf/util/parse-events.c +++ b/tools/perf/util/parse-events.c @@ -688,9 +688,25 @@ parse_raw_event(const char **strp, struct perf_event_attr *attr) return EVT_FAILED; n = hex2u64(str + 1, &config); if (n > 0) { - *strp = str + n + 1; + str += n + 1; attr->type = PERF_TYPE_RAW; attr->config = config; + if (*str == ':') { + str++; + n = hex2u64(str, &config); + if (n == 0) + return EVT_FAILED; + attr->config1 = config; + str += n; + if (*str == ':') { + str++; + n = hex2u64(str + 1, &config); + if (n == 0) + return EVT_FAILED; + attr->config2 = config; + } + } + *strp = str; return EVT_HANDLED; } return EVT_FAILED; -- 1.6.2.5 ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-21 17:41 ` [PATCH 1/1] perf tools: Add missing user space " Arnaldo Carvalho de Melo @ 2011-04-22 6:34 ` Ingo Molnar 2011-04-22 8:06 ` Ingo Molnar 2011-04-22 16:22 ` Andi Kleen 0 siblings, 2 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-22 6:34 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: linux-kernel, Andi Kleen, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo * Arnaldo Carvalho de Melo <acme@infradead.org> wrote: > From: Andi Kleen <ak@linux.intel.com> > > The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the > user space bits were not. This made it impossible to set the extra mask > and actually do the OFFCORE profiling > > This patch fixes this. It adds a new syntax ':' to raw events to specify > additional event masks. I also added support for setting config2, even > though that is not needed currently. > > [Note: the original version back in time used , -- but that actually > conflicted with event lists, so now it's :] > > Acked-by: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@elte.hu> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Stephane Eranian <eranian@gmail.com> > Cc: Lin Ming <ming.m.lin@intel.com> > Link: http://lkml.kernel.org/r/1302658203-4239-1-git-send-email-andi@firstfloor.org > Signed-off-by: Andi Kleen <ak@linux.intel.com> > Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> > --- > tools/perf/Documentation/perf-list.txt | 11 +++++++++++ > tools/perf/util/parse-events.c | 18 +++++++++++++++++- > 2 files changed, 28 insertions(+), 1 deletions(-) > > diff --git a/tools/perf/Documentation/perf-list.txt b/tools/perf/Documentation/perf-list.txt > index 7a527f7..f19f1e5 100644 > --- a/tools/perf/Documentation/perf-list.txt > +++ b/tools/perf/Documentation/perf-list.txt > @@ -61,6 +61,17 @@ raw encoding of 0x1A8 can be used: > You should refer to the processor specific documentation for getting these > details. Some of them are referenced in the SEE ALSO section below. > > +Some raw events -- like the Intel OFFCORE events -- support additional > +parameters. These can be appended after a ':'. > + > +For example on a multi socket Intel Nehalem: > + > + perf stat -e r1b7:20ff -a sleep 1 > + > +Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0 > +that measures any access to DRAM on another socket. Upto two parameters can > +be specified with additional ':' This needs to be a *lot* more user friendly. Users do not want to type in stupid hexa magic numbers to get profiling. We have moved beyond the oprofile era really. Unless there's proper generalized and human usable support i'm leaning towards turning off the offcore user-space accessible raw bits for now, and use them only kernel-internally, for the cache events. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 6:34 ` Ingo Molnar @ 2011-04-22 8:06 ` Ingo Molnar 2011-04-22 21:37 ` Peter Zijlstra 2011-04-25 17:12 ` Vince Weaver 2011-04-22 16:22 ` Andi Kleen 1 sibling, 2 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-22 8:06 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: linux-kernel, Andi Kleen, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra * Ingo Molnar <mingo@elte.hu> wrote: > This needs to be a *lot* more user friendly. Users do not want to type in > stupid hexa magic numbers to get profiling. We have moved beyond the oprofile > era really. > > Unless there's proper generalized and human usable support i'm leaning > towards turning off the offcore user-space accessible raw bits for now, and > use them only kernel-internally, for the cache events. I'm about to push out the patch attached below - it lays out the arguments in detail. I don't think we have time to fix this properly for .39 - but memory profiling could be a nice feature for v2.6.40. Thanks, Ingo ---------------------> >From b52c55c6a25e4515b5e075a989ff346fc251ed09 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Fri, 22 Apr 2011 08:44:38 +0200 Subject: [PATCH] x86, perf event: Turn off unstructured raw event access to offcore registers Andi Kleen pointed out that the Intel offcore support patches were merged without user-space tool support to the functionality: | | The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the | user space bits were not. This made it impossible to set the extra mask | and actually do the OFFCORE profiling | Andi submitted a preliminary patch for user-space support, as an extension to perf's raw event syntax: | | Some raw events -- like the Intel OFFCORE events -- support additional | parameters. These can be appended after a ':'. | | For example on a multi socket Intel Nehalem: | | perf stat -e r1b7:20ff -a sleep 1 | | Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0 | that measures any access to DRAM on another socket. | But this kind of usability is absolutely unacceptable - users should not be expected to type in magic, CPU and model specific incantations to get access to useful hardware functionality. The proper solution is to expose useful offcore functionality via generalized events - that way users do not have to care which specific CPU model they are using, they can use the conceptual event and not some model specific quirky hexa number. We already have such generalization in place for CPU cache events, and it's all very extensible. "Offcore" events measure general DRAM access patters along various parameters. They are particularly useful in NUMA systems. We want to support them via generalized DRAM events: either as the fourth level of cache (after the last-level cache), or as a separate generalization category. That way user-space support would be very obvious, memory access profiling could be done via self-explanatory commands like: perf record -e dram ./myapp perf record -e dram-remote ./myapp ... to measure DRAM accesses or more expensive cross-node NUMA DRAM accesses. These generalized events would work on all CPUs and architectures that have comparable PMU features. ( Note, these are just examples: actual implementation could have more sophistication and more parameter - as long as they center around similarly simple usecases. ) Now we do not want to revert *all* of the current offcore bits, as they are still somewhat useful for generic last-level-cache events, implemented in this commit: e994d7d23a0b: perf: Fix LLC-* events on Intel Nehalem/Westmere But we definitely do not yet want to expose the unstructured raw events to user-space, until better generalization and usability is implemented for these hardware event features. ( Note: after generalization has been implemented raw offcore events can be supported as well: there can always be an odd event that is marginally useful but not useful enough to generalize. DRAM profiling is definitely *not* such a category so generalization must be done first. ) Furthermore, PERF_TYPE_RAW access to these registers was not intended to go upstream without proper support - it was a side-effect of the above e994d7d23a0b commit, not mentioned in the changelog. As v2.6.39 is nearing release we go for the simplest approach: disable the PERF_TYPE_RAW offcore hack for now, before it escapes into a released kernel and becomes an ABI. Once proper structure is implemented for these hardware events and users are offered usable solutions we can revisit this issue. Reported-by: Andi Kleen <ak@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1302658203-4239-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/cpu/perf_event.c | 6 +++++- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index eed3673a..632e5dc 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -586,8 +586,12 @@ static int x86_setup_perfctr(struct perf_event *event) return -EOPNOTSUPP; } + /* + * Do not allow config1 (extended registers) to propagate, + * there's no sane user-space generalization yet: + */ if (attr->type == PERF_TYPE_RAW) - return x86_pmu_extra_regs(event->attr.config, event); + return 0; if (attr->type == PERF_TYPE_HW_CACHE) return set_ext_hw_attr(hwc, event); ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 8:06 ` Ingo Molnar @ 2011-04-22 21:37 ` Peter Zijlstra 2011-04-22 21:54 ` Peter Zijlstra 2011-04-23 8:13 ` Ingo Molnar 2011-04-25 17:12 ` Vince Weaver 1 sibling, 2 replies; 80+ messages in thread From: Peter Zijlstra @ 2011-04-22 21:37 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Fri, 2011-04-22 at 10:06 +0200, Ingo Molnar wrote: > > I'm about to push out the patch attached below - it lays out the arguments in > detail. I don't think we have time to fix this properly for .39 - but memory > profiling could be a nice feature for v2.6.40. Does something like the below provide enough generic infrastructure to allow the raw offcore bits again? The below needs filling out for !x86 (which I filled out with unsupported events) and x86 needs the offcore bits fixed to auto select between the two offcore events. Not-signed-off-by: /me --- arch/arm/kernel/perf_event_v6.c | 28 ++++++ arch/arm/kernel/perf_event_v7.c | 28 ++++++ arch/arm/kernel/perf_event_xscale.c | 14 +++ arch/mips/kernel/perf_event_mipsxx.c | 28 ++++++ arch/powerpc/kernel/e500-pmu.c | 5 + arch/powerpc/kernel/mpc7450-pmu.c | 5 + arch/powerpc/kernel/power4-pmu.c | 5 + arch/powerpc/kernel/power5+-pmu.c | 5 + arch/powerpc/kernel/power5-pmu.c | 5 + arch/powerpc/kernel/power6-pmu.c | 5 + arch/powerpc/kernel/power7-pmu.c | 5 + arch/powerpc/kernel/ppc970-pmu.c | 5 + arch/sh/kernel/cpu/sh4/perf_event.c | 15 +++ arch/sh/kernel/cpu/sh4a/perf_event.c | 15 +++ arch/sparc/kernel/perf_event.c | 42 ++++++++ arch/x86/kernel/cpu/perf_event_amd.c | 14 +++ arch/x86/kernel/cpu/perf_event_intel.c | 167 +++++++++++++++++++++++++------- arch/x86/kernel/cpu/perf_event_p4.c | 14 +++ include/linux/perf_event.h | 3 +- 19 files changed, 373 insertions(+), 35 deletions(-) diff --git a/arch/arm/kernel/perf_event_v6.c b/arch/arm/kernel/perf_event_v6.c index f1e8dd9..02178da 100644 --- a/arch/arm/kernel/perf_event_v6.c +++ b/arch/arm/kernel/perf_event_v6.c @@ -173,6 +173,20 @@ static const unsigned armv6_perf_cache_map[PERF_COUNT_HW_CACHE_MAX] [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, }, }, + [C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + }, }; enum armv6mpcore_perf_types { @@ -310,6 +324,20 @@ static const unsigned armv6mpcore_perf_cache_map[PERF_COUNT_HW_CACHE_MAX] [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, }, }, + [C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + }, }; static inline unsigned long diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c index 4960686..79ffc83 100644 --- a/arch/arm/kernel/perf_event_v7.c +++ b/arch/arm/kernel/perf_event_v7.c @@ -255,6 +255,20 @@ static const unsigned armv7_a8_perf_cache_map[PERF_COUNT_HW_CACHE_MAX] [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, }, }, + [C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + }, }; /* @@ -371,6 +385,20 @@ static const unsigned armv7_a9_perf_cache_map[PERF_COUNT_HW_CACHE_MAX] [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, }, }, + [C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + }, }; /* diff --git a/arch/arm/kernel/perf_event_xscale.c b/arch/arm/kernel/perf_event_xscale.c index 39affbe..7ed1a55 100644 --- a/arch/arm/kernel/perf_event_xscale.c +++ b/arch/arm/kernel/perf_event_xscale.c @@ -144,6 +144,20 @@ static const unsigned xscale_perf_cache_map[PERF_COUNT_HW_CACHE_MAX] [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, }, }, + [C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + }, }; #define XSCALE_PMU_ENABLE 0x001 diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c index 75266ff..e5ad09a 100644 --- a/arch/mips/kernel/perf_event_mipsxx.c +++ b/arch/mips/kernel/perf_event_mipsxx.c @@ -377,6 +377,20 @@ static const struct mips_perf_event mipsxxcore_cache_map [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, }, }, +[C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, +}, }; /* 74K core has completely different cache event map. */ @@ -480,6 +494,20 @@ static const struct mips_perf_event mipsxx74Kcore_cache_map [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, }, }, +[C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, +}, }; #ifdef CONFIG_MIPS_MT_SMP diff --git a/arch/powerpc/kernel/e500-pmu.c b/arch/powerpc/kernel/e500-pmu.c index b150b51..cb2e294 100644 --- a/arch/powerpc/kernel/e500-pmu.c +++ b/arch/powerpc/kernel/e500-pmu.c @@ -75,6 +75,11 @@ static int e500_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static int num_events = 128; diff --git a/arch/powerpc/kernel/mpc7450-pmu.c b/arch/powerpc/kernel/mpc7450-pmu.c index 2cc5e03..845a584 100644 --- a/arch/powerpc/kernel/mpc7450-pmu.c +++ b/arch/powerpc/kernel/mpc7450-pmu.c @@ -388,6 +388,11 @@ static int mpc7450_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; struct power_pmu mpc7450_pmu = { diff --git a/arch/powerpc/kernel/power4-pmu.c b/arch/powerpc/kernel/power4-pmu.c index ead8b3c..e9dbc2d 100644 --- a/arch/powerpc/kernel/power4-pmu.c +++ b/arch/powerpc/kernel/power4-pmu.c @@ -587,6 +587,11 @@ static int power4_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu power4_pmu = { diff --git a/arch/powerpc/kernel/power5+-pmu.c b/arch/powerpc/kernel/power5+-pmu.c index eca0ac5..f58a2bd 100644 --- a/arch/powerpc/kernel/power5+-pmu.c +++ b/arch/powerpc/kernel/power5+-pmu.c @@ -653,6 +653,11 @@ static int power5p_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu power5p_pmu = { diff --git a/arch/powerpc/kernel/power5-pmu.c b/arch/powerpc/kernel/power5-pmu.c index d5ff0f6..b1acab6 100644 --- a/arch/powerpc/kernel/power5-pmu.c +++ b/arch/powerpc/kernel/power5-pmu.c @@ -595,6 +595,11 @@ static int power5_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu power5_pmu = { diff --git a/arch/powerpc/kernel/power6-pmu.c b/arch/powerpc/kernel/power6-pmu.c index 3160392..b24a3a2 100644 --- a/arch/powerpc/kernel/power6-pmu.c +++ b/arch/powerpc/kernel/power6-pmu.c @@ -516,6 +516,11 @@ static int power6_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu power6_pmu = { diff --git a/arch/powerpc/kernel/power7-pmu.c b/arch/powerpc/kernel/power7-pmu.c index 593740f..6d9dccb 100644 --- a/arch/powerpc/kernel/power7-pmu.c +++ b/arch/powerpc/kernel/power7-pmu.c @@ -342,6 +342,11 @@ static int power7_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu power7_pmu = { diff --git a/arch/powerpc/kernel/ppc970-pmu.c b/arch/powerpc/kernel/ppc970-pmu.c index 9a6e093..b121de9 100644 --- a/arch/powerpc/kernel/ppc970-pmu.c +++ b/arch/powerpc/kernel/ppc970-pmu.c @@ -467,6 +467,11 @@ static int ppc970_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu ppc970_pmu = { diff --git a/arch/sh/kernel/cpu/sh4/perf_event.c b/arch/sh/kernel/cpu/sh4/perf_event.c index 748955d..fa4f724 100644 --- a/arch/sh/kernel/cpu/sh4/perf_event.c +++ b/arch/sh/kernel/cpu/sh4/perf_event.c @@ -180,6 +180,21 @@ static const int sh7750_cache_events [ C(RESULT_MISS) ] = -1, }, }, + + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; static int sh7750_event_map(int event) diff --git a/arch/sh/kernel/cpu/sh4a/perf_event.c b/arch/sh/kernel/cpu/sh4a/perf_event.c index 17e6beb..84a2c39 100644 --- a/arch/sh/kernel/cpu/sh4a/perf_event.c +++ b/arch/sh/kernel/cpu/sh4a/perf_event.c @@ -205,6 +205,21 @@ static const int sh4a_cache_events [ C(RESULT_MISS) ] = -1, }, }, + + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; static int sh4a_event_map(int event) diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c index ee8426e..d890e0f 100644 --- a/arch/sparc/kernel/perf_event.c +++ b/arch/sparc/kernel/perf_event.c @@ -245,6 +245,20 @@ static const cache_map_t ultra3_cache_map = { [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, }, }, +[C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED }, + [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, +}, }; static const struct sparc_pmu ultra3_pmu = { @@ -360,6 +374,20 @@ static const cache_map_t niagara1_cache_map = { [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, }, }, +[C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED }, + [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, +}, }; static const struct sparc_pmu niagara1_pmu = { @@ -472,6 +500,20 @@ static const cache_map_t niagara2_cache_map = { [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, }, }, +[C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED }, + [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, +}, }; static const struct sparc_pmu niagara2_pmu = { diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index cf4e369..01c7dd3 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -89,6 +89,20 @@ static __initconst const u64 amd_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = 0xb8e9, /* CPU Request to Memory, l+r */ + [ C(RESULT_MISS) ] = 0x98e9, /* CPU Request to Memory, r */ + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; /* diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 43fa20b..225efa0 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -184,26 +184,23 @@ static __initconst const u64 snb_hw_cache_event_ids }, }, [ C(LL ) ] = { - /* - * TBD: Need Off-core Response Performance Monitoring support - */ [ C(OP_READ) ] = { - /* OFFCORE_RESPONSE_0.ANY_DATA.LOCAL_CACHE */ + /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */ [ C(RESULT_ACCESS) ] = 0x01b7, - /* OFFCORE_RESPONSE_1.ANY_DATA.ANY_LLC_MISS */ - [ C(RESULT_MISS) ] = 0x01bb, + /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */ + [ C(RESULT_MISS) ] = 0x01b7, }, [ C(OP_WRITE) ] = { - /* OFFCORE_RESPONSE_0.ANY_RFO.LOCAL_CACHE */ + /* OFFCORE_RESPONSE.ANY_RFO.LOCAL_CACHE */ [ C(RESULT_ACCESS) ] = 0x01b7, - /* OFFCORE_RESPONSE_1.ANY_RFO.ANY_LLC_MISS */ - [ C(RESULT_MISS) ] = 0x01bb, + /* OFFCORE_RESPONSE.ANY_RFO.ANY_LLC_MISS */ + [ C(RESULT_MISS) ] = 0x01b7, }, [ C(OP_PREFETCH) ] = { - /* OFFCORE_RESPONSE_0.PREFETCH.LOCAL_CACHE */ + /* OFFCORE_RESPONSE.PREFETCH.LOCAL_CACHE */ [ C(RESULT_ACCESS) ] = 0x01b7, - /* OFFCORE_RESPONSE_1.PREFETCH.ANY_LLC_MISS */ - [ C(RESULT_MISS) ] = 0x01bb, + /* OFFCORE_RESPONSE.PREFETCH.ANY_LLC_MISS */ + [ C(RESULT_MISS) ] = 0x01b7, }, }, [ C(DTLB) ] = { @@ -248,6 +245,20 @@ static __initconst const u64 snb_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */ + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + }, }; static __initconst const u64 westmere_hw_cache_event_ids @@ -285,26 +296,26 @@ static __initconst const u64 westmere_hw_cache_event_ids }, [ C(LL ) ] = { [ C(OP_READ) ] = { - /* OFFCORE_RESPONSE_0.ANY_DATA.LOCAL_CACHE */ + /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */ [ C(RESULT_ACCESS) ] = 0x01b7, - /* OFFCORE_RESPONSE_1.ANY_DATA.ANY_LLC_MISS */ - [ C(RESULT_MISS) ] = 0x01bb, + /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */ + [ C(RESULT_MISS) ] = 0x01b7, }, /* * Use RFO, not WRITEBACK, because a write miss would typically occur * on RFO. */ [ C(OP_WRITE) ] = { - /* OFFCORE_RESPONSE_1.ANY_RFO.LOCAL_CACHE */ - [ C(RESULT_ACCESS) ] = 0x01bb, - /* OFFCORE_RESPONSE_0.ANY_RFO.ANY_LLC_MISS */ + /* OFFCORE_RESPONSE.ANY_RFO.LOCAL_CACHE */ + [ C(RESULT_ACCESS) ] = 0x01b7, + /* OFFCORE_RESPONSE.ANY_RFO.ANY_LLC_MISS */ [ C(RESULT_MISS) ] = 0x01b7, }, [ C(OP_PREFETCH) ] = { - /* OFFCORE_RESPONSE_0.PREFETCH.LOCAL_CACHE */ + /* OFFCORE_RESPONSE.PREFETCH.LOCAL_CACHE */ [ C(RESULT_ACCESS) ] = 0x01b7, - /* OFFCORE_RESPONSE_1.PREFETCH.ANY_LLC_MISS */ - [ C(RESULT_MISS) ] = 0x01bb, + /* OFFCORE_RESPONSE.PREFETCH.ANY_LLC_MISS */ + [ C(RESULT_MISS) ] = 0x01b7, }, }, [ C(DTLB) ] = { @@ -349,19 +360,51 @@ static __initconst const u64 westmere_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */ + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + }, }; /* * OFFCORE_RESPONSE MSR bits (subset), See IA32 SDM Vol 3 30.6.1.3 */ -#define DMND_DATA_RD (1 << 0) -#define DMND_RFO (1 << 1) -#define DMND_WB (1 << 3) -#define PF_DATA_RD (1 << 4) -#define PF_DATA_RFO (1 << 5) -#define RESP_UNCORE_HIT (1 << 8) -#define RESP_MISS (0xf600) /* non uncore hit */ +#define DMND_DATA_RD (1 << 0) +#define DMND_RFO (1 << 1) +#define DMND_IFETCH (1 << 2) +#define DMND_WB (1 << 3) +#define PF_DATA_RD (1 << 4) +#define PF_DATA_RFO (1 << 5) +#define PF_IFETCH (1 << 6) +#define OFFCORE_OTHER (1 << 7) +#define UNCORE_HIT (1 << 8) +#define OTHER_CORE_HIT_SNP (1 << 9) +#define OTHER_CORE_HITM (1 << 10) + /* reserved */ +#define REMOTE_CACHE_FWD (1 << 12) +#define REMOTE_DRAM (1 << 13) +#define LOCAL_DRAM (1 << 14) +#define NON_DRAM (1 << 15) + +#define ALL_DRAM (REMOTE_DRAM|LOCAL_DRAM) + +#define DMND_READ (DMND_DATA_RD) +#define DMND_WRITE (DMND_RFO|DMND_WB) +#define DMND_PREFETCH (PF_DATA_RD|PF_DATA_FRO) + +#define L3_HIT (UNCORE_HIT|OTHER_CORE_HIT_SNP|OTHER_CORE_HITM) +#define L3_MISS (NON_DRAM|ALL_DRAM|REMOTE_CACHE_FWD) static __initconst const u64 nehalem_hw_cache_extra_regs [PERF_COUNT_HW_CACHE_MAX] @@ -370,18 +413,32 @@ static __initconst const u64 nehalem_hw_cache_extra_regs { [ C(LL ) ] = { [ C(OP_READ) ] = { - [ C(RESULT_ACCESS) ] = DMND_DATA_RD|RESP_UNCORE_HIT, - [ C(RESULT_MISS) ] = DMND_DATA_RD|RESP_MISS, + [ C(RESULT_ACCESS) ] = DMND_READ|L3_HIT, + [ C(RESULT_MISS) ] = DMND_READ|L3_MISS, }, [ C(OP_WRITE) ] = { - [ C(RESULT_ACCESS) ] = DMND_RFO|DMND_WB|RESP_UNCORE_HIT, - [ C(RESULT_MISS) ] = DMND_RFO|DMND_WB|RESP_MISS, + [ C(RESULT_ACCESS) ] = DMND_WRITE|L3_HIT, + [ C(RESULT_MISS) ] = DMND_WRITE|L3_MISS, }, [ C(OP_PREFETCH) ] = { - [ C(RESULT_ACCESS) ] = PF_DATA_RD|PF_DATA_RFO|RESP_UNCORE_HIT, - [ C(RESULT_MISS) ] = PF_DATA_RD|PF_DATA_RFO|RESP_MISS, + [ C(RESULT_ACCESS) ] = DMND_PREFETCH|L3_HIT, + [ C(RESULT_MISS) ] = DMND_PREFETCH|L3_MISS, }, } + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = DMND_READ|ALL_DRAM, + [ C(RESULT_MISS) ] = DMND_READ|REMOTE_DRAM, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = DMND_WRITE|ALL_DRAM, + [ C(RESULT_MISS) ] = DMND_WRITE|REMOTE_DRAM, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = DMND_PREFETCH|ALL_DRAM, + [ C(RESULT_MISS) ] = DMND_PREFETCH|REMOTE_DRAM, + }, + }, }; static __initconst const u64 nehalem_hw_cache_event_ids @@ -483,6 +540,20 @@ static __initconst const u64 nehalem_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */ + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + }, }; static __initconst const u64 core2_hw_cache_event_ids @@ -574,6 +645,20 @@ static __initconst const u64 core2_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; static __initconst const u64 atom_hw_cache_event_ids @@ -665,6 +750,20 @@ static __initconst const u64 atom_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; static void intel_pmu_disable_all(void) diff --git a/arch/x86/kernel/cpu/perf_event_p4.c b/arch/x86/kernel/cpu/perf_event_p4.c index 74507c1..e802c7e 100644 --- a/arch/x86/kernel/cpu/perf_event_p4.c +++ b/arch/x86/kernel/cpu/perf_event_p4.c @@ -554,6 +554,20 @@ static __initconst const u64 p4_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; static u64 p4_general_events[PERF_COUNT_HW_MAX] = { diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index ee9f1e7..df4a841 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -59,7 +59,7 @@ enum perf_hw_id { /* * Generalized hardware cache events: * - * { L1-D, L1-I, LLC, ITLB, DTLB, BPU } x + * { L1-D, L1-I, LLC, ITLB, DTLB, BPU, NODE } x * { read, write, prefetch } x * { accesses, misses } */ @@ -70,6 +70,7 @@ enum perf_hw_cache_id { PERF_COUNT_HW_CACHE_DTLB = 3, PERF_COUNT_HW_CACHE_ITLB = 4, PERF_COUNT_HW_CACHE_BPU = 5, + PERF_COUNT_HW_CACHE_NODE = 6, PERF_COUNT_HW_CACHE_MAX, /* non-ABI */ }; ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 21:37 ` Peter Zijlstra @ 2011-04-22 21:54 ` Peter Zijlstra 2011-04-22 22:19 ` Peter Zijlstra 2011-04-22 22:57 ` Peter Zijlstra 2011-04-23 8:13 ` Ingo Molnar 1 sibling, 2 replies; 80+ messages in thread From: Peter Zijlstra @ 2011-04-22 21:54 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Fri, 2011-04-22 at 23:37 +0200, Peter Zijlstra wrote: > The below needs filling out for !x86 (which I filled out with > unsupported events) and x86 needs the offcore bits fixed to auto select > between the two offcore events. Urgh, so SNB has different MSR_OFFCORE_RESPONSE bits and needs another table. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 21:54 ` Peter Zijlstra @ 2011-04-22 22:19 ` Peter Zijlstra 2011-04-22 23:54 ` Andi Kleen 2011-04-22 22:57 ` Peter Zijlstra 1 sibling, 1 reply; 80+ messages in thread From: Peter Zijlstra @ 2011-04-22 22:19 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Fri, 2011-04-22 at 23:54 +0200, Peter Zijlstra wrote: > On Fri, 2011-04-22 at 23:37 +0200, Peter Zijlstra wrote: > > The below needs filling out for !x86 (which I filled out with > > unsupported events) and x86 needs the offcore bits fixed to auto select > > between the two offcore events. > > Urgh, so SNB has different MSR_OFFCORE_RESPONSE bits and needs another table. /* * Sandy Bridge MSR_OFFCORE_RESPONSE bits; * See IA32 SDM Vol 3B 30.8.5 */ #define SNB_DMND_DATA_RD (1 << 0) #define SNB_DMND_RFO (1 << 1) #define SNB_DMND_IFETCH (1 << 2) #define SNB_DMND_WB (1 << 3) #define SNB_PF_DATA_RD (1 << 4) #define SNB_PF_DATA_RFO (1 << 5) #define SNB_PF_IFETCH (1 << 6) #define SNB_PF_LLC_DATA_RD (1 << 7) #define SNB_PF_LLC_RFO (1 << 8) #define SNB_PF_LLC_IFETCH (1 << 9) #define SNB_BUS_LOCKS (1 << 10) #define SNB_STRM_ST (1 << 11) /* hole */ #define SNB_OFFCORE_OTHER (1 << 15) #define SNB_COMMON (1 << 16) #define SNB_NO_SUPP (1 << 17) #define SNB_LLC_HITM (1 << 18) #define SNB_LLC_HITE (1 << 19) #define SNB_LLC_HITS (1 << 20) #define SNB_LLC_HITF (1 << 21) /* hole */ #define SNB_SNP_NONE (1 << 31) #define SNB_SNP_NOT_NEEDED (1 << 32) #define SNB_SNP_MISS (1 << 33) #define SNB_SNP_NO_FWD (1 << 34) #define SNB_SNP_FWD (1 << 35) #define SNB_HITM (1 << 36) #define SNB_NON_DRAM (1 << 37) #define SNB_DMND_READ (SNB_DMND_DATA_RD) #define SNB_DMND_WRITE (SNB_DMND_RFO|SNB_DMND_WB|SNB_STRM_ST) #define SNB_DMND_PREFETCH (SNB_PF_DATA_RD|SNB_PF_DATA_RFO) Is what I came up with, but I'm stumped on how to construct: #define SNB_L3_HIT #define SNB_L3_MISS #define SNB_ALL_DRAM #define SNB_REMOTE_DRAM Anybody got clue? ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 22:19 ` Peter Zijlstra @ 2011-04-22 23:54 ` Andi Kleen 2011-04-23 7:49 ` Peter Zijlstra 0 siblings, 1 reply; 80+ messages in thread From: Andi Kleen @ 2011-04-22 23:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner > > #define SNB_PF_LLC_DATA_RD (1 << 7) > #define SNB_PF_LLC_RFO (1 << 8) > #define SNB_PF_LLC_IFETCH (1 << 9) > #define SNB_BUS_LOCKS (1 << 10) > #define SNB_STRM_ST (1 << 11) > /* hole */ > #define SNB_OFFCORE_OTHER (1 << 15) > #define SNB_COMMON (1 << 16) > #define SNB_NO_SUPP (1 << 17) > #define SNB_LLC_HITM (1 << 18) > #define SNB_LLC_HITE (1 << 19) > #define SNB_LLC_HITS (1 << 20) > #define SNB_LLC_HITF (1 << 21) > /* hole */ > #define SNB_SNP_NONE (1 << 31) > #define SNB_SNP_NOT_NEEDED (1 << 32) > #define SNB_SNP_MISS (1 << 33) > #define SNB_SNP_NO_FWD (1 << 34) > #define SNB_SNP_FWD (1 << 35) > #define SNB_HITM (1 << 36) > #define SNB_NON_DRAM (1 << 37) > > #define SNB_DMND_READ (SNB_DMND_DATA_RD) > #define SNB_DMND_WRITE (SNB_DMND_RFO|SNB_DMND_WB|SNB_STRM_ST) > #define SNB_DMND_PREFETCH (SNB_PF_DATA_RD|SNB_PF_DATA_RFO) > > Is what I came up with, but I'm stumped on how to construct: > > #define SNB_L3_HIT All the LLC hits together. Or it can be done with the PEBS memory latency event (like Lin-Ming's patch) or with mem_load_uops_retired (but then only for loads) > #define SNB_L3_MISS Don't set any of the LLC bits > > #define SNB_ALL_DRAM Just don't set NON_DRAM > #define SNB_REMOTE_DRAM The current client SNBs for which those tables are don't have remote DRAM. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 23:54 ` Andi Kleen @ 2011-04-23 7:49 ` Peter Zijlstra 0 siblings, 0 replies; 80+ messages in thread From: Peter Zijlstra @ 2011-04-23 7:49 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Fri, 2011-04-22 at 16:54 -0700, Andi Kleen wrote: > > > > #define SNB_PF_LLC_DATA_RD (1 << 7) > > #define SNB_PF_LLC_RFO (1 << 8) > > #define SNB_PF_LLC_IFETCH (1 << 9) > > #define SNB_BUS_LOCKS (1 << 10) > > #define SNB_STRM_ST (1 << 11) > > /* hole */ > > #define SNB_OFFCORE_OTHER (1 << 15) > > #define SNB_COMMON (1 << 16) > > #define SNB_NO_SUPP (1 << 17) > > #define SNB_LLC_HITM (1 << 18) > > #define SNB_LLC_HITE (1 << 19) > > #define SNB_LLC_HITS (1 << 20) > > #define SNB_LLC_HITF (1 << 21) > > /* hole */ > > #define SNB_SNP_NONE (1 << 31) > > #define SNB_SNP_NOT_NEEDED (1 << 32) > > #define SNB_SNP_MISS (1 << 33) > > #define SNB_SNP_NO_FWD (1 << 34) > > #define SNB_SNP_FWD (1 << 35) > > #define SNB_HITM (1 << 36) > > #define SNB_NON_DRAM (1 << 37) > > > > #define SNB_DMND_READ (SNB_DMND_DATA_RD) > > #define SNB_DMND_WRITE (SNB_DMND_RFO|SNB_DMND_WB|SNB_STRM_ST) > > #define SNB_DMND_PREFETCH (SNB_PF_DATA_RD|SNB_PF_DATA_RFO) > > > > Is what I came up with, but I'm stumped on how to construct: > > > > #define SNB_L3_HIT > > All the LLC hits together. Bits 18-21 ? > Or it can be done with the PEBS memory latency event (like Lin-Ming's patch) or > with mem_load_uops_retired (but then only for loads) > > > #define SNB_L3_MISS > > Don't set any of the LLC bits So a 0 for the response type field? That's not valid. You have to set some bit between 16-37. > > > > > #define SNB_ALL_DRAM > > Just don't set NON_DRAM So bits 17-21|31-36 for the response type field? That seems wrong as that would include what we previously defined to be L3_HIT, which never makes it to DRAM. > > #define SNB_REMOTE_DRAM > > The current client SNBs for which those tables are don't have remote > DRAM. So what you're telling us is that simply because Intel hasn't shipped a multi-socket SNB system yet they either: 1) omitted a few bits from that table, 2) have a completely different offcore response msr just for kicks? Feh! ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 21:54 ` Peter Zijlstra 2011-04-22 22:19 ` Peter Zijlstra @ 2011-04-22 22:57 ` Peter Zijlstra 2011-04-23 0:00 ` Andi Kleen 1 sibling, 1 reply; 80+ messages in thread From: Peter Zijlstra @ 2011-04-22 22:57 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Fri, 2011-04-22 at 23:54 +0200, Peter Zijlstra wrote: > On Fri, 2011-04-22 at 23:37 +0200, Peter Zijlstra wrote: > > The below needs filling out for !x86 (which I filled out with > > unsupported events) and x86 needs the offcore bits fixed to auto select > > between the two offcore events. > > Urgh, so SNB has different MSR_OFFCORE_RESPONSE bits and needs another table. Also, NHM offcore bits were wrong... it implemented _ACCESS as _HIT and counted OTHER_CORE_HIT* as MISS even though its clearly documented as an L3 hit. Current scribblings below.. --- arch/arm/kernel/perf_event_v6.c | 28 ++++ arch/arm/kernel/perf_event_v7.c | 28 ++++ arch/arm/kernel/perf_event_xscale.c | 14 ++ arch/mips/kernel/perf_event_mipsxx.c | 28 ++++ arch/powerpc/kernel/e500-pmu.c | 5 + arch/powerpc/kernel/mpc7450-pmu.c | 5 + arch/powerpc/kernel/power4-pmu.c | 5 + arch/powerpc/kernel/power5+-pmu.c | 5 + arch/powerpc/kernel/power5-pmu.c | 5 + arch/powerpc/kernel/power6-pmu.c | 5 + arch/powerpc/kernel/power7-pmu.c | 5 + arch/powerpc/kernel/ppc970-pmu.c | 5 + arch/sh/kernel/cpu/sh4/perf_event.c | 15 ++ arch/sh/kernel/cpu/sh4a/perf_event.c | 15 ++ arch/sparc/kernel/perf_event.c | 42 ++++++ arch/x86/kernel/cpu/perf_event_amd.c | 14 ++ arch/x86/kernel/cpu/perf_event_intel.c | 253 +++++++++++++++++++++++++++----- arch/x86/kernel/cpu/perf_event_p4.c | 14 ++ include/linux/perf_event.h | 3 +- 19 files changed, 458 insertions(+), 36 deletions(-) diff --git a/arch/arm/kernel/perf_event_v6.c b/arch/arm/kernel/perf_event_v6.c index f1e8dd9..02178da 100644 --- a/arch/arm/kernel/perf_event_v6.c +++ b/arch/arm/kernel/perf_event_v6.c @@ -173,6 +173,20 @@ static const unsigned armv6_perf_cache_map[PERF_COUNT_HW_CACHE_MAX] [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, }, }, + [C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + }, }; enum armv6mpcore_perf_types { @@ -310,6 +324,20 @@ static const unsigned armv6mpcore_perf_cache_map[PERF_COUNT_HW_CACHE_MAX] [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, }, }, + [C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + }, }; static inline unsigned long diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c index 4960686..79ffc83 100644 --- a/arch/arm/kernel/perf_event_v7.c +++ b/arch/arm/kernel/perf_event_v7.c @@ -255,6 +255,20 @@ static const unsigned armv7_a8_perf_cache_map[PERF_COUNT_HW_CACHE_MAX] [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, }, }, + [C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + }, }; /* @@ -371,6 +385,20 @@ static const unsigned armv7_a9_perf_cache_map[PERF_COUNT_HW_CACHE_MAX] [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, }, }, + [C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + }, }; /* diff --git a/arch/arm/kernel/perf_event_xscale.c b/arch/arm/kernel/perf_event_xscale.c index 39affbe..7ed1a55 100644 --- a/arch/arm/kernel/perf_event_xscale.c +++ b/arch/arm/kernel/perf_event_xscale.c @@ -144,6 +144,20 @@ static const unsigned xscale_perf_cache_map[PERF_COUNT_HW_CACHE_MAX] [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, }, }, + [C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED, + [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED, + }, + }, }; #define XSCALE_PMU_ENABLE 0x001 diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c index 75266ff..e5ad09a 100644 --- a/arch/mips/kernel/perf_event_mipsxx.c +++ b/arch/mips/kernel/perf_event_mipsxx.c @@ -377,6 +377,20 @@ static const struct mips_perf_event mipsxxcore_cache_map [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, }, }, +[C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, +}, }; /* 74K core has completely different cache event map. */ @@ -480,6 +494,20 @@ static const struct mips_perf_event mipsxx74Kcore_cache_map [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, }, }, +[C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID }, + [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID }, + }, +}, }; #ifdef CONFIG_MIPS_MT_SMP diff --git a/arch/powerpc/kernel/e500-pmu.c b/arch/powerpc/kernel/e500-pmu.c index b150b51..cb2e294 100644 --- a/arch/powerpc/kernel/e500-pmu.c +++ b/arch/powerpc/kernel/e500-pmu.c @@ -75,6 +75,11 @@ static int e500_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static int num_events = 128; diff --git a/arch/powerpc/kernel/mpc7450-pmu.c b/arch/powerpc/kernel/mpc7450-pmu.c index 2cc5e03..845a584 100644 --- a/arch/powerpc/kernel/mpc7450-pmu.c +++ b/arch/powerpc/kernel/mpc7450-pmu.c @@ -388,6 +388,11 @@ static int mpc7450_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; struct power_pmu mpc7450_pmu = { diff --git a/arch/powerpc/kernel/power4-pmu.c b/arch/powerpc/kernel/power4-pmu.c index ead8b3c..e9dbc2d 100644 --- a/arch/powerpc/kernel/power4-pmu.c +++ b/arch/powerpc/kernel/power4-pmu.c @@ -587,6 +587,11 @@ static int power4_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu power4_pmu = { diff --git a/arch/powerpc/kernel/power5+-pmu.c b/arch/powerpc/kernel/power5+-pmu.c index eca0ac5..f58a2bd 100644 --- a/arch/powerpc/kernel/power5+-pmu.c +++ b/arch/powerpc/kernel/power5+-pmu.c @@ -653,6 +653,11 @@ static int power5p_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu power5p_pmu = { diff --git a/arch/powerpc/kernel/power5-pmu.c b/arch/powerpc/kernel/power5-pmu.c index d5ff0f6..b1acab6 100644 --- a/arch/powerpc/kernel/power5-pmu.c +++ b/arch/powerpc/kernel/power5-pmu.c @@ -595,6 +595,11 @@ static int power5_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu power5_pmu = { diff --git a/arch/powerpc/kernel/power6-pmu.c b/arch/powerpc/kernel/power6-pmu.c index 3160392..b24a3a2 100644 --- a/arch/powerpc/kernel/power6-pmu.c +++ b/arch/powerpc/kernel/power6-pmu.c @@ -516,6 +516,11 @@ static int power6_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu power6_pmu = { diff --git a/arch/powerpc/kernel/power7-pmu.c b/arch/powerpc/kernel/power7-pmu.c index 593740f..6d9dccb 100644 --- a/arch/powerpc/kernel/power7-pmu.c +++ b/arch/powerpc/kernel/power7-pmu.c @@ -342,6 +342,11 @@ static int power7_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu power7_pmu = { diff --git a/arch/powerpc/kernel/ppc970-pmu.c b/arch/powerpc/kernel/ppc970-pmu.c index 9a6e093..b121de9 100644 --- a/arch/powerpc/kernel/ppc970-pmu.c +++ b/arch/powerpc/kernel/ppc970-pmu.c @@ -467,6 +467,11 @@ static int ppc970_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(OP_WRITE)] = { -1, -1 }, [C(OP_PREFETCH)] = { -1, -1 }, }, + [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */ + [C(OP_READ)] = { -1, -1 }, + [C(OP_WRITE)] = { -1, -1 }, + [C(OP_PREFETCH)] = { -1, -1 }, + }, }; static struct power_pmu ppc970_pmu = { diff --git a/arch/sh/kernel/cpu/sh4/perf_event.c b/arch/sh/kernel/cpu/sh4/perf_event.c index 748955d..fa4f724 100644 --- a/arch/sh/kernel/cpu/sh4/perf_event.c +++ b/arch/sh/kernel/cpu/sh4/perf_event.c @@ -180,6 +180,21 @@ static const int sh7750_cache_events [ C(RESULT_MISS) ] = -1, }, }, + + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; static int sh7750_event_map(int event) diff --git a/arch/sh/kernel/cpu/sh4a/perf_event.c b/arch/sh/kernel/cpu/sh4a/perf_event.c index 17e6beb..84a2c39 100644 --- a/arch/sh/kernel/cpu/sh4a/perf_event.c +++ b/arch/sh/kernel/cpu/sh4a/perf_event.c @@ -205,6 +205,21 @@ static const int sh4a_cache_events [ C(RESULT_MISS) ] = -1, }, }, + + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; static int sh4a_event_map(int event) diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c index ee8426e..d890e0f 100644 --- a/arch/sparc/kernel/perf_event.c +++ b/arch/sparc/kernel/perf_event.c @@ -245,6 +245,20 @@ static const cache_map_t ultra3_cache_map = { [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, }, }, +[C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED }, + [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, +}, }; static const struct sparc_pmu ultra3_pmu = { @@ -360,6 +374,20 @@ static const cache_map_t niagara1_cache_map = { [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, }, }, +[C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED }, + [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, +}, }; static const struct sparc_pmu niagara1_pmu = { @@ -472,6 +500,20 @@ static const cache_map_t niagara2_cache_map = { [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, }, }, +[C(NODE)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED }, + [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED }, + [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED }, + }, +}, }; static const struct sparc_pmu niagara2_pmu = { diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index cf4e369..01c7dd3 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -89,6 +89,20 @@ static __initconst const u64 amd_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = 0xb8e9, /* CPU Request to Memory, l+r */ + [ C(RESULT_MISS) ] = 0x98e9, /* CPU Request to Memory, r */ + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; /* diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 43fa20b..fe4e8b1 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -150,6 +150,86 @@ static u64 intel_pmu_event_map(int hw_event) return intel_perfmon_event_map[hw_event]; } +/* + * Sandy Bridge MSR_OFFCORE_RESPONSE bits; + * See IA32 SDM Vol 3B 30.8.5 + */ + +#define SNB_DMND_DATA_RD (1 << 0) +#define SNB_DMND_RFO (1 << 1) +#define SNB_DMND_IFETCH (1 << 2) +#define SNB_DMND_WB (1 << 3) +#define SNB_PF_DATA_RD (1 << 4) +#define SNB_PF_DATA_RFO (1 << 5) +#define SNB_PF_IFETCH (1 << 6) +#define SNB_PF_LLC_DATA_RD (1 << 7) +#define SNB_PF_LLC_RFO (1 << 8) +#define SNB_PF_LLC_IFETCH (1 << 9) +#define SNB_BUS_LOCKS (1 << 10) +#define SNB_STRM_ST (1 << 11) + /* hole */ +#define SNB_OFFCORE_OTHER (1 << 15) +#define SNB_COMMON (1 << 16) +#define SNB_NO_SUPP (1 << 17) +#define SNB_LLC_HITM (1 << 18) +#define SNB_LLC_HITE (1 << 19) +#define SNB_LLC_HITS (1 << 20) +#define SNB_LLC_HITF (1 << 21) + /* hole */ +#define SNB_SNP_NONE (1 << 31) +#define SNB_SNP_NOT_NEEDED (1 << 32) +#define SNB_SNP_MISS (1 << 33) +#define SNB_SNP_NO_FWD (1 << 34) +#define SNB_SNP_FWD (1 << 35) +#define SNB_HITM (1 << 36) +#define SNB_NON_DRAM (1 << 37) + +#define SNB_DMND_READ (SNB_DMND_DATA_RD) +#define SNB_DMND_WRITE (SNB_DMND_RFO|SNB_DMND_WB|SNB_STRM_ST) +#define SNB_DMND_PREFETCH (SNB_PF_DATA_RD|SNB_PF_DATA_RFO) + +#define SNB_L3_HIT () +#define SNB_L3_MISS () +#define SNB_L3_ACCESS (SNB_L3_HIT|SNB_L3_MISS) + +#define SNB_ALL_DRAM () +#define SNB_REMOTE_DRAM () + +static __initconst const u64 snb_hw_cache_extra_regs + [PERF_COUNT_HW_CACHE_MAX] + [PERF_COUNT_HW_CACHE_OP_MAX] + [PERF_COUNT_HW_CACHE_RESULT_MAX] = +{ + [ C(LL ) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = SNB_DMND_READ|SNB_L3_ACCESS, + [ C(RESULT_MISS) ] = SNB_DMND_READ|SNB_L3_MISS, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = SNB_DMND_WRITE|SNB_L3_ACCESS, + [ C(RESULT_MISS) ] = SNB_DMND_WRITE|SNB_L3_MISS, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = SNB_DMND_PREFETCH|SNB_L3_ACCESS, + [ C(RESULT_MISS) ] = SNB_DMND_PREFETCH|SNB_L3_MISS, + }, + } + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = SNB_DMND_READ|SNB_ALL_DRAM, + [ C(RESULT_MISS) ] = SNB_DMND_READ|SNB_REMOTE_DRAM, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = SNB_DMND_WRITE|SNB_ALL_DRAM, + [ C(RESULT_MISS) ] = SNB_DMND_WRITE|SNB_REMOTE_DRAM, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = SNB_DMND_PREFETCH|SNB_ALL_DRAM, + [ C(RESULT_MISS) ] = SNB_DMND_PREFETCH|SNB_REMOTE_DRAM, + }, + }, +}; + static __initconst const u64 snb_hw_cache_event_ids [PERF_COUNT_HW_CACHE_MAX] [PERF_COUNT_HW_CACHE_OP_MAX] @@ -184,26 +264,23 @@ static __initconst const u64 snb_hw_cache_event_ids }, }, [ C(LL ) ] = { - /* - * TBD: Need Off-core Response Performance Monitoring support - */ [ C(OP_READ) ] = { - /* OFFCORE_RESPONSE_0.ANY_DATA.LOCAL_CACHE */ + /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */ [ C(RESULT_ACCESS) ] = 0x01b7, - /* OFFCORE_RESPONSE_1.ANY_DATA.ANY_LLC_MISS */ - [ C(RESULT_MISS) ] = 0x01bb, + /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */ + [ C(RESULT_MISS) ] = 0x01b7, }, [ C(OP_WRITE) ] = { - /* OFFCORE_RESPONSE_0.ANY_RFO.LOCAL_CACHE */ + /* OFFCORE_RESPONSE.ANY_RFO.LOCAL_CACHE */ [ C(RESULT_ACCESS) ] = 0x01b7, - /* OFFCORE_RESPONSE_1.ANY_RFO.ANY_LLC_MISS */ - [ C(RESULT_MISS) ] = 0x01bb, + /* OFFCORE_RESPONSE.ANY_RFO.ANY_LLC_MISS */ + [ C(RESULT_MISS) ] = 0x01b7, }, [ C(OP_PREFETCH) ] = { - /* OFFCORE_RESPONSE_0.PREFETCH.LOCAL_CACHE */ + /* OFFCORE_RESPONSE.PREFETCH.LOCAL_CACHE */ [ C(RESULT_ACCESS) ] = 0x01b7, - /* OFFCORE_RESPONSE_1.PREFETCH.ANY_LLC_MISS */ - [ C(RESULT_MISS) ] = 0x01bb, + /* OFFCORE_RESPONSE.PREFETCH.ANY_LLC_MISS */ + [ C(RESULT_MISS) ] = 0x01b7, }, }, [ C(DTLB) ] = { @@ -248,6 +325,20 @@ static __initconst const u64 snb_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */ + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + }, }; static __initconst const u64 westmere_hw_cache_event_ids @@ -285,26 +376,26 @@ static __initconst const u64 westmere_hw_cache_event_ids }, [ C(LL ) ] = { [ C(OP_READ) ] = { - /* OFFCORE_RESPONSE_0.ANY_DATA.LOCAL_CACHE */ + /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */ [ C(RESULT_ACCESS) ] = 0x01b7, - /* OFFCORE_RESPONSE_1.ANY_DATA.ANY_LLC_MISS */ - [ C(RESULT_MISS) ] = 0x01bb, + /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */ + [ C(RESULT_MISS) ] = 0x01b7, }, /* * Use RFO, not WRITEBACK, because a write miss would typically occur * on RFO. */ [ C(OP_WRITE) ] = { - /* OFFCORE_RESPONSE_1.ANY_RFO.LOCAL_CACHE */ - [ C(RESULT_ACCESS) ] = 0x01bb, - /* OFFCORE_RESPONSE_0.ANY_RFO.ANY_LLC_MISS */ + /* OFFCORE_RESPONSE.ANY_RFO.LOCAL_CACHE */ + [ C(RESULT_ACCESS) ] = 0x01b7, + /* OFFCORE_RESPONSE.ANY_RFO.ANY_LLC_MISS */ [ C(RESULT_MISS) ] = 0x01b7, }, [ C(OP_PREFETCH) ] = { - /* OFFCORE_RESPONSE_0.PREFETCH.LOCAL_CACHE */ + /* OFFCORE_RESPONSE.PREFETCH.LOCAL_CACHE */ [ C(RESULT_ACCESS) ] = 0x01b7, - /* OFFCORE_RESPONSE_1.PREFETCH.ANY_LLC_MISS */ - [ C(RESULT_MISS) ] = 0x01bb, + /* OFFCORE_RESPONSE.PREFETCH.ANY_LLC_MISS */ + [ C(RESULT_MISS) ] = 0x01b7, }, }, [ C(DTLB) ] = { @@ -349,19 +440,53 @@ static __initconst const u64 westmere_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */ + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + }, }; /* - * OFFCORE_RESPONSE MSR bits (subset), See IA32 SDM Vol 3 30.6.1.3 + * Nehalem/Westmere MSR_OFFCORE_RESPONSE bits; + * See IA32 SDM Vol 3B 30.6.1.3 */ -#define DMND_DATA_RD (1 << 0) -#define DMND_RFO (1 << 1) -#define DMND_WB (1 << 3) -#define PF_DATA_RD (1 << 4) -#define PF_DATA_RFO (1 << 5) -#define RESP_UNCORE_HIT (1 << 8) -#define RESP_MISS (0xf600) /* non uncore hit */ +#define NHM_DMND_DATA_RD (1 << 0) +#define NHM_DMND_RFO (1 << 1) +#define NHM_DMND_IFETCH (1 << 2) +#define NHM_DMND_WB (1 << 3) +#define NHM_PF_DATA_RD (1 << 4) +#define NHM_PF_DATA_RFO (1 << 5) +#define NHM_PF_IFETCH (1 << 6) +#define NHM_OFFCORE_OTHER (1 << 7) +#define NHM_UNCORE_HIT (1 << 8) +#define NHM_OTHER_CORE_HIT_SNP (1 << 9) +#define NHM_OTHER_CORE_HITM (1 << 10) + /* reserved */ +#define NHM_REMOTE_CACHE_FWD (1 << 12) +#define NHM_REMOTE_DRAM (1 << 13) +#define NHM_LOCAL_DRAM (1 << 14) +#define NHM_NON_DRAM (1 << 15) + +#define NHM_ALL_DRAM (NHM_REMOTE_DRAM|NHM_LOCAL_DRAM) + +#define NHM_DMND_READ (NHM_DMND_DATA_RD) +#define NHM_DMND_WRITE (NHM_DMND_RFO|NHM_DMND_WB) +#define NHM_DMND_PREFETCH (NHM_PF_DATA_RD|NHM_PF_DATA_RFO) + +#define NHM_L3_HIT (NHM_UNCORE_HIT|NHM_OTHER_CORE_HIT_SNP|NHM_OTHER_CORE_HITM) +#define NHM_L3_MISS (NHM_NON_DRAM|NHM_ALL_DRAM|NHM_REMOTE_CACHE_FWD) +#define NHM_L3_ACCESS (NHM_L3_HIT|NHM_L3_MISS) static __initconst const u64 nehalem_hw_cache_extra_regs [PERF_COUNT_HW_CACHE_MAX] @@ -370,18 +495,32 @@ static __initconst const u64 nehalem_hw_cache_extra_regs { [ C(LL ) ] = { [ C(OP_READ) ] = { - [ C(RESULT_ACCESS) ] = DMND_DATA_RD|RESP_UNCORE_HIT, - [ C(RESULT_MISS) ] = DMND_DATA_RD|RESP_MISS, + [ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_L3_ACCESS, + [ C(RESULT_MISS) ] = NHM_DMND_READ|NHM_L3_MISS, }, [ C(OP_WRITE) ] = { - [ C(RESULT_ACCESS) ] = DMND_RFO|DMND_WB|RESP_UNCORE_HIT, - [ C(RESULT_MISS) ] = DMND_RFO|DMND_WB|RESP_MISS, + [ C(RESULT_ACCESS) ] = NHM_DMND_WRITE|NHM_L3_ACCESS, + [ C(RESULT_MISS) ] = NHM_DMND_WRITE|NHM_L3_MISS, }, [ C(OP_PREFETCH) ] = { - [ C(RESULT_ACCESS) ] = PF_DATA_RD|PF_DATA_RFO|RESP_UNCORE_HIT, - [ C(RESULT_MISS) ] = PF_DATA_RD|PF_DATA_RFO|RESP_MISS, + [ C(RESULT_ACCESS) ] = NHM_DMND_PREFETCH|NHM_L3_ACCESS, + [ C(RESULT_MISS) ] = NHM_DMND_PREFETCH|NHM_L3_MISS, }, } + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_ALL_DRAM, + [ C(RESULT_MISS) ] = NHM_DMND_READ|NHM_REMOTE_DRAM, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = NHM_DMND_WRITE|NHM_ALL_DRAM, + [ C(RESULT_MISS) ] = NHM_DMND_WRITE|NHM_REMOTE_DRAM, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = NHM_DMND_PREFETCH|NHM_ALL_DRAM, + [ C(RESULT_MISS) ] = NHM_DMND_PREFETCH|NHM_REMOTE_DRAM, + }, + }, }; static __initconst const u64 nehalem_hw_cache_event_ids @@ -483,6 +622,20 @@ static __initconst const u64 nehalem_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */ + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = 0x01b7, + [ C(RESULT_MISS) ] = 0x01b7, + }, + }, }; static __initconst const u64 core2_hw_cache_event_ids @@ -574,6 +727,20 @@ static __initconst const u64 core2_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; static __initconst const u64 atom_hw_cache_event_ids @@ -665,6 +832,20 @@ static __initconst const u64 atom_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; static void intel_pmu_disable_all(void) @@ -1444,6 +1625,8 @@ static __init int intel_pmu_init(void) case 42: /* SandyBridge */ memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, sizeof(hw_cache_event_ids)); + memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, + sizeof(hw_cache_extra_regs)); intel_pmu_lbr_init_nhm(); diff --git a/arch/x86/kernel/cpu/perf_event_p4.c b/arch/x86/kernel/cpu/perf_event_p4.c index 74507c1..e802c7e 100644 --- a/arch/x86/kernel/cpu/perf_event_p4.c +++ b/arch/x86/kernel/cpu/perf_event_p4.c @@ -554,6 +554,20 @@ static __initconst const u64 p4_hw_cache_event_ids [ C(RESULT_MISS) ] = -1, }, }, + [ C(NODE) ] = { + [ C(OP_READ) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_WRITE) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + [ C(OP_PREFETCH) ] = { + [ C(RESULT_ACCESS) ] = -1, + [ C(RESULT_MISS) ] = -1, + }, + }, }; static u64 p4_general_events[PERF_COUNT_HW_MAX] = { diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index ee9f1e7..df4a841 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -59,7 +59,7 @@ enum perf_hw_id { /* * Generalized hardware cache events: * - * { L1-D, L1-I, LLC, ITLB, DTLB, BPU } x + * { L1-D, L1-I, LLC, ITLB, DTLB, BPU, NODE } x * { read, write, prefetch } x * { accesses, misses } */ @@ -70,6 +70,7 @@ enum perf_hw_cache_id { PERF_COUNT_HW_CACHE_DTLB = 3, PERF_COUNT_HW_CACHE_ITLB = 4, PERF_COUNT_HW_CACHE_BPU = 5, + PERF_COUNT_HW_CACHE_NODE = 6, PERF_COUNT_HW_CACHE_MAX, /* non-ABI */ }; ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 22:57 ` Peter Zijlstra @ 2011-04-23 0:00 ` Andi Kleen 2011-04-23 7:50 ` Peter Zijlstra 0 siblings, 1 reply; 80+ messages in thread From: Andi Kleen @ 2011-04-23 0:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Sat, Apr 23, 2011 at 12:57:42AM +0200, Peter Zijlstra wrote: > On Fri, 2011-04-22 at 23:54 +0200, Peter Zijlstra wrote: > > On Fri, 2011-04-22 at 23:37 +0200, Peter Zijlstra wrote: > > > The below needs filling out for !x86 (which I filled out with > > > unsupported events) and x86 needs the offcore bits fixed to auto select > > > between the two offcore events. > > > > Urgh, so SNB has different MSR_OFFCORE_RESPONSE bits and needs another table. > > Also, NHM offcore bits were wrong... it implemented _ACCESS as _HIT and What is ACCESS if not a HIT? > counted OTHER_CORE_HIT* as MISS even though its clearly documented as an > L3 hit. When the other core owns the cache line it has to be fetched from there. That's not a LLC hit. -Andi ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-23 0:00 ` Andi Kleen @ 2011-04-23 7:50 ` Peter Zijlstra 0 siblings, 0 replies; 80+ messages in thread From: Peter Zijlstra @ 2011-04-23 7:50 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Fri, 2011-04-22 at 17:00 -0700, Andi Kleen wrote: > On Sat, Apr 23, 2011 at 12:57:42AM +0200, Peter Zijlstra wrote: > > On Fri, 2011-04-22 at 23:54 +0200, Peter Zijlstra wrote: > > > On Fri, 2011-04-22 at 23:37 +0200, Peter Zijlstra wrote: > > > > The below needs filling out for !x86 (which I filled out with > > > > unsupported events) and x86 needs the offcore bits fixed to auto select > > > > between the two offcore events. > > > > > > Urgh, so SNB has different MSR_OFFCORE_RESPONSE bits and needs another table. > > > > Also, NHM offcore bits were wrong... it implemented _ACCESS as _HIT and > > What is ACCESS if not a HIT? An ACCESS is all requests for data that comes in, after which you either HIT or MISS in which case you have to ask someone else down the line. > > counted OTHER_CORE_HIT* as MISS even though its clearly documented as an > > L3 hit. > > When the other core owns the cache line it has to be fetched from there. > That's not a LLC hit. Then _why_ are they described in 30.6.1.3, table 30-15, as: OTHER_CORE_HIT_SNP 9 (R/W). L3 Hit: .... OTHER_CORE_HITM 10 (R/W). L3 Hit: ... ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 21:37 ` Peter Zijlstra 2011-04-22 21:54 ` Peter Zijlstra @ 2011-04-23 8:13 ` Ingo Molnar 1 sibling, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-23 8:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner * Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, 2011-04-22 at 10:06 +0200, Ingo Molnar wrote: > > > > I'm about to push out the patch attached below - it lays out the arguments in > > detail. I don't think we have time to fix this properly for .39 - but memory > > profiling could be a nice feature for v2.6.40. > > Does something like the below provide enough generic infrastructure to > allow the raw offcore bits again? Yeah, this looks like a pretty good start - this is roughly the approach i outlined to Stephane and Andi, generic cache events extended with one more 'node' level. Andi, Stephane, if you'd like to see the Intel offcore bits supported in 2.6.40 (or 2.6.41) please help out Peter with review, testing, tools/perf/ integration, etc. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 8:06 ` Ingo Molnar 2011-04-22 21:37 ` Peter Zijlstra @ 2011-04-25 17:12 ` Vince Weaver 2011-04-25 17:54 ` Ingo Molnar 2011-04-26 9:25 ` Peter Zijlstra 1 sibling, 2 replies; 80+ messages in thread From: Vince Weaver @ 2011-04-25 17:12 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra sorry for the late reply on this thread, it happened inconveniently over the long weekend. On Fri, 22 Apr 2011, Ingo Molnar wrote: > But this kind of usability is absolutely unacceptable - users should not > be expected to type in magic, CPU and model specific incantations to get > access to useful hardware functionality. That's why people use libpfm4. or PAPI. And they do. Current PAPI snapshots support offcore response on recent git kernels. With full names, no hex values, thanks to libpfm4. All the world is not perf. > The proper solution is to expose useful offcore functionality via > generalized events - that way users do not have to care which specific > CPU model they are using, they can use the conceptual event and not some > model specific quirky hexa number. No no no no. Blocking access to raw events is the wrong idea. If anything, the whole "generic events" thing in the kernel should be ditched. Wrong events are used at times (see AMD branch events a few releases back, now Nehalem cache events). This all belongs in userspace, as was pointed out at the start. The kernel has no business telling users which perf events are interesting, or limiting them! What is this, windows? If you do block access to any raw events, we're going to have to start recommending people ditch perf_events and start patching the kernel with perfctr again. We already do for P4/netburst users, as Pentium 4 support is currently hosed due to NMI event conflicts. Also with perfctr it's much easier to get low-latency access to the counters. See: http://web.eecs.utk.edu/~vweaver1/projects/papi-cost/ Vince ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-25 17:12 ` Vince Weaver @ 2011-04-25 17:54 ` Ingo Molnar 2011-04-25 21:46 ` Vince Weaver 2011-04-26 9:25 ` Peter Zijlstra 1 sibling, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-25 17:54 UTC (permalink / raw) To: Vince Weaver Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra * Vince Weaver <vweaver1@eecs.utk.edu> wrote: > [...] The kernel has no business telling users which perf events are > interesting, or limiting them! [...] The policy is very simple and common-sense: if a given piece of PMU functionality is useful enough to be exposed via a raw interface, then it must be useful enough to be generalized as well. > [...] What is this, windows? FYI, this is how the Linux kernel has operated from day 1 on: we support hardware features to abstract useful highlevel functionality out of it. I would not expect this to change anytime soon. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-25 17:54 ` Ingo Molnar @ 2011-04-25 21:46 ` Vince Weaver 2011-04-25 22:12 ` Andi Kleen ` (2 more replies) 0 siblings, 3 replies; 80+ messages in thread From: Vince Weaver @ 2011-04-25 21:46 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra On Mon, 25 Apr 2011, Ingo Molnar wrote: > > * Vince Weaver <vweaver1@eecs.utk.edu> wrote: > > > [...] The kernel has no business telling users which perf events are > > interesting, or limiting them! [...] > > The policy is very simple and common-sense: if a given piece of PMU > functionality is useful enough to be exposed via a raw interface, then > it must be useful enough to be generalized as well. what does that even mean? How do you "generalize" a functionality like writing a value to an auxiliary MSR register? The PAPI tool was using the perf_events interface in the 2.6.39-git kernels to collect offcore response results by properly setting the config1 register on Nehalem and Westmere machines. Now it has been disabled for unclear reasons. Could you at least have some sort of relevant errno value set in this case? It's a real pain in userspace code to try to sort out the perf_event return values to find out if a feature is supported, unsupported (lack of hardware), unsupported (not implemented yet), unsupported (disabled due to whim of kernel developer), unsupported (because you have some sort of configuration conflict). > > [...] What is this, windows? > > FYI, this is how the Linux kernel has operated from day 1 on: we support > hardware features to abstract useful highlevel functionality out of it. > I would not expect this to change anytime soon. I started using Linux because it actually let me use my hardware without interfering with what I was trying to do. Not because it disabled access to the hardware due to some perceived lack of generalization in an extra unncessary software translation layer. Vince vweaver1@eecs.utk.edu ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-25 21:46 ` Vince Weaver @ 2011-04-25 22:12 ` Andi Kleen 2011-04-26 7:23 ` Ingo Molnar 2011-04-26 7:38 ` Ingo Molnar 2011-04-26 9:49 ` Peter Zijlstra 2 siblings, 1 reply; 80+ messages in thread From: Andi Kleen @ 2011-04-25 22:12 UTC (permalink / raw) To: Vince Weaver Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, torvalds > The PAPI tool was using the perf_events interface in the 2.6.39-git > kernels to collect offcore response results by properly setting the > config1 register on Nehalem and Westmere machines. I already had some users for this functionality too. Offcore events are quite useful for various analysis: basically every time you have a memory performance problem -- especially a NUMA problem -- they can help you a lot tracking it down. They answer questions like "who accesses memory on another node" As far as I'm concerned b52c55c6a25e4515b5e075a989ff346fc251ed09 is a bad feature regression. > > Now it has been disabled for unclear reasons. Also unfortunately only partial. Previously you could at least write the MSR from user space through /dev/cpu/*/msr, but now the kernel randomly rewrites it if anyone else uses cache events. Right now I have some frontend scripts which are doing this, but it's really quite nasty. It's very sad we have to go through this. -Andi ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-25 22:12 ` Andi Kleen @ 2011-04-26 7:23 ` Ingo Molnar 0 siblings, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-26 7:23 UTC (permalink / raw) To: Andi Kleen Cc: Vince Weaver, Arnaldo Carvalho de Melo, linux-kernel, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra, torvalds * Andi Kleen <ak@linux.intel.com> wrote: > > Now it has been disabled for unclear reasons. > > Also unfortunately only partial. Previously you could at least write the MSR > from user space through /dev/cpu/*/msr, but now the kernel randomly rewrites > it if anyone else uses cache events. Ugh, that's an unbelievable hack - if you hack an active PMU via writing to it via /dev/cpu/*/msr and it breaks you really get to keep the pieces. There's a reason why those devices are root only - it's as if you wrote to a filesystem that is already mounted! If your user-space twiddling scripts go bad who knows what state the CPU gets into and you might be reporting bogus bugs. I think writing to those msrs directly should probably taint the kernel: i'll prepare a patch for that. > It's very sad we have to go through this. Not really, it took Peter 10 minutes to come up with an RFC patch to extend the cache events in a meaningful way - and that was actually more useful to users than all prior offcore patches combined. So the kernel already won from this episode. We are not at all interested in hiding PMU functionality and keeping it unstructured, and just passing through some opaque raw ABI to user-space. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-25 21:46 ` Vince Weaver 2011-04-25 22:12 ` Andi Kleen @ 2011-04-26 7:38 ` Ingo Molnar 2011-04-26 20:51 ` Vince Weaver 2011-04-26 9:49 ` Peter Zijlstra 2 siblings, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-26 7:38 UTC (permalink / raw) To: Vince Weaver Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra * Vince Weaver <vweaver1@eecs.utk.edu> wrote: > On Mon, 25 Apr 2011, Ingo Molnar wrote: > > > > > * Vince Weaver <vweaver1@eecs.utk.edu> wrote: > > > > > [...] The kernel has no business telling users which perf events are > > > interesting, or limiting them! [...] > > > > The policy is very simple and common-sense: if a given piece of PMU > > functionality is useful enough to be exposed via a raw interface, then > > it must be useful enough to be generalized as well. > > what does that even mean? How do you "generalize" a functionality like > writing a value to an auxiliary MSR register? Here are a few examples: - the pure act of task switching sometimes involves writing to MSRs. How is it generalized? The concept of 'processes/threads' is offered to user-space and thus this functionality is generalized - the raw MSRs are not just passed through to user-space. - a wide range of VMX (virtualization) functionality on Intel CPUs operates via writing special values to specific MSR registers. How is it 'generalized'? A meaningful, structured ABI is provided to user-space in form of the KVM device and associated semantics. The raw MSRs are not just passed through to user-space. - the ability of CPUs to change frequency is offered via writing special values to special MSRs. How is this generalized? The cpufreq subsystem offers a frequency/cpu API and associated abstractions - the raw MSRs are not just passed through to user-space. - in the context of perf events we generalize the concept of an 'event' and we abstract out common, CPU model neutral CPU hardware concepts like 'cycles', 'instructions', 'branches' and a simplified cache hierarchy - and offer those events as generic events to user-space. We do not just pass the raw MSRs through to user-space. - [ etc. - a lot of useful CPU functionality is MSR driven, the PMU is nothing special there. ] The kernel development process is in essence an abstraction engine, and if you expect something else you'll probably be facing a lot of frustrating episodes in the future as well where others try to abstract out meaningful generalizations. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-26 7:38 ` Ingo Molnar @ 2011-04-26 20:51 ` Vince Weaver 2011-04-27 6:52 ` Ingo Molnar 0 siblings, 1 reply; 80+ messages in thread From: Vince Weaver @ 2011-04-26 20:51 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra On Tue, 26 Apr 2011, Ingo Molnar wrote: > The kernel development process is in essence an abstraction engine, and if you > expect something else you'll probably be facing a lot of frustrating episodes > in the future as well where others try to abstract out meaningful > generalizations. yes, but you are taking abstraction to the extreme. A filesystem abstracts out the access to raw disk... but under Linux we still allow raw access to /dev/sda TCP/IP abstracts out the access to the network... but under Linux we still allow creating raw packets. It is fine to have some sort of high-level abstraction of perf events for those who don't have PhDs in computer architecture. Fine. But don't get in the way of people who know what they are doing. Vince ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-26 20:51 ` Vince Weaver @ 2011-04-27 6:52 ` Ingo Molnar 2011-04-28 22:16 ` Vince Weaver 0 siblings, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-27 6:52 UTC (permalink / raw) To: Vince Weaver Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra * Vince Weaver <vweaver1@eecs.utk.edu> wrote: > On Tue, 26 Apr 2011, Ingo Molnar wrote: > > > The kernel development process is in essence an abstraction engine, and if > > you expect something else you'll probably be facing a lot of frustrating > > episodes in the future as well where others try to abstract out meaningful > > generalizations. > > yes, but you are taking abstraction to the extreme. Firstly, that claim is a far cry from your original claim: ' How do you "generalize" a functionality like writing a value to an auxiliary MSR register? ' ... so i guess you conceded the point at least partially, without actually openly and honestly conceding the point? Secondly, you are still quite wrong even with your revised opinion. Being able to type '-e cycles' and '-e instructions' in perf and get ... cycles and instructions counts/events, and the kernel helping that kind of approach is not 'abstraction to the extreme', it's called 'common sense'. The fact that perfmon and oprofile works via magic vendor-specific event string incantations is one of the many design failures of those projects - not a virtue. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-27 6:52 ` Ingo Molnar @ 2011-04-28 22:16 ` Vince Weaver 2011-04-28 23:30 ` Thomas Gleixner ` (2 more replies) 0 siblings, 3 replies; 80+ messages in thread From: Vince Weaver @ 2011-04-28 22:16 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra On Wed, 27 Apr 2011, Ingo Molnar wrote: > Secondly, you are still quite wrong even with your revised opinion. Being able > to type '-e cycles' and '-e instructions' in perf and get ... cycles and > instructions counts/events, and the kernel helping that kind of approach is not > 'abstraction to the extreme', it's called 'common sense'. by your logic I should be able to delete a file by saying echo "delete /tmp/tempfile" > /dev/sdc1 because using unlink() is too low of an abstraction and confusing to the user. > The fact that perfmon and oprofile works via magic vendor-specific event string > incantations is one of the many design failures of those projects - not a > virtue. Well we disagree. I think one of perf_events biggest failings (among many) is that these generalized event definitions are shoved into the kernel. At least it bloats the kernel in an option commonly turned on by vendors. At worst it gives users a full sense of security in thinking these counters are A). Portable across architectures and B). Actually measure what they say they do. I know it is fun to reinvent the wheel, but you ignored decades of experience in dealing with perf-counters when you ran off and invented perf_events. It will bite you eventually. Vince ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-28 22:16 ` Vince Weaver @ 2011-04-28 23:30 ` Thomas Gleixner 2011-04-29 2:28 ` Andi Kleen 2011-04-29 19:32 ` Ingo Molnar 2 siblings, 0 replies; 80+ messages in thread From: Thomas Gleixner @ 2011-04-28 23:30 UTC (permalink / raw) To: Vince Weaver Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Peter Zijlstra Vince, On Thu, 28 Apr 2011, Vince Weaver wrote: > On Wed, 27 Apr 2011, Ingo Molnar wrote: > > > Secondly, you are still quite wrong even with your revised opinion. Being able > > to type '-e cycles' and '-e instructions' in perf and get ... cycles and > > instructions counts/events, and the kernel helping that kind of approach is not > > 'abstraction to the extreme', it's called 'common sense'. > > by your logic I should be able to delete a file by saying > echo "delete /tmp/tempfile" > /dev/sdc1 > because using unlink() is too low of an abstraction and confusing to the > user. Your definition of 'common sense' seems to be rather backwards. > > The fact that perfmon and oprofile works via magic vendor-specific event string > > incantations is one of the many design failures of those projects - not a > > virtue. > > Well we disagree. I think one of perf_events biggest failings (among > many) is that these generalized event definitions are shoved into the Put the failings on the table if you think there are any real ones. The generalized event definitions are debatable, but Ingo's argument that they fulfil the common sense level is definitely a strong enough one to keep them. The problem at hand which ignited this flame war is definitely borderline and I don't agree with Ingo that it should not made be available right now in the raw form. That's an hardware enablement feature which can be useful even if tools/perf has not support for it and we have no generalized event for it. That's two different stories. perf has always allowed to use raw events and I don't see a reason why we should not do that in this case if it enables a subset of the perf userbase to make use of it. > kernel. At least it bloats the kernel in an option commonly turned on by Well compared to the back then proposed perfmon kernel bloat, that's really nothing you should whine about. > vendors. At worst it gives users a full sense of security in thinking > these counters are A). Portable across architectures and B). Actually > measure what they say they do. Again, in the common sense approach they actually do what they say. For real experts like you there are still the raw events to get the real thing which is meaningful for those who understand what 'cycles' and 'instructions' really mean. Cough, cough.... > I know it is fun to reinvent the wheel, but you ignored decades of > experience in dealing with perf-counters when you ran off and invented > perf_events. It will bite you eventually. Stop this whining already. I thoroughly reviewed the outcome of "decades of experience" and I still shudder when I get reminded of that exercise. Yes, we invented perf_events because the proposed perfmon kernel patches were an outright horror full of cobbled together experience dump along with a nice bunch of unfixable security holes, locking issues and permisson problems plus a completely nonobvious userspace interface. Short a complete design failure. So perf_events were not the reinvention of the wheel. It was a sane design decision to make performance counters available _AND_ useful for a broad audience and a broad range of use cases. If the only substantial complaint about perf you can bring up is the detail of generalized events, then we can agree that we disagree and stop wasting electrons right now. Thanks, tglx ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-28 22:16 ` Vince Weaver 2011-04-28 23:30 ` Thomas Gleixner @ 2011-04-29 2:28 ` Andi Kleen 2011-04-29 19:32 ` Ingo Molnar 2 siblings, 0 replies; 80+ messages in thread From: Andi Kleen @ 2011-04-29 2:28 UTC (permalink / raw) To: Vince Weaver Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra > I know it is fun to reinvent the wheel, but you ignored decades of > experience in dealing with perf-counters when you ran off and invented > perf_events. It will bite you eventually. s/eventually// A good example for that is perf events counted completely bogus LLC generalized cache events for several releases (before the offcore patches went in). And BTW they now completely changed again with Peter's changed, counting something quite different. -Andi ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-28 22:16 ` Vince Weaver 2011-04-28 23:30 ` Thomas Gleixner 2011-04-29 2:28 ` Andi Kleen @ 2011-04-29 19:32 ` Ingo Molnar 2 siblings, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-29 19:32 UTC (permalink / raw) To: Vince Weaver Cc: Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Peter Zijlstra * Vince Weaver <vweaver1@eecs.utk.edu> wrote: > On Wed, 27 Apr 2011, Ingo Molnar wrote: > > > Secondly, you are still quite wrong even with your revised opinion. Being able > > to type '-e cycles' and '-e instructions' in perf and get ... cycles and > > instructions counts/events, and the kernel helping that kind of approach is not > > 'abstraction to the extreme', it's called 'common sense'. > > by your logic I should be able to delete a file by saying > > echo "delete /tmp/tempfile" > /dev/sdc1 > because using unlink() is too low of an abstraction and confusing to the > user. Erm, unlink() does not pass magic hexa constants to the disk controller. unlink() is a high level interface that works across a vast range of disk controllers, disks, network mounted filesystems, in-RAM filesystems, in-ROM filesystems, clustered filesystems and other mediums. Just like that we can tell perf to count 'cycles', 'branches' or 'branch-misses' - all of these are relatively high level concepts (in the scope of CPUs) that work across a vast range of CPU types and models. Similarly, for offcore we want to introduce the concept of 'node local' versus 'remote' memory - perhaps with some events for inter-CPU traffic as well - because that probably covers most of the NUMA related memory profiling needs. Raw events are to perf what ioctls are to the VFS: small details nobody felt worth generalizing. My point in this discussion is that we do not offer new filesystems that support *only* ioctl calls ... Is this simple concept so hard to understand? Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-25 21:46 ` Vince Weaver 2011-04-25 22:12 ` Andi Kleen 2011-04-26 7:38 ` Ingo Molnar @ 2011-04-26 9:49 ` Peter Zijlstra 2 siblings, 0 replies; 80+ messages in thread From: Peter Zijlstra @ 2011-04-26 9:49 UTC (permalink / raw) To: Vince Weaver Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Mon, 2011-04-25 at 17:46 -0400, Vince Weaver wrote: > > The policy is very simple and common-sense: if a given piece of PMU > > functionality is useful enough to be exposed via a raw interface, then > > it must be useful enough to be generalized as well. > > what does that even mean? How do you "generalize" a functionality like > writing a value to an auxiliary MSR register? Come-on Vince, I know you're smarter than that! The external register is simply an extension of the configuration space, instead of the normal evsel msr you get evsel:offcore pairs. After that its simply scheduling them right. It simply adds more events to the PMU (in a rather sad way, it would have been so much nicer if Intel had simply extended the evsel MSR for every PMC, they could have also used that for the load-latency thing etc.) Now, these extra events offered are L3 and NUMA events, the 'common' interesting set is mostly covered by Andi's LLC mods and my NODE extension, after that there's mostly details left in offcore. So the writing of an extra MSR is totally irrelevant, its the extra events that are. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-25 17:12 ` Vince Weaver 2011-04-25 17:54 ` Ingo Molnar @ 2011-04-26 9:25 ` Peter Zijlstra 2011-04-26 20:33 ` Vince Weaver 1 sibling, 1 reply; 80+ messages in thread From: Peter Zijlstra @ 2011-04-26 9:25 UTC (permalink / raw) To: Vince Weaver Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Mon, 2011-04-25 at 13:12 -0400, Vince Weaver wrote: > On Fri, 22 Apr 2011, Ingo Molnar wrote: > > > But this kind of usability is absolutely unacceptable - users should not > > be expected to type in magic, CPU and model specific incantations to get > > access to useful hardware functionality. > > That's why people use libpfm4. or PAPI. And they do. And how is typing in hex numbers different from typing in model specific event names? All the same to me, you still need to understand your micro architecture very thoroughly and read the SDMs. PAPI actually has 'generalized' events, but I guess you're going to tell me nobody uses those since they're not useful. > Current PAPI snapshots support offcore response on recent git kernels. > With full names, no hex values, thanks to libpfm4. > > All the world is not perf. I know, all the world is interested in investing tons of time learning about their one architecture and extract the last few percent of performance. And that is fine for those few people who can afford it, but generally optimizing for a single specific platform isn't cost effective. I looks like you're all so stuck in your HPC/lowlevel way of things you're not even realizing there's much more to be gained by providing easy and useful tools to the general public, stuff that works similarly across architectures. > > The proper solution is to expose useful offcore functionality via > > generalized events - that way users do not have to care which specific > > CPU model they are using, they can use the conceptual event and not some > > model specific quirky hexa number. > > No no no no. > > Blocking access to raw events is the wrong idea. If anything, the whole > "generic events" thing in the kernel should be ditched. Wrong events are > used at times (see AMD branch events a few releases back, now Nehalem > cache events). This all belongs in userspace, as was pointed out at the > start. The kernel has no business telling users which perf events are > interesting, or limiting them! What is this, windows? The kernel has no place scheduling pmcs either I expect, or scheduling tasks for that matter. We all know you don't believe in upgrading kernels or in kernels very much at all. > If you do block access to any raw events, we're going to have to start > recommending people ditch perf_events and start patching the kernel with > perfctr again. We already do for P4/netburst users, as Pentium 4 support > is currently hosed due to NMI event conflicts. Very constructive attitude, instead of helping you simply subvert and route around, thanks man! You could of course a) simply disable the NMI watchdog, or b) improve the space-heater (aka. P4) PMU implementation to use alternative encodings -- from what I understood the problem with P4 is that there's multiple ways to encode the same event and currently if you take one it doesn't try others. > Also with perfctr it's much easier to get low-latency access to the > counters. See: > http://web.eecs.utk.edu/~vweaver1/projects/papi-cost/ And why is that? is that the lack of userspace rdpmc? That should be possible with perf, powerpc actually does that already. Various people mentioned wanting to make this work on x86 but I've yet to see a patch. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-26 9:25 ` Peter Zijlstra @ 2011-04-26 20:33 ` Vince Weaver 2011-04-26 21:19 ` Cyrill Gorcunov 2011-04-27 6:43 ` Ingo Molnar 0 siblings, 2 replies; 80+ messages in thread From: Vince Weaver @ 2011-04-26 20:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Tue, 26 Apr 2011, Peter Zijlstra wrote: > > That's why people use libpfm4. or PAPI. And they do. > > And how is typing in hex numbers different from typing in model specific > event names? Reall... quick, tell me what event 0x53cf28 corresponds to on a core2. Now if I said L2_IFETCH:BOTH_CORES you know several things about what it is. Plus, you can do a quick search in the Intel Arch manual and find more info. With the hex value you have to do some shifting and masking by hand before looking up. An even worse problem: Quick... tell me what actual hardware event L1-dcache-loads corresponds to on an L2. Can you tell without digging through kernel source code? Does that event include prefetches? Does it include speculative events? Does it count page walks? Does it overcount by an amount equal to the number of hardware interrupts? If I use the equivelent event on an AMD64, will all the same hold? > PAPI actually has 'generalized' events, but I guess you're going to tell > me nobody uses those since they're not useful. Of course people use them. But we don't _force_ people to use them. We don't disable access to raw events. Although alarmingly it seems like the kernel is going to start to, possibly meaning even our users can't use our 'generalized' events if for example they incorporate OFFCORE_RESPONSE. Another issue: if a problem is found with one of the PAPI events, they can update and recompile and run out of their own account at will. If there's a problem with a kernel generalized events, you have to reinstall a kernel. Something many users can't do. For example, your Nehalem cache fixes will be in 2.6.39. How long until that appears in a stock distro? How long until that appears in an RHEL release? > > All the world is not perf. > > I know, all the world is interested in investing tons of time learning > about their one architecture and extract the last few percent of > performance. There are people out there who have been using perf counters on UNIX/Linux machines for decades. They know what events they want to measure. They are not interested in having the kernel tell them they can't do it. > I looks like you're all so stuck in your HPC/lowlevel way of things > you're not even realizing there's much more to be gained by providing > easy and useful tools to the general public, stuff that works similarly > across architectures. We're not saying people can't use perf. Who knows, maybe PAPI will go away becayse perf is so good. It's just silly to block out access to RAW events on the argument that "it's too hard". Again, are we Microsoft here? > Very constructive attitude, instead of helping you simply subvert and > route around, thanks man! I spent a lot of time trying to fix P4 support back in the 2.6.35 days. I only have so much time to spend on this stuff. When people complain about p4 support, I direct them to Cyrill et al. I can't force them to become kernel developers. Usually they want immediate results, which they can get with perfctr. People want offcore response. People want uncore access. People want raw event access. I can tell them "send a patch to the kernel, it'll languish in obscurity for years and maybe in 2.6.4x you'll see it". Or they can have support today with an outside patch. Which do you think they choose? > And why is that? is that the lack of userspace rdpmc? That should be > possible with perf, powerpc actually does that already. Various people > mentioned wanting to make this work on x86 but I've yet to see a patch. We at the PAPI project welcome any patches you'd care to contribute to our project too, to make things better. It goes both ways you know. Vince ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-26 20:33 ` Vince Weaver @ 2011-04-26 21:19 ` Cyrill Gorcunov 2011-04-26 21:25 ` Don Zickus 2011-04-27 6:43 ` Ingo Molnar 1 sibling, 1 reply; 80+ messages in thread From: Cyrill Gorcunov @ 2011-04-26 21:19 UTC (permalink / raw) To: Vince Weaver Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner, Don Zickus On 04/27/2011 12:33 AM, Vince Weaver wrote: ... > > I spent a lot of time trying to fix P4 support back in the 2.6.35 days. > I only have so much time to spend on this stuff. > > When people complain about p4 support, I direct them to Cyrill et al. I > can't force them to become kernel developers. Usually they want immediate > results, which they can get with perfctr. > Vince I've not read the whole thread so no idea what is all about, but if you have some p4 machines and have some will to help -- mind to test the patch below, it should fix nmi-watchdog and cycles conflict. It's utter raw RFC (and i know there is a nit i should update) but still might be interesting to see the results. Untested. -- perf, x86: P4 PMU -- Introduce alternate events v3 Alternate events are used to increase perf subsystem counter usage. In general the idea is to find an "alternate" event (if there is one) which do count the same magnitude as former event but use different counter allowing them to run simultaneously with original event. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> --- arch/x86/include/asm/perf_event_p4.h | 6 ++ arch/x86/kernel/cpu/perf_event_p4.c | 74 ++++++++++++++++++++++++++++++++++- 2 files changed, 78 insertions(+), 2 deletions(-) Index: linux-2.6.git/arch/x86/include/asm/perf_event_p4.h ===================================================================== --- linux-2.6.git.orig/arch/x86/include/asm/perf_event_p4.h +++ linux-2.6.git/arch/x86/include/asm/perf_event_p4.h @@ -36,6 +36,10 @@ #define P4_ESCR_T1_OS 0x00000002U #define P4_ESCR_T1_USR 0x00000001U +#define P4_ESCR_USR_MASK \ + (P4_ESCR_T0_OS | P4_ESCR_T0_USR | \ + P4_ESCR_T1_OS | P4_ESCR_T1_USR) + #define P4_ESCR_EVENT(v) ((v) << P4_ESCR_EVENT_SHIFT) #define P4_ESCR_EMASK(v) ((v) << P4_ESCR_EVENTMASK_SHIFT) #define P4_ESCR_TAG(v) ((v) << P4_ESCR_TAG_SHIFT) @@ -839,5 +843,7 @@ enum P4_PEBS_METRIC { * 31: reserved (HT thread) */ +#define P4_INVALID_CONFIG (u64)~0 + #endif /* PERF_EVENT_P4_H */ Index: linux-2.6.git/arch/x86/kernel/cpu/perf_event_p4.c ===================================================================== --- linux-2.6.git.orig/arch/x86/kernel/cpu/perf_event_p4.c +++ linux-2.6.git/arch/x86/kernel/cpu/perf_event_p4.c @@ -609,6 +609,31 @@ static u64 p4_general_events[PERF_COUNT_ p4_config_pack_cccr(P4_CCCR_EDGE | P4_CCCR_COMPARE), }; +/* + * Alternate events allow us to find substitution for an event if + * it's already borrowed, so they may be considered as event aliases. + */ +struct p4_alt_event { + unsigned int event; + u64 config; +} p4_alternate_events[]= { + { + .event = P4_EVENT_GLOBAL_POWER_EVENTS, + .config = + p4_config_pack_escr(P4_ESCR_EVENT(P4_EVENT_EXECUTION_EVENT) | + P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, NBOGUS0) | + P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, NBOGUS1) | + P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, NBOGUS2) | + P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, NBOGUS3) | + P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, BOGUS0) | + P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, BOGUS1) | + P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, BOGUS2) | + P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, BOGUS3)) | + p4_config_pack_cccr(P4_CCCR_THRESHOLD(15) | P4_CCCR_COMPLEMENT | + P4_CCCR_COMPARE), + }, +}; + static struct p4_event_bind *p4_config_get_bind(u64 config) { unsigned int evnt = p4_config_unpack_event(config); @@ -620,6 +645,18 @@ static struct p4_event_bind *p4_config_g return bind; } +static u64 p4_find_alternate_config(unsigned int evnt) +{ + unsigned int i; + + for (i = 0; i < ARRAY_SIZE(p4_alternate_events); i++) { + if (evnt == p4_alternate_events[i].event) + return p4_alternate_events[i].config; + } + + return P4_INVALID_CONFIG; +} + static u64 p4_pmu_event_map(int hw_event) { struct p4_event_bind *bind; @@ -1133,8 +1170,41 @@ static int p4_pmu_schedule_events(struct } cntr_idx = p4_next_cntr(thread, used_mask, bind); - if (cntr_idx == -1 || test_bit(escr_idx, escr_mask)) - goto done; + if (cntr_idx == -1 || test_bit(escr_idx, escr_mask)) { + + /* + * So the former event already accepted to run + * and the only way to success here is to use + * an alternate event. + */ + const u64 usr_mask = p4_config_pack_escr(P4_ESCR_USR_MASK); + u64 alt_config; + unsigned int event; + + event = p4_config_unpack_event(hwc->config); + alt_config = p4_find_alternate_config(event); + + if (alt_config == P4_INVALID_CONFIG) + goto done; + + bind = p4_config_get_bind(alt_config); + escr_idx = p4_get_escr_idx(bind->escr_msr[thread]); + if (unlikely(escr_idx == -1)) + goto done; + + cntr_idx = p4_next_cntr(thread, used_mask, bind); + if (cntr_idx == -1 || test_bit(escr_idx, escr_mask)) + goto done; + + /* + * This is a destructive operation we're going + * to make. We substitute the former config with + * alternate one to continue tracking it after. + * Be carefull and don't kill the custom bits + * in the former config. + */ + hwc->config = (hwc->config & usr_mask) | alt_config; + } p4_pmu_swap_config_ts(hwc, cpu); if (assign) ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-26 21:19 ` Cyrill Gorcunov @ 2011-04-26 21:25 ` Don Zickus 2011-04-26 21:33 ` Cyrill Gorcunov 0 siblings, 1 reply; 80+ messages in thread From: Don Zickus @ 2011-04-26 21:25 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Vince Weaver, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Wed, Apr 27, 2011 at 01:19:07AM +0400, Cyrill Gorcunov wrote: > Vince I've not read the whole thread so no idea what is all about, but if you > have some p4 machines and have some will to help -- mind to test the patch below, > it should fix nmi-watchdog and cycles conflict. It's utter raw RFC (and i know there > is a nit i should update) but still might be interesting to see the results. > Untested. > -- > perf, x86: P4 PMU -- Introduce alternate events v3 Unfortunately it just panic'd for me when I ran perf record grep -r don / Thoughts? Cheers, Don redfish.lab.bos.redhat.com login: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff8101ff60>] p4_pmu_schedule_events+0xb0/0x4c0 PGD 2c603067 PUD 2d617067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/online CPU 2 Modules linked in: autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log uinput ppdev e1000 parport_pc parport sg dcdbas pcspkr snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm sn] Pid: 1734, comm: grep Not tainted 2.6.39-rc3usb3-latest+ #339 Dell Inc. Precision WorkStation 470 /0P7996 RIP: 0010:[<ffffffff8101ff60>] [<ffffffff8101ff60>] p4_pmu_schedule_events+0xb0/0x4c0 RSP: 0018:ffff88003fb03b18 EFLAGS: 00010016 RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88003c30de00 RSI: 0000000000000004 RDI: 000000000000000f RBP: ffff88003fb03bb8 R08: 0000000000000001 R09: 0000000000000001 R10: 000000000000006d R11: ffff88003acb4ae8 R12: ffff88002d490c00 R13: ffff88003fb03b78 R14: 0000000000000001 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff88003fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000002d728000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process grep (pid: 1734, threadinfo ffff88002d648000, task ffff88003acb4240) Stack: ffff880000000014 ffff88003acb4b10 b00002030803c000 0000000000000003 0000000200000001 ffff88003fb03bc8 0000000100000002 ffff88003fb03bcc 0000000181a24ee0 ffff88003fb0cd48 0000000000000008 0000000000000000 Call Trace: <IRQ> [<ffffffff8101b9e1>] ? x86_pmu_add+0xb1/0x170 [<ffffffff8101b8bf>] x86_pmu_commit_txn+0x5f/0xb0 [<ffffffff810ff0c4>] ? perf_event_update_userpage+0xa4/0xe0 [<ffffffff810ff020>] ? perf_output_end+0x60/0x60 [<ffffffff81100dca>] group_sched_in+0x8a/0x160 [<ffffffff8110100b>] ctx_sched_in+0x16b/0x1d0 [<ffffffff811017ce>] perf_event_task_tick+0x1de/0x260 [<ffffffff8104fc1e>] scheduler_tick+0xde/0x2b0 [<ffffffff81096e20>] ? tick_nohz_handler+0x100/0x100 ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-26 21:25 ` Don Zickus @ 2011-04-26 21:33 ` Cyrill Gorcunov 0 siblings, 0 replies; 80+ messages in thread From: Cyrill Gorcunov @ 2011-04-26 21:33 UTC (permalink / raw) To: Don Zickus Cc: Vince Weaver, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On 04/27/2011 01:25 AM, Don Zickus wrote: > On Wed, Apr 27, 2011 at 01:19:07AM +0400, Cyrill Gorcunov wrote: >> Vince I've not read the whole thread so no idea what is all about, but if you >> have some p4 machines and have some will to help -- mind to test the patch below, >> it should fix nmi-watchdog and cycles conflict. It's utter raw RFC (and i know there >> is a nit i should update) but still might be interesting to see the results. >> Untested. >> -- >> perf, x86: P4 PMU -- Introduce alternate events v3 > > Unfortunately it just panic'd for me when I ran > > perf record grep -r don / > > Thoughts? > > Cheers, > Don > > redfish.lab.bos.redhat.com login: BUG: unable to handle kernel NULL > pointer dereference at 0000000000000008 > IP: [<ffffffff8101ff60>] p4_pmu_schedule_events+0xb0/0x4c0 > PGD 2c603067 PUD 2d617067 PMD 0 > Oops: 0000 [#1] SMP > last sysfs file: /sys/devices/system/cpu/online > CPU 2 > Modules linked in: autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log > uinput ppdev e1000 parport_pc parport sg dcdbas pcspkr snd_intel8x0 > snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm sn] > > Pid: 1734, comm: grep Not tainted 2.6.39-rc3usb3-latest+ #339 Dell Inc. > Precision WorkStation 470 /0P7996 > RIP: 0010:[<ffffffff8101ff60>] [<ffffffff8101ff60>] > p4_pmu_schedule_events+0xb0/0x4c0 > RSP: 0018:ffff88003fb03b18 EFLAGS: 00010016 > RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000 > RDX: ffff88003c30de00 RSI: 0000000000000004 RDI: 000000000000000f > RBP: ffff88003fb03bb8 R08: 0000000000000001 R09: 0000000000000001 > R10: 000000000000006d R11: ffff88003acb4ae8 R12: ffff88002d490c00 > R13: ffff88003fb03b78 R14: 0000000000000001 R15: 0000000000000001 > FS: 0000000000000000(0000) GS:ffff88003fb00000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000008 CR3: 000000002d728000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process grep (pid: 1734, threadinfo ffff88002d648000, task > ffff88003acb4240) > Stack: > ffff880000000014 ffff88003acb4b10 b00002030803c000 0000000000000003 > 0000000200000001 ffff88003fb03bc8 0000000100000002 ffff88003fb03bcc > 0000000181a24ee0 ffff88003fb0cd48 0000000000000008 0000000000000000 > Call Trace: > <IRQ> > [<ffffffff8101b9e1>] ? x86_pmu_add+0xb1/0x170 > [<ffffffff8101b8bf>] x86_pmu_commit_txn+0x5f/0xb0 > [<ffffffff810ff0c4>] ? perf_event_update_userpage+0xa4/0xe0 > [<ffffffff810ff020>] ? perf_output_end+0x60/0x60 > [<ffffffff81100dca>] group_sched_in+0x8a/0x160 > [<ffffffff8110100b>] ctx_sched_in+0x16b/0x1d0 > [<ffffffff811017ce>] perf_event_task_tick+0x1de/0x260 > [<ffffffff8104fc1e>] scheduler_tick+0xde/0x2b0 > [<ffffffff81096e20>] ? tick_nohz_handler+0x100/0x100 > Ouch, I bet p4_config_get_bind returned NULL and here we are. Weird, seems I've missed something. Don, I'll continue tomorrow, ok? (kinda sleep already). -- Cyrill ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-26 20:33 ` Vince Weaver 2011-04-26 21:19 ` Cyrill Gorcunov @ 2011-04-27 6:43 ` Ingo Molnar 2011-04-28 22:10 ` Vince Weaver 1 sibling, 1 reply; 80+ messages in thread From: Ingo Molnar @ 2011-04-27 6:43 UTC (permalink / raw) To: Vince Weaver Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner * Vince Weaver <vweaver1@eecs.utk.edu> wrote: > On Tue, 26 Apr 2011, Peter Zijlstra wrote: > > > > That's why people use libpfm4. or PAPI. And they do. > > > > And how is typing in hex numbers different from typing in model specific > > event names? > > Reall... quick, tell me what event 0x53cf28 corresponds to on a core2. > > Now if I said L2_IFETCH:BOTH_CORES you know several things about what it is. Erm, that assumes you already know that magic incantation. Most of the users who want to do measurements and profiling do not know that. So there's little difference between: - someone shows them the 0x53cf28 magic code - someone shows them the L2_IFETCH:BOTH_CORES magic symbol So you while hexa values have like 10% utility, the stupid, vendor-specific event names you are pushing here have like 15% utility. In perf we are aiming for 100% utility, where if someone knows something about CPUs and can type 'cycles', 'instructions' or 'branches', will get the obvious result. This is not a difficult usability concept really. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-27 6:43 ` Ingo Molnar @ 2011-04-28 22:10 ` Vince Weaver 0 siblings, 0 replies; 80+ messages in thread From: Vince Weaver @ 2011-04-28 22:10 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, linux-kernel, Andi Kleen, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo, Thomas Gleixner On Wed, 27 Apr 2011, Ingo Molnar wrote: > > Erm, that assumes you already know that magic incantation. Most of the users > who want to do measurements and profiling do not know that. So there's little > difference between: > > - someone shows them the 0x53cf28 magic code > - someone shows them the L2_IFETCH:BOTH_CORES magic symbol > > So you while hexa values have like 10% utility, the stupid, vendor-specific > event names you are pushing here have like 15% utility. > > In perf we are aiming for 100% utility, where if someone knows something about > CPUs and can type 'cycles', 'instructions' or 'branches', will get the obvious > result. > > This is not a difficult usability concept really. yes, and this functionality belongs in the perf tool itself (or some other user tool, like libpfm4, or PAPI). Not in the kernel. How much larger are you willing to make the kernel to hold your generalized events? PAPI has at least 128 that people have found useful enough to add over the years. There are probably more. I notice the kernel doesn't have any FP or SSE/Vector counts yet. Or uops. Or hw-interrupt counts. Fused multiply-add? How about GPU counters (PAPI is starting to support these)? Network counters? Infiniband? You're being lazy and pushing "perf" functionality into the kernel. It belongs in userspace. It's not the kernel's job to make things easy for users. Its job is to make things possible, and get out of the way. It's already bad enough that your generalized events can change from kernel version to kernel version without warning. By being in the kernel, aren't they a stable ABI that can't be changed? Vince ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 6:34 ` Ingo Molnar 2011-04-22 8:06 ` Ingo Molnar @ 2011-04-22 16:22 ` Andi Kleen 2011-04-22 19:54 ` Ingo Molnar 1 sibling, 1 reply; 80+ messages in thread From: Andi Kleen @ 2011-04-22 16:22 UTC (permalink / raw) To: Ingo Molnar Cc: Arnaldo Carvalho de Melo, linux-kernel, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo On Fri, Apr 22, 2011 at 08:34:29AM +0200, Ingo Molnar wrote: > This needs to be a *lot* more user friendly. Users do not want to type in > stupid hexa magic numbers to get profiling. We have moved beyond the oprofile > era really. I agree that the raw events are quite user unfriendly. Unfortunately they are the way of life in perf -- unlike oprofile -- currently if you want any CPU specific events like this. Really to make sense out of all this you need per CPU full event lists. I have an own wrapper to make it more user friendly, but its functionality should arguably migrate into perf. I did a patch to add a mapping file some time ago, but it likely needs some improvements before it can be merged (aka not .39), like auto selecting a suitable mapping file and backtranslating raw mappings on output. BTW the new perf lat code needs the raw events config1 specification internally, so this is needed in some form anyways. Short of that the extended raw events is the best we can get short term I think. So I would prefer to have it for .39 to make this feature usable at all. I attached the old mapping file patch for your reference. I also put up a few mapping files for Intel CPUs at ftp://ftp.kernel.org/pub/linux/kernel/people/ak/pmu/* e.g. to use it with Nehalem offcore events and this patch you would use today wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/pmu/nhm-ep.map perf --map-file nhm-ep.map top -e offcore_response_0.any_data.local_cache_dram -Andi commit 37323c19ceb57101cc2160059c567ee14055b7c8 Author: Andi Kleen <ak@linux.intel.com> Date: Mon Nov 8 04:52:18 2010 +0100 mapping file support diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt index a91f9f9..63bdbbb 100644 --- a/tools/perf/Documentation/perf-record.txt +++ b/tools/perf/Documentation/perf-record.txt @@ -120,6 +120,9 @@ Do not update the builid cache. This saves some overhead in situations where the information in the perf.data file (which includes buildids) is sufficient. +--map-events=file +Use file as event mapping file. + SEE ALSO -------- linkperf:perf-stat[1], linkperf:perf-list[1] diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt index 4b3a2d4..4f20af3 100644 --- a/tools/perf/Documentation/perf-stat.txt +++ b/tools/perf/Documentation/perf-stat.txt @@ -53,6 +53,9 @@ comma-sperated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2 In per-thread mode, this option is ignored. The -a option is still necessary to activate system-wide monitoring. Default is to count on all CPUs. +--map-events=file +Use file as event mapping file. + EXAMPLES -------- diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 93bd2ff..6fdf892 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -794,6 +794,9 @@ const struct option record_options[] = { OPT_CALLBACK('e', "event", NULL, "event", "event selector. use 'perf list' to list available events", parse_events), + OPT_CALLBACK(0, "map-events", NULL, "map-events", + "specify mapping file for events", + map_events), OPT_CALLBACK(0, "filter", NULL, "filter", "event filter", parse_filter), OPT_INTEGER('p', "pid", &target_pid, diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index a6b4d44..f21f307 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -525,6 +525,9 @@ static const struct option options[] = { OPT_CALLBACK('e', "event", NULL, "event", "event selector. use 'perf list' to list available events", parse_events), + OPT_CALLBACK(0, "map-events", NULL, "map-events", + "specify mapping file for events", + map_events), OPT_BOOLEAN('i', "no-inherit", &no_inherit, "child tasks do not inherit counters"), OPT_INTEGER('p', "pid", &target_pid, diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c index 4af5bd5..2cc7b3d 100644 --- a/tools/perf/util/parse-events.c +++ b/tools/perf/util/parse-events.c @@ -83,6 +83,14 @@ static const char *sw_event_names[] = { "emulation-faults", }; +struct mapping { + const char *str; + const char *res; +}; + +static int mapping_max; +static struct mapping *mappings; + #define MAX_ALIASES 8 static const char *hw_cache[][MAX_ALIASES] = { @@ -731,12 +739,28 @@ parse_event_modifier(const char **strp, struct perf_event_attr *attr) return 0; } +static int cmp_mapping(const void *a, const void *b) +{ + const struct mapping *am = a; + const struct mapping *bm = b; + return strcmp(am->str, bm->str); +} + +static const char * +get_event_mapping(const char *str) +{ + struct mapping key = { .str = str }; + struct mapping *r = bsearch(&key, mappings, mapping_max, + sizeof(struct mapping), cmp_mapping); + return r ? r->res : NULL; +} + /* * Each event can have multiple symbolic names. * Symbolic names are (almost) exactly matched. */ static enum event_result -parse_event_symbols(const char **str, struct perf_event_attr *attr) +do_parse_event_symbols(const char **str, struct perf_event_attr *attr) { enum event_result ret; @@ -774,6 +798,15 @@ modifier: return ret; } +static enum event_result +parse_event_symbols(const char **str, struct perf_event_attr *attr) +{ + const char *map = get_event_mapping(*str); + if (map) + *str = map; + return do_parse_event_symbols(str, attr); +} + static int store_event_type(const char *orgname) { char filename[PATH_MAX], *c; @@ -963,3 +996,54 @@ void print_events(void) exit(129); } + +int map_events(const struct option *opt __used, const char *str, + int unset __used) +{ + FILE *f; + char *line = NULL; + size_t linelen = 0; + char *p; + int lineno = 0; + static int mapping_size; + struct mapping *map; + + f = fopen(str, "r"); + if (!f) { + pr_err("Cannot open event map file"); + return -1; + } + while (getline(&line, &linelen, f) > 0) { + lineno++; + p = strpbrk(line, "\n#"); + if (p) + *p = 0; + p = line + strspn(line, " \t"); + if (*p == 0) + continue; + if (mapping_max >= mapping_size) { + if (!mapping_size) + mapping_size = 2048; + mapping_size *= 2; + mappings = realloc(mappings, + mapping_size * sizeof(struct mapping)); + if (!mappings) { + pr_err("Out of memory\n"); + exit(ENOMEM); + } + } + map = &mappings[mapping_max++]; + map->str = strsep(&p, " \t"); + map->res = strsep(&p, " \t"); + if (!map->str || !map->res) { + fprintf(stderr, "%s:%d: Invalid line in map file\n", + str, lineno); + } + line = NULL; + linelen = 0; + } + fclose(f); + qsort(mappings, mapping_max, sizeof(struct mapping), + cmp_mapping); + return 0; +} diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h index fc4ab3f..1d6df9c 100644 --- a/tools/perf/util/parse-events.h +++ b/tools/perf/util/parse-events.h @@ -33,5 +33,6 @@ extern void print_events(void); extern char debugfs_path[]; extern int valid_debugfs_mount(const char *debugfs); +extern int map_events(const struct option *opt, const char *str, int unset); #endif /* __PERF_PARSE_EVENTS_H */ ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 2011-04-22 16:22 ` Andi Kleen @ 2011-04-22 19:54 ` Ingo Molnar 0 siblings, 0 replies; 80+ messages in thread From: Ingo Molnar @ 2011-04-22 19:54 UTC (permalink / raw) To: Andi Kleen Cc: Arnaldo Carvalho de Melo, linux-kernel, Peter Zijlstra, Stephane Eranian, Lin Ming, Arnaldo Carvalho de Melo * Andi Kleen <ak@linux.intel.com> wrote: > On Fri, Apr 22, 2011 at 08:34:29AM +0200, Ingo Molnar wrote: > > This needs to be a *lot* more user friendly. Users do not want to type in > > stupid hexa magic numbers to get profiling. We have moved beyond the oprofile > > era really. > > I agree that the raw events are quite user unfriendly. > > Unfortunately they are the way of life in perf -- unlike oprofile -- > currently if you want any CPU specific events like this. Not sure where you take that blanket statement from, but no, raw events are not really the 'way of life' - judging by the various user feedback we get they come up pretty rarely. The thing is, most people just use the default 'perf record' and that's it - they do not even care about a *single* event - they just want to profile their code somehow. Then the second most popular event category are the generalized events, the ones you can see in perf list output: cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cpu-clock [Software event] task-clock [Software event] page-faults OR faults [Software event] minor-faults [Software event] major-faults [Software event] context-switches OR cs [Software event] cpu-migrations OR migrations [Software event] alignment-faults [Software event] emulation-faults [Software event] L1-dcache-loads [Hardware cache event] L1-dcache-load-misses [Hardware cache event] L1-dcache-stores [Hardware cache event] L1-dcache-store-misses [Hardware cache event] L1-dcache-prefetches [Hardware cache event] L1-dcache-prefetch-misses [Hardware cache event] L1-icache-loads [Hardware cache event] L1-icache-load-misses [Hardware cache event] L1-icache-prefetches [Hardware cache event] L1-icache-prefetch-misses [Hardware cache event] LLC-loads [Hardware cache event] LLC-load-misses [Hardware cache event] LLC-stores [Hardware cache event] LLC-store-misses [Hardware cache event] LLC-prefetches [Hardware cache event] LLC-prefetch-misses [Hardware cache event] dTLB-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-stores [Hardware cache event] dTLB-store-misses [Hardware cache event] dTLB-prefetches [Hardware cache event] dTLB-prefetch-misses [Hardware cache event] iTLB-loads [Hardware cache event] iTLB-load-misses [Hardware cache event] branch-loads [Hardware cache event] branch-load-misses [Hardware cache event] These are useful but are used less frequently. Then come tracepoint based events - and as a distant last, come raw events. Yes, they raw events are useful occasionally, just like modifying applications via a hexa editor is useful occasionally. If done often we better abstract it out. > Really to make sense out of all this you need per CPU full event lists. To make sense out of what? You are making very sweeping yet vague statements. > I have an own wrapper to make it more user friendly, but its functionality > should arguably migrate into perf. Uhm, no - your patch seem to reintroduce oprofile's horrible events files. We really learned from that mistake and do not want to step back ... Please see the detailed mails i wrote in this thread, what we want is to extend and improve existing generalizations of events. The useful bits of the offcore PMU fit nicely into that scheme. Thanks, Ingo ^ permalink raw reply [flat|nested] 80+ messages in thread
end of thread, other threads:[~2011-04-29 19:32 UTC | newest] Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-04-22 8:47 [PATCH 1/1] perf tools: Add missing user space support for config1/config2 Stephane Eranian 2011-04-22 9:23 ` Ingo Molnar 2011-04-22 9:41 ` Stephane Eranian 2011-04-22 10:52 ` [generalized cache events] " Ingo Molnar 2011-04-22 12:04 ` Stephane Eranian 2011-04-22 13:18 ` Ingo Molnar 2011-04-22 20:31 ` Stephane Eranian 2011-04-22 20:47 ` Ingo Molnar 2011-04-23 12:13 ` Stephane Eranian 2011-04-23 12:49 ` Ingo Molnar 2011-04-22 21:03 ` Ingo Molnar 2011-04-23 12:27 ` Stephane Eranian 2011-04-22 16:51 ` Andi Kleen 2011-04-22 19:57 ` Ingo Molnar 2011-04-26 9:25 ` Peter Zijlstra 2011-04-22 16:50 ` arun 2011-04-22 17:00 ` Andi Kleen 2011-04-22 20:30 ` Ingo Molnar 2011-04-22 20:32 ` Ingo Molnar 2011-04-23 0:03 ` Andi Kleen 2011-04-23 7:50 ` Peter Zijlstra 2011-04-23 12:06 ` Stephane Eranian 2011-04-23 12:36 ` Ingo Molnar 2011-04-23 13:16 ` Peter Zijlstra 2011-04-25 18:48 ` Stephane Eranian 2011-04-25 19:40 ` Andi Kleen 2011-04-25 19:55 ` Ingo Molnar 2011-04-24 2:15 ` Andi Kleen 2011-04-24 2:19 ` Andi Kleen 2011-04-25 17:41 ` Ingo Molnar 2011-04-25 18:00 ` Dehao Chen [not found] ` <BANLkTiks31-pMJe4zCKrppsrA1d6KanJFA@mail.gmail.com> 2011-04-25 18:05 ` Ingo Molnar 2011-04-25 18:39 ` Stephane Eranian 2011-04-25 19:45 ` Ingo Molnar 2011-04-23 8:02 ` Ingo Molnar 2011-04-23 20:14 ` [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES Ingo Molnar 2011-04-24 6:16 ` Arun Sharma 2011-04-25 17:37 ` Ingo Molnar 2011-04-26 9:25 ` Peter Zijlstra 2011-04-26 14:00 ` Ingo Molnar 2011-04-27 11:11 ` Ingo Molnar 2011-04-27 14:47 ` Arun Sharma 2011-04-27 15:48 ` Ingo Molnar 2011-04-27 16:27 ` Ingo Molnar 2011-04-27 19:05 ` Arun Sharma 2011-04-27 19:03 ` Arun Sharma -- strict thread matches above, loose matches on Subject: below -- 2011-04-21 17:41 [GIT PULL 0/1] perf/urgent Fix missing support for config1/config2 Arnaldo Carvalho de Melo 2011-04-21 17:41 ` [PATCH 1/1] perf tools: Add missing user space " Arnaldo Carvalho de Melo 2011-04-22 6:34 ` Ingo Molnar 2011-04-22 8:06 ` Ingo Molnar 2011-04-22 21:37 ` Peter Zijlstra 2011-04-22 21:54 ` Peter Zijlstra 2011-04-22 22:19 ` Peter Zijlstra 2011-04-22 23:54 ` Andi Kleen 2011-04-23 7:49 ` Peter Zijlstra 2011-04-22 22:57 ` Peter Zijlstra 2011-04-23 0:00 ` Andi Kleen 2011-04-23 7:50 ` Peter Zijlstra 2011-04-23 8:13 ` Ingo Molnar 2011-04-25 17:12 ` Vince Weaver 2011-04-25 17:54 ` Ingo Molnar 2011-04-25 21:46 ` Vince Weaver 2011-04-25 22:12 ` Andi Kleen 2011-04-26 7:23 ` Ingo Molnar 2011-04-26 7:38 ` Ingo Molnar 2011-04-26 20:51 ` Vince Weaver 2011-04-27 6:52 ` Ingo Molnar 2011-04-28 22:16 ` Vince Weaver 2011-04-28 23:30 ` Thomas Gleixner 2011-04-29 2:28 ` Andi Kleen 2011-04-29 19:32 ` Ingo Molnar 2011-04-26 9:49 ` Peter Zijlstra 2011-04-26 9:25 ` Peter Zijlstra 2011-04-26 20:33 ` Vince Weaver 2011-04-26 21:19 ` Cyrill Gorcunov 2011-04-26 21:25 ` Don Zickus 2011-04-26 21:33 ` Cyrill Gorcunov 2011-04-27 6:43 ` Ingo Molnar 2011-04-28 22:10 ` Vince Weaver 2011-04-22 16:22 ` Andi Kleen 2011-04-22 19:54 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).