* [patch] perf: ARMv7 wrong "branches" generalized instruction @ 2011-08-10 17:40 Vince Weaver 2011-08-10 18:33 ` Will Deacon 0 siblings, 1 reply; 10+ messages in thread From: Vince Weaver @ 2011-08-10 17:40 UTC (permalink / raw) To: linux-kernel Cc: will.deacon, sam wang, Ingo Molnar, Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian Hello Sam Wang reported to me that my perf_event validation tests were failing with branches on ARM Cortex A9. It turns out the branches event used (ARMV7_PERFCTR_PC_WRITE) only seems to count taken branches. ARMV7_PERFCTR_PC_IMM_BRANCH seems to do a better job of counting both taken and not-taken. So I've attached a patch to change the definition for Cotex A9. This might be needed for Cortex A8 but I don't have a machine to test on (yet). I'm assuming this is a proper fix. The "generalized" events aren't defined very well so there's always some wiggle room about what they mean. Patch tested on a Pandaboard. The test code looks like this. There should be 500,000*3 branches. But the second branch (the not taken "bge test_jmp2") is not counted with PC_WRITE. asm( "\teor r3,r3,r3\n" "\tldr r3,=500000\n" "test_loop:\n" "\tB test_jmp\n" "\tnop\n" "test_jmp:\n" "\teor r2,r2,r2\n" "\tcmp r2,#1\n" "\tbge test_jmp2\n" "\tnop\n" "\tadd r2,r2,#1\n" "test_jmp2:\n" "\tsub r3,r3,#1\n" "\tcmp r3,#1\n" "\tbgt test_loop\n" : /* no output registers */ : /* no inputs */ : "cc", "r2", "r3" /* clobbered */ ); Vince vweaver1@eecs.utk.edu Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu> diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c index 4c85183..4d11bd5 100644 --- a/arch/arm/kernel/perf_event_v7.c +++ b/arch/arm/kernel/perf_event_v7.c @@ -323,7 +323,7 @@ static const unsigned armv7_a9_perf_map[PERF_COUNT_HW_MAX] = { ARMV7_PERFCTR_INST_OUT_OF_RENAME_STAGE, [PERF_COUNT_HW_CACHE_REFERENCES] = ARMV7_PERFCTR_COHERENT_LINE_HIT, [PERF_COUNT_HW_CACHE_MISSES] = ARMV7_PERFCTR_COHERENT_LINE_MISS, - [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = ARMV7_PERFCTR_PC_WRITE, + [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = ARMV7_PERFCTR_PC_IMM_BRANCH, [PERF_COUNT_HW_BRANCH_MISSES] = ARMV7_PERFCTR_PC_BRANCH_MIS_PRED, [PERF_COUNT_HW_BUS_CYCLES] = ARMV7_PERFCTR_CLOCK_CYCLES, }; ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction 2011-08-10 17:40 [patch] perf: ARMv7 wrong "branches" generalized instruction Vince Weaver @ 2011-08-10 18:33 ` Will Deacon 2011-08-10 19:01 ` Vince Weaver 0 siblings, 1 reply; 10+ messages in thread From: Will Deacon @ 2011-08-10 18:33 UTC (permalink / raw) To: Vince Weaver Cc: linux-kernel, sam wang, Ingo Molnar, Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian On Wed, Aug 10, 2011 at 06:40:31PM +0100, Vince Weaver wrote: > Hello Hi Vince, > Sam Wang reported to me that my perf_event validation tests were failing > with branches on ARM Cortex A9. > > It turns out the branches event used (ARMV7_PERFCTR_PC_WRITE) only seems > to count taken branches. It also counts exceptions and instructions that write to the PC. > ARMV7_PERFCTR_PC_IMM_BRANCH seems to do a better job of counting both > taken and not-taken. So I've attached a patch to change the definition > for Cotex A9. Well, it also only considers immediate branches so whilst it might satisy your test, I think that overall it's a less meaningful number. > This might be needed for Cortex A8 but I don't have a machine to test on > (yet). We use the same event encoding for HW_BRANCH_INSTRUCTIONS on the A8. > I'm assuming this is a proper fix. The "generalized" events aren't > defined very well so there's always some wiggle room about what they mean. I'm really not a big fan of the generalised events. I appreciate that they make perf easier to use but *only* if you can actually provide a sensible definition of the event which can (ideally) be compared between two different CPU implementations for the same architecture. So, my take on this is that we should either: (a) leave it like it is since taken branches is probably a more useful metric than number of immediate branches executed. (b) start replacing our generalised events with HW_OP_UNSUPPORTED and force the user to use raw events. I agree this isn't very friendly, but it's better than giving them crazy results [for example, we currently report more cache misses than cache references on A9 iirc]. Personally, I'm favour of (b) and getting userspace to provide the user with a CPU-specific event listing and then translate this to raw events using something like libpfm. As an aside, I also think this is part of a bigger problem. For example, the software event PERF_COUNT_SW_EMULATION_FAULTS would be much more useful if we could describe different types of emulation faults. These would probably be architecture-specific and we would need a way for userspace to communicate the event subclass to the kernel rather than having separate ABI events for them. So not only would we want raw events, we'd also want a way to specify the PMU to handle them (given that a global event namespace across PMUs is unrealistic). Will ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction 2011-08-10 18:33 ` Will Deacon @ 2011-08-10 19:01 ` Vince Weaver 2011-08-10 19:16 ` Måns Rullgård 2011-08-10 22:07 ` Will Deacon 0 siblings, 2 replies; 10+ messages in thread From: Vince Weaver @ 2011-08-10 19:01 UTC (permalink / raw) To: Will Deacon Cc: linux-kernel, sam wang, Ingo Molnar, Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian On Wed, 10 Aug 2011, Will Deacon wrote: > > It turns out the branches event used (ARMV7_PERFCTR_PC_WRITE) only seems > > to count taken branches. > > It also counts exceptions and instructions that write to the PC. are those more common than not-taken branches? I'd think branch predictor statistics will be a bit off if only taken instructions are measured. > > ARMV7_PERFCTR_PC_IMM_BRANCH seems to do a better job of counting both > > taken and not-taken. So I've attached a patch to change the definition > > for Cotex A9. > > Well, it also only considers immediate branches so whilst it might > satisy your test, I think that overall it's a less meaningful number. I guess there isn't more info available about which branches exactly are counted by all the events? I've gone through the trouble of writing such tests to find out experimentally what various counters count for x86, it would be sad to have to do it again for ARM. > (b) start replacing our generalised events with HW_OP_UNSUPPORTED and force > the user to use raw events. I agree this isn't very friendly, but it's > better than giving them crazy results [for example, we currently report > more cache misses than cache references on A9 iirc]. > > Personally, I'm favour of (b) and getting userspace to provide the user with > a CPU-specific event listing and then translate this to raw events using > something like libpfm. I agree 100%, but it's an unpopular opinion on linux-kernel. (Note that I'm the one who contributed ARM Cortex A8/A9 support to both libpfm4 and PAPI). Since the generalized events are there and ABI though, people are going to use them. That's why I've been writing tests that check them to see exactly what they are measuring. It's still an important issue to know what "branches" measures, just it probably shouldn't be a kernel issue like it's become. Vince vweaver1@eecs.utk.edu ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction 2011-08-10 19:01 ` Vince Weaver @ 2011-08-10 19:16 ` Måns Rullgård 2011-08-10 22:07 ` Will Deacon 1 sibling, 0 replies; 10+ messages in thread From: Måns Rullgård @ 2011-08-10 19:16 UTC (permalink / raw) To: linux-kernel Vince Weaver <vweaver1@eecs.utk.edu> writes: > On Wed, 10 Aug 2011, Will Deacon wrote: > >> > It turns out the branches event used (ARMV7_PERFCTR_PC_WRITE) only seems >> > to count taken branches. >> >> It also counts exceptions and instructions that write to the PC. > > are those more common than not-taken branches? Every function return and every PLT call would fall in this category. > I'd think branch predictor statistics will be a bit off if only taken > instructions are measured. Clearly. -- Måns Rullgård mans@mansr.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction 2011-08-10 19:01 ` Vince Weaver 2011-08-10 19:16 ` Måns Rullgård @ 2011-08-10 22:07 ` Will Deacon 2011-08-11 8:15 ` Ingo Molnar 1 sibling, 1 reply; 10+ messages in thread From: Will Deacon @ 2011-08-10 22:07 UTC (permalink / raw) To: Vince Weaver Cc: linux-kernel, sam wang, Ingo Molnar, Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian On Wed, Aug 10, 2011 at 08:01:20PM +0100, Vince Weaver wrote: > On Wed, 10 Aug 2011, Will Deacon wrote: > > > > It turns out the branches event used (ARMV7_PERFCTR_PC_WRITE) only seems > > > to count taken branches. > > > > It also counts exceptions and instructions that write to the PC. > > are those more common than not-taken branches? I'd think branch predictor > statistics will be a bit off if only taken instructions are measured. They're almost certainly not as common in normal code. However, as I've mentioned below, ARMV7_PERFCTR_PC_IMM_BRANCH only counts immediate branches so I don't think this is so useful for general consumption. > > > ARMV7_PERFCTR_PC_IMM_BRANCH seems to do a better job of counting both > > > taken and not-taken. So I've attached a patch to change the definition > > > for Cotex A9. > > > > Well, it also only considers immediate branches so whilst it might > > satisy your test, I think that overall it's a less meaningful number. > > I guess there isn't more info available about which branches exactly are > counted by all the events? I've gone through the trouble of writing such > tests to find out experimentally what various counters count for x86, it > would be sad to have to do it again for ARM. The problem is, it's largely CPU specific. This has improved slightly with newer cores and there is a PMUv2 document which describes common architectural events and their reserved numbers, but it is still optional for the CPU to implement these (notably, Cortex-A9 doesn't implement the architected instruction counter). Whilst your tests sound useful, to get any meaningful results out of ARM you will need to either skip difficult tests or make them CPU specific and use the raw encodings. > > (b) start replacing our generalised events with HW_OP_UNSUPPORTED and force > > the user to use raw events. I agree this isn't very friendly, but it's > > better than giving them crazy results [for example, we currently report > > more cache misses than cache references on A9 iirc]. > > > > Personally, I'm favour of (b) and getting userspace to provide the user with > > a CPU-specific event listing and then translate this to raw events using > > something like libpfm. > > I agree 100%, but it's an unpopular opinion on linux-kernel. (Note that > I'm the one who contributed ARM Cortex A8/A9 support to both libpfm4 and > PAPI). I can see why it's an unpopular idea if it's not necessary on your architecture but for ARM it's really the only way forward without continuing to introduce a mess of sparsely populated event tables every time a new CPU crops up. > Since the generalized events are there and ABI though, people are going to > use them. That's why I've been writing tests that check them to see > exactly what they are measuring. Right, but as I say, `instructions' on one core might not be `instructions' on another core. Just removing the ABI types from ARM will at least stop people using them. From what I've seen of perf users on ARM, they start with the ABI events, get some nonsensical results and then switch exclusively to raw events from then on. > It's still an important issue to know what "branches" measures, just it > probably shouldn't be a kernel issue like it's become. The TRM for the A9 will describe various events for counting branch-related events. These may be specific to the pipeline and micro-architecture and therefore you can't really tar them all with the same brush. Will ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction 2011-08-10 22:07 ` Will Deacon @ 2011-08-11 8:15 ` Ingo Molnar 2011-08-11 9:16 ` Will Deacon 2011-08-12 4:35 ` Vince Weaver 0 siblings, 2 replies; 10+ messages in thread From: Ingo Molnar @ 2011-08-11 8:15 UTC (permalink / raw) To: Will Deacon Cc: Vince Weaver, linux-kernel, sam wang, Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian, Linus Torvalds, David S. Miller, Thomas Gleixner * Will Deacon <will.deacon@arm.com> wrote: > [...] From what I've seen of perf users on ARM, they start with the > ABI events, get some nonsensical results and then switch > exclusively to raw events from then on. Could you give a specific example of such nonsensical output on ARM? Bugs should be fixed and yes i can that see if ARM produces nonsencial output then people won't use that nonsensical output (duh). Please fix or improve the nonsensical output. Btw., i have a pretty different experience from you: people will use most of the (default) generic events pretty happily because most developers have an adequate notion of 'cycles, branches, instructions' and they will *STOP* at the boundary of having to go into CPU microarchitecture specific details ... People just use the tool defaults in most cases, only a select few will bother with model specific events. Life is short and learning CPU microarchitecture specific details is a long and difficult process that is not justified for most users/developers - not in small part because the juicy bits of how specific CPUs really work (and what raw events correspond to those details) are behind an NDA protected curtain, only accessible to a few privileged people ... That is not what Linux interfaces are about in my opinion. So what you and Vince are suggesting, to dumb down the kernel parts of perf and force users into raw or microarchitecture specific events actually *reduces* the user-base very significantly - while in practice even just cycles, instructions and branches level analysis handles 99% of the everyday performance analysis needs ... We saw how the "push CPU specific events to users and tooling" concept didn't work with oprofile - why do we have to re-discuss this part of failed Linux history again and again? The approach Vince and you are suggesting is literally sacrificing 99% of utility for 1% of the users - a not very smart approach. I don't mind accomodating the needs of 1% of power-users (at all), but: *NOT AT THE EXPENSE OF THE COMMON CASE*. doh. > > I agree 100%, but it's an unpopular opinion on linux-kernel. > > (Note that I'm the one who contributed ARM Cortex A8/A9 support > > to both libpfm4 and PAPI). > > I can see why it's an unpopular idea if it's not necessary on your > architecture but for ARM it's really the only way forward without > continuing to introduce a mess of sparsely populated event tables > every time a new CPU crops up. Generic events are not about lkml popularity ... it's about usability. And why would it have to be implemented in a messy way? We have a number of CPU specific tables (and quirks) on x86 as well - that's the job of pretty much any kernel driver, to abstract things away in a per CPU, often per device (and sometimes even per card variant type) manner. We literally have more than 7 million lines of drivers/* code that provides generic abstractions - not just a few thousand lines of raw PCI operations space where user-space can write magic values to ... Similarly, for perf events we don't do a raw binary ABI mess for really good reasons: tools and users do not think in CPU and model specific hexa numbers, they operate in higher level concepts. That is a basic quality of implementation property. It's the *job* of the kernel to abstract things away, we don't shy away from that ... > > Since the generalized events are there and ABI though, people are > > going to use them. That's why I've been writing tests that check > > them to see exactly what they are measuring. > > Right, but as I say, `instructions' on one core might not be > `instructions' on another core. Just removing the ABI types from > ARM will at least stop people using them. [...] What are you talking about? Sure ARM Cortex 9 will execute instructions of a user-space application just as much as do other ARM CPUs. Sure as it executes that app it will execute instructions, you can single-step through it and thus you can count how many instructions it has executed, right? > > It's still an important issue to know what "branches" measures, > > just it probably shouldn't be a kernel issue like it's become. > > The TRM for the A9 will describe various events for counting > branch-related events. These may be specific to the pipeline and > micro-architecture and therefore you can't really tar them all with > the same brush. The best generic event is the one that the coder/user of a user-space app sees the CPU executing instructions/branches/etc. If the PMU cannot give that then the (statistically) next best approximation should be provided. If you think about it that is a pretty unambiguous definition: each ARM core will execute user-space applications and the same (compatible) assembly routine results in the same end result, in the same number of visible assembly instructions, right? In practice most people will use the default event: cycles for perf stat/top and the default 'perf stat' output. We've also had numerous cases where kernel developers went way beyond those metrics and apprecitated that tooling would provide good approximations for all those events regardless of what CPU type the workload was running on (and sometimes even documented this in the changelog). So having generic events is not some fancy, unused property, but a pretty important measurement aspect of perf. Thanks, Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction 2011-08-11 8:15 ` Ingo Molnar @ 2011-08-11 9:16 ` Will Deacon 2011-08-12 10:34 ` Ingo Molnar 2011-08-12 4:35 ` Vince Weaver 1 sibling, 1 reply; 10+ messages in thread From: Will Deacon @ 2011-08-11 9:16 UTC (permalink / raw) To: Ingo Molnar Cc: Vince Weaver, linux-kernel, sam wang, Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian, Linus Torvalds, David S. Miller, Thomas Gleixner Hi Ingo, Thanks for your input on this. On Thu, Aug 11, 2011 at 09:15:25AM +0100, Ingo Molnar wrote: > > * Will Deacon <will.deacon@arm.com> wrote: > > > [...] From what I've seen of perf users on ARM, they start with the > > ABI events, get some nonsensical results and then switch > > exclusively to raw events from then on. > > Could you give a specific example of such nonsensical output on ARM? > Bugs should be fixed and yes i can that see if ARM produces > nonsencial output then people won't use that nonsensical output > (duh). Please fix or improve the nonsensical output. Sure. On Cortex-A9 I see this: Performance counter stats for 'ls': 2862 cache-references 20658 cache-misses # 721.803 % of all cache refs 0.019123136 seconds time elapsed This is because we're actually reporting cache hits for cache-references in an attempt to provide something remotely similar. I agree that this is broken, which is why I'm leaning towards a more liberal use of HW_OP_UNSUPPORTED. > Btw., i have a pretty different experience from you: people will use > most of the (default) generic events pretty happily because most > developers have an adequate notion of 'cycles, branches, > instructions' and they will *STOP* at the boundary of having to go > into CPU microarchitecture specific details ... Ok, perhaps my experience comes my sheltered life in the company of micro-architecture nerds :) Although, I think that if the generic events were more applicable to ARM I would be seeing what you see. > People just use the tool defaults in most cases, only a select few > will bother with model specific events. Life is short and learning > CPU microarchitecture specific details is a long and difficult > process that is not justified for most users/developers - not in > small part because the juicy bits of how specific CPUs really work > (and what raw events correspond to those details) are behind an NDA > protected curtain, only accessible to a few privileged people ... > > That is not what Linux interfaces are about in my opinion. I completely agree with you on avoiding these interfaces in general. However, the ARM event numbers aren't under NDA and even if we could put them in the kernel, there's no way of communicating that to the user because the events don't match up well with what the ABI expects. For example, an event that may be useful on A15 is: 0x6d: Exclusive instruction speculatively executed - STREX pass (this could be used for investigating lock contention) yet users are currently forced to use a raw event for this anyway. This is fine for the more esoteric events like 0x40: Counts the number of Java bytecodes being decoded, including speculative ones. where only a select few will care about it. > So what you and Vince are suggesting, to dumb down the kernel parts > of perf and force users into raw or microarchitecture specific events > actually *reduces* the user-base very significantly - while in > practice even just cycles, instructions and branches level analysis > handles 99% of the everyday performance analysis needs ... No. I don't think that the kernel part should be dumbed down, nor do I think that the user should have to play with hex numbers. I just think that we should allow a way to communicate named CPU-specific events to the user. We have userspace libraries that do this, but if you want to avoid the OProfile mess then we could look at putting this into the kernel (although I worry that these tables will become large). > We saw how the "push CPU specific events to users and tooling" > concept didn't work with oprofile - why do we have to re-discuss this > part of failed Linux history again and again? > > The approach Vince and you are suggesting is literally sacrificing > 99% of utility for 1% of the users - a not very smart approach. I > don't mind accomodating the needs of 1% of power-users (at all), but: > > *NOT AT THE EXPENSE OF THE COMMON CASE*. > > doh. So let's leave the common-case as a `best effort' attempt to match the ABI events to whatever we have on the running CPU and come up with a way to augment the set of named events provided by perf. > > > > Right, but as I say, `instructions' on one core might not be > > `instructions' on another core. Just removing the ABI types from > > ARM will at least stop people using them. [...] > > What are you talking about? Sure ARM Cortex 9 will execute > instructions of a user-space application just as much as do other ARM > CPUs. Sure as it executes that app it will execute instructions, you > can single-step through it and thus you can count how many > instructions it has executed, right? On A9: instructions (0x68): Instructions coming out of the core renaming stage Counts the number of instructions going through the Register Renaming stage. This number is an approximate number of the total number of instructions speculatively executed, and even more approximate of the total number of instructions architecturally executed. The approximation depends mainly on the branch misprediction rate. On A8: instructions (0x08): Instruction architecturally executed The problem being that the A9 PMU event really doesn't tie back to the programmer's model. It's an approximation though, so it's alright provided you don't try to compare it between CPUs. > If you think about it that is a pretty unambiguous definition: each > ARM core will execute user-space applications and the same > (compatible) assembly routine results in the same end result, in the > same number of visible assembly instructions, right? Yes, from the programmer's model it's the same, but the event counts might not correlate so well with that. Sometimes you may need to have two event counters and sum the total, for example (the earlier cache-references should be hits + misses). > In practice most people will use the default event: cycles for perf > stat/top and the default 'perf stat' output. We have a dedicated cycle counter, so no issues there. > We've also had numerous cases where kernel developers went way beyond > those metrics and apprecitated that tooling would provide good > approximations for all those events regardless of what CPU type the > workload was running on (and sometimes even documented this in the > changelog). > > So having generic events is not some fancy, unused property, but a > pretty important measurement aspect of perf. Ok, but how can we expose the rest of the CPU events without using raw events? Cheers, Will ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction 2011-08-11 9:16 ` Will Deacon @ 2011-08-12 10:34 ` Ingo Molnar 2011-08-15 11:18 ` Will Deacon 0 siblings, 1 reply; 10+ messages in thread From: Ingo Molnar @ 2011-08-12 10:34 UTC (permalink / raw) To: Will Deacon Cc: Vince Weaver, linux-kernel, sam wang, Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian, Linus Torvalds, David S. Miller, Thomas Gleixner * Will Deacon <will.deacon@arm.com> wrote: > Hi Ingo, > > Thanks for your input on this. > > On Thu, Aug 11, 2011 at 09:15:25AM +0100, Ingo Molnar wrote: > > > > * Will Deacon <will.deacon@arm.com> wrote: > > > > > [...] From what I've seen of perf users on ARM, they start with the > > > ABI events, get some nonsensical results and then switch > > > exclusively to raw events from then on. > > > > Could you give a specific example of such nonsensical output on ARM? > > Bugs should be fixed and yes i can that see if ARM produces > > nonsencial output then people won't use that nonsensical output > > (duh). Please fix or improve the nonsensical output. > > Sure. On Cortex-A9 I see this: > > > Performance counter stats for 'ls': > > 2862 cache-references > 20658 cache-misses # 721.803 % of all cache refs Well, you said 'instructions' in your mail: >> Right, but as I say, `instructions' on one core might not be >> `instructions' on another core. Just removing the ABI types from >> ARM will at least stop people using them. [...] So can we agree that cycles, instructions and branches are fine on ARM? Even discounting hits/misses/references restrictions that you are running into, cache events are approximate on x86 too - most PMUs have random restrictions on what can be measured and what not - the cache access critical path gate count is not something you want to lengthen with much PMU complexity ... > 0.019123136 seconds time elapsed > > This is because we're actually reporting cache hits for > cache-references in an attempt to provide something remotely > similar. I agree that this is broken, which is why I'm leaning > towards a more liberal use of HW_OP_UNSUPPORTED. If there's no 'references' event on that CPU then there's several solutions would could do. Firstly, we could extend: enum perf_hw_cache_op_result_id { PERF_COUNT_HW_CACHE_RESULT_ACCESS = 0, PERF_COUNT_HW_CACHE_RESULT_MISS = 1, PERF_COUNT_HW_CACHE_RESULT_MAX, /* non-ABI */ }; with a third, RESULT_HIT variant, and the architecture could fill in whichever events it can count. User-space could then request all three and do the trivial arithmetics when one of them is missing as 'not counted'. Secondly, we could let the kernel do the arithmetics: when 'accesses' and 'misses' are requested, the kernel could start a 'hits' and 'misses' event and do the addition internally. This couples the events though, in a way not visible to user-space, which might complicate things. A third variant would be a variation of the second solution: to create a standalone 'compound' event by running two hw events (hits and misses), when user-space requests 'references'. > > Btw., i have a pretty different experience from you: people will > > use most of the (default) generic events pretty happily because > > most developers have an adequate notion of 'cycles, branches, > > instructions' and they will *STOP* at the boundary of having to > > go into CPU microarchitecture specific details ... > > Ok, perhaps my experience comes my sheltered life in the company of > micro-architecture nerds :) [...] That's an excusable sin, happens to most folks who specialize in PMU fun - they just don't get the point of "dumbing down" all those nifty, totally exciting microarchitectural details ;-) Many times 'as many details as possible' is my preference as well - i like 'perf stat -ddd' output a lot (after first getting a simplified overview run). So successive runs of: perf stat perf stat -d perf stat -dd perf stat -ddd ... tell the same fundamental story with increasing 'resolution' and detail of analysis. That does not mean that my admittedly odd and occasionally extreme preferences as an expert are what should dictate the design though. > [...] Although, I think that if the generic events were more > applicable to ARM I would be seeing what you see. > > > People just use the tool defaults in most cases, only a select > > few will bother with model specific events. Life is short and > > learning CPU microarchitecture specific details is a long and > > difficult process that is not justified for most users/developers > > - not in small part because the juicy bits of how specific CPUs > > really work (and what raw events correspond to those details) are > > behind an NDA protected curtain, only accessible to a few > > privileged people ... > > > > That is not what Linux interfaces are about in my opinion. > > I completely agree with you on avoiding these interfaces in > general. However, the ARM event numbers aren't under NDA and even > if we could put them in the kernel, there's no way of communicating > that to the user because the events don't match up well with what > the ABI expects. Well, can you see other problems beyond the hits/misses/references problem? I think we can solve that one. > For example, an event that may be useful on A15 is: > > 0x6d: Exclusive instruction speculatively executed - STREX pass > > (this could be used for investigating lock contention) > > yet users are currently forced to use a raw event for this anyway. > This is fine for the more esoteric events like > > 0x40: Counts the number of Java bytecodes being decoded, including > speculative ones. > > where only a select few will care about it. We could certainly extend the number of generic events. What are 'exclusive instructions' on ARM - ones that do atomic operations? With any generalization, there will be a somewhat fuzzy boundary between events that are best kept raw and events that are worth generalizing. So the fact that you can find esoteric sounding but useful events that probably only apply to ARM does not invalidate the general idea of abstracting out cross-CPU concepts. I personally would rather err on the side of generalizing too many than too few events: - If a given event cannot be expressed on a CPU model then that's not a big problem: it literally does not exist on that CPU and nothing we can do will create it out of thin air. It will remain obscure and we can live with that. - But if a useful event is only accessible via the raw ABI, and it turns out to be present on other CPUs as well and tools would like to make use of it, then it would be actively harmful if tools used the raw ABI. If generalized it can be used more widely. > > So what you and Vince are suggesting, to dumb down the kernel > > parts of perf and force users into raw or microarchitecture > > specific events actually *reduces* the user-base very > > significantly - while in practice even just cycles, instructions > > and branches level analysis handles 99% of the everyday > > performance analysis needs ... > > No. I don't think that the kernel part should be dumbed down, nor > do I think that the user should have to play with hex numbers. I > just think that we should allow a way to communicate named > CPU-specific events to the user. We have userspace libraries that > do this, but if you want to avoid the OProfile mess then we could > look at putting this into the kernel (although I worry that these > tables will become large). Size is not an issue. > > We saw how the "push CPU specific events to users and tooling" > > concept didn't work with oprofile - why do we have to re-discuss > > this part of failed Linux history again and again? > > > > The approach Vince and you are suggesting is literally > > sacrificing 99% of utility for 1% of the users - a not very smart > > approach. I don't mind accomodating the needs of 1% of > > power-users (at all), but: > > > > *NOT AT THE EXPENSE OF THE COMMON CASE*. > > > > doh. > > So let's leave the common-case as a `best effort' attempt to match > the ABI events to whatever we have on the running CPU and come up > with a way to augment the set of named events provided by perf. Correct - as long as 'best effort' is still statistically equivalent to the real, 'ideal' event. For the specific cache hits/misses/references example you cited i think we need to do better than what we have currently: clearly we don't want 'references' to be a smaller integer value than 'misses'. > > > Right, but as I say, `instructions' on one core might not be > > > `instructions' on another core. Just removing the ABI types > > > from ARM will at least stop people using them. [...] > > > > What are you talking about? Sure ARM Cortex 9 will execute > > instructions of a user-space application just as much as do other > > ARM CPUs. Sure as it executes that app it will execute > > instructions, you can single-step through it and thus you can > > count how many instructions it has executed, right? > > On A9: > > instructions (0x68): > Instructions coming out of the core renaming stage > > Counts the number of instructions going through the Register Renaming > stage. This number is an approximate number of the total number of > instructions speculatively executed, and even more approximate of > the total number of instructions architecturally executed. The > approximation depends mainly on the branch misprediction rate. > > On A8: > > instructions (0x08): > Instruction architecturally executed > > The problem being that the A9 PMU event really doesn't tie back to > the programmer's model. It's an approximation though, so it's > alright provided you don't try to compare it between CPUs. ok - i think this is an example where the definition is statistically equivalent - i.e. 'good enough'. Cross-CPU comparisons are never obvious in any case: compilers generate different code on different CPUs and different systems tend to have different user-space. 99% of the comparisons are done in the same system, just with different versions of the software running on it. > > If you think about it that is a pretty unambiguous definition: > > each ARM core will execute user-space applications and the same > > (compatible) assembly routine results in the same end result, in > > the same number of visible assembly instructions, right? > > Yes, from the programmer's model it's the same, but the event > counts might not correlate so well with that. [...] Right. > [...] Sometimes you may need to have two event counters and sum the > total, for example (the earlier cache-references should be hits + > misses). Yes - and i think the cache event artifacts are well beyond the 'statistically equivalent' noise and we need to fix that imprecision. > > In practice most people will use the default event: cycles for > > perf stat/top and the default 'perf stat' output. > > We have a dedicated cycle counter, so no issues there. Good - this makes 90% of the users happy already ;-) > > We've also had numerous cases where kernel developers went way > > beyond those metrics and apprecitated that tooling would provide > > good approximations for all those events regardless of what CPU > > type the workload was running on (and sometimes even documented > > this in the changelog). > > > > So having generic events is not some fancy, unused property, but > > a pretty important measurement aspect of perf. > > Ok, but how can we expose the rest of the CPU events without using > raw events? I think Corey sent a patch some time ago (a year ago?) that allowed CPU specific events to be defined by the kernel. I think it would be useful - i think we've generalized most of the core stuff that's worth generalizing so we can start populating the more esoteric tables as well. These events could be used via some self-explanatory syntax, such as: -e cpu::instr_strex or so - and would map to 0x6d on A9. Hm? Thanks, Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction 2011-08-12 10:34 ` Ingo Molnar @ 2011-08-15 11:18 ` Will Deacon 0 siblings, 0 replies; 10+ messages in thread From: Will Deacon @ 2011-08-15 11:18 UTC (permalink / raw) To: Ingo Molnar Cc: Vince Weaver, linux-kernel, sam wang, Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian, Linus Torvalds, David S. Miller, Thomas Gleixner Hi Ingo, Sorry the delayed response, I was away this weekend. On Fri, Aug 12, 2011 at 11:34:26AM +0100, Ingo Molnar wrote: > So can we agree that cycles, instructions and branches are fine on > ARM? Cycles are easy and should work everywhere. Instructions aren't portable between CPUs, but we've established that's ok. Branches are a bit more tricky since most of the time we can only count taken branches. The set of branch events we have on ARM present the same problem as the cache events in that you really need to combine them to get something meaningful back. For example, A15 can count: 0x10 Mispredicted or not predicted branch speculatively executed 0x12 Predictable branch speculatively executed 0x76 Branch speculatively executed - Immediate branch 0x78 Branch speculatively executed - Procedure return 0x79 Branch speculatively executed - Indirect branch So you can use 0x10/0x12 to get a handle on the misprediction rate. The other events may be useful for establishing the distribution of branch types [and you could add them all up to get a rough figure on the number of branches]. A9 can do: 0x10 Mispredicted or not predicted branch speculatively executed 0x12 Predictable branch speculatively executed 0x0C Instruction architecturally executed, condition code check pass, software change of the PC 0x0D Instruction architecturally executed, immediate branch 0x6E Predictable function returns Note that we can't count indirect branch instructions. > If there's no 'references' event on that CPU then there's several > solutions would could do. > > Firstly, we could extend: > > enum perf_hw_cache_op_result_id { > PERF_COUNT_HW_CACHE_RESULT_ACCESS = 0, > PERF_COUNT_HW_CACHE_RESULT_MISS = 1, > > PERF_COUNT_HW_CACHE_RESULT_MAX, /* non-ABI */ > }; > > with a third, RESULT_HIT variant, and the architecture could fill in > whichever events it can count. User-space could then request all > three and do the trivial arithmetics when one of them is missing as > 'not counted'. If you're not opposed to extending the ABI events with (arguably redundant) additional events, then I'm more than happy with this approach. > Secondly, we could let the kernel do the arithmetics: when 'accesses' > and 'misses' are requested, the kernel could start a 'hits' and > 'misses' event and do the addition internally. This couples the > events though, in a way not visible to user-space, which might > complicate things. > > A third variant would be a variation of the second solution: to > create a standalone 'compound' event by running two hw events (hits > and misses), when user-space requests 'references'. The problem with these two solutions it that the compound event may not always be as simple as a single addition. You may need a number of events to plug into an arbitrary expression in order to achieve something that relates back to the programmer's model. > > > That is not what Linux interfaces are about in my opinion. > > > > I completely agree with you on avoiding these interfaces in > > general. However, the ARM event numbers aren't under NDA and even > > if we could put them in the kernel, there's no way of communicating > > that to the user because the events don't match up well with what > > the ABI expects. > > Well, can you see other problems beyond the hits/misses/references > problem? I think we can solve that one. There's the branches issue I've highlighted above. We also can't normally distinguish between read and write misses for caches and TLBs so we report the combined total for each, meaning that they're always the same. Finally, our L2 cache may be off-chip and so we have to plug it in as a separate PMU rather than include it in the CPU cache map (this leads back to the entirely separate discussion about how to interface the perf tool with multiple PMUs). > > For example, an event that may be useful on A15 is: > > > > 0x6d: Exclusive instruction speculatively executed - STREX pass > > > > (this could be used for investigating lock contention) > > > > yet users are currently forced to use a raw event for this anyway. > > This is fine for the more esoteric events like > > > > 0x40: Counts the number of Java bytecodes being decoded, including > > speculative ones. > > > > where only a select few will care about it. > > We could certainly extend the number of generic events. What are > 'exclusive instructions' on ARM - ones that do atomic operations? Yes, they're used for atomic sections of code where you don't want another CPU to modify a variable on which you're operating and tend to be used for the cmpxchg part of spinlocks. Multi-core CPUs have events to report back the STREX_PASS and STREX_FAIL (somebody stomped on my variable so I have to repeat the `transaction') so you can get an indication of lock contention. > With any generalization, there will be a somewhat fuzzy boundary > between events that are best kept raw and events that are worth > generalizing. So the fact that you can find esoteric sounding but > useful events that probably only apply to ARM does not invalidate the > general idea of abstracting out cross-CPU concepts. Ok, but I think that for some events on some CPUs it may be better to say OP_UNSUPPORTED rather than mislead the user if the approximation is too poor (c.f. cache references from earlier on). > I personally would rather err on the side of generalizing too many > than too few events: > > - If a given event cannot be expressed on a CPU model then that's not > a big problem: it literally does not exist on that CPU and nothing > we can do will create it out of thin air. It will remain obscure > and we can live with that. Agreed. > - But if a useful event is only accessible via the raw ABI, and it > turns out to be present on other CPUs as well and tools would like > to make use of it, then it would be actively harmful if tools used > the raw ABI. If generalized it can be used more widely. Sure and that way it becomes a named event which gets rid of the horrible hex. > > > So what you and Vince are suggesting, to dumb down the kernel > > > parts of perf and force users into raw or microarchitecture > > > specific events actually *reduces* the user-base very > > > significantly - while in practice even just cycles, instructions > > > and branches level analysis handles 99% of the everyday > > > performance analysis needs ... > > > > No. I don't think that the kernel part should be dumbed down, nor > > do I think that the user should have to play with hex numbers. I > > just think that we should allow a way to communicate named > > CPU-specific events to the user. We have userspace libraries that > > do this, but if you want to avoid the OProfile mess then we could > > look at putting this into the kernel (although I worry that these > > tables will become large). > > Size is not an issue. Ok, I just don't want this to get viewed in the same light as the OMAP clock data that Linus objected to. > > So let's leave the common-case as a `best effort' attempt to match > > the ABI events to whatever we have on the running CPU and come up > > with a way to augment the set of named events provided by perf. > > Correct - as long as 'best effort' is still statistically equivalent > to the real, 'ideal' event. > > For the specific cache hits/misses/references example you cited i > think we need to do better than what we have currently: clearly we > don't want 'references' to be a smaller integer value than 'misses'. If you're happy to add the new ABI event, I'll update the ARM backend. > > > We've also had numerous cases where kernel developers went way > > > beyond those metrics and apprecitated that tooling would provide > > > good approximations for all those events regardless of what CPU > > > type the workload was running on (and sometimes even documented > > > this in the changelog). > > > > > > So having generic events is not some fancy, unused property, but > > > a pretty important measurement aspect of perf. > > > > Ok, but how can we expose the rest of the CPU events without using > > raw events? > > I think Corey sent a patch some time ago (a year ago?) that allowed > CPU specific events to be defined by the kernel. I think it would be > useful - i think we've generalized most of the core stuff that's > worth generalizing so we can start populating the more esoteric > tables as well. > > These events could be used via some self-explanatory syntax, such as: > > -e cpu::instr_strex > > or so - and would map to 0x6d on A9. Hm? cpu::instr_strex_pass => 0x63 cpu::instr_strex_fail => 0x64 I would *really* like to see this in perf as I think it opens up a whole set of useful events that are currently not being used as much as they could be. Furthermore, the cpu:: qualification can tie back to a PMU instance, so our L2 problems can be fixed with: l2cc::evictions for example (actually, I already have some hacks in place for this but it's all at the hex level so that would look like rn:1 - event 0x1 on PMU 0xn). I would also like to see this sort of syntax for software events, where you can drill down into something like PERF_COUNT_SW_EMULATION_FAULTS to see actually which groups of instructions are being emulated: emulation_faults::fp to count floating point emulation, for example. Cheers, Will ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction 2011-08-11 8:15 ` Ingo Molnar 2011-08-11 9:16 ` Will Deacon @ 2011-08-12 4:35 ` Vince Weaver 1 sibling, 0 replies; 10+ messages in thread From: Vince Weaver @ 2011-08-12 4:35 UTC (permalink / raw) To: Ingo Molnar Cc: Will Deacon, linux-kernel, sam wang, Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian, Linus Torvalds, David S. Miller, Thomas Gleixner On Thu, 11 Aug 2011, Ingo Molnar wrote: > > * Will Deacon <will.deacon@arm.com> wrote: > So what you and Vince are suggesting, to dumb down the kernel parts > of perf and force users into raw or microarchitecture specific events > actually *reduces* the user-base very significantly - while in > practice even just cycles, instructions and branches level analysis > handles 99% of the everyday performance analysis needs ... No, what I want are the generalized events to accurately describe what is being measured. To do this properly you need a lot more granularity. PAPI for example has 100+ generalized events, some of which require multiple hardware events to be combined to get the desired outcome. Having a "branch" event like that on ARM that ignores not-taken events is going to drive you nuts when you are trying to sample in your code to find out why the branch miss rate is so high and you want to find the loop exits that are predicted poorly (but can't find them because loop exits tend to be not-taken branches). Having a "L1-dcache-load" event that includes stores (like on current AMD) will drive you nuts if you are debugging code and it shows that somehow these loads are triggering cache-coherency invalidates when you know that usually only stores can do so, and why would a load only event count stores. Having a tool that gives misleading names to things would be like if I gave some poor user a copy of gdb that silently set breakpoints a random offset from where the actueal breakpoint. Sure it probably correlates to where you want the code to stop, but it's not what the tool says it is doing.. So either come up with finer-grained generalized events, or else do a better job of picking them. The fact that my _extremely_ simple validation tests keep turning up problems like this indicate that no one bothered checking these events before they shoved them in the kernel. Once in, these bad events linger for years and it's not even possible to tell from userspace what raw even a generalized event maps to. So anyone who cares is going to use a tool that uses raw events anyway. (Note I am not saying they'll calculate the raw hex vaue themselves. They'll use a pre-existing tool written by your maligned 1% power users that will pick the proper raw event for them.) > We saw how the "push CPU specific events to users and tooling" > concept didn't work with oprofile - why do we have to re-discuss this > part of failed Linux history again and again? No one is arguing for oprofile. libpfm4/PAPI or a tool like pfmon, sure. > The approach Vince and you are suggesting is literally sacrificing > 99% of utility for 1% of the users - a not very smart approach. I > don't mind accomodating the needs of 1% of power-users (at all), but: > > *NOT AT THE EXPENSE OF THE COMMON CASE*. > > [...] > > We literally have more than 7 million lines of drivers/* code that > provides generic abstractions - not just a few thousand lines of raw > PCI operations space where user-space can write magic values to ... > > [...] > > It's the *job* of the kernel to abstract things away, we don't shy > away from that ... I get the impression if you were the graphics maintainer you'd specify all drivers should use a 1024x768x16bpp generic abstraction and dither or scale all devices to match this. This would be a nice abstraction that would make graphics programming oh so much easier for the casual programmer and it provides 99% of what most users want. The 1% of power users are unimportant. Also one could only use officialy blessed list of colors appearing in some obscure file under /sys. Also access to 3D functionality would be blocked until the people wanting 3D had properly developed a generic concept of the color "mauve" that could be applied across the board, even on black+white only hardware. > So having generic events is not some fancy, unused property, but a > pretty important measurement aspect of perf. perf the userspace tool or perf_event the kernel ABI? Vince ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-08-15 11:19 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-08-10 17:40 [patch] perf: ARMv7 wrong "branches" generalized instruction Vince Weaver 2011-08-10 18:33 ` Will Deacon 2011-08-10 19:01 ` Vince Weaver 2011-08-10 19:16 ` Måns Rullgård 2011-08-10 22:07 ` Will Deacon 2011-08-11 8:15 ` Ingo Molnar 2011-08-11 9:16 ` Will Deacon 2011-08-12 10:34 ` Ingo Molnar 2011-08-15 11:18 ` Will Deacon 2011-08-12 4:35 ` Vince Weaver
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.