[patch] perf: ARMv7 wrong "branches" generalized instruction

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch] perf: ARMv7 wrong "branches" generalized instruction
@ 2011-08-10 17:40 Vince Weaver
  2011-08-10 18:33 ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Vince Weaver @ 2011-08-10 17:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: will.deacon, sam wang, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian

Hello

Sam Wang reported to me that my perf_event validation tests were failing 
with branches on ARM Cortex A9.

It turns out the branches event used (ARMV7_PERFCTR_PC_WRITE) only seems
to count taken branches.

ARMV7_PERFCTR_PC_IMM_BRANCH seems to do a better job of counting both 
taken and not-taken.  So I've attached a patch to change the definition
for Cotex A9.

This might be needed for Cortex A8 but I don't have a machine to test on 
(yet).

I'm assuming this is a proper fix.  The "generalized" events aren't 
defined very well so there's always some wiggle room about what they mean.

Patch tested on a Pandaboard.

The test code looks like this.  There should be 500,000*3 branches.  But
the second branch (the not taken "bge test_jmp2") is not counted with 
PC_WRITE.

        asm(    "\teor r3,r3,r3\n"
                "\tldr r3,=500000\n"
                "test_loop:\n"
                "\tB test_jmp\n"
                "\tnop\n"
                "test_jmp:\n"
                "\teor r2,r2,r2\n"
                "\tcmp r2,#1\n"
                "\tbge test_jmp2\n"     
                "\tnop\n"
                "\tadd r2,r2,#1\n"
                "test_jmp2:\n"
                "\tsub r3,r3,#1\n"
                "\tcmp r3,#1\n"
                "\tbgt test_loop\n"
                : /* no output registers */
                : /* no inputs           */
                : "cc", "r2", "r3" /* clobbered */
        );


Vince
vweaver1@eecs.utk.edu

Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu>

diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c
index 4c85183..4d11bd5 100644
--- a/arch/arm/kernel/perf_event_v7.c
+++ b/arch/arm/kernel/perf_event_v7.c
@@ -323,7 +323,7 @@ static const unsigned armv7_a9_perf_map[PERF_COUNT_HW_MAX] = {
 					ARMV7_PERFCTR_INST_OUT_OF_RENAME_STAGE,
 	[PERF_COUNT_HW_CACHE_REFERENCES]    = ARMV7_PERFCTR_COHERENT_LINE_HIT,
 	[PERF_COUNT_HW_CACHE_MISSES]	    = ARMV7_PERFCTR_COHERENT_LINE_MISS,
-	[PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = ARMV7_PERFCTR_PC_WRITE,
+	[PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = ARMV7_PERFCTR_PC_IMM_BRANCH,
 	[PERF_COUNT_HW_BRANCH_MISSES]	    = ARMV7_PERFCTR_PC_BRANCH_MIS_PRED,
 	[PERF_COUNT_HW_BUS_CYCLES]	    = ARMV7_PERFCTR_CLOCK_CYCLES,
 };


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction
  2011-08-10 17:40 [patch] perf: ARMv7 wrong "branches" generalized instruction Vince Weaver
@ 2011-08-10 18:33 ` Will Deacon
  2011-08-10 19:01   ` Vince Weaver
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2011-08-10 18:33 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, sam wang, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian

On Wed, Aug 10, 2011 at 06:40:31PM +0100, Vince Weaver wrote:
> Hello

Hi Vince,

> Sam Wang reported to me that my perf_event validation tests were failing 
> with branches on ARM Cortex A9.
> 
> It turns out the branches event used (ARMV7_PERFCTR_PC_WRITE) only seems
> to count taken branches.

It also counts exceptions and instructions that write to the PC.

> ARMV7_PERFCTR_PC_IMM_BRANCH seems to do a better job of counting both 
> taken and not-taken.  So I've attached a patch to change the definition
> for Cotex A9.

Well, it also only considers immediate branches so whilst it might satisy
your test, I think that overall it's a less meaningful number.

> This might be needed for Cortex A8 but I don't have a machine to test on 
> (yet).

We use the same event encoding for HW_BRANCH_INSTRUCTIONS on the A8.

> I'm assuming this is a proper fix.  The "generalized" events aren't 
> defined very well so there's always some wiggle room about what they mean.

I'm really not a big fan of the generalised events. I appreciate that they
make perf easier to use but *only* if you can actually provide a sensible
definition of the event which can (ideally) be compared between two
different CPU implementations for the same architecture.

So, my take on this is that we should either:

(a) leave it like it is since taken branches is probably a more useful
    metric than number of immediate branches executed.

(b) start replacing our generalised events with HW_OP_UNSUPPORTED and force
    the user to use raw events. I agree this isn't very friendly, but it's
    better than giving them crazy results [for example, we currently report
    more cache misses than cache references on A9 iirc].

Personally, I'm favour of (b) and getting userspace to provide the user with
a CPU-specific event listing and then translate this to raw events using
something like libpfm.

As an aside, I also think this is part of a bigger problem. For example, the
software event PERF_COUNT_SW_EMULATION_FAULTS would be much more useful if
we could describe different types of emulation faults. These would probably
be architecture-specific and we would need a way for userspace to communicate
the event subclass to the kernel rather than having separate ABI events for
them. So not only would we want raw events, we'd also want a way to specify
the PMU to handle them (given that a global event namespace across PMUs is
unrealistic).

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction
  2011-08-10 18:33 ` Will Deacon
@ 2011-08-10 19:01   ` Vince Weaver
  2011-08-10 19:16     ` Måns Rullgård
  2011-08-10 22:07     ` Will Deacon
  0 siblings, 2 replies; 10+ messages in thread
From: Vince Weaver @ 2011-08-10 19:01 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sam wang, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian

On Wed, 10 Aug 2011, Will Deacon wrote:

> > It turns out the branches event used (ARMV7_PERFCTR_PC_WRITE) only seems
> > to count taken branches.
>
> It also counts exceptions and instructions that write to the PC.

are those more common than not-taken branches?  I'd think branch predictor 
statistics will be a bit off if only taken instructions are measured.

> > ARMV7_PERFCTR_PC_IMM_BRANCH seems to do a better job of counting both 
> > taken and not-taken.  So I've attached a patch to change the definition
> > for Cotex A9.
>
> Well, it also only considers immediate branches so whilst it might 
> satisy your test, I think that overall it's a less meaningful number.

I guess there isn't more info available about which branches exactly are 
counted by all the events?  I've gone through the trouble of writing such 
tests to find out experimentally what various counters count for x86, it 
would be sad to have to do it again for ARM.

> (b) start replacing our generalised events with HW_OP_UNSUPPORTED and force
>     the user to use raw events. I agree this isn't very friendly, but it's
>     better than giving them crazy results [for example, we currently report
>     more cache misses than cache references on A9 iirc].
> 
> Personally, I'm favour of (b) and getting userspace to provide the user with
> a CPU-specific event listing and then translate this to raw events using
> something like libpfm.

I agree 100%, but it's an unpopular opinion on linux-kernel.  (Note that 
I'm the one who contributed ARM Cortex A8/A9 support to both libpfm4 and 
PAPI).

Since the generalized events are there and ABI though, people are going to 
use them.  That's why I've been writing tests that check them to see 
exactly what they are measuring.

It's still an important issue to know what "branches" measures, just it 
probably shouldn't be a kernel issue like it's become.

Vince
vweaver1@eecs.utk.edu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction
  2011-08-10 19:01   ` Vince Weaver
@ 2011-08-10 19:16     ` Måns Rullgård
  2011-08-10 22:07     ` Will Deacon
  1 sibling, 0 replies; 10+ messages in thread
From: Måns Rullgård @ 2011-08-10 19:16 UTC (permalink / raw)
  To: linux-kernel

Vince Weaver <vweaver1@eecs.utk.edu> writes:

> On Wed, 10 Aug 2011, Will Deacon wrote:
>
>> > It turns out the branches event used (ARMV7_PERFCTR_PC_WRITE) only seems
>> > to count taken branches.
>>
>> It also counts exceptions and instructions that write to the PC.
>
> are those more common than not-taken branches? 

Every function return and every PLT call would fall in this category.

> I'd think branch predictor statistics will be a bit off if only taken
> instructions are measured.

Clearly.

-- 
Måns Rullgård
mans@mansr.com


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction
  2011-08-10 19:01   ` Vince Weaver
  2011-08-10 19:16     ` Måns Rullgård
@ 2011-08-10 22:07     ` Will Deacon
  2011-08-11  8:15       ` Ingo Molnar
  1 sibling, 1 reply; 10+ messages in thread
From: Will Deacon @ 2011-08-10 22:07 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, sam wang, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian

On Wed, Aug 10, 2011 at 08:01:20PM +0100, Vince Weaver wrote:
> On Wed, 10 Aug 2011, Will Deacon wrote:
> 
> > > It turns out the branches event used (ARMV7_PERFCTR_PC_WRITE) only seems
> > > to count taken branches.
> >
> > It also counts exceptions and instructions that write to the PC.
> 
> are those more common than not-taken branches?  I'd think branch predictor 
> statistics will be a bit off if only taken instructions are measured.

They're almost certainly not as common in normal code. However, as I've
mentioned below, ARMV7_PERFCTR_PC_IMM_BRANCH only counts immediate branches
so I don't think this is so useful for general consumption.

> > > ARMV7_PERFCTR_PC_IMM_BRANCH seems to do a better job of counting both 
> > > taken and not-taken.  So I've attached a patch to change the definition
> > > for Cotex A9.
> >
> > Well, it also only considers immediate branches so whilst it might 
> > satisy your test, I think that overall it's a less meaningful number.
> 
> I guess there isn't more info available about which branches exactly are 
> counted by all the events?  I've gone through the trouble of writing such 
> tests to find out experimentally what various counters count for x86, it 
> would be sad to have to do it again for ARM.

The problem is, it's largely CPU specific. This has improved slightly with
newer cores and there is a PMUv2 document which describes common
architectural events and their reserved numbers, but it is still optional
for the CPU to implement these (notably, Cortex-A9 doesn't implement the
architected instruction counter).

Whilst your tests sound useful, to get any meaningful results out of ARM you
will need to either skip difficult tests or make them CPU specific and use the
raw encodings.

> > (b) start replacing our generalised events with HW_OP_UNSUPPORTED and force
> >     the user to use raw events. I agree this isn't very friendly, but it's
> >     better than giving them crazy results [for example, we currently report
> >     more cache misses than cache references on A9 iirc].
> > 
> > Personally, I'm favour of (b) and getting userspace to provide the user with
> > a CPU-specific event listing and then translate this to raw events using
> > something like libpfm.
> 
> I agree 100%, but it's an unpopular opinion on linux-kernel.  (Note that 
> I'm the one who contributed ARM Cortex A8/A9 support to both libpfm4 and 
> PAPI).

I can see why it's an unpopular idea if it's not necessary on your
architecture but for ARM it's really the only way forward without continuing
to introduce a mess of sparsely populated event tables every time a new CPU
crops up.

> Since the generalized events are there and ABI though, people are going to 
> use them.  That's why I've been writing tests that check them to see 
> exactly what they are measuring.

Right, but as I say, `instructions' on one core might not be `instructions'
on another core. Just removing the ABI types from ARM will at least stop
people using them. From what I've seen of perf users on ARM, they start with
the ABI events, get some nonsensical results and then switch exclusively to
raw events from then on.

> It's still an important issue to know what "branches" measures, just it 
> probably shouldn't be a kernel issue like it's become.

The TRM for the A9 will describe various events for counting branch-related
events. These may be specific to the pipeline and micro-architecture and
therefore you can't really tar them all with the same brush.

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction
  2011-08-10 22:07     ` Will Deacon
@ 2011-08-11  8:15       ` Ingo Molnar
  2011-08-11  9:16         ` Will Deacon
  2011-08-12  4:35         ` Vince Weaver
  0 siblings, 2 replies; 10+ messages in thread
From: Ingo Molnar @ 2011-08-11  8:15 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vince Weaver, linux-kernel, sam wang, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian,
	Linus Torvalds, David S. Miller, Thomas Gleixner

* Will Deacon <will.deacon@arm.com> wrote:

> [...] From what I've seen of perf users on ARM, they start with the 
> ABI events, get some nonsensical results and then switch 
> exclusively to raw events from then on.

Could you give a specific example of such nonsensical output on ARM? 
Bugs should be fixed and yes i can that see if ARM produces 
nonsencial output then people won't use that nonsensical output 
(duh). Please fix or improve the nonsensical output.

Btw., i have a pretty different experience from you: people will use 
most of the (default) generic events pretty happily because most 
developers have an adequate notion of 'cycles, branches, 
instructions' and they will *STOP* at the boundary of having to go 
into CPU microarchitecture specific details ...

People just use the tool defaults in most cases, only a select few 
will bother with model specific events. Life is short and learning 
CPU microarchitecture specific details is a long and difficult 
process that is not justified for most users/developers - not in 
small part because the juicy bits of how specific CPUs really work 
(and what raw events correspond to those details) are behind an NDA 
protected curtain, only accessible to a few privileged people ...

That is not what Linux interfaces are about in my opinion.

So what you and Vince are suggesting, to dumb down the kernel parts 
of perf and force users into raw or microarchitecture specific events 
actually *reduces* the user-base very significantly - while in 
practice even just cycles, instructions and branches level analysis 
handles 99% of the everyday performance analysis needs ...

We saw how the "push CPU specific events to users and tooling" 
concept didn't work with oprofile - why do we have to re-discuss this 
part of failed Linux history again and again?

The approach Vince and you are suggesting is literally sacrificing 
99% of utility for 1% of the users - a not very smart approach. I 
don't mind accomodating the needs of 1% of power-users (at all), but:

   *NOT AT THE EXPENSE OF THE COMMON CASE*.

doh.

> > I agree 100%, but it's an unpopular opinion on linux-kernel.  
> > (Note that I'm the one who contributed ARM Cortex A8/A9 support 
> > to both libpfm4 and PAPI).
>
> I can see why it's an unpopular idea if it's not necessary on your 
> architecture but for ARM it's really the only way forward without 
> continuing to introduce a mess of sparsely populated event tables 
> every time a new CPU crops up.

Generic events are not about lkml popularity ... it's about 
usability.

And why would it have to be implemented in a messy way? We have a 
number of CPU specific tables (and quirks) on x86 as well - that's 
the job of pretty much any kernel driver, to abstract things away in 
a per CPU, often per device (and sometimes even per card variant 
type) manner.

We literally have more than 7 million lines of drivers/* code that 
provides generic abstractions - not just a few thousand lines of raw 
PCI operations space where user-space can write magic values to ...

Similarly, for perf events we don't do a raw binary ABI mess for 
really good reasons: tools and users do not think in CPU and model 
specific hexa numbers, they operate in higher level concepts.

That is a basic quality of implementation property.

It's the *job* of the kernel to abstract things away, we don't shy 
away from that ...

> > Since the generalized events are there and ABI though, people are 
> > going to use them.  That's why I've been writing tests that check 
> > them to see exactly what they are measuring.
> 
> Right, but as I say, `instructions' on one core might not be 
> `instructions' on another core. Just removing the ABI types from 
> ARM will at least stop people using them. [...]

What are you talking about? Sure ARM Cortex 9 will execute 
instructions of a user-space application just as much as do other ARM 
CPUs. Sure as it executes that app it will execute instructions, you 
can single-step through it and thus you can count how many 
instructions it has executed, right?

> > It's still an important issue to know what "branches" measures, 
> > just it probably shouldn't be a kernel issue like it's become.
> 
> The TRM for the A9 will describe various events for counting 
> branch-related events. These may be specific to the pipeline and 
> micro-architecture and therefore you can't really tar them all with 
> the same brush.

The best generic event is the one that the coder/user of a user-space 
app sees the CPU executing instructions/branches/etc.

If the PMU cannot give that then the (statistically) next best 
approximation should be provided.

If you think about it that is a pretty unambiguous definition: each 
ARM core will execute user-space applications and the same 
(compatible) assembly routine results in the same end result, in the 
same number of visible assembly instructions, right?

In practice most people will use the default event: cycles for perf 
stat/top and the default 'perf stat' output.

We've also had numerous cases where kernel developers went way beyond 
those metrics and apprecitated that tooling would provide good 
approximations for all those events regardless of what CPU type the 
workload was running on (and sometimes even documented this in the 
changelog).

So having generic events is not some fancy, unused property, but a 
pretty important measurement aspect of perf.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction
  2011-08-11  8:15       ` Ingo Molnar
@ 2011-08-11  9:16         ` Will Deacon
  2011-08-12 10:34           ` Ingo Molnar
  2011-08-12  4:35         ` Vince Weaver
  1 sibling, 1 reply; 10+ messages in thread
From: Will Deacon @ 2011-08-11  9:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vince Weaver, linux-kernel, sam wang, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian,
	Linus Torvalds, David S. Miller, Thomas Gleixner

Hi Ingo,

Thanks for your input on this.

On Thu, Aug 11, 2011 at 09:15:25AM +0100, Ingo Molnar wrote:
> 
> * Will Deacon <will.deacon@arm.com> wrote:
> 
> > [...] From what I've seen of perf users on ARM, they start with the 
> > ABI events, get some nonsensical results and then switch 
> > exclusively to raw events from then on.
> 
> Could you give a specific example of such nonsensical output on ARM? 
> Bugs should be fixed and yes i can that see if ARM produces 
> nonsencial output then people won't use that nonsensical output 
> (duh). Please fix or improve the nonsensical output.

Sure. On Cortex-A9 I see this:

Performance counter stats for 'ls':

              2862 cache-references
             20658 cache-misses              #  721.803 % of all cache refs

       0.019123136 seconds time elapsed

This is because we're actually reporting cache hits for cache-references
in an attempt to provide something remotely similar. I agree that this is
broken, which is why I'm leaning towards a more liberal use of
HW_OP_UNSUPPORTED.

> Btw., i have a pretty different experience from you: people will use 
> most of the (default) generic events pretty happily because most 
> developers have an adequate notion of 'cycles, branches, 
> instructions' and they will *STOP* at the boundary of having to go 
> into CPU microarchitecture specific details ...

Ok, perhaps my experience comes my sheltered life in the company of
micro-architecture nerds :) Although, I think that if the generic events
were more applicable to ARM I would be seeing what you see.

> People just use the tool defaults in most cases, only a select few 
> will bother with model specific events. Life is short and learning 
> CPU microarchitecture specific details is a long and difficult 
> process that is not justified for most users/developers - not in 
> small part because the juicy bits of how specific CPUs really work 
> (and what raw events correspond to those details) are behind an NDA 
> protected curtain, only accessible to a few privileged people ...
> 
> That is not what Linux interfaces are about in my opinion.

I completely agree with you on avoiding these interfaces in general.
However, the ARM event numbers aren't under NDA and even if we could put
them in the kernel, there's no way of communicating that to the user because
the events don't match up well with what the ABI expects.

For example, an event that may be useful on A15 is:

  0x6d: Exclusive instruction speculatively executed - STREX pass

  (this could be used for investigating lock contention)

yet users are currently forced to use a raw event for this anyway.
This is fine for the more esoteric events like

  0x40: Counts the number of Java bytecodes being decoded, including
        speculative ones.

where only a select few will care about it.

> So what you and Vince are suggesting, to dumb down the kernel parts 
> of perf and force users into raw or microarchitecture specific events 
> actually *reduces* the user-base very significantly - while in 
> practice even just cycles, instructions and branches level analysis 
> handles 99% of the everyday performance analysis needs ...

No. I don't think that the kernel part should be dumbed down, nor do I think
that the user should have to play with hex numbers. I just think that we
should allow a way to communicate named CPU-specific events to the user. We
have userspace libraries that do this, but if you want to avoid the OProfile
mess then we could look at putting this into the kernel (although I worry
that these tables will become large).

> We saw how the "push CPU specific events to users and tooling" 
> concept didn't work with oprofile - why do we have to re-discuss this 
> part of failed Linux history again and again?
> 
> The approach Vince and you are suggesting is literally sacrificing 
> 99% of utility for 1% of the users - a not very smart approach. I 
> don't mind accomodating the needs of 1% of power-users (at all), but:
> 
>    *NOT AT THE EXPENSE OF THE COMMON CASE*.
> 
> doh.

So let's leave the common-case as a `best effort' attempt to match the ABI
events to whatever we have on the running CPU and come up with a way to
augment the set of named events provided by perf.

> > 
> > Right, but as I say, `instructions' on one core might not be 
> > `instructions' on another core. Just removing the ABI types from 
> > ARM will at least stop people using them. [...]
> 
> What are you talking about? Sure ARM Cortex 9 will execute 
> instructions of a user-space application just as much as do other ARM 
> CPUs. Sure as it executes that app it will execute instructions, you 
> can single-step through it and thus you can count how many 
> instructions it has executed, right?

On A9:

instructions (0x68):
	Instructions coming out of the core renaming stage

	Counts the number of instructions going through the Register Renaming
	stage. This number is an approximate number of the total number of
	instructions speculatively executed, and even more approximate of
	the total number of instructions architecturally executed. The
	approximation depends mainly on the branch misprediction rate.

On A8:

instructions (0x08):
	Instruction architecturally executed

The problem being that the A9 PMU event really doesn't tie back to the
programmer's model. It's an approximation though, so it's alright provided
you don't try to compare it between CPUs.

> If you think about it that is a pretty unambiguous definition: each 
> ARM core will execute user-space applications and the same 
> (compatible) assembly routine results in the same end result, in the 
> same number of visible assembly instructions, right?

Yes, from the programmer's model it's the same, but the event counts might
not correlate so well with that. Sometimes you may need to have two event
counters and sum the total, for example (the earlier cache-references should
be hits + misses).

> In practice most people will use the default event: cycles for perf 
> stat/top and the default 'perf stat' output.

We have a dedicated cycle counter, so no issues there.

> We've also had numerous cases where kernel developers went way beyond 
> those metrics and apprecitated that tooling would provide good 
> approximations for all those events regardless of what CPU type the 
> workload was running on (and sometimes even documented this in the 
> changelog).
> 
> So having generic events is not some fancy, unused property, but a 
> pretty important measurement aspect of perf.

Ok, but how can we expose the rest of the CPU events without using raw
events?

Cheers,

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction
  2011-08-11  8:15       ` Ingo Molnar
  2011-08-11  9:16         ` Will Deacon
@ 2011-08-12  4:35         ` Vince Weaver
  1 sibling, 0 replies; 10+ messages in thread
From: Vince Weaver @ 2011-08-12  4:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Will Deacon, linux-kernel, sam wang, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian,
	Linus Torvalds, David S. Miller, Thomas Gleixner

On Thu, 11 Aug 2011, Ingo Molnar wrote:

> 
> * Will Deacon <will.deacon@arm.com> wrote:
> So what you and Vince are suggesting, to dumb down the kernel parts 
> of perf and force users into raw or microarchitecture specific events 
> actually *reduces* the user-base very significantly - while in 
> practice even just cycles, instructions and branches level analysis 
> handles 99% of the everyday performance analysis needs ...

No, what I want are the generalized events to accurately describe what is 
being measured.

To do this properly you need a lot more granularity.  PAPI for example has 
100+ generalized events, some of which require multiple hardware events 
to be combined to get the desired outcome.

Having a "branch" event like that on ARM that ignores not-taken events is 
going to drive you nuts when you are trying to sample in your code to 
find out why the branch miss rate is so high and you want to find the loop 
exits that are predicted poorly (but can't find them because loop exits 
tend to be not-taken branches).

Having a "L1-dcache-load" event that includes stores (like on current AMD) 
will drive you nuts if you are debugging code and it shows that somehow
these loads are triggering cache-coherency invalidates when you know that 
usually only stores can do so, and why would a load only event count 
stores.

Having a tool that gives misleading names to things would be like if I 
gave some poor user a copy of gdb that silently set breakpoints a random
offset from where the actueal breakpoint.  Sure it probably correlates to 
where you want the code to stop, but it's not what the tool says it is 
doing..

So either come up with finer-grained generalized events, or else do a 
better job of picking them.  The fact that my _extremely_ simple 
validation tests keep turning up problems like this indicate that no one 
bothered checking these events before they shoved them in the kernel.

Once in, these bad events linger for years and it's not even possible
to tell from userspace what raw even a generalized event maps to.
So anyone who cares is going to use a tool that uses raw events anyway.

(Note I am not saying they'll calculate the raw hex vaue themselves.
 They'll use a pre-existing tool written by your maligned 1% power
 users that will pick the proper raw event for them.)

> We saw how the "push CPU specific events to users and tooling" 
> concept didn't work with oprofile - why do we have to re-discuss this 
> part of failed Linux history again and again?

No one is arguing for oprofile.  libpfm4/PAPI or a tool like pfmon, sure.

> The approach Vince and you are suggesting is literally sacrificing 
> 99% of utility for 1% of the users - a not very smart approach. I 
> don't mind accomodating the needs of 1% of power-users (at all), but:
> 
>    *NOT AT THE EXPENSE OF THE COMMON CASE*.
>
> [...]
> 
> We literally have more than 7 million lines of drivers/* code that 
> provides generic abstractions - not just a few thousand lines of raw 
> PCI operations space where user-space can write magic values to ...
>
> [...]
>
> It's the *job* of the kernel to abstract things away, we don't shy 
> away from that ...

I get the impression if you were the graphics maintainer you'd specify all 
drivers should use a 1024x768x16bpp generic abstraction and dither or 
scale all devices to match this.  This would be a nice abstraction that 
would make graphics programming oh so much easier for the casual 
programmer and it provides 99% of what most users want.  The 1% of power
users are unimportant.

Also one could only use officialy blessed list of colors appearing in 
some obscure file under /sys.

Also access to 3D functionality would be blocked until the people wanting 
3D had properly developed a generic concept of the color "mauve" that 
could be applied across the board, even on black+white only hardware.

> So having generic events is not some fancy, unused property, but a 
> pretty important measurement aspect of perf.

perf the userspace tool or perf_event the kernel ABI?

Vince

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction
  2011-08-11  9:16         ` Will Deacon
@ 2011-08-12 10:34           ` Ingo Molnar
  2011-08-15 11:18             ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2011-08-12 10:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vince Weaver, linux-kernel, sam wang, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian,
	Linus Torvalds, David S. Miller, Thomas Gleixner


* Will Deacon <will.deacon@arm.com> wrote:

> Hi Ingo,
> 
> Thanks for your input on this.
> 
> On Thu, Aug 11, 2011 at 09:15:25AM +0100, Ingo Molnar wrote:
> > 
> > * Will Deacon <will.deacon@arm.com> wrote:
> > 
> > > [...] From what I've seen of perf users on ARM, they start with the 
> > > ABI events, get some nonsensical results and then switch 
> > > exclusively to raw events from then on.
> > 
> > Could you give a specific example of such nonsensical output on ARM? 
> > Bugs should be fixed and yes i can that see if ARM produces 
> > nonsencial output then people won't use that nonsensical output 
> > (duh). Please fix or improve the nonsensical output.
> 
> Sure. On Cortex-A9 I see this:
> 
> 
> Performance counter stats for 'ls':
> 
>               2862 cache-references
>              20658 cache-misses              #  721.803 % of all cache refs

Well, you said 'instructions' in your mail:

>> Right, but as I say, `instructions' on one core might not be 
>> `instructions' on another core. Just removing the ABI types from 
>> ARM will at least stop people using them. [...]

So can we agree that cycles, instructions and branches are fine on 
ARM?

Even discounting hits/misses/references restrictions that you are 
running into, cache events are approximate on x86 too - most PMUs 
have random restrictions on what can be measured and what not - the 
cache access critical path gate count is not something you want to 
lengthen with much PMU complexity ...

>        0.019123136 seconds time elapsed
> 
> This is because we're actually reporting cache hits for 
> cache-references in an attempt to provide something remotely 
> similar. I agree that this is broken, which is why I'm leaning 
> towards a more liberal use of HW_OP_UNSUPPORTED.

If there's no 'references' event on that CPU then there's several 
solutions would could do.

Firstly, we could extend:

enum perf_hw_cache_op_result_id {
        PERF_COUNT_HW_CACHE_RESULT_ACCESS       = 0,
        PERF_COUNT_HW_CACHE_RESULT_MISS         = 1,

        PERF_COUNT_HW_CACHE_RESULT_MAX,         /* non-ABI */
};

with a third, RESULT_HIT variant, and the architecture could fill in 
whichever events it can count. User-space could then request all 
three and do the trivial arithmetics when one of them is missing as 
'not counted'.

Secondly, we could let the kernel do the arithmetics: when 'accesses' 
and 'misses' are requested, the kernel could start a 'hits' and 
'misses' event and do the addition internally. This couples the 
events though, in a way not visible to user-space, which might 
complicate things.

A third variant would be a variation of the second solution: to 
create a standalone 'compound' event by running two hw events (hits 
and misses), when user-space requests 'references'.

> > Btw., i have a pretty different experience from you: people will 
> > use most of the (default) generic events pretty happily because 
> > most developers have an adequate notion of 'cycles, branches, 
> > instructions' and they will *STOP* at the boundary of having to 
> > go into CPU microarchitecture specific details ...
> 
> Ok, perhaps my experience comes my sheltered life in the company of 
> micro-architecture nerds :) [...]

That's an excusable sin, happens to most folks who specialize in PMU 
fun - they just don't get the point of "dumbing down" all those 
nifty, totally exciting microarchitectural details ;-)

Many times 'as many details as possible' is my preference as well - i 
like 'perf stat -ddd' output a lot (after first getting a simplified 
overview run). So successive runs of:

  perf stat
  perf stat -d
  perf stat -dd
  perf stat -ddd

... tell the same fundamental story with increasing 'resolution' and 
detail of analysis.

That does not mean that my admittedly odd and occasionally extreme 
preferences as an expert are what should dictate the design though.

> [...] Although, I think that if the generic events were more 
> applicable to ARM I would be seeing what you see.
> 
> > People just use the tool defaults in most cases, only a select 
> > few will bother with model specific events. Life is short and 
> > learning CPU microarchitecture specific details is a long and 
> > difficult process that is not justified for most users/developers 
> > - not in small part because the juicy bits of how specific CPUs 
> > really work (and what raw events correspond to those details) are 
> > behind an NDA protected curtain, only accessible to a few 
> > privileged people ...
> > 
> > That is not what Linux interfaces are about in my opinion.
> 
> I completely agree with you on avoiding these interfaces in 
> general. However, the ARM event numbers aren't under NDA and even 
> if we could put them in the kernel, there's no way of communicating 
> that to the user because the events don't match up well with what 
> the ABI expects.

Well, can you see other problems beyond the hits/misses/references 
problem? I think we can solve that one.

> For example, an event that may be useful on A15 is:
> 
>   0x6d: Exclusive instruction speculatively executed - STREX pass
> 
>   (this could be used for investigating lock contention)
> 
> yet users are currently forced to use a raw event for this anyway.
> This is fine for the more esoteric events like
> 
>   0x40: Counts the number of Java bytecodes being decoded, including
>         speculative ones.
> 
> where only a select few will care about it.

We could certainly extend the number of generic events. What are 
'exclusive instructions' on ARM - ones that do atomic operations?

With any generalization, there will be a somewhat fuzzy boundary 
between events that are best kept raw and events that are worth 
generalizing. So the fact that you can find esoteric sounding but 
useful events that probably only apply to ARM does not invalidate the 
general idea of abstracting out cross-CPU concepts.

I personally would rather err on the side of generalizing too many 
than too few events:

- If a given event cannot be expressed on a CPU model then that's not 
  a big problem: it literally does not exist on that CPU and nothing
  we can do will create it out of thin air. It will remain obscure 
  and we can live with that.

- But if a useful event is only accessible via the raw ABI, and it 
  turns out to be present on other CPUs as well and tools would like 
  to make use of it, then it would be actively harmful if tools used 
  the raw ABI. If generalized it can be used more widely.


> > So what you and Vince are suggesting, to dumb down the kernel 
> > parts of perf and force users into raw or microarchitecture 
> > specific events actually *reduces* the user-base very 
> > significantly - while in practice even just cycles, instructions 
> > and branches level analysis handles 99% of the everyday 
> > performance analysis needs ...
> 
> No. I don't think that the kernel part should be dumbed down, nor 
> do I think that the user should have to play with hex numbers. I 
> just think that we should allow a way to communicate named 
> CPU-specific events to the user. We have userspace libraries that 
> do this, but if you want to avoid the OProfile mess then we could 
> look at putting this into the kernel (although I worry that these 
> tables will become large).

Size is not an issue.


> > We saw how the "push CPU specific events to users and tooling" 
> > concept didn't work with oprofile - why do we have to re-discuss 
> > this part of failed Linux history again and again?
> > 
> > The approach Vince and you are suggesting is literally 
> > sacrificing 99% of utility for 1% of the users - a not very smart 
> > approach. I don't mind accomodating the needs of 1% of 
> > power-users (at all), but:
> > 
> >    *NOT AT THE EXPENSE OF THE COMMON CASE*.
> > 
> > doh.
> 
> So let's leave the common-case as a `best effort' attempt to match 
> the ABI events to whatever we have on the running CPU and come up 
> with a way to augment the set of named events provided by perf.

Correct - as long as 'best effort' is still statistically equivalent 
to the real, 'ideal' event.

For the specific cache hits/misses/references example you cited i 
think we need to do better than what we have currently: clearly we 
don't want 'references' to be a smaller integer value than 'misses'.

> > > Right, but as I say, `instructions' on one core might not be 
> > > `instructions' on another core. Just removing the ABI types 
> > > from ARM will at least stop people using them. [...]
> > 
> > What are you talking about? Sure ARM Cortex 9 will execute 
> > instructions of a user-space application just as much as do other 
> > ARM CPUs. Sure as it executes that app it will execute 
> > instructions, you can single-step through it and thus you can 
> > count how many instructions it has executed, right?
> 
> On A9:
> 
> instructions (0x68):
> 	Instructions coming out of the core renaming stage
> 
> 	Counts the number of instructions going through the Register Renaming
> 	stage. This number is an approximate number of the total number of
> 	instructions speculatively executed, and even more approximate of
> 	the total number of instructions architecturally executed. The
> 	approximation depends mainly on the branch misprediction rate.
> 
> On A8:
> 
> instructions (0x08):
> 	Instruction architecturally executed
> 
> The problem being that the A9 PMU event really doesn't tie back to 
> the programmer's model. It's an approximation though, so it's 
> alright provided you don't try to compare it between CPUs.

ok - i think this is an example where the definition is statistically 
equivalent - i.e. 'good enough'.

Cross-CPU comparisons are never obvious in any case: compilers 
generate different code on different CPUs and different systems tend 
to have different user-space.

99% of the comparisons are done in the same system, just with 
different versions of the software running on it.


> > If you think about it that is a pretty unambiguous definition: 
> > each ARM core will execute user-space applications and the same 
> > (compatible) assembly routine results in the same end result, in 
> > the same number of visible assembly instructions, right?
> 
> Yes, from the programmer's model it's the same, but the event 
> counts might not correlate so well with that. [...]

Right.

> [...] Sometimes you may need to have two event counters and sum the 
> total, for example (the earlier cache-references should be hits + 
> misses).

Yes - and i think the cache event artifacts are well beyond the 
'statistically equivalent' noise and we need to fix that imprecision.

> > In practice most people will use the default event: cycles for 
> > perf stat/top and the default 'perf stat' output.
> 
> We have a dedicated cycle counter, so no issues there.

Good - this makes 90% of the users happy already ;-)

> > We've also had numerous cases where kernel developers went way 
> > beyond those metrics and apprecitated that tooling would provide 
> > good approximations for all those events regardless of what CPU 
> > type the workload was running on (and sometimes even documented 
> > this in the changelog).
> > 
> > So having generic events is not some fancy, unused property, but 
> > a pretty important measurement aspect of perf.
> 
> Ok, but how can we expose the rest of the CPU events without using 
> raw events?

I think Corey sent a patch some time ago (a year ago?) that allowed 
CPU specific events to be defined by the kernel. I think it would be 
useful - i think we've generalized most of the core stuff that's 
worth generalizing so we can start populating the more esoteric 
tables as well.

These events could be used via some self-explanatory syntax, such as:

	-e cpu::instr_strex

or so - and would map to 0x6d on A9. Hm?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] perf: ARMv7 wrong "branches" generalized instruction
  2011-08-12 10:34           ` Ingo Molnar
@ 2011-08-15 11:18             ` Will Deacon
  0 siblings, 0 replies; 10+ messages in thread
From: Will Deacon @ 2011-08-15 11:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vince Weaver, linux-kernel, sam wang, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, Stephane Eranian,
	Linus Torvalds, David S. Miller, Thomas Gleixner

Hi Ingo,

Sorry the delayed response, I was away this weekend.

On Fri, Aug 12, 2011 at 11:34:26AM +0100, Ingo Molnar wrote:
> So can we agree that cycles, instructions and branches are fine on
> ARM?

Cycles are easy and should work everywhere. Instructions aren't portable
between CPUs, but we've established that's ok.

Branches are a bit more tricky since most of the time we can only count
taken branches. The set of branch events we have on ARM present the same
problem as the cache events in that you really need to combine them to get
something meaningful back. For example, A15 can count:

0x10 Mispredicted or not predicted branch speculatively executed
0x12 Predictable branch speculatively executed

0x76 Branch speculatively executed - Immediate branch
0x78 Branch speculatively executed - Procedure return
0x79 Branch speculatively executed - Indirect branch

So you can use 0x10/0x12 to get a handle on the misprediction rate.
The other events may be useful for establishing the distribution of branch
types [and you could add them all up to get a rough figure on the number of
branches].

A9 can do:

0x10 Mispredicted or not predicted branch speculatively executed
0x12 Predictable branch speculatively executed

0x0C Instruction architecturally executed, condition code check pass,
     software change of the PC
0x0D Instruction architecturally executed, immediate branch

0x6E Predictable function returns

Note that we can't count indirect branch instructions.

> If there's no 'references' event on that CPU then there's several
> solutions would could do.
> 
> Firstly, we could extend:
> 
> enum perf_hw_cache_op_result_id {
>         PERF_COUNT_HW_CACHE_RESULT_ACCESS       = 0,
>         PERF_COUNT_HW_CACHE_RESULT_MISS         = 1,
> 
>         PERF_COUNT_HW_CACHE_RESULT_MAX,         /* non-ABI */
> };
> 
> with a third, RESULT_HIT variant, and the architecture could fill in
> whichever events it can count. User-space could then request all
> three and do the trivial arithmetics when one of them is missing as
> 'not counted'.

If you're not opposed to extending the ABI events with (arguably redundant)
additional events, then I'm more than happy with this approach.

> Secondly, we could let the kernel do the arithmetics: when 'accesses'
> and 'misses' are requested, the kernel could start a 'hits' and
> 'misses' event and do the addition internally. This couples the
> events though, in a way not visible to user-space, which might
> complicate things.
> 
> A third variant would be a variation of the second solution: to
> create a standalone 'compound' event by running two hw events (hits
> and misses), when user-space requests 'references'.

The problem with these two solutions it that the compound event may not always
be as simple as a single addition. You may need a number of events to plug
into an arbitrary expression in order to achieve something that relates back
to the programmer's model.

> > > That is not what Linux interfaces are about in my opinion.
> >
> > I completely agree with you on avoiding these interfaces in
> > general. However, the ARM event numbers aren't under NDA and even
> > if we could put them in the kernel, there's no way of communicating
> > that to the user because the events don't match up well with what
> > the ABI expects.
> 
> Well, can you see other problems beyond the hits/misses/references
> problem? I think we can solve that one.

There's the branches issue I've highlighted above. We also can't normally
distinguish between read and write misses for caches and TLBs so we report
the combined total for each, meaning that they're always the same. Finally,
our L2 cache may be off-chip and so we have to plug it in as a separate PMU
rather than include it in the CPU cache map (this leads back to the entirely
separate discussion about how to interface the perf tool with multiple PMUs).

> > For example, an event that may be useful on A15 is:
> >
> >   0x6d: Exclusive instruction speculatively executed - STREX pass
> >
> >   (this could be used for investigating lock contention)
> >
> > yet users are currently forced to use a raw event for this anyway.
> > This is fine for the more esoteric events like
> >
> >   0x40: Counts the number of Java bytecodes being decoded, including
> >         speculative ones.
> >
> > where only a select few will care about it.
> 
> We could certainly extend the number of generic events. What are
> 'exclusive instructions' on ARM - ones that do atomic operations?

Yes, they're used for atomic sections of code where you don't want another
CPU to modify a variable on which you're operating and tend to be used for
the cmpxchg part of spinlocks. Multi-core CPUs have events to report back
the STREX_PASS and STREX_FAIL (somebody stomped on my variable so I have to
repeat the `transaction') so you can get an indication of lock contention.

> With any generalization, there will be a somewhat fuzzy boundary
> between events that are best kept raw and events that are worth
> generalizing. So the fact that you can find esoteric sounding but
> useful events that probably only apply to ARM does not invalidate the
> general idea of abstracting out cross-CPU concepts.

Ok, but I think that for some events on some CPUs it may be better to say
OP_UNSUPPORTED rather than mislead the user if the approximation is too poor
(c.f. cache references from earlier on).

> I personally would rather err on the side of generalizing too many
> than too few events:
> 
> - If a given event cannot be expressed on a CPU model then that's not
>   a big problem: it literally does not exist on that CPU and nothing
>   we can do will create it out of thin air. It will remain obscure
>   and we can live with that.

Agreed.

> - But if a useful event is only accessible via the raw ABI, and it
>   turns out to be present on other CPUs as well and tools would like
>   to make use of it, then it would be actively harmful if tools used
>   the raw ABI. If generalized it can be used more widely.

Sure and that way it becomes a named event which gets rid of the horrible
hex.

> > > So what you and Vince are suggesting, to dumb down the kernel
> > > parts of perf and force users into raw or microarchitecture
> > > specific events actually *reduces* the user-base very
> > > significantly - while in practice even just cycles, instructions
> > > and branches level analysis handles 99% of the everyday
> > > performance analysis needs ...
> >
> > No. I don't think that the kernel part should be dumbed down, nor
> > do I think that the user should have to play with hex numbers. I
> > just think that we should allow a way to communicate named
> > CPU-specific events to the user. We have userspace libraries that
> > do this, but if you want to avoid the OProfile mess then we could
> > look at putting this into the kernel (although I worry that these
> > tables will become large).
> 
> Size is not an issue.

Ok, I just don't want this to get viewed in the same light as the OMAP clock
data that Linus objected to.

> > So let's leave the common-case as a `best effort' attempt to match
> > the ABI events to whatever we have on the running CPU and come up
> > with a way to augment the set of named events provided by perf.
> 
> Correct - as long as 'best effort' is still statistically equivalent
> to the real, 'ideal' event.
> 
> For the specific cache hits/misses/references example you cited i
> think we need to do better than what we have currently: clearly we
> don't want 'references' to be a smaller integer value than 'misses'.

If you're happy to add the new ABI event, I'll update the ARM backend.

> > > We've also had numerous cases where kernel developers went way
> > > beyond those metrics and apprecitated that tooling would provide
> > > good approximations for all those events regardless of what CPU
> > > type the workload was running on (and sometimes even documented
> > > this in the changelog).
> > >
> > > So having generic events is not some fancy, unused property, but
> > > a pretty important measurement aspect of perf.
> >
> > Ok, but how can we expose the rest of the CPU events without using
> > raw events?
> 
> I think Corey sent a patch some time ago (a year ago?) that allowed
> CPU specific events to be defined by the kernel. I think it would be
> useful - i think we've generalized most of the core stuff that's
> worth generalizing so we can start populating the more esoteric
> tables as well.
> 
> These events could be used via some self-explanatory syntax, such as:
> 
>         -e cpu::instr_strex
> 
> or so - and would map to 0x6d on A9. Hm?

cpu::instr_strex_pass => 0x63
cpu::instr_strex_fail => 0x64

I would *really* like to see this in perf as I think it opens up a whole set
of useful events that are currently not being used as much as they could be.
Furthermore, the cpu:: qualification can tie back to a PMU instance, so our
L2 problems can be fixed with:

l2cc::evictions

for example (actually, I already have some hacks in place for this but it's
all at the hex level so that would look like rn:1 - event 0x1 on PMU 0xn).

I would also like to see this sort of syntax for software events, where you
can drill down into something like PERF_COUNT_SW_EMULATION_FAULTS to see
actually which groups of instructions are being emulated:

emulation_faults::fp

to count floating point emulation, for example.

Cheers,

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-08-15 11:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-10 17:40 [patch] perf: ARMv7 wrong "branches" generalized instruction Vince Weaver
2011-08-10 18:33 ` Will Deacon
2011-08-10 19:01   ` Vince Weaver
2011-08-10 19:16     ` Måns Rullgård
2011-08-10 22:07     ` Will Deacon
2011-08-11  8:15       ` Ingo Molnar
2011-08-11  9:16         ` Will Deacon
2011-08-12 10:34           ` Ingo Molnar
2011-08-15 11:18             ` Will Deacon
2011-08-12  4:35         ` Vince Weaver

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.