All of lore.kernel.org
 help / color / mirror / Atom feed
* Problem with perf hardware counters grouping
@ 2011-08-31  8:57 Mike Hommey
  2011-09-01 11:53 ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Hommey @ 2011-08-31  8:57 UTC (permalink / raw)
  To: linux-kernel

Hi,

I'm having two different problems with perf hardware counters with a
group leader:
- perf_event_open()ing more than 3 makes all of them always return a
  value of 0;
- perf_event_open()ing more than 4 fails with ENOSPC.

This doesn't happen with software counters.

The source at the end of this message exhibits the problem:
$ gcc -o test test.c
$ strace -eperf_event_open ./test
perf_event_open(0x7fff7aeb3e00, 0, 0xffffffff, 0xffffffff, 0) = 3
perf_event_open(0x7fff7aeb3e00, 0, 0xffffffff, 0x3, 0) = 4
perf_event_open(0x7fff7aeb3e00, 0, 0xffffffff, 0x3, 0) = 5
Count: 10857
$ gcc -o test test.c -DN=4
$ strace -eperf_event_open ./test
perf_event_open(0x7fff13a16bd0, 0, 0xffffffff, 0xffffffff, 0) = 3
perf_event_open(0x7fff13a16bd0, 0, 0xffffffff, 0x3, 0) = 4
perf_event_open(0x7fff13a16bd0, 0, 0xffffffff, 0x3, 0) = 5
perf_event_open(0x7fff13a16bd0, 0, 0xffffffff, 0x3, 0) = 6
Count: 0
$ gcc -o test test.c -DN=5
$ strace -eperf_event_open ./test
perf_event_open(0x7fff64700c60, 0, 0xffffffff, 0xffffffff, 0) = 3
perf_event_open(0x7fff64700c60, 0, 0xffffffff, 0x3, 0) = 4
perf_event_open(0x7fff64700c60, 0, 0xffffffff, 0x3, 0) = 5
perf_event_open(0x7fff64700c60, 0, 0xffffffff, 0x3, 0) = 6
perf_event_open(0x7fff64700c60, 0, 0xffffffff, 0x3, 0) = -1 ENOSPC (No
space left on device)
Count: 0

I think the latter is due to the hard limit of 4 for the number of slots
for hardware breakpoints. No idea about the former.

Mike

----------------8<-------------------
#define _GNU_SOURCE 1

#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>

int perf_event_open(struct perf_event_attr *hw_event_uptr,
		    pid_t pid, int cpu, int group_fd, unsigned long flags) {
   return syscall(__NR_perf_event_open, hw_event_uptr, pid, cpu, group_fd,flags);
}

int group_leader = -1;

int _perf_event_open(uint64_t config) {
  struct perf_event_attr pe;
  int fd;

  memset(&pe,0,sizeof(struct perf_event_attr));
  pe.type = PERF_TYPE_HARDWARE;
  pe.size = sizeof(struct perf_event_attr);
  pe.config = config;
  if (group_leader == -1)
    pe.disabled = 1;
  pe.mmap = 1;
  pe.comm = 1;
  
  fd = perf_event_open(&pe, 0, -1, group_leader, 0);
  if (fd >= 0 && group_leader == -1)
    group_leader = fd;

  return fd;
}

int main(int argc, char** argv) {
   uint64_t c;
   int fd = _perf_event_open(PERF_COUNT_HW_CPU_CYCLES);
   _perf_event_open(PERF_COUNT_HW_INSTRUCTIONS);
   _perf_event_open(PERF_COUNT_HW_CACHE_REFERENCES);
#if N > 3
   _perf_event_open(PERF_COUNT_HW_CACHE_MISSES);
#endif
#if N > 4
   _perf_event_open(PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
#endif
   ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);   
   ioctl(fd, PERF_EVENT_IOC_DISABLE,0);

   read(fd, &c, sizeof(c));

   printf("Count: %ld\n",c);

   return 0;
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with perf hardware counters grouping
  2011-08-31  8:57 Problem with perf hardware counters grouping Mike Hommey
@ 2011-09-01 11:53 ` Peter Zijlstra
  2011-09-01 11:59   ` Mike Hommey
                     ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Peter Zijlstra @ 2011-09-01 11:53 UTC (permalink / raw)
  To: Mike Hommey; +Cc: linux-kernel

On Wed, 2011-08-31 at 10:57 +0200, Mike Hommey wrote:
> I'm having two different problems with perf hardware counters with a
> group leader:
> - perf_event_open()ing more than 3 makes all of them always return a
>   value of 0;
> - perf_event_open()ing more than 4 fails with ENOSPC. 


I'm guessing you're running on something x86, either AMD-Fam10-12 or
Intel-NHM+.

Both those have 4 generic hardware counters, but x86 defaults to
enabling the NMI watchdog which takes one, leaving you with 3 (try: echo
0 > /proc/sys/kernel/nmi_watchdog). If you had looked at your dmesg
output you'd have found lines like:

  NMI watchdog enabled, takes one hw-pmu counter.

The code can only check if the group as a whole could possibly fit on a
PMU, which is where your failure on >4 comes from.

What happens with your >3 case is that while the group is valid and
could fit on the PMU, it won't fit at runtime because the NMI watchdog
is taking one and won't budge (cpu-pinned counter have precedence over
any other kind), effectively starving your group of pmu runtime.

Also, we should fix that return to say -EINVAL or so.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with perf hardware counters grouping
  2011-09-01 11:53 ` Peter Zijlstra
@ 2011-09-01 11:59   ` Mike Hommey
  2011-09-01 12:40     ` Peter Zijlstra
  2011-09-01 15:21   ` Vince Weaver
  2011-09-06 19:43   ` Vince Weaver
  2 siblings, 1 reply; 10+ messages in thread
From: Mike Hommey @ 2011-09-01 11:59 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel

On Thu, Sep 01, 2011 at 01:53:32PM +0200, Peter Zijlstra wrote:
> On Wed, 2011-08-31 at 10:57 +0200, Mike Hommey wrote:
> > I'm having two different problems with perf hardware counters with a
> > group leader:
> > - perf_event_open()ing more than 3 makes all of them always return a
> >   value of 0;
> > - perf_event_open()ing more than 4 fails with ENOSPC. 
> 
> 
> I'm guessing you're running on something x86, either AMD-Fam10-12 or
> Intel-NHM+.

Core2Duo

> Both those have 4 generic hardware counters, but x86 defaults to
> enabling the NMI watchdog which takes one, leaving you with 3 (try: echo
> 0 > /proc/sys/kernel/nmi_watchdog). If you had looked at your dmesg
> output you'd have found lines like:
> 
>   NMI watchdog enabled, takes one hw-pmu counter.

Indeed, that shows up.

> The code can only check if the group as a whole could possibly fit on a
> PMU, which is where your failure on >4 comes from.
> 
> What happens with your >3 case is that while the group is valid and
> could fit on the PMU, it won't fit at runtime because the NMI watchdog
> is taking one and won't budge (cpu-pinned counter have precedence over
> any other kind), effectively starving your group of pmu runtime.

That makes sense. But how exactly is not using groups different, then?
perf, for instance doesn't use groups, and can get all the hardware
counters.

Mike

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with perf hardware counters grouping
  2011-09-01 11:59   ` Mike Hommey
@ 2011-09-01 12:40     ` Peter Zijlstra
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2011-09-01 12:40 UTC (permalink / raw)
  To: Mike Hommey; +Cc: linux-kernel

On Thu, 2011-09-01 at 13:59 +0200, Mike Hommey wrote:

> > I'm guessing you're running on something x86, either AMD-Fam10-12 or
> > Intel-NHM+.
> 
> Core2Duo

Ah, ok, then you're also using the fixed purpose thingies.

> > What happens with your >3 case is that while the group is valid and
> > could fit on the PMU, it won't fit at runtime because the NMI watchdog
> > is taking one and won't budge (cpu-pinned counter have precedence over
> > any other kind), effectively starving your group of pmu runtime.
> 
> That makes sense. But how exactly is not using groups different, then?
> perf, for instance doesn't use groups, and can get all the hardware
> counters.

The purpose of groups is to co-schedule events on the PMU, that is we
mandate that all members of the group are configured at the same time.
Note that this does not imply the group is scheduled at all times
(although you could request that by setting the perf_event_attr::pinned
on the leader).

By not using groups but individual counters we do not have this
restriction and perf will schedule them individually.

Now perf with rotate events when there are more than can physically fit
on the PMU at any one time, including groups. This can create the
appearance that all 4 are in fact working.

# perf stat -e instructions  ~/loop_ld

 Performance counter stats for '/root/loop_ld':

       400,765,771 instructions              #    0.00  insns per cycle        

       0.085995705 seconds time elapsed

# perf stat -e instructions -e instructions -e instructions -e instructions -e instructions -e instructions ~/loop_1b_ld

 Performance counter stats for '/root/loop_1b_ld':

       398,136,503 instructions              #    0.00  insns per cycle         [83.45%]
       400,387,443 instructions              #    0.00  insns per cycle         [83.62%]
       400,076,744 instructions              #    0.00  insns per cycle         [83.60%]
       400,221,739 instructions              #    0.00  insns per cycle         [83.62%]
       400,038,563 instructions              #    0.00  insns per cycle         [83.60%]
       402,085,668 instructions              #    0.00  insns per cycle         [82.94%]

       0.085712325 seconds time elapsed


This is on a wsm (4 gp + 1 fp counter capable of counting insn) with NMI
disabled.

Note the [83%] thing, that indicates these things got over committed and
we had to rotate the counters. In particular it is the ration between
PERF_FORMAT_TOTAL_TIME_ENABLED and PERF_FORMAT_TOTAL_TIME_RUNNING and we
use that to scale up the count.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with perf hardware counters grouping
  2011-09-01 11:53 ` Peter Zijlstra
  2011-09-01 11:59   ` Mike Hommey
@ 2011-09-01 15:21   ` Vince Weaver
  2011-09-01 16:41     ` Peter Zijlstra
  2011-09-06 19:43   ` Vince Weaver
  2 siblings, 1 reply; 10+ messages in thread
From: Vince Weaver @ 2011-09-01 15:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Hommey, linux-kernel

On Thu, 1 Sep 2011, Peter Zijlstra wrote:

> What happens with your >3 case is that while the group is valid and
> could fit on the PMU, it won't fit at runtime because the NMI watchdog
> is taking one and won't budge (cpu-pinned counter have precedence over
> any other kind), effectively starving your group of pmu runtime.
> 
> Also, we should fix that return to say -EINVAL or so.

UGH!  I just noticed this problem yesterday and was meaning to track it 
down.

This obviously causes PAPI to fail if you try to use the maximum number of 
counters.  Instead of getting EINVAL at open time or even at start time, 
you just silently read all zeros at read time, and by then it's too late 
to do anything useful about the problem because you just missed measuring 
what you were trying to.

Is there any good workaround, or do we have to fall back to trying to 
start/read/stop every proposed event set to make sure it's valid?

This is going to seriously impact performance, and perf_event performance 
is pretty bad to begin with.  The whole reason I was writing the tests to 
trigger this is because PAPI users are complaining that perf_event 
overhead is roughly twice that of perfctr or perfmon2, which I've verified 
experimentally.

Vince


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with perf hardware counters grouping
  2011-09-01 15:21   ` Vince Weaver
@ 2011-09-01 16:41     ` Peter Zijlstra
  2011-09-01 17:16       ` Vince Weaver
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2011-09-01 16:41 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Mike Hommey, linux-kernel

On Thu, 2011-09-01 at 11:21 -0400, Vince Weaver wrote:
> On Thu, 1 Sep 2011, Peter Zijlstra wrote:
> 
> > What happens with your >3 case is that while the group is valid and
> > could fit on the PMU, it won't fit at runtime because the NMI watchdog
> > is taking one and won't budge (cpu-pinned counter have precedence over
> > any other kind), effectively starving your group of pmu runtime.

> UGH!  I just noticed this problem yesterday and was meaning to track it 
> down.
> 
> This obviously causes PAPI to fail if you try to use the maximum number of 
> counters.  Instead of getting EINVAL at open time or even at start time, 
> you just silently read all zeros at read time, and by then it's too late 
> to do anything useful about the problem because you just missed measuring 
> what you were trying to.
> 
> Is there any good workaround, or do we have to fall back to trying to 
> start/read/stop every proposed event set to make sure it's valid?

I guess my first question is going to be, how do you know what the
maximum number of counters is in the first place?


> This is going to seriously impact performance, and perf_event performance 
> is pretty bad to begin with.  The whole reason I was writing the tests to 
> trigger this is because PAPI users are complaining that perf_event 
> overhead is roughly twice that of perfctr or perfmon2, which I've verified 
> experimentally.

Yeah, you keep saying this, where does it come from? Only the lack of
userspace rdpmc?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with perf hardware counters grouping
  2011-09-01 16:41     ` Peter Zijlstra
@ 2011-09-01 17:16       ` Vince Weaver
  2011-09-01 17:24         ` Vince Weaver
  0 siblings, 1 reply; 10+ messages in thread
From: Vince Weaver @ 2011-09-01 17:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Hommey, linux-kernel

On Thu, 1 Sep 2011, Peter Zijlstra wrote:
> > Is there any good workaround, or do we have to fall back to trying to 
> > start/read/stop every proposed event set to make sure it's valid?
> 
> I guess my first question is going to be, how do you know what the
> maximum number of counters is in the first place?

The use case where this comes up the easiest is where you are adding
events to an eventset one at a time until failure.  Then you assume
failure - 1 is the number available.  So this would boil down
to doing that many sys_perf_open()/close() calls.  This obviously fails
in the current watchdog timer case.

The other way to know is the query libpfm4 which "knows" the number of 
counters available on each CPU.  PAPI uses this as a hueristic not as a 
hard limit, but it can also lead to the problem occurring if the test 
tries the limit, the sys_perf_open() succcedes... and then fails upon 
read.

Does the perf tool work around this in some way?  

> > This is going to seriously impact performance, and perf_event performance 
> > is pretty bad to begin with.  The whole reason I was writing the tests to 
> > trigger this is because PAPI users are complaining that perf_event 
> > overhead is roughly twice that of perfctr or perfmon2, which I've verified 
> > experimentally.
> 
> Yeah, you keep saying this, where does it come from? Only the lack of
> userspace rdpmc?

that's part of it.  I've been working on isolating this, but for a fair 
comparison it involves writing low-level code that accesses perf_event, 
perfctr, and perfmon2 directly at the syscall level and as you can imagine 
that's not easy or fun.  It's also tricky as you can imagine to try to 
profile the perf_event code using perf_events.

Vince
vweaver1@eecs.utk.edu


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with perf hardware counters grouping
  2011-09-01 17:16       ` Vince Weaver
@ 2011-09-01 17:24         ` Vince Weaver
  0 siblings, 0 replies; 10+ messages in thread
From: Vince Weaver @ 2011-09-01 17:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Hommey, linux-kernel

On Thu, 1 Sep 2011, Vince Weaver wrote:

> 
> Does the perf tool work around this in some way?  

oh I see, the perf tool, at least when using perf -e stat on multiple 
events, opens each event individually.  PAPI uses group leaders which is 
why we see the issue.

Vince
vweaver1@eecs.utk.edu


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with perf hardware counters grouping
  2011-09-01 11:53 ` Peter Zijlstra
  2011-09-01 11:59   ` Mike Hommey
  2011-09-01 15:21   ` Vince Weaver
@ 2011-09-06 19:43   ` Vince Weaver
  2011-09-06 20:22     ` Don Zickus
  2 siblings, 1 reply; 10+ messages in thread
From: Vince Weaver @ 2011-09-06 19:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Hommey, linux-kernel, dzickus

On Thu, 1 Sep 2011, Peter Zijlstra wrote:
> 
> Both those have 4 generic hardware counters, but x86 defaults to
> enabling the NMI watchdog which takes one, leaving you with 3 (try: echo
> 0 > /proc/sys/kernel/nmi_watchdog). If you had looked at your dmesg
> output you'd have found lines like:
> 
>   NMI watchdog enabled, takes one hw-pmu counter.
> 
> The code can only check if the group as a whole could possibly fit on a
> PMU, which is where your failure on >4 comes from.
> 
> What happens with your >3 case is that while the group is valid and
> could fit on the PMU, it won't fit at runtime because the NMI watchdog
> is taking one and won't budge (cpu-pinned counter have precedence over
> any other kind), effectively starving your group of pmu runtime.
> 
> Also, we should fix that return to say -EINVAL or so.

So any hope of a fix on this?  

As mentioned this is a serious problem for PAPI and I am trying to find a 
good way to enable a workaround in a way that doesn't punish people who 
have the watchdog disabled.

Is there a "stable" API method of determining if the nmi_watchdog is 
present and stealing a perf-counter?  

If I find a "1" in /proc/sys/kernel/nmi_watchdog can I assume a counter is 
being stolen?

Vince


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with perf hardware counters grouping
  2011-09-06 19:43   ` Vince Weaver
@ 2011-09-06 20:22     ` Don Zickus
  0 siblings, 0 replies; 10+ messages in thread
From: Don Zickus @ 2011-09-06 20:22 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Peter Zijlstra, Mike Hommey, linux-kernel

On Tue, Sep 06, 2011 at 03:43:09PM -0400, Vince Weaver wrote:
> On Thu, 1 Sep 2011, Peter Zijlstra wrote:
> > 
> > Both those have 4 generic hardware counters, but x86 defaults to
> > enabling the NMI watchdog which takes one, leaving you with 3 (try: echo
> > 0 > /proc/sys/kernel/nmi_watchdog). If you had looked at your dmesg
> > output you'd have found lines like:
> > 
> >   NMI watchdog enabled, takes one hw-pmu counter.
> > 
> > The code can only check if the group as a whole could possibly fit on a
> > PMU, which is where your failure on >4 comes from.
> > 
> > What happens with your >3 case is that while the group is valid and
> > could fit on the PMU, it won't fit at runtime because the NMI watchdog
> > is taking one and won't budge (cpu-pinned counter have precedence over
> > any other kind), effectively starving your group of pmu runtime.
> > 
> > Also, we should fix that return to say -EINVAL or so.
> 
> So any hope of a fix on this?  
> 
> As mentioned this is a serious problem for PAPI and I am trying to find a 
> good way to enable a workaround in a way that doesn't punish people who 
> have the watchdog disabled.
> 
> Is there a "stable" API method of determining if the nmi_watchdog is 
> present and stealing a perf-counter?  
> 
> If I find a "1" in /proc/sys/kernel/nmi_watchdog can I assume a counter is 
> being stolen?

Short answer: yes

Long answer: all that means is the nmi_watchdog is running on some
counter, hopefully the hardware based performance counters.  In some rare
cases where the hardware isn't available it may fallback to software based
counters.  But if the hardware isn't available, I think PAPI will have
bigger issues than the nmi_watchdog. :-)

Cheers,
Don

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-09-06 20:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-31  8:57 Problem with perf hardware counters grouping Mike Hommey
2011-09-01 11:53 ` Peter Zijlstra
2011-09-01 11:59   ` Mike Hommey
2011-09-01 12:40     ` Peter Zijlstra
2011-09-01 15:21   ` Vince Weaver
2011-09-01 16:41     ` Peter Zijlstra
2011-09-01 17:16       ` Vince Weaver
2011-09-01 17:24         ` Vince Weaver
2011-09-06 19:43   ` Vince Weaver
2011-09-06 20:22     ` Don Zickus

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.