All of lore.kernel.org
 help / color / mirror / Atom feed
* v2 of comments on Performance Counters for Linux (PCL)
@ 2009-06-16 17:42 stephane eranian
  2009-06-22 11:48 ` Ingo Molnar
                   ` (19 more replies)
  0 siblings, 20 replies; 68+ messages in thread
From: stephane eranian @ 2009-06-16 17:42 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Thomas Gleixner, Ingo Molnar, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

Hi,

Here is an updated version of my comments on PCL. Compared to the
previous version,
I have removed all the issues that were fixed or clarified. I have
kept all the issues and
open questions which I think are not solved yet and I added a few more.


I/ General API comments

 1/ System calls

     * ioctl()

       You have defined 5 ioctls() so far to operate on an existing event.
       I was under the impression that ioctl() should not be used except for
       drivers.

       How do you justify your usage of ioctl() in this context?

 2/ Grouping

       By design, an event can only be part of one group at a time. Events in
       a group are guaranteed to be active on the PMU at the same time. That
       means a group cannot have more events than there are available counters
       on the PMU. Tools may want to know the number of counters available in
       order to group their events accordingly, such that reliable ratios
       could be computed. It seems the only way to know this is by trial and
       error. This is not practical.

 3/ Multiplexing and system-wide

       Multiplexing is time-based and it is hooked into the timer tick. At
       every tick, the kernel tries to schedule another set of event groups.

       In tickless kernels if a CPU is idle, no timer tick is generated,
       therefore no multiplexing occurs. This is incorrect. It's not because
       the CPU is idle, that there aren't any interesting PMU events to measure.
       Parts of the CPU may still be active, e.g., caches and buses. And thus,
       it is expected that multiplexing still happens.

       You need to hook up the timer source for multiplexing to something else
       which is not affected by tickless. You cannot simply disable tickless
       during a measurement because you would not be measuring the system as
       it actually behaves.

  4/ Controlling group multiplexing

       Although multiplexing is exposed to users via the timing information,
       events may not necessarily be grouped at random by tools. Groups may
       not be ordered at random either.

       I know of tools which craft the sequence of groups carefully such that
       related events are in neighboring groups such that they measure similar
       parts of the execution. This way, you can mitigate the fluctuations
       introduced by multiplexing. In other words, some tools may want to
       control the order in which groups are scheduled on the PMU.

       You mentioned that groups are multiplexed in creation order. But which
       creation order? As far as I know, multiple distinct tools may be
       attaching to the same thread at the same time and their groups may be
       interleaved in the list. Therefore, I believe 'creation order' refers
       to the global group creation order which is only visible to the kernel.
       Each tool may see a different order. Let's take an example.

       Tool A creates group G1, G2, G3 and attaches them to thread T0. At the
       same time tool B creates group G4, G5. The actual global order may
       be: G1, G4, G2, G5, G3. This is what the kernel is going to multiplex.
       Each group will be multiplexed in the right order from the point of view
       of each tool. But there will be gaps. It would be nice to have a way
       to ensure that the sequence is either: G1, G2, G3, G4, G5 or G4, G5,
       G1, G2, G3. In other words, avoid the interleaving.

  5/ Mmaped count

       It is possible to read counts directly from user space for
self-monitoring
       threads. This leverages a HW capability present on some processors. On
       X86, this is possible via RDPMC.

       The full 64-bit count is constructed by combining the hardware value
       extracted with an assembly instruction and a base value made available
       thru the mmap. There is an atomic generation count available to deal
       with the race condition.

       I believe there is a problem with this approach given that the PMU
       is shared and that events can be multiplexed. That means that even
       though you are self-monitoring, events get replaced on the PMU. The
       assembly instruction is unaware of that, it reads a register
not an event.

       On x86, assume event A is hosted in counter 0, thus you need RDPMC(0)
       to extract the count. But then, the event is replaced by another one
       which reuses counter 0. At the user level, you will still use RDPMC(0)
       but it will read the HW value from a different event and combine it
       with a base count from another one.

       To avoid this, you need to pin the event so it stays in the PMU at
       all times. Now, here is something unclear to me. Pinning does not
       mean stay in the SAME register, it means the event stays on the PMU
       but it can possibly change register. To prevent that, I believe you need
       to also set exclusive so that no other group can be scheduled, and thus
       possibly use the same counter.

       Looks like this is the only way you can make this actually work.
       Not setting pinned+exclusive, is another pitfall in which many people
       will fall into.

  6/ Group scheduling

       Looking at the existing code, it seems to me there is a risk of
       starvation for groups, i.e., groups never scheduled on the PMU.

       My understanding of the scheduling algorithm is:

               - first try to  schedule pinned groups. If a pinned group
                 fails, put it in error mode. read() will fail until the
                 group gets another chance at being scheduled.

               - then try to schedule the remaining groups. If a group fails
                 just skip it.

       If the group list does not change, then certain groups may always fail.
       However, the ordering of the list changes because at every tick, it is
       rotated. The head becomes the tail. Therefore, each group eventually gets
       the first position and therefore gets the full PMU to assign its events.

       This works as long as there is a guarantee the list will ALWAYS
rotate. If
       a thread does not run long enough for a tick, it may never rotate.

  7/ Group validity checking

       At the user level, an application is only concerned with events
and grouping
       of those events. The assignment logic is performed by the kernel.

       For a group to be scheduled, all its events must be compatible
with each other,
       otherwise the group will never be scheduled. It is not clear to
me when that
       sanity check will be performed if I create the group such that
it is stopped.

       If the group goes all the way to scheduling, it will never be
scheduled. Counts
       will be zero and the users will have no idea why. If the group
is put in error
       state, read will not be possible. But again, how will the user know why?


  8/ Generalized cache events

      In recent days, you have added support for what you call
'generalized cache events'.

      The log defines:
               new event type: PERF_TYPE_HW_CACHE

               This is a 3-dimensional space:
               { L1-D, L1-I, L2, ITLB, DTLB, BPU } x
               { load, store, prefetch } x
               { accesses, misses }

      Those generic events are then mapped by the kernel onto actual
PMU events if possible.

      I don't see any justification for adding this and especially in
the kernel.

      What's the motivation and goal of this?

      If you define generic events, you need to provide a clear
definition of what they are
      actually measuring. This is especially true for caches because
there are many cache
      events and many different behaviors.

      If the goal is to make comparisons easier. I believe this is
doomed to fail. Because
      different caches behave differently, events capture different
subtle things, e.g, HW
      prefetch vs. sw prefetch. If to actually understand what the
generic event is counting
      I need to know the mapping, then this whole feature is useless.

  9/ Group reading

      It is possible to start/stop an event group simply via ioctl()
on the group
      leader. However, it is not possible to read all the counts with a single
      with a single read() system call. That seems odd. Furhermore, I
believe you
      want reads to be as atomic as possible.

  10/ Event buffer minimal useful size

      As it stands, the buffer header occupies the first page, even though the
      buffer header struct is 32-byte long. That's a lot of precious
RLIMIT_MEMLOCK
      memory wasted.

      The actual buffer (data) starts at the next page (from builtin-top.c):

       static void mmap_read_counter(struct mmap_data *md)
       {
               unsigned int head = mmap_read_head(md);
               unsigned int old = md->prev;
               unsigned char *data = md->base + page_size;


       Given that the buffer "full" notification are sent on page
crossing boundaries,
       if the actual buffer payload size is 1 page, you are guaranteed
to have your
       samples overwritten.

       This leads me to believe that the minimal buffer size to get
useful data is 3 pages.
       This is per event group per thread. That puts a lot of pressure
on RLIMIT_MEMLOCK
       which is ususally set fairly low by distros.

   11/ Missing definitions for generic hardware events

       As soon as you define generic events, you need to provide a
clear and precise definition
       at to what they measure. This is crucial to make them useful. I
have not seen such a
       definition yet.

II/ X86 comments

  1/ Fixed counters on Intel

       You cannot simply fall back to generic counters if you cannot find
       a fixed counter. There are model-specific bugs, for instance
       UNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
       Nehalem when it is used in fixed counter 2 or a generic counter. The
       same is true on Core.

       You cannot simply look at the event field code to determine whether
       this is an event supported by a fixed counters. You must look at the
       other fields such as edge, invert, cnt-mask. If those are present then
       you have to fall back to using a generic counter as fixed counters only
       support priv level filtering. As indicated above, though, programming
       UNHALTED_REFERENCE_CYCLES on a generic counter does not count the same
       thing, therefore you need to fail if filters other than priv levels are
       present on this event.

  2/ Event knowledge missing

       There are constraints on events in Intel processors. Different
constraints
       do exist on AMD64 processors, especially with uncore-releated events.

       In your model, those need to be taken care of by the kernel. Should the
       kernel make the wrong decision, there would be no work-around for user
       tools. Take the example I outlined just above with Intel fixed counters.

       The current code-base does not have any constrained event
support, therefore
       bogus counts may be returned depending on the event measured.

III/ Requests

  1/ Sampling period randomization

       It is our experience (on Itanium, for instance), that for certain
       sampling measurements, it is beneficial to randomize the sampling
       period a bit. This is in particular the case when sampling on an
       event that happens very frequently and which is not related to
       timing, e.g., branch_instructions_retired. Randomization helps mitigate
       the bias. You do not need something sophisticated. But when you are using
       a kernel-level sampling buffer, you need to have the kernel randomize.
       Randomization needs to be supported per event.

IV/ Open questions

  1/ Support for model-specific uncore PMU monitoring capabilities

       Recent processors have multiple PMUs. Typically one per core and but
       also one at the socket level, e.g., Intel Nehalem. It is expected that
       this API will provide access to these PMU as well.

       It seems like with the current API, raw events for those PMU would need
       a new architecture-specific type as the event encoding by itself may
       not be enough to disambiguate between a core and uncore PMU event.

       How are those events going to be supported?

  2/ Features impacting all counters

       On some PMU models, e.g., Itanium, they are certain features which have
       an influence on all counters that are active. For instance, there is a
       way to restrict monitoring to a range of continuous code or data
       addresses using both some PMU registers and the debug registers.

       Given that the API exposes events (counters) as independent of each
       other, I wonder how range restriction could be implemented.

       Similarly, on Itanium, there are global behaviors. For instance, on
       counter overflow the entire PMU freezes all at once. That seems to be
       contradictory with the design of the API which creates the illusion of
       independence.

       What solutions do you propose?

  3/ AMD IBS

       How is AMD IBS going to be implemented?

       IBS has two separate sets of registers. One to capture fetch related
       data and another one to capture instruction execution data. For each,
       there is one config register but multiple data registers. In each mode,
       there is a specific sampling period and IBS can interrupt.

       It looks like you could define two pseudo events or event types and then
       define a new record_format and read_format.  That formats would only be
       valid for an IBS event.

       Is that how you intend to support IBS?

  4/ Intel PEBS

       Since Netburst-based processors, Intel PMUs support a hardware sampling
       buffer mechanism called PEBS.

       PEBS really became useful with Nehalem.

       Not all events support PEBS. Up until Nehalem, only one counter supported
       PEBS (PMC0). The format of the hardware buffer has changed between Core
       and Nehalem. It is not yet architected, thus it can still evolve with
       future PMU models.

       On Nehalem, there is a new PEBS-based feature called Load Latency
       Filtering which captures where data cache misses occur
       (similar to Itanium D-EAR). Activating this feature requires setting a
       latency threshold hosted in a separate PMU MSR.

       On Nehalem, given that all 4 generic counters support PEBS, the
       sampling buffer may contain samples generated by any of the 4 counters.
       The buffer includes a bitmask of registers to determine the source
       of the samples. Multiple bits may be set in the bitmask.


       How PEBS will be supported for this new API?

  5/ Intel Last Branch Record (LBR)

       Intel processors since Netburst have a cyclic buffer hosted in
       registers which can record taken branches. Each taken branch is stored
       into a pair of LBR registers (source, destination). Up until Nehalem,
       there was not filtering capabilities for LBR. LBR is not an architected
       PMU feature.

       There is no counter associated with LBR. Nehalem has a LBR_SELECT MSR.
       However there are some constraints on it given it is shared by threads.

       LBR is only useful when sampling and therefore must be combined with a
       counter. LBR must also be configured to freeze on PMU interrupt.

       How is LBR going to be supported?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: v2 of comments on Performance Counters for Linux (PCL)
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
@ 2009-06-22 11:48 ` Ingo Molnar
  2009-06-22 11:49 ` I.1 - System calls - ioctl Ingo Molnar
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:48 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel


hi Stephane,

Thanks for the extensive feedback! Your numerous comments cover 
twenty sub-topics, so we've tabulated the summary of Peter's and my 
replies, to give a quick overview:

-------------------------------------------------------------------------
     Topic/question you raised               # Our (current) reply
-------------------------------------------------------------------------
  I/ General API comments
......
  1/ System calls - ioctl                    # agree, willfix
  2/ Grouping                                # agree, questions asked
  3/ Multiplexing and system-wide            # agree, disagree on severity
  4/ Controlling group multiplexing          # agree, disagree on relevance
  5/ Mmaped count                            # disagree
  6/ Group scheduling                        # agree, disagree on severity
  7/ Group validity checking                 # questions answered
  8/ Generalized cache events                # disagree
  9/ Group reading                           # already possible - extend on need
 10/ Event buffer minimal useful size        # questions answered
 11/ Missing definitions for generic events  # reply question asked

 II/ X86 comments
......
  1/ Fixed counters on Intel                 # agree, willfix
  2/ Event knowledge missing                 # disagree

III/ Requests
......
  1/ Sampling period randomization           # support possible, outlined

 IV/ Open questions
......
  1/ Support for model-specific uncore PMU   # support possible, outlined
  2/ Features impacting all counters         # support possible, outlined
  3/ AMD IBS                                 # support possible, outlined
  4/ Intel PEBS                              # support possible, outlined
  5/ Intel Last Branch Record (LBR)          # support possible, outlined
-------------------------------------------------------------------------

( We'll send the individual replies in separate mails to keep mail
  size manageable. )

At the outset, it appears that there's two major themes to your 
questions (with a few other topics as well which i dont mention 
here):

 - Expressing/handling constraints with certain PMU features
 - Supporting various non-counter PMU features

In general we'd like to observe that our primary design goal is to 
offer a coherent, comprehensive and still simple to use performance 
measurement and profiling framework for Linux. We also provide an 
integrated tool in tools/perf/ to give a complete solution.

Many of the perfcounters features use no PMU at all, and in fact 
it's entirely possible to use most 'perf' features and get a 
meaningful analysis/profile of the system without any PMU.

In other words, we consider the PMU to be the means as to enhance 
perfcounters, not the end goal. A PMU will make a good difference to 
quality and precision, but we do not let the sanity of our 
interfaces be dictated by PMU feature quirkiness.

We cherry-pick the PMU features that look most useful, and expose 
them via rich performance measurement abstractions. It does not mean 
that we strive for immediately exposing _all_ PMU features and drown 
users in lots of conflicting hw facilities (different on each 
platform) as quickly as possible.

Having said that, we do think that pretty much any sane PMU feature
_can_ be supported via perfcounters (and most insane ones as well)
and that the limit is 'do we want to', not 'can we'.

Even heavily constrained PMUs can be supported: PowerPC, which is a
CPU family with heavily constrained PMU variants is widely supported
by perfcounters here and today.

	Ingo, Peter

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
  2009-06-22 11:48 ` Ingo Molnar
@ 2009-06-22 11:49 ` Ingo Molnar
  2009-06-22 12:58   ` Christoph Hellwig
  2009-06-22 11:50 ` I.2 - Grouping Ingo Molnar
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:49 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel


> I/ General API comments
>
> 1/ System calls
>
>   * ioctl()
>
>    You have defined 5 ioctls() so far to operate on an existing
>    event. I was under the impression that ioctl() should not be
>    used except for drivers.
>
> How do you justify your usage of ioctl() in this context?

We can certainly do a separate sys_perf_counter_ctrl() syscall - and
we will do that if people think the extra syscall slot is worth it
in this case.

The (mild) counter-argument so far was that the current ioctls are
very simple over "IO" attributes of counters:

 - enable
 - disable
 - reset
 - refresh
 - set-period

So they could be considered 'IO controls' in the classic sense and
act as a (mild) exception to the 'dont use ioctls' rule.

They are not some weird tacked-on syscall functionality - they
modify the IO properties of counters: on/off, value and rate. If
they go beyond that we'll put it all into a separate syscall and
deprecate the ioctl (which will have a relatively short half-time
due to the tools being hosted in the kernel repo).

This could happen right now in fact, if people think it's worth it.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.2 - Grouping
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
  2009-06-22 11:48 ` Ingo Molnar
  2009-06-22 11:49 ` I.1 - System calls - ioctl Ingo Molnar
@ 2009-06-22 11:50 ` Ingo Molnar
  2009-06-22 19:45   ` stephane eranian
                     ` (2 more replies)
  2009-06-22 11:51 ` I.3 - Multiplexing and system-wide Ingo Molnar
                   ` (16 subsequent siblings)
  19 siblings, 3 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:50 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 2/ Grouping
>
> By design, an event can only be part of one group at a time.
> Events in a group are guaranteed to be active on the PMU at the
> same time. That means a group cannot have more events than there
> are available counters on the PMU. Tools may want to know the
> number of counters available in order to group their events
> accordingly, such that reliable ratios could be computed. It seems
> the only way to know this is by trial and error. This is not
> practical.

Groups are there to support heavily constrained PMUs, and for them
this is the only way, as there is no simple linear expression for
how many counters one can load on the PMU.

The ideal model to tooling is relatively independent PMU registers
(counters) with little constraints - most modern CPUs meet that
model.

All the existing tooling (tools/perf/) operates on that model and
this leads to easy programmability and flexible results. This model
needs no grouping of counters.

Could you please cite specific examples in terms of tools/perf/?
What feature do you think needs to know more about constraints? What
is the specific win in precision we could achieve via that?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.3 - Multiplexing and system-wide
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (2 preceding siblings ...)
  2009-06-22 11:50 ` I.2 - Grouping Ingo Molnar
@ 2009-06-22 11:51 ` Ingo Molnar
  2009-06-22 11:51 ` I.4 - Controlling group multiplexing Ingo Molnar
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:51 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 3/ Multiplexing and system-wide
>
> Multiplexing is time-based and it is hooked into the timer tick.
> At every tick, the kernel tries to schedule another set of event
> groups.
>
> In tickless kernels if a CPU is idle, no timer tick is generated,
> therefore no multiplexing occurs. This is incorrect. It's not
> because the CPU is idle, that there aren't any interesting PMU
> events to measure. Parts of the CPU may still be active, e.g.,
> caches and buses. And thus, it is expected that multiplexing still
> happens.
>
> You need to hook up the timer source for multiplexing to something
> else which is not affected by tickless. You cannot simply disable
> tickless during a measurement because you would not be measuring
> the system as it actually behaves.

There is only a very small difference between inhibiting nohz and
waking up the machine using timers on the same interval.

Using the scheduler tick to round-robin counters is a first-level
approximation we use: it in essence corresponds to multiplexing
counters along a cycle cost metric. If there's sufficient interest
we can drive the round-robining by the PMU itself: this makes it HZ
and dynticks invariant.

The simplified multiplexing we offer now certainly works well enough
in practice - if there's interest we can expand it.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.4 - Controlling group multiplexing
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (3 preceding siblings ...)
  2009-06-22 11:51 ` I.3 - Multiplexing and system-wide Ingo Molnar
@ 2009-06-22 11:51 ` Ingo Molnar
  2009-06-22 11:52 ` I.5 - Mmaped count Ingo Molnar
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:51 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 4/ Controlling group multiplexing
>
> Although multiplexing is exposed to users via the timing
> information, events may not necessarily be grouped at random by
> tools. Groups may not be ordered at random either.
>
> I know of tools which craft the sequence of groups carefully such
> that related events are in neighboring groups such that they
> measure similar parts of the execution. This way, you can mitigate
> the fluctuations introduced by multiplexing. In other words, some
> tools may want to control the order in which groups are scheduled
> on the PMU.
>
> You mentioned that groups are multiplexed in creation order. But
> which creation order? As far as I know, multiple distinct tools
> may be attaching to the same thread at the same time and their
> groups may be interleaved in the list. Therefore, I believe
> 'creation order' refers to the global group creation order which
> is only visible to the kernel. Each tool may see a different
> order. Let's take an example.
>
> Tool A creates group G1, G2, G3 and attaches them to thread T0. At
> the same time tool B creates group G4, G5. The actual global order
> may be: G1, G4, G2, G5, G3. This is what the kernel is going to
> multiplex. Each group will be multiplexed in the right order from
> the point of view of each tool. But there will be gaps. It would
> be nice to have a way to ensure that the sequence is either: G1,
> G2, G3, G4, G5 or G4, G5, G1, G2, G3. In other words, avoid the
> interleaving.

Since its all sampling, what is the statistical relevance of
scheduling groups back to back when interleaved with other groups?

I appreciate people might be wanting this, but is that wanting
justified?

That is, A1 A2 B1 B2 vs A1 B1 A2 B2, what statistically relevant
feature does one have over the other, esp. since the RR interval is
outside of programm control.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.5 - Mmaped count
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (4 preceding siblings ...)
  2009-06-22 11:51 ` I.4 - Controlling group multiplexing Ingo Molnar
@ 2009-06-22 11:52 ` Ingo Molnar
  2009-06-22 12:25   ` stephane eranian
  2009-06-22 11:53 ` I.6 - Group scheduling Ingo Molnar
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:52 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 5/ Mmaped count
>
> It is possible to read counts directly from user space for
> self-monitoring threads. This leverages a HW capability present on
> some processors. On X86, this is possible via RDPMC.
>
> The full 64-bit count is constructed by combining the hardware
> value extracted with an assembly instruction and a base value made
> available thru the mmap. There is an atomic generation count
> available to deal with the race condition.
>
> I believe there is a problem with this approach given that the PMU
> is shared and that events can be multiplexed. That means that even
> though you are self-monitoring, events get replaced on the PMU.
> The assembly instruction is unaware of that, it reads a register
> not an event.
>
> On x86, assume event A is hosted in counter 0, thus you need
> RDPMC(0) to extract the count. But then, the event is replaced by
> another one which reuses counter 0. At the user level, you will
> still use RDPMC(0) but it will read the HW value from a different
> event and combine it with a base count from another one.
>
> To avoid this, you need to pin the event so it stays in the PMU at
> all times. Now, here is something unclear to me. Pinning does not
> mean stay in the SAME register, it means the event stays on the
> PMU but it can possibly change register. To prevent that, I
> believe you need to also set exclusive so that no other group can
> be scheduled, and thus possibly use the same counter.
>
> Looks like this is the only way you can make this actually work.
> Not setting pinned+exclusive, is another pitfall in which many
> people will fall into.

   do {
     seq = pc->lock;

     barrier()
     if (pc->index) {
       count = pmc_read(pc->index - 1);
       count += pc->offset;
     } else
       goto regular_read;

     barrier();
   } while (pc->lock != seq);

We don't see the hole you are referring to. The sequence lock
ensures you get a consistent view.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.6 - Group scheduling
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (5 preceding siblings ...)
  2009-06-22 11:52 ` I.5 - Mmaped count Ingo Molnar
@ 2009-06-22 11:53 ` Ingo Molnar
  2009-06-22 11:54 ` I.7 - Group validity checking Ingo Molnar
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:53 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 6/ Group scheduling
>
> Looking at the existing code, it seems to me there is a risk of
> starvation for groups, i.e., groups never scheduled on the PMU.
>
> My understanding of the scheduling algorithm is:
>
> - first try to �schedule pinned groups. If a pinned group
>   fails, put it in error mode. read() will fail until the
>   group gets another chance at being scheduled.
>
> - then try to schedule the remaining groups. If a group fails
>   just skip it.
>
> If the group list does not change, then certain groups may always
> fail. However, the ordering of the list changes because at every
> tick, it is rotated. The head becomes the tail. Therefore, each
> group eventually gets the first position and therefore gets the
> full PMU to assign its events.
>
> This works as long as there is a guarantee the list will ALWAYS
> rotate. If a thread does not run long enough for a tick, it may
> never rotate.

You need to ensure the task never runs during the tick, this is a
statistical propery that is bound to untrue for any 'normal'
application.

This is similar to the old cputime hiding tricks. We don't think
this will be a problem, you really need to actively avoid the tick
to aggregate a significantly lower tick rate.

In any case this can also be excluded via a natural extension
mentioned above: scheduling based not on the scheduler tick but
based on one of the counters.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.7 - Group validity checking
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (6 preceding siblings ...)
  2009-06-22 11:53 ` I.6 - Group scheduling Ingo Molnar
@ 2009-06-22 11:54 ` Ingo Molnar
  2009-06-22 11:54 ` I.8 - Generalized cache events Ingo Molnar
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:54 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 7/ Group validity checking
>
> At the user level, an application is only concerned with events
> and grouping of those events. The assignment logic is performed by
> the kernel.
>
> For a group to be scheduled, all its events must be compatible
> with each other, otherwise the group will never be scheduled. It
> is not clear to me when that sanity check will be performed if I
> create the group such that it is stopped.

The hardware implementation should verify that the group fits on the
PMU for every new group member.

Like said before, we think this and I.2/ are in conflict.

> If the group goes all the way to scheduling, it will never be
> scheduled. Counts will be zero and the users will have no idea
> why. If the group is put in error state, read will not be
> possible. But again, how will the user know why?

It should not be possible to create such a situation. Can you see it 
being possible?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.8 - Generalized cache events
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (7 preceding siblings ...)
  2009-06-22 11:54 ` I.7 - Group validity checking Ingo Molnar
@ 2009-06-22 11:54 ` Ingo Molnar
  2009-06-22 11:55 ` I.9 - Group reading Ingo Molnar
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:54 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 8/ Generalized cache events
>
>  In recent days, you have added support for what you call
> 'generalized cache events'.
>
>  The log defines:
> new event type: PERF_TYPE_HW_CACHE
>
>  This is a 3-dimensional space:
>  { L1-D, L1-I, L2, ITLB, DTLB, BPU } x
>  { load, store, prefetch } x
>  { accesses, misses }
>
> Those generic events are then mapped by the kernel onto actual PMU
> events if possible.
>
> I don't see any justification for adding this and especially in
> the kernel.
>
> What's the motivation and goal of this?
>
> If you define generic events, you need to provide a clear
> definition of what they are actually measuring. This is especially
> true for caches because there are many cache events and many
> different behaviors.
>
> If the goal is to make comparisons easier. I believe this is
> doomed to fail. Because different caches behave differently,
> events capture different subtle things, e.g, HW prefetch vs. sw
> prefetch. If to actually understand what the generic event is
> counting I need to know the mapping, then this whole feature is
> useless.

[ we since renamed L2 to LL ]

I beg to differ, subtle differences don't matter for big picture
performance indications.

If you measure significant last level cache misses by any metric,
you know you're hosed. Any work done to reduce this metric will
improve your workload.

Why do I need to care for the exact details of the event to make
valid use of this?

Sure, some people might be interested, and yes there are certainly
cases where you do want all detail available, but this interface
does not prohibit that usage - you can still use the full raw
spectrum of events, thousands of them if you so wish, easily
accessed both via the syscall and via the tools.

This categorization we added merely enables a somewhat simpler high
level view (that tries to be CPU model invariant) that suffices for
a lot of things.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.9 - Group reading
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (8 preceding siblings ...)
  2009-06-22 11:54 ` I.8 - Generalized cache events Ingo Molnar
@ 2009-06-22 11:55 ` Ingo Molnar
  2009-06-22 11:55 ` I.10 - Event buffer minimal useful size Ingo Molnar
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:55 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 9/ Group reading
>
> It is possible to start/stop an event group simply via ioctl() on
> the group leader. However, it is not possible to read all the
> counts with a single with a single read() system call. That seems
> odd. Furhermore, I believe you want reads to be as atomic as
> possible.

If you want an atomic snapshot you can do it: disable the group,
read out the counts, enable the group.

But, as your other comment under I/5 indicates, there are ways to
read out the PMU directly, via RDPMC instructions. Those are not
atomic either if used for multiple counters. Is your argument that
they are thus useless?

But if there is a strong use-case we can add PERF_FORMAT_GROUP.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.10 - Event buffer minimal useful size
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (9 preceding siblings ...)
  2009-06-22 11:55 ` I.9 - Group reading Ingo Molnar
@ 2009-06-22 11:55 ` Ingo Molnar
  2009-06-22 11:56 ` I.11 - Missing definitions for generic events Ingo Molnar
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:55 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 10/ Event buffer minimal useful size
>
> As it stands, the buffer header occupies the first page, even
> though the buffer header struct is 32-byte long. That's a lot of
> precious RLIMIT_MEMLOCK memory wasted.
>
>  The actual buffer (data) starts at the next page (from builtin-top.c):
>
> static void mmap_read_counter(struct mmap_data *md)
> {
>	unsigned int head = mmap_read_head(md);
>	unsigned int old = md->prev;
>	unsigned char *data = md->base + page_size;
>
> Given that the buffer "full" notification are sent on page
> crossing boundaries, if the actual buffer payload size is 1 page,
> you are guaranteed to have your samples overwritten.
>
> This leads me to believe that the minimal buffer size to get
> useful data is 3 pages. This is per event group per thread. That
> puts a lot of pressure on RLIMIT_MEMLOCK which is ususally set
> fairly low by distros.

Regarding notifications: you can actually tell it to generate
wakeups every 'n' events instead of at page boundaries. (See:
attr.wakeup_events).

Regarding locked pags: there is an extra per user mlock limit for
perf counters found in sysctl_perf_counter_mlock so it can be
configured independently. Furthermore, the accounting is done on a
per struct user basis, not on a per task basis - which makes the use
of locked pages far less of an issue to distros.

Regarding using 4K for the sampling data header: we are making use
of that property. We recently extended the header to include more
fields and having it writable - so the 4K MMU separation between
data header and data pages very much paid off already - and it could
provide more advantages in the future.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.11 - Missing definitions for generic events
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (10 preceding siblings ...)
  2009-06-22 11:55 ` I.10 - Event buffer minimal useful size Ingo Molnar
@ 2009-06-22 11:56 ` Ingo Molnar
  2009-06-22 14:54   ` stephane eranian
  2009-06-22 11:57 ` II.1 - Fixed counters on Intel Ingo Molnar
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:56 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 11/ Missing definitions for generic hardware events
>
> As soon as you define generic events, you need to provide a clear
> and precise definition at to what they measure. This is crucial to
> make them useful. I have not seen such a definition yet.

Do you mean their names aren't clear enough? :-)

Please suggest alternatives and/or clearer definitions.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: II.1 - Fixed counters on Intel
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (11 preceding siblings ...)
  2009-06-22 11:56 ` I.11 - Missing definitions for generic events Ingo Molnar
@ 2009-06-22 11:57 ` Ingo Molnar
  2009-06-22 14:27   ` stephane eranian
  2009-06-22 11:57 ` II.2 - Event knowledge missing Ingo Molnar
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:57 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> II/ X86 comments
>
>  1/ Fixed counters on Intel
>
> You cannot simply fall back to generic counters if you cannot find
> a fixed counter. There are model-specific bugs, for instance
> UNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same
> thing on Nehalem when it is used in fixed counter 2 or a generic
> counter. The same is true on Core.

This could be handled via a model specific quirk, if the erratum is
serious enough.

> You cannot simply look at the event field code to determine
> whether this is an event supported by a fixed counters. You must
> look at the other fields such as edge, invert, cnt-mask. If those
> are present then you have to fall back to using a generic counter
> as fixed counters only support priv level filtering. As indicated
> above, though, programming UNHALTED_REFERENCE_CYCLES on a generic
> counter does not count the same thing, therefore you need to fail
> if filters other than priv levels are present on this event.

Agreed, we'll fix this.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: II.2 - Event knowledge missing
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (12 preceding siblings ...)
  2009-06-22 11:57 ` II.1 - Fixed counters on Intel Ingo Molnar
@ 2009-06-22 11:57 ` Ingo Molnar
  2009-06-23 13:18   ` stephane eranian
  2009-06-22 11:58 ` III.1 - Sampling period randomization Ingo Molnar
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:57 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 2/ Event knowledge missing
>
> There are constraints on events in Intel processors. Different
> constraints do exist on AMD64 processors, especially with
> uncore-releated events.

You raise the issue of uncore events in IV.1, but let us reply here
primarily.

Un-core counters and events seem to be somewhat un-interesting to
us. (Patches from those who find them interesting are welcome of
course!)

So they werent in the first line of PMU features to cherry-pick
from.

The main problem with uncore events is that they are per physical
package, and hence tying a piece of physical metric exposed via them
to a particular workload is hard - unless full-system analysis is
performed. 'Task driven' metrics seem far more useful to performance
analysis (and those are the preferred analysis method of most
user-space developers), as they allow particularized sampling and
allow the tight connection between workload and metric.

If, despite our expecations, uncore events prove to be useful,
popular and required elements of performance analysis, they can be
supported in perfcounters via various levels:

 - a special raw ID range on x86, only to per CPU counters. The
   low-level implementation reserves the uncore PMCs, so overlapping
   allocation (and interaction between the cores via the MSRs) is
   not possible.

 - generic enumeration with full tooling support, time-sharing and
   the whole set of features. The low-level backend would time-share
   the resource between interested CPUs.

There is no limitation in the perfcounters design that somehow makes
uncore events harder to support. The uncore counters _themselves_
are limited to begin with - so rich features cannot be offered on
top of them.

> In your model, those need to be taken care of by the kernel.
> Should the kernel make the wrong decision, there would be no
> work-around for user tools. Take the example I outlined just above
> with Intel fixed counters.

Yes. Why should there be a work-around for user tools if the fix is
for the kernel? We don't try to work around random driver bugs in userspace
either, we simply fix the kernel.

> The current code-base does not have any constrained event support,
> therefore bogus counts may be returned depending on the event
> measured.

Then we'll need to grow some when we run into them :-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: III.1 - Sampling period randomization
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (13 preceding siblings ...)
  2009-06-22 11:57 ` II.2 - Event knowledge missing Ingo Molnar
@ 2009-06-22 11:58 ` Ingo Molnar
  2009-06-22 11:58 ` IV.1 - Support for model-specific uncore PMU Ingo Molnar
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:58 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> III/ Requests
>
> 1/ Sampling period randomization
>
> It is our experience (on Itanium, for instance), that for certain
> sampling measurements, it is beneficial to randomize the sampling
> period a bit. This is in particular the case when sampling on an
> event that happens very frequently and which is not related to
> timing, e.g., branch_instructions_retired. Randomization helps
> mitigate the bias. You do not need something sophisticated. But
> when you are using a kernel-level sampling buffer, you need to
> have the kernel randomize. Randomization needs to be supported per
> event.

This is on the todo list, it should be pretty straight forward to
implement. Basically add a u64 randomize:1 flag and a u64 rand_size,
and add some noise to hwc->sample_period in perf_counter_overflow(),
just like the .freq path does.

In fact the auto-freq sampling counters effectively randomize
already, as the path of stabilization/convergence is never exactly
the same. We could add a little bit of constant noise to the freq
counter and it would auto-correct automatically.

Since you seem to care abut this sub-feature, would you be
interested in sending a patch for that? The essential steps are:

 - Take a new bit off perf_counter_attr::__reserved_1 and name
   it attr.randomize.

 - Inject a few bits of trivial noise into the period calculated
   in kernel/perf_counter.c:perf_adjust_period(). The best place
   would be to add it right before this line:

        hwc->sample_period = sample_period;

   If you add a small relative fuzz of below 1% to sample_period
   then the code will auto-correct.

 - All the tools deal just fine with variable periods already, so
   there's no tooling updates needed, beyond adding a --randomize
   flag to tools/perf/builtin-record.c.

Let us know if you need any help with any of this!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: IV.1 - Support for model-specific uncore PMU
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (14 preceding siblings ...)
  2009-06-22 11:58 ` III.1 - Sampling period randomization Ingo Molnar
@ 2009-06-22 11:58 ` Ingo Molnar
  2009-06-22 11:59 ` IV.2 - Features impacting all counters Ingo Molnar
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:58 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> IV/ Open questions
>
> 1/ Support for model-specific uncore PMU monitoring capabilities
>
> Recent processors have multiple PMUs. Typically one per core and
> but also one at the socket level, e.g., Intel Nehalem. It is
> expected that this API will provide access to these PMU as well.
>
> It seems like with the current API, raw events for those PMU would
> need a new architecture-specific type as the event encoding by
> itself may not be enough to disambiguate between a core and uncore
> PMU event.
>
> How are those events going to be supported?

(Please see our reply to II.2.)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: IV.2 - Features impacting all counters
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (15 preceding siblings ...)
  2009-06-22 11:58 ` IV.1 - Support for model-specific uncore PMU Ingo Molnar
@ 2009-06-22 11:59 ` Ingo Molnar
  2009-06-22 12:00 ` IV.3 - AMD IBS Ingo Molnar
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 11:59 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 2/ Features impacting all counters
>
> On some PMU models, e.g., Itanium, they are certain features which
> have an influence on all counters that are active. For instance,
> there is a way to restrict monitoring to a range of continuous
> code or data addresses using both some PMU registers and the debug
> registers.
>
> Given that the API exposes events (counters) as independent of
> each other, I wonder how range restriction could be implemented.

A solution is to make it a per-counter attribute and fail to 
schedule multiple counters at the same time when these constraints 
differ.

> Similarly, on Itanium, there are global behaviors. For instance,
> on counter overflow the entire PMU freezes all at once. That seems
> to be contradictory with the design of the API which creates the
> illusion of independence.
>
> What solutions do you propose?

We propose the same solution as last time: live with the small
imprecisions caused by such hardware limitations. We submit that
natural workload noise is probably far bigger than any such effect.
We could certainly list it as a platform limitation.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: IV.3 - AMD IBS
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (16 preceding siblings ...)
  2009-06-22 11:59 ` IV.2 - Features impacting all counters Ingo Molnar
@ 2009-06-22 12:00 ` Ingo Molnar
  2009-06-22 14:08   ` [perfmon2] " Rob Fowler
  2009-06-22 19:17   ` stephane eranian
  2009-06-22 12:00 ` IV.4 - Intel PEBS Ingo Molnar
  2009-06-22 12:01 ` IV.5 - Intel Last Branch Record (LBR) Ingo Molnar
  19 siblings, 2 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 12:00 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 3/ AMD IBS
>
> How is AMD IBS going to be implemented?
>
> IBS has two separate sets of registers. One to capture fetch
> related data and another one to capture instruction execution
> data. For each, there is one config register but multiple data
> registers. In each mode, there is a specific sampling period and
> IBS can interrupt.
>
> It looks like you could define two pseudo events or event types
> and then define a new record_format and read_format. That formats
> would only be valid for an IBS event.
>
> Is that how you intend to support IBS?

That is indeed one of the ways we thought of, not really nice, but
then, IBS is really weird, what were those AMD engineers thinking
:-)

The point is - weird hardware gets expressed as a ... weird feature
under perfcounters too. Not all hardware weirdnesses can be
engineered away.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: IV.4 - Intel PEBS
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (17 preceding siblings ...)
  2009-06-22 12:00 ` IV.3 - AMD IBS Ingo Molnar
@ 2009-06-22 12:00 ` Ingo Molnar
  2009-06-22 12:16   ` Andi Kleen
  2009-06-22 12:01 ` IV.5 - Intel Last Branch Record (LBR) Ingo Molnar
  19 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 12:00 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 4/ Intel PEBS
>
> Since Netburst-based processors, Intel PMUs support a hardware
> sampling buffer mechanism called PEBS.
>
> PEBS really became useful with Nehalem.
>
> Not all events support PEBS. Up until Nehalem, only one counter
> supported PEBS (PMC0). The format of the hardware buffer has
> changed between Core and Nehalem. It is not yet architected, thus
> it can still evolve with future PMU models.
>
> On Nehalem, there is a new PEBS-based feature called Load Latency
> Filtering which captures where data cache misses occur (similar to
> Itanium D-EAR). Activating this feature requires setting a latency
> threshold hosted in a separate PMU MSR.
>
> On Nehalem, given that all 4 generic counters support PEBS, the
> sampling buffer may contain samples generated by any of the 4
> counters. The buffer includes a bitmask of registers to determine
> the source of the samples. Multiple bits may be set in the
> bitmask.
>
> How PEBS will be supported for this new API?

Note, the relevance of PEBS (or IBS) should not be over-stated: for
example it fundamentally cannot do precise call-chain recording (it
only records the RIP, not any of the return frames), which removes
from its utility. Another limitation is that only a few basic
hardware event types are supported by PEBS.

Having said that, PEBS is a hardware sampling feature that is
definitely saner than AMD's IBS. There's two immediate incremental
uses of it in perfcounters:

 - it makes flat sampling lower overhead by avoiding an NMI for all
   sample points.

 - it makes flat sampled data more precise. (I.e. it can avoid the
   1-2 instructions 'skidding' of a sample position, for a handful
   of PEBS-capable events.)

As such its primary support form would be 'transparent enablement':
i.e. on those (relatively few) events that are PEBS supported it
would be enabled automatically, and would result in more precise
(and possibly, cheaper) samples.

No separate APIs are needed really - the kernel can abstract it away
and can provide the user what the user wants: good and fast samples.

Regarding demultiplexing on Nehalem: PEBS goes into the DS (Data
Store), and indeed on Nehalem all PEBS counters 'mix' their PEBS
records in the same stream of data. One possible model to support
them is to set the PEBS threshold to one, and hence generate an
interrupt for each PEBS record. At offset 0x90 of the PEBS record we
have a snapshot of the global status register:

  0x90   IA32_PERF_GLOBAL_STATUS

Which tells us that relative to the previous PEBS record in the DS
which counter overflowed. If this were not reliable, we could still
poll all active counters for overflows and get a occasionally
imprecise but still statistically meaningful and precise
demultiplexing.

As to enabling PEBS with the (CPU-)global latency recording filters,
we can do this transparantly for every PEBS supported event, or can
mandate PEBS scheduling when a PEBS only feature like load latency
is requested.

This means that for most purposes PEBS will be transparant.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: IV.5 - Intel Last Branch Record (LBR)
  2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
                   ` (18 preceding siblings ...)
  2009-06-22 12:00 ` IV.4 - Intel PEBS Ingo Molnar
@ 2009-06-22 12:01 ` Ingo Molnar
  2009-06-22 20:02   ` stephane eranian
  19 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 12:01 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> 5/ Intel Last Branch Record (LBR)
>
> Intel processors since Netburst have a cyclic buffer hosted in
> registers which can record taken branches. Each taken branch is
> stored into a pair of LBR registers (source, destination). Up
> until Nehalem, there was not filtering capabilities for LBR. LBR
> is not an architected PMU feature.
>
> There is no counter associated with LBR. Nehalem has a LBR_SELECT
> MSR. However there are some constraints on it given it is shared
> by threads.
>
> LBR is only useful when sampling and therefore must be combined
> with a counter. LBR must also be configured to freeze on PMU
> interrupt.
>
> How is LBR going to be supported?

If there's interest then one sane way to support it would be to
expose it as a new sampling format (PERF_SAMPLE_*).

Regarding the constraints - if we choose to expose the branch-type
filtering capabilities of Nehalem, then that puts a constraint on
counter scheduling: two counters with conflicting constraints should
not be scheduled at once, but should be time-shared via the usual
mechanism.

The typical use-case would be to have no or compatible LBR filter
attributes between counters though - so having the conflicts is not
an issue as long as it works according to the usual model.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: IV.4 - Intel PEBS
  2009-06-22 12:00 ` IV.4 - Intel PEBS Ingo Molnar
@ 2009-06-22 12:16   ` Andi Kleen
  0 siblings, 0 replies; 68+ messages in thread
From: Andi Kleen @ 2009-06-22 12:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, Jun 22, 2009 at 02:00:52PM +0200, Ingo Molnar wrote:
> Having said that, PEBS is a hardware sampling feature that is
> definitely saner than AMD's IBS. There's two immediate incremental
> uses of it in perfcounters:
> 
>  - it makes flat sampling lower overhead by avoiding an NMI for all
>    sample points.
> 
>  - it makes flat sampled data more precise. (I.e. it can avoid the
>    1-2 instructions 'skidding' of a sample position, for a handful

There are realistic examples where the non pebs "shadow" can be far more
than that, even giving systematic error and hiding complete basic blocks.

>    of PEBS-capable events.)

There are a more reasons, especially on Nehalem there are some useful things
you can only measure with PEBS. e.g. memory latency or address
histograms (although the later is quite complicated). Also it 
has a lot more PEBS capable events than older parts.

Long term the trend is likely that more and more advanced PMU features
will require PEBS.

> Regarding demultiplexing on Nehalem: PEBS goes into the DS (Data
> Store), and indeed on Nehalem all PEBS counters 'mix' their PEBS
> records in the same stream of data. One possible model to support
> them is to set the PEBS threshold to one, and hence generate an
> interrupt for each PEBS record. At offset 0x90 of the PEBS record we

Then you would need the NMIs again, the NMI avoidance in PEBS only
works with higher thresholds.

> As to enabling PEBS with the (CPU-)global latency recording filters,
> we can do this transparantly for every PEBS supported event, or can
> mandate PEBS scheduling when a PEBS only feature like load latency
> is requested.
> 
> This means that for most purposes PEBS will be transparant.

One disadvantage here is that you're giving away a lot of measuring 
overhead: interrupts are always much more costly than a PEBS event directly
written by the CPU.

But when you support batching multiple PEBS events I suspect the consumer
needs to be aware of the limitations,  e.g. no precise time stamps.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.5 - Mmaped count
  2009-06-22 11:52 ` I.5 - Mmaped count Ingo Molnar
@ 2009-06-22 12:25   ` stephane eranian
  2009-06-22 12:35     ` Peter Zijlstra
  2009-06-23  0:33     ` Paul Mackerras
  0 siblings, 2 replies; 68+ messages in thread
From: stephane eranian @ 2009-06-22 12:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, Jun 22, 2009 at 1:52 PM, Ingo Molnar<mingo@elte.hu> wrote:
>> 5/ Mmaped count
>>
>> It is possible to read counts directly from user space for
>> self-monitoring threads. This leverages a HW capability present on
>> some processors. On X86, this is possible via RDPMC.
>>
>> The full 64-bit count is constructed by combining the hardware
>> value extracted with an assembly instruction and a base value made
>> available thru the mmap. There is an atomic generation count
>> available to deal with the race condition.
>>
>> I believe there is a problem with this approach given that the PMU
>> is shared and that events can be multiplexed. That means that even
>> though you are self-monitoring, events get replaced on the PMU.
>> The assembly instruction is unaware of that, it reads a register
>> not an event.
>>
>> On x86, assume event A is hosted in counter 0, thus you need
>> RDPMC(0) to extract the count. But then, the event is replaced by
>> another one which reuses counter 0. At the user level, you will
>> still use RDPMC(0) but it will read the HW value from a different
>> event and combine it with a base count from another one.
>>
>> To avoid this, you need to pin the event so it stays in the PMU at
>> all times. Now, here is something unclear to me. Pinning does not
>> mean stay in the SAME register, it means the event stays on the
>> PMU but it can possibly change register. To prevent that, I
>> believe you need to also set exclusive so that no other group can
>> be scheduled, and thus possibly use the same counter.
>>
>> Looks like this is the only way you can make this actually work.
>> Not setting pinned+exclusive, is another pitfall in which many
>> people will fall into.
>
>   do {
>     seq = pc->lock;
>
>     barrier()
>     if (pc->index) {
>       count = pmc_read(pc->index - 1);
>       count += pc->offset;
>     } else
>       goto regular_read;
>
>     barrier();
>   } while (pc->lock != seq);
>
> We don't see the hole you are referring to. The sequence lock
> ensures you get a consistent view.
>
Let's take an example, with two groups, one event in each group.
Both events scheduled on counter0, i.e,, rdpmc(0). The 2 groups
are multiplexed, one each tick. The user gets 2 file descriptors
and thus two mmap'ed pages.

Suppose the user wants to read, using the above loop, the value of the
event in the first group BUT it's the 2nd group  that is currently active
and loaded on counter0, i.e., rdpmc(0) returns the value of the 2nd event.

Unless you tell me that pc->index is marked invalid (0) when the
event is not scheduled. I don't see how you can avoid reading
the wrong value. I am assuming that is the event is not scheduled
lock remains constant.

Assuming the event is active when you enter the loop and you
read a value. How to get the timing information to scale the
count?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.5 - Mmaped count
  2009-06-22 12:25   ` stephane eranian
@ 2009-06-22 12:35     ` Peter Zijlstra
  2009-06-22 12:54       ` stephane eranian
  2009-06-23  0:39       ` Paul Mackerras
  2009-06-23  0:33     ` Paul Mackerras
  1 sibling, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2009-06-22 12:35 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, 2009-06-22 at 14:25 +0200, stephane eranian wrote:
> On Mon, Jun 22, 2009 at 1:52 PM, Ingo Molnar<mingo@elte.hu> wrote:
> >> 5/ Mmaped count
> >>
> >> It is possible to read counts directly from user space for
> >> self-monitoring threads. This leverages a HW capability present on
> >> some processors. On X86, this is possible via RDPMC.
> >>
> >> The full 64-bit count is constructed by combining the hardware
> >> value extracted with an assembly instruction and a base value made
> >> available thru the mmap. There is an atomic generation count
> >> available to deal with the race condition.
> >>
> >> I believe there is a problem with this approach given that the PMU
> >> is shared and that events can be multiplexed. That means that even
> >> though you are self-monitoring, events get replaced on the PMU.
> >> The assembly instruction is unaware of that, it reads a register
> >> not an event.
> >>
> >> On x86, assume event A is hosted in counter 0, thus you need
> >> RDPMC(0) to extract the count. But then, the event is replaced by
> >> another one which reuses counter 0. At the user level, you will
> >> still use RDPMC(0) but it will read the HW value from a different
> >> event and combine it with a base count from another one.
> >>
> >> To avoid this, you need to pin the event so it stays in the PMU at
> >> all times. Now, here is something unclear to me. Pinning does not
> >> mean stay in the SAME register, it means the event stays on the
> >> PMU but it can possibly change register. To prevent that, I
> >> believe you need to also set exclusive so that no other group can
> >> be scheduled, and thus possibly use the same counter.
> >>
> >> Looks like this is the only way you can make this actually work.
> >> Not setting pinned+exclusive, is another pitfall in which many
> >> people will fall into.
> >
> >   do {
> >     seq = pc->lock;
> >
> >     barrier()
> >     if (pc->index) {
> >       count = pmc_read(pc->index - 1);
> >       count += pc->offset;
> >     } else
> >       goto regular_read;
> >
> >     barrier();
> >   } while (pc->lock != seq);
> >
> > We don't see the hole you are referring to. The sequence lock
> > ensures you get a consistent view.
> >
> Let's take an example, with two groups, one event in each group.
> Both events scheduled on counter0, i.e,, rdpmc(0). The 2 groups
> are multiplexed, one each tick. The user gets 2 file descriptors
> and thus two mmap'ed pages.
> 
> Suppose the user wants to read, using the above loop, the value of the
> event in the first group BUT it's the 2nd group  that is currently active
> and loaded on counter0, i.e., rdpmc(0) returns the value of the 2nd event.
> 
> Unless you tell me that pc->index is marked invalid (0) when the
> event is not scheduled. I don't see how you can avoid reading
> the wrong value. I am assuming that is the event is not scheduled
> lock remains constant.

Indeed, pc->index == 0 means its not currently available.

> Assuming the event is active when you enter the loop and you
> read a value. How to get the timing information to scale the
> count?

I think we would have to add that do the data page,.. something like the
below?

Paulus?

---
Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -232,6 +232,10 @@ struct perf_counter_mmap_page {
 	__u32	lock;			/* seqlock for synchronization */
 	__u32	index;			/* hardware counter identifier */
 	__s64	offset;			/* add to hardware counter value */
+	__u64	total_time;		/* total time counter active */
+	__u64	running_time;		/* time counter on cpu */
+
+	__u64	__reserved[123];	/* align at 1k */
 
 	/*
 	 * Control data for the mmap() data buffer.
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1782,6 +1782,12 @@ void perf_counter_update_userpage(struct
 	if (counter->state == PERF_COUNTER_STATE_ACTIVE)
 		userpg->offset -= atomic64_read(&counter->hw.prev_count);
 
+	userpg->total_time = counter->total_time_enabled +
+			atomic64_read(&counter->child_total_time_enabled);
+
+	userpg->running_time = counter->total_time_running +
+			atomic64_read(&counter->child_total_time_running);
+
 	barrier();
 	++userpg->lock;
 	preempt_enable();



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.5 - Mmaped count
  2009-06-22 12:35     ` Peter Zijlstra
@ 2009-06-22 12:54       ` stephane eranian
  2009-06-22 14:39         ` Peter Zijlstra
  2009-06-23  0:41         ` Paul Mackerras
  2009-06-23  0:39       ` Paul Mackerras
  1 sibling, 2 replies; 68+ messages in thread
From: stephane eranian @ 2009-06-22 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, Jun 22, 2009 at 2:35 PM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
> On Mon, 2009-06-22 at 14:25 +0200, stephane eranian wrote:
>> On Mon, Jun 22, 2009 at 1:52 PM, Ingo Molnar<mingo@elte.hu> wrote:
>> >> 5/ Mmaped count
>> >>
>> >> It is possible to read counts directly from user space for
>> >> self-monitoring threads. This leverages a HW capability present on
>> >> some processors. On X86, this is possible via RDPMC.
>> >>
>> >> The full 64-bit count is constructed by combining the hardware
>> >> value extracted with an assembly instruction and a base value made
>> >> available thru the mmap. There is an atomic generation count
>> >> available to deal with the race condition.
>> >>
>> >> I believe there is a problem with this approach given that the PMU
>> >> is shared and that events can be multiplexed. That means that even
>> >> though you are self-monitoring, events get replaced on the PMU.
>> >> The assembly instruction is unaware of that, it reads a register
>> >> not an event.
>> >>
>> >> On x86, assume event A is hosted in counter 0, thus you need
>> >> RDPMC(0) to extract the count. But then, the event is replaced by
>> >> another one which reuses counter 0. At the user level, you will
>> >> still use RDPMC(0) but it will read the HW value from a different
>> >> event and combine it with a base count from another one.
>> >>
>> >> To avoid this, you need to pin the event so it stays in the PMU at
>> >> all times. Now, here is something unclear to me. Pinning does not
>> >> mean stay in the SAME register, it means the event stays on the
>> >> PMU but it can possibly change register. To prevent that, I
>> >> believe you need to also set exclusive so that no other group can
>> >> be scheduled, and thus possibly use the same counter.
>> >>
>> >> Looks like this is the only way you can make this actually work.
>> >> Not setting pinned+exclusive, is another pitfall in which many
>> >> people will fall into.
>> >
>> >   do {
>> >     seq = pc->lock;
>> >
>> >     barrier()
>> >     if (pc->index) {
>> >       count = pmc_read(pc->index - 1);
>> >       count += pc->offset;
>> >     } else
>> >       goto regular_read;
>> >
>> >     barrier();
>> >   } while (pc->lock != seq);
>> >
>> > We don't see the hole you are referring to. The sequence lock
>> > ensures you get a consistent view.
>> >
>> Let's take an example, with two groups, one event in each group.
>> Both events scheduled on counter0, i.e,, rdpmc(0). The 2 groups
>> are multiplexed, one each tick. The user gets 2 file descriptors
>> and thus two mmap'ed pages.
>>
>> Suppose the user wants to read, using the above loop, the value of the
>> event in the first group BUT it's the 2nd group  that is currently active
>> and loaded on counter0, i.e., rdpmc(0) returns the value of the 2nd event.
>>
>> Unless you tell me that pc->index is marked invalid (0) when the
>> event is not scheduled. I don't see how you can avoid reading
>> the wrong value. I am assuming that is the event is not scheduled
>> lock remains constant.
>
> Indeed, pc->index == 0 means its not currently available.

I don't see where you clear that field on x86.
Looks like it comes from hwc->idx. I suspect you need
to do something in x86_pmu_disable() to be symmetrical
with x86_pmu_enable().

I suspect something similar needs to be done on Power.


>
>> Assuming the event is active when you enter the loop and you
>> read a value. How to get the timing information to scale the
>> count?
>
> I think we would have to add that do the data page,.. something like the
> below?
>
Yes.



> ---
> Index: linux-2.6/include/linux/perf_counter.h
> ===================================================================
> --- linux-2.6.orig/include/linux/perf_counter.h
> +++ linux-2.6/include/linux/perf_counter.h
> @@ -232,6 +232,10 @@ struct perf_counter_mmap_page {
>        __u32   lock;                   /* seqlock for synchronization */
>        __u32   index;                  /* hardware counter identifier */
>        __s64   offset;                 /* add to hardware counter value */
> +       __u64   total_time;             /* total time counter active */
> +       __u64   running_time;           /* time counter on cpu */
> +
> +       __u64   __reserved[123];        /* align at 1k */
>
>        /*
>         * Control data for the mmap() data buffer.
> Index: linux-2.6/kernel/perf_counter.c
> ===================================================================
> --- linux-2.6.orig/kernel/perf_counter.c
> +++ linux-2.6/kernel/perf_counter.c
> @@ -1782,6 +1782,12 @@ void perf_counter_update_userpage(struct
>        if (counter->state == PERF_COUNTER_STATE_ACTIVE)
>                userpg->offset -= atomic64_read(&counter->hw.prev_count);
>
> +       userpg->total_time = counter->total_time_enabled +
> +                       atomic64_read(&counter->child_total_time_enabled);
> +
> +       userpg->running_time = counter->total_time_running +
> +                       atomic64_read(&counter->child_total_time_running);
> +
>        barrier();
>        ++userpg->lock;
>        preempt_enable();
>
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-06-22 11:49 ` I.1 - System calls - ioctl Ingo Molnar
@ 2009-06-22 12:58   ` Christoph Hellwig
  2009-06-22 13:56     ` Ingo Molnar
  2009-07-13 10:53     ` Peter Zijlstra
  0 siblings, 2 replies; 68+ messages in thread
From: Christoph Hellwig @ 2009-06-22 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, Jun 22, 2009 at 01:49:31PM +0200, Ingo Molnar wrote:
> > How do you justify your usage of ioctl() in this context?
> 
> We can certainly do a separate sys_perf_counter_ctrl() syscall - and
> we will do that if people think the extra syscall slot is worth it
> in this case.
> 
> The (mild) counter-argument so far was that the current ioctls are
> very simple over "IO" attributes of counters:
> 
>  - enable
>  - disable
>  - reset
>  - refresh
>  - set-period
> 
> So they could be considered 'IO controls' in the classic sense and
> act as a (mild) exception to the 'dont use ioctls' rule.
> 
> They are not some weird tacked-on syscall functionality - they
> modify the IO properties of counters: on/off, value and rate. If
> they go beyond that we'll put it all into a separate syscall and
> deprecate the ioctl (which will have a relatively short half-time
> due to the tools being hosted in the kernel repo).
> 
> This could happen right now in fact, if people think it's worth it.

Yet another multiplexer doesn't buy as anything over ioctls unless it
adds more structure.  PERF_COUNTER_IOC_ENABLE/PERF_COUNTER_IOC_DISABLE/
PERF_COUNTER_IOC_RESET are calls without any argument, so it's kinda
impossible to add more structure.  perf_counter_refresh has an integer
argument, and perf_counter_period aswell (with a slightly more
complicated calling convention due to passing a pointer to the 64bit
integer).  I don't see how moving this to syscalls would improve
things.

But talking about syscalls the sys_perf_counter_open prototype is
really ugly - it uses either the pid or cpu argument which is a pretty
clear indicator it should actually be two sys calls.

Incomplete patch without touching the actuall wire-up below to
demonstrate it:


Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c	2009-06-22 14:43:35.323966162 +0200
+++ linux-2.6/kernel/perf_counter.c	2009-06-22 14:57:30.223807475 +0200
@@ -1396,41 +1396,14 @@ __perf_counter_init_context(struct perf_
 	ctx->task = task;
 }
 
-static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
+static struct perf_counter_context *find_get_pid_context(pid_t pid)
 {
 	struct perf_counter_context *parent_ctx;
 	struct perf_counter_context *ctx;
-	struct perf_cpu_context *cpuctx;
 	struct task_struct *task;
 	unsigned long flags;
 	int err;
 
-	/*
-	 * If cpu is not a wildcard then this is a percpu counter:
-	 */
-	if (cpu != -1) {
-		/* Must be root to operate on a CPU counter: */
-		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
-			return ERR_PTR(-EACCES);
-
-		if (cpu < 0 || cpu > num_possible_cpus())
-			return ERR_PTR(-EINVAL);
-
-		/*
-		 * We could be clever and allow to attach a counter to an
-		 * offline CPU and activate it when the CPU comes up, but
-		 * that's for later.
-		 */
-		if (!cpu_isset(cpu, cpu_online_map))
-			return ERR_PTR(-ENODEV);
-
-		cpuctx = &per_cpu(perf_cpu_context, cpu);
-		ctx = &cpuctx->ctx;
-		get_ctx(ctx);
-
-		return ctx;
-	}
-
 	rcu_read_lock();
 	if (!pid)
 		task = current;
@@ -3727,6 +3700,16 @@ static int perf_copy_attr(struct perf_co
 	if (attr->read_format & ~(PERF_FORMAT_MAX-1))
 		return -EINVAL;
 
+	if (!attr->exclude_kernel) {
+		if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+			return -EACCES;
+	}
+
+	if (attr->freq) {
+		if (attr->sample_freq > sysctl_perf_counter_sample_rate)
+			return -EINVAL;
+	}
+
 out:
 	return ret;
 
@@ -3736,52 +3719,16 @@ err_size:
 	goto out;
 }
 
-/**
- * sys_perf_counter_open - open a performance counter, associate it to a task/cpu
- *
- * @attr_uptr:	event type attributes for monitoring/sampling
- * @pid:		target pid
- * @cpu:		target cpu
- * @group_fd:		group leader counter fd
- */
-SYSCALL_DEFINE5(perf_counter_open,
-		struct perf_counter_attr __user *, attr_uptr,
-		pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
+static int do_perf_counter_open(struct perf_counter_attr *attr,
+		struct perf_counter_context *ctx, int cpu, int group_fd)
 {
 	struct perf_counter *counter, *group_leader;
-	struct perf_counter_attr attr;
-	struct perf_counter_context *ctx;
 	struct file *counter_file = NULL;
 	struct file *group_file = NULL;
 	int fput_needed = 0;
 	int fput_needed2 = 0;
 	int ret;
 
-	/* for future expandability... */
-	if (flags)
-		return -EINVAL;
-
-	ret = perf_copy_attr(attr_uptr, &attr);
-	if (ret)
-		return ret;
-
-	if (!attr.exclude_kernel) {
-		if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-			return -EACCES;
-	}
-
-	if (attr.freq) {
-		if (attr.sample_freq > sysctl_perf_counter_sample_rate)
-			return -EINVAL;
-	}
-
-	/*
-	 * Get the target context (task or percpu):
-	 */
-	ctx = find_get_context(pid, cpu);
-	if (IS_ERR(ctx))
-		return PTR_ERR(ctx);
-
 	/*
 	 * Look up the group leader (we will attach this counter to it):
 	 */
@@ -3810,11 +3757,11 @@ SYSCALL_DEFINE5(perf_counter_open,
 		/*
 		 * Only a group leader can be exclusive or pinned
 		 */
-		if (attr.exclusive || attr.pinned)
+		if (attr->exclusive || attr->pinned)
 			goto err_put_context;
 	}
 
-	counter = perf_counter_alloc(&attr, cpu, ctx, group_leader,
+	counter = perf_counter_alloc(attr, cpu, ctx, group_leader,
 				     GFP_KERNEL);
 	ret = PTR_ERR(counter);
 	if (IS_ERR(counter))
@@ -3857,6 +3804,68 @@ err_put_context:
 	goto out_fput;
 }
 
+SYSCALL_DEFINE4(perf_counter_open_pid,
+		struct perf_counter_attr __user *, attr_uptr,
+		pid_t, pid, int, group_fd, unsigned long, flags)
+{
+	struct perf_counter_attr attr;
+	struct perf_counter_context *ctx;
+	int ret;
+
+	/* for future expandability... */
+	if (flags)
+		return -EINVAL;
+
+	ret = perf_copy_attr(attr_uptr, &attr);
+	if (ret)
+		return ret;
+
+	ctx = find_get_pid_context(pid);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	return do_perf_counter_open(&attr, ctx, -1, group_fd);
+}
+
+SYSCALL_DEFINE4(perf_counter_open_cpu,
+		struct perf_counter_attr __user *, attr_uptr,
+		int, cpu, int, group_fd, unsigned long, flags)
+{
+	struct perf_counter_attr attr;
+	struct perf_counter_context *ctx;
+	struct perf_cpu_context *cpuctx;
+	int ret;
+
+	/* for future expandability... */
+	if (flags)
+		return -EINVAL;
+
+	ret = perf_copy_attr(attr_uptr, &attr);
+	if (ret)
+		return ret;
+
+	/* Must be root to operate on a CPU counter: */
+	if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (cpu < 0 || cpu > num_possible_cpus())
+		return -EINVAL;
+
+	/*
+	 * We could be clever and allow to attach a counter to an
+	 * offline CPU and activate it when the CPU comes up, but
+	 * that's for later.
+	 */
+	if (!cpu_isset(cpu, cpu_online_map))
+		return -ENODEV;
+
+	cpuctx = &per_cpu(perf_cpu_context, cpu);
+	ctx = &cpuctx->ctx;
+	get_ctx(ctx);
+
+	return do_perf_counter_open(&attr, ctx, cpu, group_fd);
+}
+
 /*
  * inherit a counter from parent task to child task:
  */
@@ -4027,7 +4036,7 @@ void perf_counter_exit_task(struct task_
 	__perf_counter_task_sched_out(child_ctx);
 
 	/*
-	 * Take the context lock here so that if find_get_context is
+	 * Take the context lock here so that if find_get_pid_context is
 	 * reading child->perf_counter_ctxp, we wait until it has
 	 * incremented the context's refcount before we do put_ctx below.
 	 */

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
---end quoted text---

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-06-22 12:58   ` Christoph Hellwig
@ 2009-06-22 13:56     ` Ingo Molnar
  2009-06-22 17:41       ` Arnd Bergmann
  2009-07-13 10:53     ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2009-06-22 13:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: eranian, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel


* Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Jun 22, 2009 at 01:49:31PM +0200, Ingo Molnar wrote:
> > > How do you justify your usage of ioctl() in this context?
> > 
> > We can certainly do a separate sys_perf_counter_ctrl() syscall - 
> > and we will do that if people think the extra syscall slot is 
> > worth it in this case.
> > 
> > The (mild) counter-argument so far was that the current ioctls 
> > are very simple over "IO" attributes of counters:
> > 
> >  - enable
> >  - disable
> >  - reset
> >  - refresh
> >  - set-period
> > 
> > So they could be considered 'IO controls' in the classic sense 
> > and act as a (mild) exception to the 'dont use ioctls' rule.
> > 
> > They are not some weird tacked-on syscall functionality - they 
> > modify the IO properties of counters: on/off, value and rate. If 
> > they go beyond that we'll put it all into a separate syscall and 
> > deprecate the ioctl (which will have a relatively short 
> > half-time due to the tools being hosted in the kernel repo).
> > 
> > This could happen right now in fact, if people think it's worth it.
> 
> Yet another multiplexer doesn't buy as anything over ioctls unless 
> it adds more structure.  
> PERF_COUNTER_IOC_ENABLE/PERF_COUNTER_IOC_DISABLE/ 
> PERF_COUNTER_IOC_RESET are calls without any argument, so it's 
> kinda impossible to add more structure.  perf_counter_refresh has 
> an integer argument, and perf_counter_period aswell (with a 
> slightly more complicated calling convention due to passing a 
> pointer to the 64bit integer).  I don't see how moving this to 
> syscalls would improve things.

Yes - this is what kept us from moving it until now. But we are 
ready and willing to add a sys_perf_counter_chattr() syscall to 
change attributes. We are in the 'avoid ioctls' camp, but we are
not dogmatic about that. As you seem to agree this seems to be one 
of the narrow special cases where ioctls still make sense.

There _is_ another, more theoretical argument in favor of 
sys_perf_counter_chattr(): it is quite conceivable that as usage of 
perfcounters expands we want to change more and more attributes. So 
even though right now the ioctl just about manages to serve this 
role, it would be more future-proof to use sys_perf_counter_chattr() 
and deprecate the ioctl() straight away - to not even leave a 
_chance_ for some ioctl crap to seep into the API.

So ... we are on two minds about this, and if people dont mind a 
second syscall entry, we are glad to add it.

> But talking about syscalls the sys_perf_counter_open prototype is 
> really ugly - it uses either the pid or cpu argument which is a 
> pretty clear indicator it should actually be two sys calls.
> 
> Incomplete patch without touching the actuall wire-up below to 
> demonstrate it:

Dunno, not sure i agree here. 'CPU ID' is a pretty natural expansion 
of the usage of 'scope of counter'. The scope can be:

  - task-self
  - specific PID
  - specific PID with inheritance
  - specific CPU

It is not really 'multiplexing' completely unrelated things: a CPU 
is 'all PIDs running on a specific CPU' specifier. It is providing 
an expanded definition of 'target context to be monitored'. Just 
like signals have PID, group-PID and controlling-terminal type of 
extensions. We dont really have syscalls for each of those either.

Also note that the syscall does not have different meanings 
depending on whether it's a CPU counter or a specific-task counter 
or a task-and-all-child-tasks counter. So it is not the ugly kind of 
multiplexing that makes most ioctls such a jumbled mess.

If we were to unroll and expand _all_ such things in our current 
300+ syscalls in the kernel we'd have thousands of syscalls. Do we 
want that? Dunno. No strong feelings - but i dont think our current 
syscalls are unclean, and i dont think we should insist on a model 
that would have resulted in so many syscalls, were this enforced 
from the get go.

Furthermore, having a 'target ID' concept does make it harder to 
expand the range of targets that we might want to monitor. Do we 
want to expand the PID one with a PID-group notion perhaps? Or do we 
want to add IDs for specific VMs perhaps? It does not change the 
semantics, it only changes the (pretty orthogonal and isolated) 
'target context' field's meaning.

Hope this helps,

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [perfmon2] IV.3 - AMD IBS
  2009-06-22 12:00 ` IV.3 - AMD IBS Ingo Molnar
@ 2009-06-22 14:08   ` Rob Fowler
  2009-06-22 17:58     ` Maynard Johnson
  2009-06-23  6:19     ` Peter Zijlstra
  2009-06-22 19:17   ` stephane eranian
  1 sibling, 2 replies; 68+ messages in thread
From: Rob Fowler @ 2009-06-22 14:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, Peter Zijlstra, Philip Mucci, LKML, Andi Kleen,
	Paul Mackerras, Maynard Johnson, Andrew Morton, Thomas Gleixner,
	perfmon2-devel



Ingo Molnar wrote:
>> 3/ AMD IBS
>>
>> How is AMD IBS going to be implemented?
>>
>> IBS has two separate sets of registers. One to capture fetch
>> related data and another one to capture instruction execution
>> data. For each, there is one config register but multiple data
>> registers. In each mode, there is a specific sampling period and
>> IBS can interrupt.
>>
>> It looks like you could define two pseudo events or event types
>> and then define a new record_format and read_format. That formats
>> would only be valid for an IBS event.
>>
>> Is that how you intend to support IBS?
> 
> That is indeed one of the ways we thought of, not really nice, but
> then, IBS is really weird, what were those AMD engineers thinking
> :-)

Actually, IBS has roots in DEC's "ProfileMe" for Alpha EV67 and later
processors.   Those of us who used it there found it to be an extremely
powerful, low-overhead mechanism for directly collecting information about
how well the micro-architecture is performing.  In particular, it can tell
you, not only which instructions take a long time to traverse the pipe, but
it also tells you which instructions delay other instructions and by how much.
This is extremely valuable if you are either working on instruction scheduling
in a compiler, or are modifying a program to give the compiler the opportunity
to do a good job.

A core group of engineers who worked on Alpha went on to AMD.

An unfortunate problem with IBS on AMD is that good support isn't common in the "mainstream"
open source community.

Code Analyst from  AMD, also involving ex-DEC engineers, is
the only place where it is supported at all decently throughout the tool chain.
Last  time I looked, there was a tweaked version of oprofile underlying CA.
I haven't checked to see whether the tweaks have migrated back into the oprofile
trunk.


> 
> The point is - weird hardware gets expressed as a ... weird feature
> under perfcounters too. Not all hardware weirdnesses can be
> engineered away.
> 
> ------------------------------------------------------------------------------
> Are you an open source citizen? Join us for the Open Source Bridge conference!
> Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250.
> Need another reason to go? 24-hour hacker lounge. Register today!
> http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

-- 
Robert J. Fowler
Chief Domain Scientist, HPC
Renaissance Computing Institute
The University of North Carolina at Chapel Hill
100 Europa Dr, Suite 540
Chapel Hill, NC 27517
V: 919.445.9670
F: 919 445.9669
rjf@renci.org

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: II.1 - Fixed counters on Intel
  2009-06-22 11:57 ` II.1 - Fixed counters on Intel Ingo Molnar
@ 2009-06-22 14:27   ` stephane eranian
  0 siblings, 0 replies; 68+ messages in thread
From: stephane eranian @ 2009-06-22 14:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, Jun 22, 2009 at 1:57 PM, Ingo Molnar<mingo@elte.hu> wrote:
>> II/ X86 comments
>>
>>  1/ Fixed counters on Intel
>>
>> You cannot simply fall back to generic counters if you cannot find
>> a fixed counter. There are model-specific bugs, for instance
>> UNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same
>> thing on Nehalem when it is used in fixed counter 2 or a generic
>> counter. The same is true on Core.
>
> This could be handled via a model specific quirk, if the erratum is
> serious enough.

Better demonstrated with an actual example on a 2.4GHz Quad: 10s noploop

$ pfmon -v --us-c
-eunhalted_core_cycles,unhalted_reference_cycles,cpu_clk_unhalted:bus
-u noploop 10
[FIXED_CTRL(pmc16)=0xaa0 pmi0=1 en0=0x0 pmi1=1 en1=0x2 pmi2=1 en2=0x2]
UNHALTED_CORE_CYCLES UNHALTED_REFERENCE_CYCLES
[PERFEVTSEL0(pmc0)=0x51013c event_sel=0x3c umask=0x1 os=0 usr=1 en=1
int=1 inv=0 edge=0 cnt_mask=0] CPU_CLK_UNHALTED
noploop for 10 seconds
23,833,577,042 UNHALTED_CORE_CYCLES
23,853,100,716 UNHALTED_REFERENCE_CYCLES
 2,650,345,853 CPU_CLK_UNHALTED:BUS

0x013c on fixed counter2 = unhalted_reference_cycles
0x013c on generic counter = cpu_clk_unhalted:bus

Difference is significant, isn't it?

And the question becomes: how do I express that I want the fixed counter version
of the event using the current PCL API, given both have the same encoding?


>
>> You cannot simply look at the event field code to determine
>> whether this is an event supported by a fixed counters. You must
>> look at the other fields such as edge, invert, cnt-mask. If those
>> are present then you have to fall back to using a generic counter
>> as fixed counters only support priv level filtering. As indicated
>> above, though, programming UNHALTED_REFERENCE_CYCLES on a generic
>> counter does not count the same thing, therefore you need to fail
>> if filters other than priv levels are present on this event.
>
> Agreed, we'll fix this.
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.5 - Mmaped count
  2009-06-22 12:54       ` stephane eranian
@ 2009-06-22 14:39         ` Peter Zijlstra
  2009-06-23  0:41         ` Paul Mackerras
  1 sibling, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2009-06-22 14:39 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, 2009-06-22 at 14:54 +0200, stephane eranian wrote:
> On Mon, Jun 22, 2009 at 2:35 PM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
> > On Mon, 2009-06-22 at 14:25 +0200, stephane eranian wrote:
> >> On Mon, Jun 22, 2009 at 1:52 PM, Ingo Molnar<mingo@elte.hu> wrote:
> >> >> 5/ Mmaped count
> >> >>
> >> >> It is possible to read counts directly from user space for
> >> >> self-monitoring threads. This leverages a HW capability present on
> >> >> some processors. On X86, this is possible via RDPMC.
> >> >>
> >> >> The full 64-bit count is constructed by combining the hardware
> >> >> value extracted with an assembly instruction and a base value made
> >> >> available thru the mmap. There is an atomic generation count
> >> >> available to deal with the race condition.
> >> >>
> >> >> I believe there is a problem with this approach given that the PMU
> >> >> is shared and that events can be multiplexed. That means that even
> >> >> though you are self-monitoring, events get replaced on the PMU.
> >> >> The assembly instruction is unaware of that, it reads a register
> >> >> not an event.
> >> >>
> >> >> On x86, assume event A is hosted in counter 0, thus you need
> >> >> RDPMC(0) to extract the count. But then, the event is replaced by
> >> >> another one which reuses counter 0. At the user level, you will
> >> >> still use RDPMC(0) but it will read the HW value from a different
> >> >> event and combine it with a base count from another one.
> >> >>
> >> >> To avoid this, you need to pin the event so it stays in the PMU at
> >> >> all times. Now, here is something unclear to me. Pinning does not
> >> >> mean stay in the SAME register, it means the event stays on the
> >> >> PMU but it can possibly change register. To prevent that, I
> >> >> believe you need to also set exclusive so that no other group can
> >> >> be scheduled, and thus possibly use the same counter.
> >> >>
> >> >> Looks like this is the only way you can make this actually work.
> >> >> Not setting pinned+exclusive, is another pitfall in which many
> >> >> people will fall into.
> >> >
> >> >   do {
> >> >     seq = pc->lock;
> >> >
> >> >     barrier()
> >> >     if (pc->index) {
> >> >       count = pmc_read(pc->index - 1);
> >> >       count += pc->offset;
> >> >     } else
> >> >       goto regular_read;
> >> >
> >> >     barrier();
> >> >   } while (pc->lock != seq);
> >> >
> >> > We don't see the hole you are referring to. The sequence lock
> >> > ensures you get a consistent view.
> >> >
> >> Let's take an example, with two groups, one event in each group.
> >> Both events scheduled on counter0, i.e,, rdpmc(0). The 2 groups
> >> are multiplexed, one each tick. The user gets 2 file descriptors
> >> and thus two mmap'ed pages.
> >>
> >> Suppose the user wants to read, using the above loop, the value of the
> >> event in the first group BUT it's the 2nd group  that is currently active
> >> and loaded on counter0, i.e., rdpmc(0) returns the value of the 2nd event.
> >>
> >> Unless you tell me that pc->index is marked invalid (0) when the
> >> event is not scheduled. I don't see how you can avoid reading
> >> the wrong value. I am assuming that is the event is not scheduled
> >> lock remains constant.
> >
> > Indeed, pc->index == 0 means its not currently available.
> 
> I don't see where you clear that field on x86.

x86 doesn't have this feature fully implemented yet.. its still on the
todo list. Paulus started this on power, so it should work there.

> Looks like it comes from hwc->idx. I suspect you need
> to do something in x86_pmu_disable() to be symmetrical
> with x86_pmu_enable().

Right.

> I suspect something similar needs to be done on Power.

It looks like the power disable method does indeed do this:

	if (counter->hw.idx) {
		write_pmc(counter->hw.idx, 0);
		counter->hw.idx = 0;
	}
	perf_counter_update_userpage(counter);


The below might suffice for x86.. but its not real nice.
Power already has that whole +1 thing in its ->idx field, x86 does not.
So I either munge x86 or add something like I did below..

Paul, any suggestions?

---
Index: linux-2.6/arch/x86/kernel/cpu/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_counter.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_counter.c
@@ -912,6 +912,8 @@ x86_perf_counter_set_period(struct perf_
 	err = checking_wrmsrl(hwc->counter_base + idx,
 			     (u64)(-left) & x86_pmu.counter_mask);
 
+	perf_counter_update_userpage(counter);
+
 	return ret;
 }
 
@@ -1041,6 +1043,8 @@ try_generic:
 	x86_perf_counter_set_period(counter, hwc, idx);
 	x86_pmu.enable(hwc, idx);
 
+	perf_counter_update_userpage(counter);
+
 	return 0;
 }
 
@@ -1133,6 +1137,8 @@ static void x86_pmu_disable(struct perf_
 	x86_perf_counter_update(counter, hwc, idx);
 	cpuc->counters[idx] = NULL;
 	clear_bit(idx, cpuc->used_mask);
+
+	perf_counter_update_userpage(counter);
 }
 
 /*
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1753,6 +1753,14 @@ int perf_counter_task_disable(void)
 	return 0;
 }
 
+static int perf_counter_index(struct perf_counter *counter)
+{
+	if (counter->state != PERF_COUNTER_STATE_ACTIVE)
+		return 0;
+
+	return counter->hw->idx + 1 - PERF_COUNTER_INDEX_OFFSET;
+}
+
 /*
  * Callers need to ensure there can be no nesting of this function, otherwise
  * the seqlock logic goes bad. We can not serialize this because the arch
@@ -1777,7 +1785,7 @@ void perf_counter_update_userpage(struct
 	preempt_disable();
 	++userpg->lock;
 	barrier();
-	userpg->index = counter->hw.idx;
+	userpg->index = perf_counter_index(counter);
 	userpg->offset = atomic64_read(&counter->count);
 	if (counter->state == PERF_COUNTER_STATE_ACTIVE)
 		userpg->offset -= atomic64_read(&counter->hw.prev_count);
Index: linux-2.6/arch/powerpc/include/asm/perf_counter.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/perf_counter.h
+++ linux-2.6/arch/powerpc/include/asm/perf_counter.h
@@ -61,6 +61,8 @@ struct pt_regs;
 extern unsigned long perf_misc_flags(struct pt_regs *regs);
 extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
 
+#define PERF_COUNTER_INDEX_OFFSET	1
+
 /*
  * Only override the default definitions in include/linux/perf_counter.h
  * if we have hardware PMU support.
Index: linux-2.6/arch/x86/include/asm/perf_counter.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/perf_counter.h
+++ linux-2.6/arch/x86/include/asm/perf_counter.h
@@ -87,6 +87,9 @@ union cpuid10_edx {
 #ifdef CONFIG_PERF_COUNTERS
 extern void init_hw_perf_counters(void);
 extern void perf_counters_lapic_init(void);
+
+#define PERF_COUNTER_INDEX_OFFSET			0
+
 #else
 static inline void init_hw_perf_counters(void)		{ }
 static inline void perf_counters_lapic_init(void)	{ }



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.11 - Missing definitions for generic events
  2009-06-22 11:56 ` I.11 - Missing definitions for generic events Ingo Molnar
@ 2009-06-22 14:54   ` stephane eranian
  0 siblings, 0 replies; 68+ messages in thread
From: stephane eranian @ 2009-06-22 14:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, Jun 22, 2009 at 1:56 PM, Ingo Molnar<mingo@elte.hu> wrote:
>> 11/ Missing definitions for generic hardware events
>>
>> As soon as you define generic events, you need to provide a clear
>> and precise definition at to what they measure. This is crucial to
>> make them useful. I have not seen such a definition yet.
>
> Do you mean their names aren't clear enough? :-)
>
No I'd like to see a defintion behind every name:

PERF_COUNT_HW_CPU_CYCLES: impacted by freq scaling or not?
PERF_COUNT_HW_INSTRUCTIONS
PERF_COUNT_HW_CACHE_REFERENCES: what cache level, data, code?
PERF_COUNT_HW_CACHE_MISSES: what cache level, data, code?
PERF_COUNT_HW_BRANCH_INSTRUCTIONS: taken, not-taken?
PERF_COUNT_HW_BRANCH_MISSES
PERF_COUNT_HW_BUS_CYCLES: what bus?

Take BUS_CYCLES, and based on my example on Core with UNHALTED_REFERENCE_CYCLE,
without a clear definition, it seems hard to understand if you need to
map it to 0x13c on a generic
counter or on fixed counter 2.

Having clearly spelled out definitions help port PCL to other
processors, it also helps user
understand which event they need to select. Users should not have to
dig through the
code to find the actual mapping for each PMU to understand what the
events actually
measure.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-06-22 13:56     ` Ingo Molnar
@ 2009-06-22 17:41       ` Arnd Bergmann
  0 siblings, 0 replies; 68+ messages in thread
From: Arnd Bergmann @ 2009-06-22 17:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, eranian, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Peter Zijlstra, Paul Mackerras, Andi Kleen,
	Maynard Johnson, Carl Love, Corey J Ashford, Philip Mucci,
	Dan Terpstra, perfmon2-devel

On Monday 22 June 2009, Ingo Molnar wrote:
> There is another, more theoretical argument in favor of 
> sys_perf_counter_chattr(): it is quite conceivable that as usage of 
> perfcounters expands we want to change more and more attributes. So 
> even though right now the ioctl just about manages to serve this 
> role, it would be more future-proof to use sys_perf_counter_chattr() 
> and deprecate the ioctl() straight away - to not even leave a 
> chance for some ioctl crap to seep into the API.
> 
> So ... we are on two minds about this, and if people dont mind a 
> second syscall entry, we are glad to add it.

I think adding one or more system calls is definitely worth it
if that means getting rid of the ioctl interface here. While I
don't generally mind adding ioctl calls, I would much prefer to
restrict their use to device files, sockets and to the existing
cases for regular files.

Conceptually, ioctl is a different class of interface from the
'new system call' case, in a number of ways. For new subsystems
I would just never mix them by allowing ioctl on something that
was not returned by open() or socket().

	Arnd <><

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [perfmon2] IV.3 - AMD IBS
  2009-06-22 14:08   ` [perfmon2] " Rob Fowler
@ 2009-06-22 17:58     ` Maynard Johnson
  2009-06-23  6:19     ` Peter Zijlstra
  1 sibling, 0 replies; 68+ messages in thread
From: Maynard Johnson @ 2009-06-22 17:58 UTC (permalink / raw)
  To: Rob Fowler
  Cc: Andrew Morton, Andi Kleen, Peter Zijlstra, eranian, LKML,
	Ingo Molnar, Philip Mucci, Paul Mackerras, perfmon2-devel,
	Thomas Gleixner, Suravee.Suthikulpanit

Rob Fowler <rjf@renci.org> wrote on 06/22/2009 09:08:34 AM:

>
>
> Ingo Molnar wrote:
> >> 3/ AMD IBS
> >>
> >> How is AMD IBS going to be implemented?
> >>
> >> IBS has two separate sets of registers. One to capture fetch
> >> related data and another one to capture instruction execution
> >> data. For each, there is one config register but multiple data
> >> registers. In each mode, there is a specific sampling period and
> >> IBS can interrupt.
> >>
> >> It looks like you could define two pseudo events or event types
> >> and then define a new record_format and read_format. That formats
> >> would only be valid for an IBS event.
> >>
> >> Is that how you intend to support IBS?
> >
> > That is indeed one of the ways we thought of, not really nice, but
> > then, IBS is really weird, what were those AMD engineers thinking
> > :-)
>
> Actually, IBS has roots in DEC's "ProfileMe" for Alpha EV67 and later
> processors.   Those of us who used it there found it to be an extremely
> powerful, low-overhead mechanism for directly collecting information
about
> how well the micro-architecture is performing.  In particular, it can
tell
> you, not only which instructions take a long time to traverse the pipe,
but
> it also tells you which instructions delay other instructions and byhow
much.
> This is extremely valuable if you are either working on instruction
scheduling
> in a compiler, or are modifying a program to give the compiler the
opportunity
> to do a good job.
>
> A core group of engineers who worked on Alpha went on to AMD.
>
> An unfortunate problem with IBS on AMD is that good support isn't
> common in the "mainstream"
> open source community.
>
> Code Analyst from  AMD, also involving ex-DEC engineers, is
> the only place where it is supported at all decently throughout the
> tool chain.
> Last  time I looked, there was a tweaked version of oprofile underlying
CA.
> I haven't checked to see whether the tweaks have migrated back into
> the oprofile trunk.
Yes, IBS on AMD is now supported upstream in mainline, contributed by
Suravee Suthikulpanit.

-Maynard
>
>
> >
> > The point is - weird hardware gets expressed as a ... weird feature
> > under perfcounters too. Not all hardware weirdnesses can be
> > engineered away.
> >
> >
>
------------------------------------------------------------------------------

> > Are you an open source citizen? Join us for the Open Source Bridge
> conference!
> > Portland, OR, June 17-19. Two days of sessions, one day of
> unconference: $250.
> > Need another reason to go? 24-hour hacker lounge. Register today!
> > http://ad.doubleclick.net/clk;215844324;13503038;v?http:
> //opensourcebridge.org
> > _______________________________________________
> > perfmon2-devel mailing list
> > perfmon2-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>
> --
> Robert J. Fowler
> Chief Domain Scientist, HPC
> Renaissance Computing Institute
> The University of North Carolina at Chapel Hill
> 100 Europa Dr, Suite 540
> Chapel Hill, NC 27517
> V: 919.445.9670
> F: 919 445.9669
> rjf@renci.org


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: IV.3 - AMD IBS
  2009-06-22 12:00 ` IV.3 - AMD IBS Ingo Molnar
  2009-06-22 14:08   ` [perfmon2] " Rob Fowler
@ 2009-06-22 19:17   ` stephane eranian
  1 sibling, 0 replies; 68+ messages in thread
From: stephane eranian @ 2009-06-22 19:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, Jun 22, 2009 at 2:00 PM, Ingo Molnar<mingo@elte.hu> wrote:
>> 3/ AMD IBS
>>
>> How is AMD IBS going to be implemented?
>>
>> IBS has two separate sets of registers. One to capture fetch
>> related data and another one to capture instruction execution
>> data. For each, there is one config register but multiple data
>> registers. In each mode, there is a specific sampling period and
>> IBS can interrupt.
>>
>> It looks like you could define two pseudo events or event types
>> and then define a new record_format and read_format. That formats
>> would only be valid for an IBS event.
>>
>> Is that how you intend to support IBS?
>
> That is indeed one of the ways we thought of, not really nice, but
> then, IBS is really weird, what were those AMD engineers thinking
> :-)
>

What's your problem with IBS? You need to elaborate when you make
this kind of comments. Remember what we discussed back in December.
If hardware designers put that in, it's because they think it can deliver
value-add. It may be difficult to use, but it delivers interesting data.


> The point is - weird hardware gets expressed as a ... weird feature
> under perfcounters too. Not all hardware weirdnesses can be
> engineered away.
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.2 - Grouping
  2009-06-22 11:50 ` I.2 - Grouping Ingo Molnar
@ 2009-06-22 19:45   ` stephane eranian
  2009-06-22 22:04     ` Corey Ashford
  2009-06-22 21:38   ` Corey Ashford
  2009-06-23  5:16   ` Paul Mackerras
  2 siblings, 1 reply; 68+ messages in thread
From: stephane eranian @ 2009-06-22 19:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, Jun 22, 2009 at 1:50 PM, Ingo Molnar<mingo@elte.hu> wrote:
>> 2/ Grouping
>>
>> By design, an event can only be part of one group at a time.

As I read this again, another question came up. Is the statement
above also true for the group leader?


>> Events in a group are guaranteed to be active on the PMU at the
>> same time. That means a group cannot have more events than there
>> are available counters on the PMU. Tools may want to know the
>> number of counters available in order to group their events
>> accordingly, such that reliable ratios could be computed. It seems
>> the only way to know this is by trial and error. This is not
>> practical.
>
> Groups are there to support heavily constrained PMUs, and for them
> this is the only way, as there is no simple linear expression for
> how many counters one can load on the PMU.
>
But then, does that mean that users need to be aware of constraints
to form groups. Group are formed by users. I thought, the whole point
of the API was to hide that kind of hardware complexity.

Groups are needed when sampling, e.g., PERF_SAMPLE_GROUP.
For instance, I sample the IP and the values of the other events in
my group.  Grouping looks like the only way to sampling on Itanium
branch buffers (BTB), for instance. You program an event in a generic
counter which counts the number of entries recorded in the buffer.
Thus, to sample the BTB, you sample on this event, when it overflows
you grab the content of the BTB. Here, the event and the BTB are tied
together. You cannot count the event in one group, and have the BTB
in another one (BTB = 1 config reg + 32 data registers + 1 position reg).



> The ideal model to tooling is relatively independent PMU registers
> (counters) with little constraints - most modern CPUs meet that
> model.
>
I agree that would be really good. But that's not what we have.

> All the existing tooling (tools/perf/) operates on that model and
> this leads to easy programmability and flexible results. This model
> needs no grouping of counters.
>
> Could you please cite specific examples in terms of tools/perf/?
> What feature do you think needs to know more about constraints? What
> is the specific win in precision we could achieve via that?
>
See above example.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: IV.5 - Intel Last Branch Record (LBR)
  2009-06-22 12:01 ` IV.5 - Intel Last Branch Record (LBR) Ingo Molnar
@ 2009-06-22 20:02   ` stephane eranian
  0 siblings, 0 replies; 68+ messages in thread
From: stephane eranian @ 2009-06-22 20:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, Jun 22, 2009 at 2:01 PM, Ingo Molnar<mingo@elte.hu> wrote:
>> 5/ Intel Last Branch Record (LBR)
>>
>> Intel processors since Netburst have a cyclic buffer hosted in
>> registers which can record taken branches. Each taken branch is
>> stored into a pair of LBR registers (source, destination). Up
>> until Nehalem, there was not filtering capabilities for LBR. LBR
>> is not an architected PMU feature.
>>
>> There is no counter associated with LBR. Nehalem has a LBR_SELECT
>> MSR. However there are some constraints on it given it is shared
>> by threads.
>>
>> LBR is only useful when sampling and therefore must be combined
>> with a counter. LBR must also be configured to freeze on PMU
>> interrupt.
>>
>> How is LBR going to be supported?
>
> If there's interest then one sane way to support it would be to
> expose it as a new sampling format (PERF_SAMPLE_*).
>
LBR is very important, it becomes useable with Nehalem where you
can filter on priv level. It is important for statistical basic block
profiling, for instance. Another important feature is its ability to freeze
on PMU interrupt.

LBR is also interesting because it yield a path to an event.

LBR on NHM (and others) is not that easy to handle because:
   - need to read-modify-write IA32_DEBUGCTL
   - LBR_TOS, the position pointer is read-only
   - LBR_SELECT to configure LBR is shared at the core-level on NHM

but it is very much worthwhile.

> Regarding the constraints - if we choose to expose the branch-type
> filtering capabilities of Nehalem, then that puts a constraint on
> counter scheduling: two counters with conflicting constraints should
> not be scheduled at once, but should be time-shared via the usual
> mechanism.
>
You need to expose the branch filtering in some way. The return
branch filter is useful for statistical call graph sampling, for instance.
If you can't disable the other types of branches, then the relevance
of the data drops.

> The typical use-case would be to have no or compatible LBR filter
> attributes between counters though - so having the conflicts is not
> an issue as long as it works according to the usual model.
>
Conflict arises when two events request different filter value. The conflict
happens in both per-thread and per-cpu mode when HT is on. In per-cpu
mode it can be controlled at the core level. But in per-thread mode, it
has to be managed globally as a thread may migrate. Multiplexing
as it is implemented manages groups attached to one thread or CPU.
Here it would have to look across a pair of CPUs (or threads) and
ensure that no two groups using LBR with conflicting filter selections
can be scheduled at the same time on the two hyper-threads of the
same core. I think it may be easier, for now, to say LBR_SELECT
is global, first come, first serve.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.2 - Grouping
  2009-06-22 11:50 ` I.2 - Grouping Ingo Molnar
  2009-06-22 19:45   ` stephane eranian
@ 2009-06-22 21:38   ` Corey Ashford
  2009-06-23  5:16   ` Paul Mackerras
  2 siblings, 0 replies; 68+ messages in thread
From: Corey Ashford @ 2009-06-22 21:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel



Ingo Molnar wrote:
>> 2/ Grouping
>>
>> By design, an event can only be part of one group at a time.
>> Events in a group are guaranteed to be active on the PMU at the
>> same time. That means a group cannot have more events than there
>> are available counters on the PMU. Tools may want to know the
>> number of counters available in order to group their events
>> accordingly, such that reliable ratios could be computed. It seems
>> the only way to know this is by trial and error. This is not
>> practical.
> 
> Groups are there to support heavily constrained PMUs, and for them
> this is the only way, as there is no simple linear expression for
> how many counters one can load on the PMU.
> 
> The ideal model to tooling is relatively independent PMU registers
> (counters) with little constraints - most modern CPUs meet that
> model.
> 
> All the existing tooling (tools/perf/) operates on that model and
> this leads to easy programmability and flexible results. This model
> needs no grouping of counters.
> 
> Could you please cite specific examples in terms of tools/perf/?
> What feature do you think needs to know more about constraints? What
> is the specific win in precision we could achieve via that?


An example of this is that a user wants to monitor 10 events, and we have four 
counters to work with.  Let's assume there is some mapping of events to counters 
where you need only 3 groups to schedule the 10 events onto the PMU.  If you 
leave it to the kernel (and don't group the events from user space), depending 
on the kernel's fast event scheduling algorithm, it may take 6 groups to get all 
of the requested events counted.  This leads to lower counts in the counters, 
and more chance for the counters to miss event bursts, which leads to less 
accurate scaled results.

Currently the PAPI substrate for PCL does do this partitioning using a very dumb 
algorithm.  But it could be improved, particularly if there was some better way 
to get feedback from the kernel other than a "yes, these fit" or "no, these 
don't fit".  I'm not sure what that way would be, though.  Perhaps an ioctl that 
does a some sort of "dry scheduling" of events to groups in an optimal way. 
This call would not need to lock any resources, and just use the kernel's 
algorithm for event constraint checking.

To me, this is not a big issue, but some sort of better mechanism might be 
considered for a future update.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.2 - Grouping
  2009-06-22 19:45   ` stephane eranian
@ 2009-06-22 22:04     ` Corey Ashford
  2009-06-23 17:51       ` stephane eranian
  0 siblings, 1 reply; 68+ messages in thread
From: Corey Ashford @ 2009-06-22 22:04 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Peter Zijlstra, Paul Mackerras, Andi Kleen,
	Maynard Johnson, cel, Corey J Ashford, Philip Mucci,
	Dan Terpstra, perfmon2-devel

stephane eranian wrote:
> On Mon, Jun 22, 2009 at 1:50 PM, Ingo Molnar<mingo@elte.hu> wrote:
>>> 2/ Grouping
>>>
>>> By design, an event can only be part of one group at a time.
> 
> As I read this again, another question came up. Is the statement
> above also true for the group leader?
> 
> 
>>> Events in a group are guaranteed to be active on the PMU at the
>>> same time. That means a group cannot have more events than there
>>> are available counters on the PMU. Tools may want to know the
>>> number of counters available in order to group their events
>>> accordingly, such that reliable ratios could be computed. It seems
>>> the only way to know this is by trial and error. This is not
>>> practical.
>> Groups are there to support heavily constrained PMUs, and for them
>> this is the only way, as there is no simple linear expression for
>> how many counters one can load on the PMU.
>>
> But then, does that mean that users need to be aware of constraints
> to form groups. Group are formed by users. I thought, the whole point
> of the API was to hide that kind of hardware complexity.
> 
> Groups are needed when sampling, e.g., PERF_SAMPLE_GROUP.
> For instance, I sample the IP and the values of the other events in
> my group.  Grouping looks like the only way to sampling on Itanium
> branch buffers (BTB), for instance. You program an event in a generic
> counter which counts the number of entries recorded in the buffer.
> Thus, to sample the BTB, you sample on this event, when it overflows
> you grab the content of the BTB. Here, the event and the BTB are tied
> together. You cannot count the event in one group, and have the BTB
> in another one (BTB = 1 config reg + 32 data registers + 1 position reg).
> 
> 
>

With getting these posts from several different sources, I missed Stephane's reply.

I can see the issue is deeper and more subtle than I had imagined.

Stephane, if you were to just place into groups those events which must be 
correlated, and just leave all of the others not grouped, wouldn't this solve 
the problem?  The kernel would be free to schedule the other events how and when 
it can, but would guarantee that your correlated events are on the PMU 
simultaneously.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
cjashfor@us.ibm.com


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.5 - Mmaped count
  2009-06-22 12:25   ` stephane eranian
  2009-06-22 12:35     ` Peter Zijlstra
@ 2009-06-23  0:33     ` Paul Mackerras
  1 sibling, 0 replies; 68+ messages in thread
From: Paul Mackerras @ 2009-06-23  0:33 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Peter Zijlstra, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

stephane eranian writes:

> Unless you tell me that pc->index is marked invalid (0) when the
> event is not scheduled. I don't see how you can avoid reading
> the wrong value. I am assuming that is the event is not scheduled
> lock remains constant.

That's what happens; when pc->index == 0, the event is not scheduled
and the current count is in pc->offset.

> Assuming the event is active when you enter the loop and you
> read a value. How to get the timing information to scale the
> count?

At present you'd have to do a read().  It wouldn't be hard to add
fields to mmapped page to enable the user program to compute
up-to-date timing information from the TSC (x86) or timebase (powerpc)
value.  We'll do that.

Paul.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.5 - Mmaped count
  2009-06-22 12:35     ` Peter Zijlstra
  2009-06-22 12:54       ` stephane eranian
@ 2009-06-23  0:39       ` Paul Mackerras
  2009-06-23  6:13         ` Peter Zijlstra
  2009-06-23  7:40         ` stephane eranian
  1 sibling, 2 replies; 68+ messages in thread
From: Paul Mackerras @ 2009-06-23  0:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

Peter Zijlstra writes:

> I think we would have to add that do the data page,.. something like the
> below?
> 
> Paulus?
> 
> ---
> Index: linux-2.6/include/linux/perf_counter.h
> ===================================================================
> --- linux-2.6.orig/include/linux/perf_counter.h
> +++ linux-2.6/include/linux/perf_counter.h
> @@ -232,6 +232,10 @@ struct perf_counter_mmap_page {
>  	__u32	lock;			/* seqlock for synchronization */
>  	__u32	index;			/* hardware counter identifier */
>  	__s64	offset;			/* add to hardware counter value */
> +	__u64	total_time;		/* total time counter active */
> +	__u64	running_time;		/* time counter on cpu */
> +
> +	__u64	__reserved[123];	/* align at 1k */
>  
>  	/*
>  	 * Control data for the mmap() data buffer.
> Index: linux-2.6/kernel/perf_counter.c
> ===================================================================
> --- linux-2.6.orig/kernel/perf_counter.c
> +++ linux-2.6/kernel/perf_counter.c
> @@ -1782,6 +1782,12 @@ void perf_counter_update_userpage(struct
>  	if (counter->state == PERF_COUNTER_STATE_ACTIVE)
>  		userpg->offset -= atomic64_read(&counter->hw.prev_count);
>  
> +	userpg->total_time = counter->total_time_enabled +
> +			atomic64_read(&counter->child_total_time_enabled);
> +
> +	userpg->running_time = counter->total_time_running +
> +			atomic64_read(&counter->child_total_time_running);

Hmmm, when the counter is running, what you want is not so much the
total time so far as a way to compute the total time so far from the
current TSC/timebase value.  So we would need to export tstamp_enabled
and tstamp_running plus a scale/offset for converting the TSC/timebase
value to nanoseconds consistent with ctx->time.  On powerpc that's
pretty straightforward because the timebases, but on x86 I gather the
offset and maybe also the scale would need to be per-cpu (which is OK,
because all the values in the mmapped page are only useful on one
specific CPU).

How would we compute the scale and offset on x86, given the current
TSC value and ctx->time?

Paul.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.5 - Mmaped count
  2009-06-22 12:54       ` stephane eranian
  2009-06-22 14:39         ` Peter Zijlstra
@ 2009-06-23  0:41         ` Paul Mackerras
  1 sibling, 0 replies; 68+ messages in thread
From: Paul Mackerras @ 2009-06-23  0:41 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Robert Richter, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

stephane eranian writes:

> I don't see where you clear that field on x86.
> Looks like it comes from hwc->idx. I suspect you need
> to do something in x86_pmu_disable() to be symmetrical
> with x86_pmu_enable().
> 
> I suspect something similar needs to be done on Power.

power_pmu_disable() already calls perf_counter_update_userpage,
so I think we're OK there.

Paul.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.2 - Grouping
  2009-06-22 11:50 ` I.2 - Grouping Ingo Molnar
  2009-06-22 19:45   ` stephane eranian
  2009-06-22 21:38   ` Corey Ashford
@ 2009-06-23  5:16   ` Paul Mackerras
  2009-06-23  7:36     ` stephane eranian
  2 siblings, 1 reply; 68+ messages in thread
From: Paul Mackerras @ 2009-06-23  5:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

Ingo Molnar writes:

> > 2/ Grouping
> >
> > By design, an event can only be part of one group at a time.

To clarify this statement of Stephane's, a _counter_ can only be in
one group.  You can have multiple counters counting the same _event_
and those counters can (obviously) be in different groups.

> > Events in a group are guaranteed to be active on the PMU at the
> > same time. That means a group cannot have more events than there
> > are available counters on the PMU. Tools may want to know the
> > number of counters available in order to group their events
> > accordingly, such that reliable ratios could be computed. It seems
> > the only way to know this is by trial and error. This is not
> > practical.
> 
> Groups are there to support heavily constrained PMUs, and for them
> this is the only way, as there is no simple linear expression for
> how many counters one can load on the PMU.

That's not the only reason for having groups, or even the main reason
IMO.  The main reason for having groups is to provide a way to ask the
scheduler to ensure that two or more counters are always scheduled
together, so that you can meaningfully do arithmetic operations on the
counter values that would be sensitive to the statistical noise
introduced by the scheduling, such as ratios and differences.

In other words, grouping is there because we don't guarantee to have
all counters scheduled onto the PMU whenever possible.  Heavily
constrained PMUs increase the need for scheduling, but even if
counters are completely orthogonal there are only a fixed number of
them so we still need to schedule counters at some point.

Paul.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.5 - Mmaped count
  2009-06-23  0:39       ` Paul Mackerras
@ 2009-06-23  6:13         ` Peter Zijlstra
  2009-06-23  7:40         ` stephane eranian
  1 sibling, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2009-06-23  6:13 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: eranian, Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Tue, 2009-06-23 at 10:39 +1000, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > I think we would have to add that do the data page,.. something like the
> > below?
> > 
> > Paulus?
> > 
> > ---
> > Index: linux-2.6/include/linux/perf_counter.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/perf_counter.h
> > +++ linux-2.6/include/linux/perf_counter.h
> > @@ -232,6 +232,10 @@ struct perf_counter_mmap_page {
> >  	__u32	lock;			/* seqlock for synchronization */
> >  	__u32	index;			/* hardware counter identifier */
> >  	__s64	offset;			/* add to hardware counter value */
> > +	__u64	total_time;		/* total time counter active */
> > +	__u64	running_time;		/* time counter on cpu */
> > +
> > +	__u64	__reserved[123];	/* align at 1k */
> >  
> >  	/*
> >  	 * Control data for the mmap() data buffer.
> > Index: linux-2.6/kernel/perf_counter.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/perf_counter.c
> > +++ linux-2.6/kernel/perf_counter.c
> > @@ -1782,6 +1782,12 @@ void perf_counter_update_userpage(struct
> >  	if (counter->state == PERF_COUNTER_STATE_ACTIVE)
> >  		userpg->offset -= atomic64_read(&counter->hw.prev_count);
> >  
> > +	userpg->total_time = counter->total_time_enabled +
> > +			atomic64_read(&counter->child_total_time_enabled);
> > +
> > +	userpg->running_time = counter->total_time_running +
> > +			atomic64_read(&counter->child_total_time_running);
> 
> Hmmm, when the counter is running, what you want is not so much the
> total time so far as a way to compute the total time so far from the
> current TSC/timebase value.  So we would need to export tstamp_enabled
> and tstamp_running plus a scale/offset for converting the TSC/timebase
> value to nanoseconds consistent with ctx->time.  On powerpc that's
> pretty straightforward because the timebases, but on x86 I gather the
> offset and maybe also the scale would need to be per-cpu (which is OK,
> because all the values in the mmapped page are only useful on one
> specific CPU).
> 
> How would we compute the scale and offset on x86, given the current
> TSC value and ctx->time?

With pain and suffering ;-)

The userpage would have to provide a multiplier and offset, and we'd
have to register a cpufreq notifier hook and iterate all active counters
and update these mult,offset bits when the cpu freq changes.

An alternative could be to simply ensure we update these timestamps at
least once per the RR interval (tick), that way the times are more or
less recent and could still be used for scaling purposes.

The most important data in these timestamps is their ratio, not their
absolute value, therefore if we keep the ratio statistically significant
we're good enough.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [perfmon2] IV.3 - AMD IBS
  2009-06-22 14:08   ` [perfmon2] " Rob Fowler
  2009-06-22 17:58     ` Maynard Johnson
@ 2009-06-23  6:19     ` Peter Zijlstra
  2009-06-23  8:19       ` stephane eranian
  2009-06-23 14:40       ` Rob Fowler
  1 sibling, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2009-06-23  6:19 UTC (permalink / raw)
  To: Rob Fowler
  Cc: Ingo Molnar, eranian, Philip Mucci, LKML, Andi Kleen,
	Paul Mackerras, Maynard Johnson, Andrew Morton, Thomas Gleixner,
	perfmon2-devel

On Mon, 2009-06-22 at 10:08 -0400, Rob Fowler wrote:
> Ingo Molnar wrote:
> >> 3/ AMD IBS
> >>
> >> How is AMD IBS going to be implemented?
> >>
> >> IBS has two separate sets of registers. One to capture fetch
> >> related data and another one to capture instruction execution
> >> data. For each, there is one config register but multiple data
> >> registers. In each mode, there is a specific sampling period and
> >> IBS can interrupt.
> >>
> >> It looks like you could define two pseudo events or event types
> >> and then define a new record_format and read_format. That formats
> >> would only be valid for an IBS event.
> >>
> >> Is that how you intend to support IBS?
> > 
> > That is indeed one of the ways we thought of, not really nice, but
> > then, IBS is really weird, what were those AMD engineers thinking
> > :-)
> 
> Actually, IBS has roots in DEC's "ProfileMe" for Alpha EV67 and later
> processors.   Those of us who used it there found it to be an extremely
> powerful, low-overhead mechanism for directly collecting information about
> how well the micro-architecture is performing.  In particular, it can tell
> you, not only which instructions take a long time to traverse the pipe, but
> it also tells you which instructions delay other instructions and by how much.
> This is extremely valuable if you are either working on instruction scheduling
> in a compiler, or are modifying a program to give the compiler the opportunity
> to do a good job.
> 
> A core group of engineers who worked on Alpha went on to AMD.
> 
> An unfortunate problem with IBS on AMD is that good support isn't common in the "mainstream"
> open source community.

The 'problem' I have with IBS is that its basically a cycle counter
coupled with a pretty arbitrary number of output dimensions separated
into two groups, ops and fetches.

This is a very weird configuration in that it has a miss-match with the
traditional one value per counter thing.

The most natural way to support IBS would be to have a special sampling
cycle counter and use that as group lead and add non sampling siblings
to that group to get individual elements.

This is however quite cumbersome.

One thing to consider when building an IBS interface is its future
extensibility. In which fashion would IBS be extended?, additional
output dimensions or something else all-together?


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.2 - Grouping
  2009-06-23  5:16   ` Paul Mackerras
@ 2009-06-23  7:36     ` stephane eranian
  2009-06-23  8:26       ` Paul Mackerras
  0 siblings, 1 reply; 68+ messages in thread
From: stephane eranian @ 2009-06-23  7:36 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Peter Zijlstra, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Tue, Jun 23, 2009 at 7:16 AM, Paul Mackerras<paulus@samba.org> wrote:
> Ingo Molnar writes:
>
>> > 2/ Grouping
>> >
>> > By design, an event can only be part of one group at a time.
>
> To clarify this statement of Stephane's, a _counter_ can only be in
> one group.  You can have multiple counters counting the same _event_
> and those counters can (obviously) be in different groups.
>
Okay.

What happens if I do:
   fd0 = perf_counter_open(&hwc1, getpid(), -1, -1, 0);
   fd1 = perf_counter_open(&hwc2, getpid(), -1, fd0, 0);

And then:
   fd2 = perf_counter_open(&hwc2, getpid(), -1, fd1, 0);

>> > Events in a group are guaranteed to be active on the PMU at the
>> > same time. That means a group cannot have more events than there
>> > are available counters on the PMU. Tools may want to know the
>> > number of counters available in order to group their events
>> > accordingly, such that reliable ratios could be computed. It seems
>> > the only way to know this is by trial and error. This is not
>> > practical.
>>
>> Groups are there to support heavily constrained PMUs, and for them
>> this is the only way, as there is no simple linear expression for
>> how many counters one can load on the PMU.
>
> That's not the only reason for having groups, or even the main reason
> IMO.  The main reason for having groups is to provide a way to ask the
> scheduler to ensure that two or more counters are always scheduled
> together, so that you can meaningfully do arithmetic operations on the
> counter values that would be sensitive to the statistical noise
> introduced by the scheduling, such as ratios and differences.
>
> In other words, grouping is there because we don't guarantee to have
> all counters scheduled onto the PMU whenever possible.  Heavily
> constrained PMUs increase the need for scheduling, but even if
> counters are completely orthogonal there are only a fixed number of
> them so we still need to schedule counters at some point.
>
I agree completely.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.5 - Mmaped count
  2009-06-23  0:39       ` Paul Mackerras
  2009-06-23  6:13         ` Peter Zijlstra
@ 2009-06-23  7:40         ` stephane eranian
  1 sibling, 0 replies; 68+ messages in thread
From: stephane eranian @ 2009-06-23  7:40 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Robert Richter, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

Paul,

On Tue, Jun 23, 2009 at 2:39 AM, Paul Mackerras<paulus@samba.org> wrote:
>
> Hmmm, when the counter is running, what you want is not so much the
> total time so far as a way to compute the total time so far from the
> current TSC/timebase value.  So we would need to export tstamp_enabled
> and tstamp_running plus a scale/offset for converting the TSC/timebase
> value to nanoseconds consistent with ctx->time.  On powerpc that's
> pretty straightforward because the timebases, but on x86 I gather the
> offset and maybe also the scale would need to be per-cpu (which is OK,
> because all the values in the mmapped page are only useful on one
> specific CPU).
>
I think you should make it such that reading via mmap and read() are
equivalent, one is just lower overhead than the other. Otherwise it would
make it more difficult for tools in case of multiplexing where you could
fallback to read() and there you would not get the same information.


> How would we compute the scale and offset on x86, given the current
> TSC value and ctx->time?
>
> Paul.
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [perfmon2] IV.3 - AMD IBS
  2009-06-23  6:19     ` Peter Zijlstra
@ 2009-06-23  8:19       ` stephane eranian
  2009-06-23 14:05         ` Ingo Molnar
  2009-06-23 14:40       ` Rob Fowler
  1 sibling, 1 reply; 68+ messages in thread
From: stephane eranian @ 2009-06-23  8:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rob Fowler, Ingo Molnar, Philip Mucci, LKML, Andi Kleen,
	Paul Mackerras, Maynard Johnson, Andrew Morton, Thomas Gleixner,
	perfmon2-devel

On Tue, Jun 23, 2009 at 8:19 AM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
>
> The 'problem' I have with IBS is that its basically a cycle counter
> coupled with a pretty arbitrary number of output dimensions separated
> into two groups, ops and fetches.
>
Well, that's your view. Mine is different.

You have 2 independent cycle counters (one for fetch, one for op),
each is coupled with a value. it is just that the value does not fit
into 64 bits. the cycle count is not hosted in a generic counter but
in its own register. The captured data for fetch is constructed with
IBSFETCHCTL, IBSFETCHLINAD, IBSFETCHPHYSAD. The 3
registers are tied together. The values they contain represent the
same fetch event. Same thing with IBS op. I don't see the problem
because your API is able to cope with variable length output data.
The sampling buffer certainly can. Of course, internally you'd have
to manage it in a special way, but you already do this for fixed
counters, don't you?

> This is a very weird configuration in that it has a miss-match with the
> traditional one value per counter thing.
>
This is not the universal model. I can give lots of examples on Itanium
where you have one config register and multiple data registers
to capture the event: branch trace buffer (1 config, 33 data), Data EAR
(cache/TLB miss sampling, 1 config 3 data),Instruction-EAR
(cache/TLB miss sampling, 1 config, 2 data).

> The most natural way to support IBS would be to have a special sampling
> cycle counter and use that as group lead and add non sampling siblings
> to that group to get individual elements.
>
As discussed in my message, I think the way to support IBS is to create two
pseudo-events (like your perf_hw_event_ids), one for fetch and one for op
(because they could be measured simultaneously). The sample_period field
would be used to express the IBS*CTL maxcnt, subject to the verification
that the bottom 4 bits must be 0. And then, you add  two new sampling formats
PERF_SAMPLE_IBSFETCH, PERF_SAMPLE_IBSOP. Those would only work
with IBS pseudo events. Once you have the randomize option in perf_counter_attr,
you could even enable IBSFETCH randomization.

What is wrong with this approach?

Another question is: how do you present the values contained in the IBS data
registers:
   1 - leave it as raw (tool parses the raw register values)
   2 - decode it in the kernel and expose your own format

With 1/, you'd pick up automatically new fields if AMD adds some.
With 2/, you'd have to change your format if AMD change theirs.



> This is however quite cumbersome.
>
> One thing to consider when building an IBS interface is its future
> extensibility. In which fashion would IBS be extended?, additional
> output dimensions or something else all-together?
>
I don't know but it would be nice to provide better filtering capabilities
but for that, they can use some of the reserved bits they have, no
need to add more data registers.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.2 - Grouping
  2009-06-23  7:36     ` stephane eranian
@ 2009-06-23  8:26       ` Paul Mackerras
  2009-06-23  8:30         ` stephane eranian
  0 siblings, 1 reply; 68+ messages in thread
From: Paul Mackerras @ 2009-06-23  8:26 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Peter Zijlstra, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

stephane eranian writes:

> What happens if I do:
>    fd0 = perf_counter_open(&hwc1, getpid(), -1, -1, 0);
>    fd1 = perf_counter_open(&hwc2, getpid(), -1, fd0, 0);
> 
> And then:
>    fd2 = perf_counter_open(&hwc2, getpid(), -1, fd1, 0);

That perf_counter_open call will fail with an EINVAL error because fd1
is not a group leader.

Paul.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.2 - Grouping
  2009-06-23  8:26       ` Paul Mackerras
@ 2009-06-23  8:30         ` stephane eranian
  2009-06-23 16:24           ` Corey Ashford
  0 siblings, 1 reply; 68+ messages in thread
From: stephane eranian @ 2009-06-23  8:30 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Peter Zijlstra, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Tue, Jun 23, 2009 at 10:26 AM, Paul Mackerras<paulus@samba.org> wrote:
> stephane eranian writes:
>
>> What happens if I do:
>>    fd0 = perf_counter_open(&hwc1, getpid(), -1, -1, 0);
>>    fd1 = perf_counter_open(&hwc2, getpid(), -1, fd0, 0);
>>
>> And then:
>>    fd2 = perf_counter_open(&hwc2, getpid(), -1, fd1, 0);
>
> That perf_counter_open call will fail with an EINVAL error because fd1
> is not a group leader.
>
Okay. Thanks.

But that leads me to another question related to grouping
which was hinted for by Peter's comment about IBS.

Can you have multiple sampling events in a group?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: II.2 - Event knowledge missing
  2009-06-22 11:57 ` II.2 - Event knowledge missing Ingo Molnar
@ 2009-06-23 13:18   ` stephane eranian
  0 siblings, 0 replies; 68+ messages in thread
From: stephane eranian @ 2009-06-23 13:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, Jun 22, 2009 at 1:57 PM, Ingo Molnar<mingo@elte.hu> wrote:
>> 2/ Event knowledge missing
>>
>> There are constraints on events in Intel processors. Different
>> constraints do exist on AMD64 processors, especially with
>> uncore-releated events.
>
> You raise the issue of uncore events in IV.1, but let us reply here
> primarily.
>
> Un-core counters and events seem to be somewhat un-interesting to
> us. (Patches from those who find them interesting are welcome of
> course!)
>
That is you opinion but not mine. I believe uncore is useful though
it is harder to manage than core PMU. I know that because I have
implemented the support for Nehalem. But going back to our discussion
from December, if it's there it's because it provides some value-add,
why would the hardware designers have bothered otherwise?

It is true that if you've only read the uncore description in Volume 3b, it
is not clear what this can actually do. Therefore, I recommend you take
a look at section B.2.5 of the Intel optimization manual:

               http://www.intel.com/Assets/PDF/manual/248966.pdf

It shows a bunch of interesting metrics one can collect using uncore.
Metrics which you cannot get any other way. Some people do care
about those, otherwise they would not be explained.

>
> The main problem with uncore events is that they are per physical
> package, and hence tying a piece of physical metric exposed via them
> to a particular workload is hard - unless full-system analysis is
> performed. 'Task driven' metrics seem far more useful to performance
> analysis (and those are the preferred analysis method of most
> user-space developers), as they allow particularized sampling and
> allow the tight connection between workload and metric.
>
That is the nature of the beast. There is not much you can do about
this. But this is still useful especially if you have a symmetrical
workload like many scientific applications have.

Note that uncore also exist on AMD64, though, not as clearly separated.
Some events collect at the package level, yet they are using core PMU
counters. And those come with restrictions as well see Section 3.12,
description of PERFCTL, in the BKDG for Family 10h.

> If, despite our expecations, uncore events prove to be useful,
> popular and required elements of performance analysis, they can be
> supported in perfcounters via various levels:
>
>  - a special raw ID range on x86, only to per CPU counters. The
>   low-level implementation reserves the uncore PMCs, so overlapping
>   allocation (and interaction between the cores via the MSRs) is
>   not possible.
>
I agree this is for CPU counters only, not per-thread. It could be any
core in the package. In fact, multiple per CPU "sessions" could
co-exist in the same package. But there is one difficulty with allowing
this, though. The uncore does not interrupt directly. You need to
designate which core(s) it will interrupt via a bitmask. It could interrupt
ALL CPUs in the package at once (which is another interesting usage
model of uncore). So I believe the choice is between 1 CPU and
all CPUs.

Uncore events have no constraints, except for the single fixed counter
event (UNC_CLK_UNHALTED). Thus, you could still use your
event model and overcommit the uncore and multiplex groups on it.
You could reject events in a group once you reach 8 (max number of
counters). I don't see the difference there. The only issue is with managing
the interrupt.



>  - generic enumeration with full tooling support, time-sharing and
>   the whole set of features. The low-level backend would time-share
>   the resource between interested CPUs.
>
> There is no limitation in the perfcounters design that somehow makes
> uncore events harder to support. The uncore counters _themselves_
> are limited to begin with - so rich features cannot be offered on
> top of them.
>
I would say they are limited. This is what you can do from where they
are sourced from.


>
>> The current code-base does not have any constrained event support,
>> therefore bogus counts may be returned depending on the event
>> measured.
>
> Then we'll need to grow some when we run into them :-)

FYI, here is the list of constrained events for Intel Core.
Counter [0] means generic counter0, [1] means generic counter1.
If you do not put these events in the right counter, they do
not count what they are supposed to, and do so silently.

Name     : FP_COMP_OPS_EXE
Code     : 0x10
Counters : [ 0 ]
Desc     : Floating point computational micro-ops executed
PEBS     : No

Name     : FP_ASSIST
Code     : 0x11
Counters : [ 1 ]
Desc     : Floating point assists
PEBS     : No

Name     : MUL
Code     : 0x12
Counters : [ 1 ]
Desc     : Multiply operations executed
PEBS     : No

Name     : DIV
Code     : 0x13
Counters : [ 1 ]
Desc     : Divide operations executed
PEBS     : No

Name     : CYCLES_DIV_BUSY
Code     : 0x14
Counters : [ 0 ]
Desc     : Cycles the divider is busy
PEBS     : No

Name     : IDLE_DURING_DIV
Code     : 0x18
Counters : [ 0 ]
Desc     : Cycles the divider is busy and all other execution units are idle
PEBS     : No

Name     : DELAYED_BYPASS
Code     : 0x19
Counters : [ 1 ]
Desc     : Delayed bypass
Umask-00 : 0x00 : [FP] : Delayed bypass to FP operation
Umask-01 : 0x01 : [SIMD] : Delayed bypass to SIMD operation
Umask-02 : 0x02 : [LOAD] : Delayed bypass to load operation
PEBS     : No

Name     : MEM_LOAD_RETIRED
Code     : 0xcb
Counters : [ 0 ]
Desc     : Retired loads that miss the L1 data cache
Umask-00 : 0x01 : [L1D_MISS] : Retired loads that miss the L1 data
cache (precise event)
Umask-01 : 0x02 : [L1D_LINE_MISS] : L1 data cache line missed by
retired loads (precise event)
Umask-02 : 0x04 : [L2_MISS] : Retired loads that miss the L2 cache
(precise event)
Umask-03 : 0x08 : [L2_LINE_MISS] : L2 cache line missed by retired
loads (precise event)
Umask-04 : 0x10 : [DTLB_MISS] : Retired loads that miss the DTLB (precise event)
PEBS     : [L1D_MISS] [L1D_LINE_MISS] [L2_MISS] [L2_LINE_MISS] [DTLB_MISS]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [perfmon2] IV.3 - AMD IBS
  2009-06-23  8:19       ` stephane eranian
@ 2009-06-23 14:05         ` Ingo Molnar
  2009-06-23 14:25           ` stephane eranian
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2009-06-23 14:05 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Rob Fowler, Philip Mucci, LKML, Andi Kleen,
	Paul Mackerras, Maynard Johnson, Andrew Morton, Thomas Gleixner,
	perfmon2-devel


* stephane eranian <eranian@googlemail.com> wrote:

> > The most natural way to support IBS would be to have a special 
> > sampling cycle counter and use that as group lead and add non 
> > sampling siblings to that group to get individual elements.
> >
> As discussed in my message, I think the way to support IBS is to 
> create two pseudo-events (like your perf_hw_event_ids), one for 
> fetch and one for op (because they could be measured 
> simultaneously). The sample_period field would be used to express 
> the IBS*CTL maxcnt, subject to the verification that the bottom 4 
> bits must be 0. And then, you add two new sampling formats 
> PERF_SAMPLE_IBSFETCH, PERF_SAMPLE_IBSOP. Those would only work 
> with IBS pseudo events. Once you have the randomize option in 
> perf_counter_attr, you could even enable IBSFETCH randomization.

I'd suggest to start smaller, and first express the 'precise' nature 
of IBS transparently, by simply mapping it to one of the generic 
events. (cycles and instructions both appears to be possible)

No extra sampling, no extra events - just a transparent side channel 
implementation for the specific case of PERF_COUNT_HW_CPU_CYCLES. (A 
bit like the fixed-purpose counters are done on the Intel side - a 
special-case - but none of the generic code knows about it.)

This gives us immediate results with less code, and also gives us 
the platform to see how IBS is structured, what kind of general 
problems/quirks it has, and how popular its precision is, etc. We 
can always add extra sampling formats on top of that (i'm not 
opposed to that), to expose more and more of IBS.

The same can be done on the PEBS side as well.

Would you be interested in pursuing this?

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [perfmon2] IV.3 - AMD IBS
  2009-06-23 14:05         ` Ingo Molnar
@ 2009-06-23 14:25           ` stephane eranian
  2009-06-23 14:55             ` Ingo Molnar
  0 siblings, 1 reply; 68+ messages in thread
From: stephane eranian @ 2009-06-23 14:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Rob Fowler, Philip Mucci, LKML, Andi Kleen,
	Paul Mackerras, Maynard Johnson, Andrew Morton, Thomas Gleixner,
	perfmon2-devel

On Tue, Jun 23, 2009 at 4:05 PM, Ingo Molnar<mingo@elte.hu> wrote:
>
> * stephane eranian <eranian@googlemail.com> wrote:
>
>> > The most natural way to support IBS would be to have a special
>> > sampling cycle counter and use that as group lead and add non
>> > sampling siblings to that group to get individual elements.
>> >
>> As discussed in my message, I think the way to support IBS is to
>> create two pseudo-events (like your perf_hw_event_ids), one for
>> fetch and one for op (because they could be measured
>> simultaneously). The sample_period field would be used to express
>> the IBS*CTL maxcnt, subject to the verification that the bottom 4
>> bits must be 0. And then, you add two new sampling formats
>> PERF_SAMPLE_IBSFETCH, PERF_SAMPLE_IBSOP. Those would only work
>> with IBS pseudo events. Once you have the randomize option in
>> perf_counter_attr, you could even enable IBSFETCH randomization.
>
> I'd suggest to start smaller, and first express the 'precise' nature
> of IBS transparently, by simply mapping it to one of the generic
> events. (cycles and instructions both appears to be possible)
>
IBS is precise by nature. It does not work like PEBS. It tags an instruction
and then collects info about it. When it retires, IBS freezes and triggers an
interrupt like a regular counter interrupt. Except this time, you don't care
about the interrupted IP, you use the instruction address in the IBS data
register, it is guaranteed to correspond to the tagged instruction.

The sampling period expresses the delay before picking the instruction
to tag. And as I said before, it is only 20 bits and the bottom 4 bits must
be zero (as they cannot be encoded).



> No extra sampling, no extra events - just a transparent side channel
> implementation for the specific case of PERF_COUNT_HW_CPU_CYCLES. (A
> bit like the fixed-purpose counters are done on the Intel side - a
> special-case - but none of the generic code knows about it.)
>

> This gives us immediate results with less code, and also gives us
> the platform to see how IBS is structured, what kind of general
> problems/quirks it has, and how popular its precision is, etc. We
> can always add extra sampling formats on top of that (i'm not
> opposed to that), to expose more and more of IBS.
>
> The same can be done on the PEBS side as well.
>
> Would you be interested in pursuing this?
>
>        Ingo
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [perfmon2] IV.3 - AMD IBS
  2009-06-23  6:19     ` Peter Zijlstra
  2009-06-23  8:19       ` stephane eranian
@ 2009-06-23 14:40       ` Rob Fowler
  1 sibling, 0 replies; 68+ messages in thread
From: Rob Fowler @ 2009-06-23 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, eranian, Philip Mucci, LKML, Andi Kleen,
	Paul Mackerras, Maynard Johnson, Andrew Morton, Thomas Gleixner,
	perfmon2-devel

I'm up to my neck in other stuff, so this will be short.

Yes, IBS is a very different model of performance measurement that
doesn't fit well with the traditional model.  It does do what the
HW engineers need for understanding multi unit out of order processors, though.

The separation of the fetch and op monitoring is an artifact of the separation
and de-coupling of the front and back end pipelines.  The front end IBS events
deal with stuff that happens in fetching:  TLB and cache misses, mis-predictions, etc.
The back end IBS events deal with computing and data fetch/store.

There are two "conventional" counters involved: the tag-to-retire and completion-to-retire
counts.  These can be accumulated, histogrammed, etc just like any conventional event, though
you need to add the counter contents to the accumulator rather than just increment.

The rest of the bits are predicates that can be used to filter the events into bins.
With n bits, you might need 2^n bins to accumulate all possibilities.
Sections 5 and 6 of the AMD software optimization guide provide some useful boolean expressions
for defining meaningful derived events.

In an old version of Rice HPCToolkit (now disappeared from the web) we
had a tool called xprof that processed DEC DCPI/ProfileMe binary files to
produce profiles with ~20 derived events that we thought would be useful.  The cost
of collecting all of this didn't vary by the amount we collected, so you would select
the ones you wanted to view at analysis time, not at execute time.  There was also a
mechanism for specifying other events. Nathan Tallent can provide details.

The Linear and Physical Address registers are an opportunity for someone to build data profiling
tools, or a combined instructions and data tool.

The critical thing is for the kernel, driver, and library builders to not do something that will
stand in the way of this.

Peter Zijlstra wrote:
> On Mon, 2009-06-22 at 10:08 -0400, Rob Fowler wrote:
>> Ingo Molnar wrote:
>>>> 3/ AMD IBS
>>>>
>>>> How is AMD IBS going to be implemented?
>>>>
>>>> IBS has two separate sets of registers. One to capture fetch
>>>> related data and another one to capture instruction execution
>>>> data. For each, there is one config register but multiple data
>>>> registers. In each mode, there is a specific sampling period and
>>>> IBS can interrupt.
>>>>
>>>> It looks like you could define two pseudo events or event types
>>>> and then define a new record_format and read_format. That formats
>>>> would only be valid for an IBS event.
>>>>
>>>> Is that how you intend to support IBS?
>>> That is indeed one of the ways we thought of, not really nice, but
>>> then, IBS is really weird, what were those AMD engineers thinking
>>> :-)
>> Actually, IBS has roots in DEC's "ProfileMe" for Alpha EV67 and later
>> processors.   Those of us who used it there found it to be an extremely
>> powerful, low-overhead mechanism for directly collecting information about
>> how well the micro-architecture is performing.  In particular, it can tell
>> you, not only which instructions take a long time to traverse the pipe, but
>> it also tells you which instructions delay other instructions and by how much.
>> This is extremely valuable if you are either working on instruction scheduling
>> in a compiler, or are modifying a program to give the compiler the opportunity
>> to do a good job.
>>
>> A core group of engineers who worked on Alpha went on to AMD.
>>
>> An unfortunate problem with IBS on AMD is that good support isn't common in the "mainstream"
>> open source community.
> 
> The 'problem' I have with IBS is that its basically a cycle counter
> coupled with a pretty arbitrary number of output dimensions separated
> into two groups, ops and fetches.
> 
> This is a very weird configuration in that it has a miss-match with the
> traditional one value per counter thing.
> 
> The most natural way to support IBS would be to have a special sampling
> cycle counter and use that as group lead and add non sampling siblings
> to that group to get individual elements.
> 
> This is however quite cumbersome.
> 
> One thing to consider when building an IBS interface is its future
> extensibility. In which fashion would IBS be extended?, additional
> output dimensions or something else all-together?

-- 
Robert J. Fowler
Chief Domain Scientist, HPC
Renaissance Computing Institute
The University of North Carolina at Chapel Hill
100 Europa Dr, Suite 540
Chapel Hill, NC 27517
V: 919.445.9670
F: 919 445.9669
rjf@renci.org

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [perfmon2] IV.3 - AMD IBS
  2009-06-23 14:25           ` stephane eranian
@ 2009-06-23 14:55             ` Ingo Molnar
  0 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2009-06-23 14:55 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Rob Fowler, Philip Mucci, LKML, Andi Kleen,
	Paul Mackerras, Maynard Johnson, Andrew Morton, Thomas Gleixner,
	perfmon2-devel


* stephane eranian <eranian@googlemail.com> wrote:

> On Tue, Jun 23, 2009 at 4:05 PM, Ingo Molnar<mingo@elte.hu> wrote:
> >
> > * stephane eranian <eranian@googlemail.com> wrote:
> >
> >> > The most natural way to support IBS would be to have a special
> >> > sampling cycle counter and use that as group lead and add non
> >> > sampling siblings to that group to get individual elements.
> >> >
> >> As discussed in my message, I think the way to support IBS is to
> >> create two pseudo-events (like your perf_hw_event_ids), one for
> >> fetch and one for op (because they could be measured
> >> simultaneously). The sample_period field would be used to express
> >> the IBS*CTL maxcnt, subject to the verification that the bottom 4
> >> bits must be 0. And then, you add two new sampling formats
> >> PERF_SAMPLE_IBSFETCH, PERF_SAMPLE_IBSOP. Those would only work
> >> with IBS pseudo events. Once you have the randomize option in
> >> perf_counter_attr, you could even enable IBSFETCH randomization.
> >
> > I'd suggest to start smaller, and first express the 'precise' 
> > nature of IBS transparently, by simply mapping it to one of the 
> > generic events. (cycles and instructions both appears to be 
> > possible)
>
> IBS is precise by nature.

(yes. Did you understand my comments above as saying the opposite?)

> [...] It does not work like PEBS. It tags an instruction and then 
> collects info about it. When it retires, IBS freezes and triggers 
> an interrupt like a regular counter interrupt. Except this time, 
> you don't care about the interrupted IP, you use the instruction 
> address in the IBS data register, it is guaranteed to correspond 
> to the tagged instruction.
> 
> The sampling period expresses the delay before picking the 
> instruction to tag. And as I said before, it is only 20 bits and 
> the bottom 4 bits must be zero (as they cannot be encoded).

The 20 bits delay is in cycles, right? So this in itself still lends 
itself to be transparently provided as a PERF_COUNT_HW_CPU_CYCLES 
counter.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.2 - Grouping
  2009-06-23  8:30         ` stephane eranian
@ 2009-06-23 16:24           ` Corey Ashford
  0 siblings, 0 replies; 68+ messages in thread
From: Corey Ashford @ 2009-06-23 16:24 UTC (permalink / raw)
  To: eranian
  Cc: Paul Mackerras, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Robert Richter, Peter Zijlstra, Andi Kleen,
	Maynard Johnson, Carl Love, Corey J Ashford, Philip Mucci,
	Dan Terpstra, perfmon2-devel

stephane eranian wrote:
> On Tue, Jun 23, 2009 at 10:26 AM, Paul Mackerras<paulus@samba.org> wrote:
>> stephane eranian writes:
>>
>>> What happens if I do:
>>>    fd0 = perf_counter_open(&hwc1, getpid(), -1, -1, 0);
>>>    fd1 = perf_counter_open(&hwc2, getpid(), -1, fd0, 0);
>>>
>>> And then:
>>>    fd2 = perf_counter_open(&hwc2, getpid(), -1, fd1, 0);
>> That perf_counter_open call will fail with an EINVAL error because fd1
>> is not a group leader.
>>
> Okay. Thanks.
> 
> But that leads me to another question related to grouping
> which was hinted for by Peter's comment about IBS.
> 
> Can you have multiple sampling events in a group?

There's a PAPI ctest which exercises this - profile_two_events - and it 
does work with the PCL substrate.

- Corey



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.2 - Grouping
  2009-06-22 22:04     ` Corey Ashford
@ 2009-06-23 17:51       ` stephane eranian
  0 siblings, 0 replies; 68+ messages in thread
From: stephane eranian @ 2009-06-23 17:51 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Peter Zijlstra, Paul Mackerras, Andi Kleen,
	Maynard Johnson, cel, Corey J Ashford, Philip Mucci,
	Dan Terpstra, perfmon2-devel

Corey,

On Tue, Jun 23, 2009 at 12:04 AM, Corey
Ashford<cjashfor@linux.vnet.ibm.com> wrote:
> stephane eranian wrote:
>>
>> On Mon, Jun 22, 2009 at 1:50 PM, Ingo Molnar<mingo@elte.hu> wrote:
>>>>
>>>> 2/ Grouping
>>>>
>>>> By design, an event can only be part of one group at a time.
>>
>> As I read this again, another question came up. Is the statement
>> above also true for the group leader?
>>
>>
>>>> Events in a group are guaranteed to be active on the PMU at the
>>>> same time. That means a group cannot have more events than there
>>>> are available counters on the PMU. Tools may want to know the
>>>> number of counters available in order to group their events
>>>> accordingly, such that reliable ratios could be computed. It seems
>>>> the only way to know this is by trial and error. This is not
>>>> practical.
>>>
>>> Groups are there to support heavily constrained PMUs, and for them
>>> this is the only way, as there is no simple linear expression for
>>> how many counters one can load on the PMU.
>>>
>> But then, does that mean that users need to be aware of constraints
>> to form groups. Group are formed by users. I thought, the whole point
>> of the API was to hide that kind of hardware complexity.
>>
>> Groups are needed when sampling, e.g., PERF_SAMPLE_GROUP.
>> For instance, I sample the IP and the values of the other events in
>> my group.  Grouping looks like the only way to sampling on Itanium
>> branch buffers (BTB), for instance. You program an event in a generic
>> counter which counts the number of entries recorded in the buffer.
>> Thus, to sample the BTB, you sample on this event, when it overflows
>> you grab the content of the BTB. Here, the event and the BTB are tied
>> together. You cannot count the event in one group, and have the BTB
>> in another one (BTB = 1 config reg + 32 data registers + 1 position reg).
>>
>
> Stephane, if you were to just place into groups those events which must be
> correlated, and just leave all of the others not grouped, wouldn't this
> solve the problem?  The kernel would be free to schedule the other events
> how and when it can, but would guarantee that your correlated events are on
> the PMU simultaneously.
>
It would work.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-06-22 12:58   ` Christoph Hellwig
  2009-06-22 13:56     ` Ingo Molnar
@ 2009-07-13 10:53     ` Peter Zijlstra
  2009-07-13 17:30       ` [perfmon2] " Arnd Bergmann
                         ` (2 more replies)
  1 sibling, 3 replies; 68+ messages in thread
From: Peter Zijlstra @ 2009-07-13 10:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ingo Molnar, eranian, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
> But talking about syscalls the sys_perf_counter_open prototype is
> really ugly - it uses either the pid or cpu argument which is a pretty
> clear indicator it should actually be two sys calls.

Would something like the below be any better?

It would allow us to later add something like PERF_TARGET_SOCKET and
things like that.

(utterly untested)

---
 include/linux/perf_counter.h |   12 ++++++++++++
 include/linux/syscalls.h     |    3 ++-
 kernel/perf_counter.c        |   24 +++++++++++++++++++++++-
 tools/perf/builtin-record.c  |   11 ++++++++++-
 tools/perf/builtin-stat.c    |   12 ++++++++++--
 tools/perf/builtin-top.c     |   11 ++++++++++-
 tools/perf/perf.h            |    7 +++----
 7 files changed, 70 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 5e970c7..f7dd22e 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -189,6 +189,18 @@ struct perf_counter_attr {
 	__u64			__reserved_3;
 };
 
+enum perf_counter_target_ids {
+	PERF_TARGET_PID		= 0,
+	PERF_TARGET_CPU		= 1,
+
+	PERF_TARGET_MAX		/* non-ABI */
+};
+
+struct perf_counter_target {
+	__u32			id;
+	__u64			val;
+};
+
 /*
  * Ioctls that can be done on a perf counter fd:
  */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 80de700..670edad 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -760,5 +760,6 @@ int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
 asmlinkage long sys_perf_counter_open(
 		struct perf_counter_attr __user *attr_uptr,
-		pid_t pid, int cpu, int group_fd, unsigned long flags);
+		struct perf_counter_target __user *target,
+		int group_fd, unsigned long flags);
 #endif
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index fa20a9d..205f0b6 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -3969,15 +3969,18 @@ err_size:
  */
 SYSCALL_DEFINE5(perf_counter_open,
 		struct perf_counter_attr __user *, attr_uptr,
-		pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
+		struct perf_counter_target __user *, target_uptr,
+		int, group_fd, unsigned long, flags)
 {
 	struct perf_counter *counter, *group_leader;
 	struct perf_counter_attr attr;
+	struct perf_counter_target target;
 	struct perf_counter_context *ctx;
 	struct file *counter_file = NULL;
 	struct file *group_file = NULL;
 	int fput_needed = 0;
 	int fput_needed2 = 0;
+	int pid, cpu;
 	int ret;
 
 	/* for future expandability... */
@@ -3998,6 +4001,25 @@ SYSCALL_DEFINE5(perf_counter_open,
 			return -EINVAL;
 	}
 
+	ret = copy_from_user(&target, target_uptr, sizeof(target));
+	if (ret)
+		return ret;
+
+	switch (target.id) {
+	case PERF_TARGET_PID:
+		pid = target.val;
+		cpu = -1;
+		break;
+
+	case PERF_TARGET_CPU:
+		cpu = target.val;
+		pid = -1;
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
 	/*
 	 * Get the target context (task or percpu):
 	 */
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4ef78a5..b172d30 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -374,6 +374,7 @@ static struct perf_header_attr *get_header_attr(struct perf_counter_attr *a, int
 static void create_counter(int counter, int cpu, pid_t pid)
 {
 	struct perf_counter_attr *attr = attrs + counter;
+	struct perf_counter_target target;
 	struct perf_header_attr *h_attr;
 	int track = !counter; /* only the first counter needs these */
 	struct {
@@ -409,8 +410,16 @@ static void create_counter(int counter, int cpu, pid_t pid)
 	attr->inherit		= (cpu < 0) && inherit;
 	attr->disabled		= 1;
 
+	if (cpu < 0) {
+		target.id = PERF_TARGET_PID;
+		target.val = pid;
+	} else {
+		target.id = PERF_TARGET_CPU;
+		target.val = cpu;
+	}
+
 try_again:
-	fd[nr_cpu][counter] = sys_perf_counter_open(attr, pid, cpu, group_fd, 0);
+	fd[nr_cpu][counter] = sys_perf_counter_open(attr, &target, group_fd, 0);
 
 	if (fd[nr_cpu][counter] < 0) {
 		int err = errno;
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 27921a8..342e642 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -106,6 +106,7 @@ static u64			runtime_cycles_noise;
 static void create_perf_stat_counter(int counter, int pid)
 {
 	struct perf_counter_attr *attr = attrs + counter;
+	struct perf_counter_target target;
 
 	if (scale)
 		attr->read_format = PERF_FORMAT_TOTAL_TIME_ENABLED |
@@ -114,8 +115,12 @@ static void create_perf_stat_counter(int counter, int pid)
 	if (system_wide) {
 		unsigned int cpu;
 
+		target.id = PERF_TARGET_CPU;
+
 		for (cpu = 0; cpu < nr_cpus; cpu++) {
-			fd[cpu][counter] = sys_perf_counter_open(attr, -1, cpu, -1, 0);
+			target.val = cpu;
+
+			fd[cpu][counter] = sys_perf_counter_open(attr, &target, -1, 0);
 			if (fd[cpu][counter] < 0 && verbose)
 				fprintf(stderr, ERR_PERF_OPEN, counter,
 					fd[cpu][counter], strerror(errno));
@@ -125,7 +130,10 @@ static void create_perf_stat_counter(int counter, int pid)
 		attr->disabled	     = 1;
 		attr->enable_on_exec = 1;
 
-		fd[0][counter] = sys_perf_counter_open(attr, pid, -1, -1, 0);
+		target.id = PERF_TARGET_PID;
+		target.val = pid;
+
+		fd[0][counter] = sys_perf_counter_open(attr, &target, -1, 0);
 		if (fd[0][counter] < 0 && verbose)
 			fprintf(stderr, ERR_PERF_OPEN, counter,
 				fd[0][counter], strerror(errno));
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 95d5c0a..facc870 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -548,6 +548,7 @@ int group_fd;
 
 static void start_counter(int i, int counter)
 {
+	struct perf_counter_target target;
 	struct perf_counter_attr *attr;
 	unsigned int cpu;
 
@@ -560,8 +561,16 @@ static void start_counter(int i, int counter)
 	attr->sample_type	= PERF_SAMPLE_IP | PERF_SAMPLE_TID;
 	attr->freq		= freq;
 
+	if (cpu < 0) {
+		target.id = PERF_TARGET_PID;
+		target.val = target_pid;
+	} else {
+		target.id = PERF_TARGET_CPU;
+		target.val = cpu;
+	}
+
 try_again:
-	fd[i][counter] = sys_perf_counter_open(attr, target_pid, cpu, group_fd, 0);
+	fd[i][counter] = sys_perf_counter_open(attr, &target, group_fd, 0);
 
 	if (fd[i][counter] < 0) {
 		int err = errno;
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index e5148e2..3d663d7 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -85,12 +85,11 @@ static inline unsigned long long rdclock(void)
 
 static inline int
 sys_perf_counter_open(struct perf_counter_attr *attr,
-		      pid_t pid, int cpu, int group_fd,
-		      unsigned long flags)
+		      struct perf_counter_target *target,
+		      int group_fd, unsigned long flags)
 {
 	attr->size = sizeof(*attr);
-	return syscall(__NR_perf_counter_open, attr, pid, cpu,
-		       group_fd, flags);
+	return syscall(__NR_perf_counter_open, attr, target, group_fd, flags);
 }
 
 #define MAX_COUNTERS			256



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [perfmon2] I.1 - System calls - ioctl
  2009-07-13 10:53     ` Peter Zijlstra
@ 2009-07-13 17:30       ` Arnd Bergmann
  2009-07-13 17:34         ` Peter Zijlstra
  2009-07-14 13:51       ` Christoph Hellwig
  2009-07-30 13:58       ` stephane eranian
  2 siblings, 1 reply; 68+ messages in thread
From: Arnd Bergmann @ 2009-07-13 17:30 UTC (permalink / raw)
  To: perfmon2-devel
  Cc: Peter Zijlstra, Christoph Hellwig, Andrew Morton, eranian,
	Philip Mucci, LKML, Andi Kleen, Paul Mackerras, Maynard Johnson,
	Ingo Molnar, Thomas Gleixner

On Monday 13 July 2009, Peter Zijlstra wrote:
> On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
> > But talking about syscalls the sys_perf_counter_open prototype is
> > really ugly - it uses either the pid or cpu argument which is a pretty
> > clear indicator it should actually be two sys calls.
> 
> Would something like the below be any better?
> 
> It would allow us to later add something like PERF_TARGET_SOCKET and
> things like that.

I don't think it helps on the ugliness side. You basically make the
two arguments a union, but instead of adding another flag and directly
passing a union, you also add interface complexity.

A strong indication for the complexity is that you got it wrong ;-) :

> +struct perf_counter_target {
> +	__u32			id;
> +	__u64			val;
> +};

This structure is not compatible between 32 and 64 bit user space on x86,
because everything except i386 adds implicit padding between id and val.

Other than that, making it extensible sounds reasonable. How about just
using a '__u64 *target' and a bit in the 'flags' argument?

	Arnd <><

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [perfmon2] I.1 - System calls - ioctl
  2009-07-13 17:30       ` [perfmon2] " Arnd Bergmann
@ 2009-07-13 17:34         ` Peter Zijlstra
  2009-07-13 17:53           ` Arnd Bergmann
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2009-07-13 17:34 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: perfmon2-devel, Christoph Hellwig, Andrew Morton, eranian,
	Philip Mucci, LKML, Andi Kleen, Paul Mackerras, Maynard Johnson,
	Ingo Molnar, Thomas Gleixner

On Mon, 2009-07-13 at 19:30 +0200, Arnd Bergmann wrote:
> On Monday 13 July 2009, Peter Zijlstra wrote:
> > On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
> > > But talking about syscalls the sys_perf_counter_open prototype is
> > > really ugly - it uses either the pid or cpu argument which is a pretty
> > > clear indicator it should actually be two sys calls.
> > 
> > Would something like the below be any better?
> > 
> > It would allow us to later add something like PERF_TARGET_SOCKET and
> > things like that.
> 
> I don't think it helps on the ugliness side. You basically make the
> two arguments a union, but instead of adding another flag and directly
> passing a union, you also add interface complexity.
> 
> A strong indication for the complexity is that you got it wrong ;-) :
> 
> > +struct perf_counter_target {
> > +	__u32			id;
> > +	__u64			val;
> > +};
> 
> This structure is not compatible between 32 and 64 bit user space on x86,
> because everything except i386 adds implicit padding between id and val.

Humm, __u64 doesn't have natural alignment? That would break more than
just this I think -- it sure surprises me.

> Other than that, making it extensible sounds reasonable. How about just
> using a '__u64 *target' and a bit in the 'flags' argument?

Would there still be a point in having it a pointer in that case?, but
yeah, that might work too?


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [perfmon2] I.1 - System calls - ioctl
  2009-07-13 17:34         ` Peter Zijlstra
@ 2009-07-13 17:53           ` Arnd Bergmann
  0 siblings, 0 replies; 68+ messages in thread
From: Arnd Bergmann @ 2009-07-13 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: perfmon2-devel, Christoph Hellwig, Andrew Morton, eranian,
	Philip Mucci, LKML, Andi Kleen, Paul Mackerras, Maynard Johnson,
	Ingo Molnar, Thomas Gleixner

On Monday 13 July 2009, Peter Zijlstra wrote:
> On Mon, 2009-07-13 at 19:30 +0200, Arnd Bergmann wrote:
> > > +struct perf_counter_target {
> > > +	__u32			id;
> > > +	__u64			val;
> > > +};
> > 
> > This structure is not compatible between 32 and 64 bit user space on x86,
> > because everything except i386 adds implicit padding between id and val.
> 
> Humm, __u64 doesn't have natural alignment? That would break more than
> just this I think -- it sure surprises me.

Yes, nobody expects this, so it is a frequent source of bugs in the ABI.
Look for compat_u64 and __packed in the definition of compat ioctl and
syscall interfaces for how we had to work around this elsewhere.

> > Other than that, making it extensible sounds reasonable. How about just
> > using a '__u64 *target' and a bit in the 'flags' argument?
> 
> Would there still be a point in having it a pointer in that case?, but
> yeah, that might work too?

passing u64 bit arguments directly to system calls is a bit complicated,
because some 32 bit architectures can only pass them in certain
register pairs, see the trouble we go through for llseek, sync_file_range
or preadv.

If you can directly pass an 'unsigned long' instead, that would work fine
though.

	Arnd <><

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-07-13 10:53     ` Peter Zijlstra
  2009-07-13 17:30       ` [perfmon2] " Arnd Bergmann
@ 2009-07-14 13:51       ` Christoph Hellwig
  2009-07-30 13:58       ` stephane eranian
  2 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2009-07-14 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, Ingo Molnar, eranian, LKML, Andrew Morton,
	Thomas Gleixner, Robert Richter, Paul Mackerras, Andi Kleen,
	Maynard Johnson, Carl Love, Corey J Ashford, Philip Mucci,
	Dan Terpstra, perfmon2-devel

On Mon, Jul 13, 2009 at 12:53:13PM +0200, Peter Zijlstra wrote:
> On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
> > But talking about syscalls the sys_perf_counter_open prototype is
> > really ugly - it uses either the pid or cpu argument which is a pretty
> > clear indicator it should actually be two sys calls.
> 
> Would something like the below be any better?
> 
> It would allow us to later add something like PERF_TARGET_SOCKET and
> things like that.
> 
> (utterly untested)

And makes the code a couple of magnitudes more ugly..  If you really
add a completely new targer adding a new system call wouldn't be a big
issue.  Altough a socket seems like a pretty logical flag extension to
the cpu syscall (or sub-syscall currently), as it would still be
specified as trace the whole cpu socket for this given cpuid.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-07-13 10:53     ` Peter Zijlstra
  2009-07-13 17:30       ` [perfmon2] " Arnd Bergmann
  2009-07-14 13:51       ` Christoph Hellwig
@ 2009-07-30 13:58       ` stephane eranian
  2009-07-30 14:13         ` Peter Zijlstra
  2 siblings, 1 reply; 68+ messages in thread
From: stephane eranian @ 2009-07-30 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Robert Richter, Paul Mackerras, Andi Kleen,
	Maynard Johnson, Carl Love, Corey J Ashford, Philip Mucci,
	Dan Terpstra, perfmon2-devel

On Mon, Jul 13, 2009 at 12:53 PM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
> On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
>> But talking about syscalls the sys_perf_counter_open prototype is
>> really ugly - it uses either the pid or cpu argument which is a pretty
>> clear indicator it should actually be two sys calls.
>
> Would something like the below be any better?
>
> It would allow us to later add something like PERF_TARGET_SOCKET and
> things like that.

One thing for sure, you need to provision for future targets, SOCKET
is an obvious
one. But given that your goal is to have a generic API, not just for
CPU PMU, then you
need to be able to add new targets, e.g., socket, chipset, GPU. The
current model with
pid and cpu is too limited, relying on flags to add more parameters
for a target is not pretty.

Given that an event can only be attached to a single target at a time, it seems
you could either do:
   - encapsulate target type + target into a struct and pass that.
This is your proposal here.
   - add a generic int target parameter and pass the target type in flags
   - provide one syscall per target type (seems to be Hellwig's)

Your scheme makes it possible to pass 64-bit target id. Not clear if
this is really needed.
It adds yet another data structure to your API.

>
> (utterly untested)
>
> ---
>  include/linux/perf_counter.h |   12 ++++++++++++
>  include/linux/syscalls.h     |    3 ++-
>  kernel/perf_counter.c        |   24 +++++++++++++++++++++++-
>  tools/perf/builtin-record.c  |   11 ++++++++++-
>  tools/perf/builtin-stat.c    |   12 ++++++++++--
>  tools/perf/builtin-top.c     |   11 ++++++++++-
>  tools/perf/perf.h            |    7 +++----
>  7 files changed, 70 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
> index 5e970c7..f7dd22e 100644
> --- a/include/linux/perf_counter.h
> +++ b/include/linux/perf_counter.h
> @@ -189,6 +189,18 @@ struct perf_counter_attr {
>        __u64                   __reserved_3;
>  };
>
> +enum perf_counter_target_ids {
> +       PERF_TARGET_PID         = 0,
> +       PERF_TARGET_CPU         = 1,
> +
> +       PERF_TARGET_MAX         /* non-ABI */
> +};
> +
> +struct perf_counter_target {
> +       __u32                   id;
> +       __u64                   val;
> +};
> +
>  /*
>  * Ioctls that can be done on a perf counter fd:
>  */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 80de700..670edad 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -760,5 +760,6 @@ int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
>
>  asmlinkage long sys_perf_counter_open(
>                struct perf_counter_attr __user *attr_uptr,
> -               pid_t pid, int cpu, int group_fd, unsigned long flags);
> +               struct perf_counter_target __user *target,
> +               int group_fd, unsigned long flags);
>  #endif
> diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
> index fa20a9d..205f0b6 100644
> --- a/kernel/perf_counter.c
> +++ b/kernel/perf_counter.c
> @@ -3969,15 +3969,18 @@ err_size:
>  */
>  SYSCALL_DEFINE5(perf_counter_open,
>                struct perf_counter_attr __user *, attr_uptr,
> -               pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
> +               struct perf_counter_target __user *, target_uptr,
> +               int, group_fd, unsigned long, flags)
>  {
>        struct perf_counter *counter, *group_leader;
>        struct perf_counter_attr attr;
> +       struct perf_counter_target target;
>        struct perf_counter_context *ctx;
>        struct file *counter_file = NULL;
>        struct file *group_file = NULL;
>        int fput_needed = 0;
>        int fput_needed2 = 0;
> +       int pid, cpu;
>        int ret;
>
>        /* for future expandability... */
> @@ -3998,6 +4001,25 @@ SYSCALL_DEFINE5(perf_counter_open,
>                        return -EINVAL;
>        }
>
> +       ret = copy_from_user(&target, target_uptr, sizeof(target));
> +       if (ret)
> +               return ret;
> +
> +       switch (target.id) {
> +       case PERF_TARGET_PID:
> +               pid = target.val;
> +               cpu = -1;
> +               break;
> +
> +       case PERF_TARGET_CPU:
> +               cpu = target.val;
> +               pid = -1;
> +               break;
> +
> +       default:
> +               return -EINVAL;
> +       }
> +
>        /*
>         * Get the target context (task or percpu):
>         */
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 4ef78a5..b172d30 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -374,6 +374,7 @@ static struct perf_header_attr *get_header_attr(struct perf_counter_attr *a, int
>  static void create_counter(int counter, int cpu, pid_t pid)
>  {
>        struct perf_counter_attr *attr = attrs + counter;
> +       struct perf_counter_target target;
>        struct perf_header_attr *h_attr;
>        int track = !counter; /* only the first counter needs these */
>        struct {
> @@ -409,8 +410,16 @@ static void create_counter(int counter, int cpu, pid_t pid)
>        attr->inherit           = (cpu < 0) && inherit;
>        attr->disabled          = 1;
>
> +       if (cpu < 0) {
> +               target.id = PERF_TARGET_PID;
> +               target.val = pid;
> +       } else {
> +               target.id = PERF_TARGET_CPU;
> +               target.val = cpu;
> +       }
> +
>  try_again:
> -       fd[nr_cpu][counter] = sys_perf_counter_open(attr, pid, cpu, group_fd, 0);
> +       fd[nr_cpu][counter] = sys_perf_counter_open(attr, &target, group_fd, 0);
>
>        if (fd[nr_cpu][counter] < 0) {
>                int err = errno;
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 27921a8..342e642 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -106,6 +106,7 @@ static u64                  runtime_cycles_noise;
>  static void create_perf_stat_counter(int counter, int pid)
>  {
>        struct perf_counter_attr *attr = attrs + counter;
> +       struct perf_counter_target target;
>
>        if (scale)
>                attr->read_format = PERF_FORMAT_TOTAL_TIME_ENABLED |
> @@ -114,8 +115,12 @@ static void create_perf_stat_counter(int counter, int pid)
>        if (system_wide) {
>                unsigned int cpu;
>
> +               target.id = PERF_TARGET_CPU;
> +
>                for (cpu = 0; cpu < nr_cpus; cpu++) {
> -                       fd[cpu][counter] = sys_perf_counter_open(attr, -1, cpu, -1, 0);
> +                       target.val = cpu;
> +
> +                       fd[cpu][counter] = sys_perf_counter_open(attr, &target, -1, 0);
>                        if (fd[cpu][counter] < 0 && verbose)
>                                fprintf(stderr, ERR_PERF_OPEN, counter,
>                                        fd[cpu][counter], strerror(errno));
> @@ -125,7 +130,10 @@ static void create_perf_stat_counter(int counter, int pid)
>                attr->disabled       = 1;
>                attr->enable_on_exec = 1;
>
> -               fd[0][counter] = sys_perf_counter_open(attr, pid, -1, -1, 0);
> +               target.id = PERF_TARGET_PID;
> +               target.val = pid;
> +
> +               fd[0][counter] = sys_perf_counter_open(attr, &target, -1, 0);
>                if (fd[0][counter] < 0 && verbose)
>                        fprintf(stderr, ERR_PERF_OPEN, counter,
>                                fd[0][counter], strerror(errno));
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index 95d5c0a..facc870 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -548,6 +548,7 @@ int group_fd;
>
>  static void start_counter(int i, int counter)
>  {
> +       struct perf_counter_target target;
>        struct perf_counter_attr *attr;
>        unsigned int cpu;
>
> @@ -560,8 +561,16 @@ static void start_counter(int i, int counter)
>        attr->sample_type       = PERF_SAMPLE_IP | PERF_SAMPLE_TID;
>        attr->freq              = freq;
>
> +       if (cpu < 0) {
> +               target.id = PERF_TARGET_PID;
> +               target.val = target_pid;
> +       } else {
> +               target.id = PERF_TARGET_CPU;
> +               target.val = cpu;
> +       }
> +
>  try_again:
> -       fd[i][counter] = sys_perf_counter_open(attr, target_pid, cpu, group_fd, 0);
> +       fd[i][counter] = sys_perf_counter_open(attr, &target, group_fd, 0);
>
>        if (fd[i][counter] < 0) {
>                int err = errno;
> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
> index e5148e2..3d663d7 100644
> --- a/tools/perf/perf.h
> +++ b/tools/perf/perf.h
> @@ -85,12 +85,11 @@ static inline unsigned long long rdclock(void)
>
>  static inline int
>  sys_perf_counter_open(struct perf_counter_attr *attr,
> -                     pid_t pid, int cpu, int group_fd,
> -                     unsigned long flags)
> +                     struct perf_counter_target *target,
> +                     int group_fd, unsigned long flags)
>  {
>        attr->size = sizeof(*attr);
> -       return syscall(__NR_perf_counter_open, attr, pid, cpu,
> -                      group_fd, flags);
> +       return syscall(__NR_perf_counter_open, attr, target, group_fd, flags);
>  }
>
>  #define MAX_COUNTERS                   256
>
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-07-30 13:58       ` stephane eranian
@ 2009-07-30 14:13         ` Peter Zijlstra
  2009-07-30 16:17           ` stephane eranian
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2009-07-30 14:13 UTC (permalink / raw)
  To: eranian
  Cc: Christoph Hellwig, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Robert Richter, Paul Mackerras, Andi Kleen,
	Maynard Johnson, Carl Love, Corey J Ashford, Philip Mucci,
	Dan Terpstra, perfmon2-devel

On Thu, 2009-07-30 at 15:58 +0200, stephane eranian wrote:
> On Mon, Jul 13, 2009 at 12:53 PM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
> > On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
> >> But talking about syscalls the sys_perf_counter_open prototype is
> >> really ugly - it uses either the pid or cpu argument which is a pretty
> >> clear indicator it should actually be two sys calls.
> >
> > Would something like the below be any better?
> >
> > It would allow us to later add something like PERF_TARGET_SOCKET and
> > things like that.
> 
> One thing for sure, you need to provision for future targets, SOCKET
> is an obvious
> one. But given that your goal is to have a generic API, not just for
> CPU PMU, then you
> need to be able to add new targets, e.g., socket, chipset, GPU. The
> current model with
> pid and cpu is too limited, relying on flags to add more parameters
> for a target is not pretty.
> 
> Given that an event can only be attached to a single target at a time, it seems
> you could either do:
>    - encapsulate target type + target into a struct and pass that.
> This is your proposal here.

*nod*

>    - add a generic int target parameter and pass the target type in flags

This would mean reserving a number of bits in the flags field for this
target id. We can do that, I figure 8 should do.. (640kb should be
enough comes to mind though :-).

>    - provide one syscall per target type (seems to be Hellwig's)
> 
> Your scheme makes it possible to pass 64-bit target id. Not clear if
> this is really needed.

Yeah, not sure either, pids, fds and similar are all 32bit iirc. We do
have 64bit resource ids but that's for things like inodes, and I'm not
sure it'd make sense to attach a counter to an inode :-)



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-07-30 14:13         ` Peter Zijlstra
@ 2009-07-30 16:17           ` stephane eranian
  2009-07-30 16:40             ` Arnd Bergmann
  0 siblings, 1 reply; 68+ messages in thread
From: stephane eranian @ 2009-07-30 16:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Robert Richter, Paul Mackerras, Andi Kleen,
	Maynard Johnson, Carl Love, Corey J Ashford, Philip Mucci,
	Dan Terpstra, perfmon2-devel

On Thu, Jul 30, 2009 at 4:13 PM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
> On Thu, 2009-07-30 at 15:58 +0200, stephane eranian wrote:
>> On Mon, Jul 13, 2009 at 12:53 PM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
>> > On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
>> >> But talking about syscalls the sys_perf_counter_open prototype is
>> >> really ugly - it uses either the pid or cpu argument which is a pretty
>> >> clear indicator it should actually be two sys calls.
>> >
>> > Would something like the below be any better?
>> >
>> > It would allow us to later add something like PERF_TARGET_SOCKET and
>> > things like that.
>>
>> One thing for sure, you need to provision for future targets, SOCKET
>> is an obvious
>> one. But given that your goal is to have a generic API, not just for
>> CPU PMU, then you
>> need to be able to add new targets, e.g., socket, chipset, GPU. The
>> current model with
>> pid and cpu is too limited, relying on flags to add more parameters
>> for a target is not pretty.
>>
>> Given that an event can only be attached to a single target at a time, it seems
>> you could either do:
>>    - encapsulate target type + target into a struct and pass that.
>> This is your proposal here.
>
> *nod*
>
>>    - add a generic int target parameter and pass the target type in flags
>
> This would mean reserving a number of bits in the flags field for this
> target id. We can do that, I figure 8 should do.. (640kb should be
> enough comes to mind though :-).
>
I meant:

long sys_perf_counter_open(
       struct perf_counter_attr *attr,
       int target,
       int group_fd,
       unsigned long flags);

If you reserve 8 bits in flags, that gives you 256 types of targets,
given that you can only measure an event on one target.

The target "name" would be passed in target.

per-thread: (self-monitoring)
          sys_perf_counter_open(&attr, getpid(), 0, 0);

cpu-wide: (monitored CPU1)
         sys_perf_counter_open(&attr, 1, 0, PERF_TARGET_CPU);

socket-wide: (socket 2)
         sys_perf_counter_open(&attr, 2, 0, PERF_TARGET_SOCKET);

I also suspect you can get by with fewer than 8 bits for the type of target.

Because in this scheme, part of flags would not be treated as bitmask
but rather bitfield, it may be better to just separate this target bitfield
like in:

long sys_perf_counter_open(
       struct perf_counter_attr *attr,
       enum perf_target_type  target_type,
       int target_id,
       int group_fd,
       unsigned long flags);

Which is what you had, except without the struct.

Then, it boils down to whether expressing a target id in 32 bits is enough.
Obviously, 64-bit would be the safest but then I understand this causes issues
on 32-bit systems.


>>    - provide one syscall per target type (seems to be Hellwig's)
>>
>> Your scheme makes it possible to pass 64-bit target id. Not clear if
>> this is really needed.
>
> Yeah, not sure either, pids, fds and similar are all 32bit iirc. We do
> have 64bit resource ids but that's for things like inodes, and I'm not
> sure it'd make sense to attach a counter to an inode :-)
>
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-07-30 16:17           ` stephane eranian
@ 2009-07-30 16:40             ` Arnd Bergmann
  2009-07-30 16:53               ` stephane eranian
  0 siblings, 1 reply; 68+ messages in thread
From: Arnd Bergmann @ 2009-07-30 16:40 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Christoph Hellwig, Ingo Molnar, LKML,
	Andrew Morton, Thomas Gleixner, Robert Richter, Paul Mackerras,
	Andi Kleen, Maynard Johnson, Carl Love, Corey J Ashford,
	Philip Mucci, Dan Terpstra, perfmon2-devel

On Thursday 30 July 2009, stephane eranian wrote:
> long sys_perf_counter_open(
>        struct perf_counter_attr *attr,
>        enum perf_target_type  target_type,
>        int target_id,
>        int group_fd,
>        unsigned long flags);
> 
> Which is what you had, except without the struct.
> 
> Then, it boils down to whether expressing a target id in 32 bits is enough.
> Obviously, 64-bit would be the safest but then I understand this causes issues
> on 32-bit systems.

Just make it an unsigned long then, that still covers all cases
where you only need the 64-bit type on 64-bit systems.

	Arnd <><

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-07-30 16:40             ` Arnd Bergmann
@ 2009-07-30 16:53               ` stephane eranian
  2009-07-30 17:20                 ` Arnd Bergmann
  0 siblings, 1 reply; 68+ messages in thread
From: stephane eranian @ 2009-07-30 16:53 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Christoph Hellwig, Ingo Molnar, LKML,
	Andrew Morton, Thomas Gleixner, Robert Richter, Paul Mackerras,
	Andi Kleen, Maynard Johnson, Carl Love, Corey J Ashford,
	Philip Mucci, Dan Terpstra, perfmon2-devel

On Thu, Jul 30, 2009 at 6:40 PM, Arnd Bergmann<arnd@arndb.de> wrote:
> On Thursday 30 July 2009, stephane eranian wrote:
>> long sys_perf_counter_open(
>>        struct perf_counter_attr *attr,
>>        enum perf_target_type  target_type,
>>        int target_id,
>>        int group_fd,
>>        unsigned long flags);
>>
>> Which is what you had, except without the struct.
>>
>> Then, it boils down to whether expressing a target id in 32 bits is enough.
>> Obviously, 64-bit would be the safest but then I understand this causes issues
>> on 32-bit systems.
>
> Just make it an unsigned long then, that still covers all cases
> where you only need the 64-bit type on 64-bit systems.
>
But that won't always work in the case of a 32-bit monitoring tool
running on top of
a 64-bit OS. Imagine the target id is indeed 64-bit, e.g., inode
number (as suggested
by Peter). It's not because you are a 32-bit tool than you cannot name
a monitoring
resource in a 64-bit OS.



>        Arnd <><
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-07-30 16:53               ` stephane eranian
@ 2009-07-30 17:20                 ` Arnd Bergmann
  2009-08-03 14:22                   ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Arnd Bergmann @ 2009-07-30 17:20 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Christoph Hellwig, Ingo Molnar, LKML,
	Andrew Morton, Thomas Gleixner, Robert Richter, Paul Mackerras,
	Andi Kleen, Maynard Johnson, Carl Love, Corey J Ashford,
	Philip Mucci, Dan Terpstra, perfmon2-devel

On Thursday 30 July 2009, stephane eranian wrote:
> But that won't always work in the case of a 32-bit monitoring tool
> running on top of
> a 64-bit OS. Imagine the target id is indeed 64-bit, e.g., inode
> number (as suggested
> by Peter). It's not because you are a 32-bit tool than you cannot name
> a monitoring
> resource in a 64-bit OS.

Right, there are obviously things that you cannot address with 
a 'long', but there are potentially other things that you could
that you cannot address with an 'int', e.g. an opaque user
token (representing a user pointer) that you can get back in
the sample data.

In the worst case, you could still redefine the argument as a
transparent union to a long and pointer in the future if you
use a 'long' now. AFAICT, there are no advantages of using
an 'int' instead of a 'long', but there are disadvantages of
using a 'long long'.

	Arnd <><

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: I.1 - System calls - ioctl
  2009-07-30 17:20                 ` Arnd Bergmann
@ 2009-08-03 14:22                   ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2009-08-03 14:22 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: eranian, Christoph Hellwig, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Robert Richter, Paul Mackerras, Andi Kleen,
	Maynard Johnson, Carl Love, Corey J Ashford, Philip Mucci,
	Dan Terpstra, perfmon2-devel

On Thu, 2009-07-30 at 19:20 +0200, Arnd Bergmann wrote:
> On Thursday 30 July 2009, stephane eranian wrote:
> > But that won't always work in the case of a 32-bit monitoring tool
> > running on top of
> > a 64-bit OS. Imagine the target id is indeed 64-bit, e.g., inode
> > number (as suggested
> > by Peter). It's not because you are a 32-bit tool than you cannot name
> > a monitoring
> > resource in a 64-bit OS.
> 
> Right, there are obviously things that you cannot address with 
> a 'long', but there are potentially other things that you could
> that you cannot address with an 'int', e.g. an opaque user
> token (representing a user pointer) that you can get back in
> the sample data.
> 
> In the worst case, you could still redefine the argument as a
> transparent union to a long and pointer in the future if you
> use a 'long' now. AFAICT, there are no advantages of using
> an 'int' instead of a 'long', but there are disadvantages of
> using a 'long long'.

OK, so time is running out on this thing. Ingo, Paulus what would you
prefer?


^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2009-08-03 14:22 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
2009-06-22 11:48 ` Ingo Molnar
2009-06-22 11:49 ` I.1 - System calls - ioctl Ingo Molnar
2009-06-22 12:58   ` Christoph Hellwig
2009-06-22 13:56     ` Ingo Molnar
2009-06-22 17:41       ` Arnd Bergmann
2009-07-13 10:53     ` Peter Zijlstra
2009-07-13 17:30       ` [perfmon2] " Arnd Bergmann
2009-07-13 17:34         ` Peter Zijlstra
2009-07-13 17:53           ` Arnd Bergmann
2009-07-14 13:51       ` Christoph Hellwig
2009-07-30 13:58       ` stephane eranian
2009-07-30 14:13         ` Peter Zijlstra
2009-07-30 16:17           ` stephane eranian
2009-07-30 16:40             ` Arnd Bergmann
2009-07-30 16:53               ` stephane eranian
2009-07-30 17:20                 ` Arnd Bergmann
2009-08-03 14:22                   ` Peter Zijlstra
2009-06-22 11:50 ` I.2 - Grouping Ingo Molnar
2009-06-22 19:45   ` stephane eranian
2009-06-22 22:04     ` Corey Ashford
2009-06-23 17:51       ` stephane eranian
2009-06-22 21:38   ` Corey Ashford
2009-06-23  5:16   ` Paul Mackerras
2009-06-23  7:36     ` stephane eranian
2009-06-23  8:26       ` Paul Mackerras
2009-06-23  8:30         ` stephane eranian
2009-06-23 16:24           ` Corey Ashford
2009-06-22 11:51 ` I.3 - Multiplexing and system-wide Ingo Molnar
2009-06-22 11:51 ` I.4 - Controlling group multiplexing Ingo Molnar
2009-06-22 11:52 ` I.5 - Mmaped count Ingo Molnar
2009-06-22 12:25   ` stephane eranian
2009-06-22 12:35     ` Peter Zijlstra
2009-06-22 12:54       ` stephane eranian
2009-06-22 14:39         ` Peter Zijlstra
2009-06-23  0:41         ` Paul Mackerras
2009-06-23  0:39       ` Paul Mackerras
2009-06-23  6:13         ` Peter Zijlstra
2009-06-23  7:40         ` stephane eranian
2009-06-23  0:33     ` Paul Mackerras
2009-06-22 11:53 ` I.6 - Group scheduling Ingo Molnar
2009-06-22 11:54 ` I.7 - Group validity checking Ingo Molnar
2009-06-22 11:54 ` I.8 - Generalized cache events Ingo Molnar
2009-06-22 11:55 ` I.9 - Group reading Ingo Molnar
2009-06-22 11:55 ` I.10 - Event buffer minimal useful size Ingo Molnar
2009-06-22 11:56 ` I.11 - Missing definitions for generic events Ingo Molnar
2009-06-22 14:54   ` stephane eranian
2009-06-22 11:57 ` II.1 - Fixed counters on Intel Ingo Molnar
2009-06-22 14:27   ` stephane eranian
2009-06-22 11:57 ` II.2 - Event knowledge missing Ingo Molnar
2009-06-23 13:18   ` stephane eranian
2009-06-22 11:58 ` III.1 - Sampling period randomization Ingo Molnar
2009-06-22 11:58 ` IV.1 - Support for model-specific uncore PMU Ingo Molnar
2009-06-22 11:59 ` IV.2 - Features impacting all counters Ingo Molnar
2009-06-22 12:00 ` IV.3 - AMD IBS Ingo Molnar
2009-06-22 14:08   ` [perfmon2] " Rob Fowler
2009-06-22 17:58     ` Maynard Johnson
2009-06-23  6:19     ` Peter Zijlstra
2009-06-23  8:19       ` stephane eranian
2009-06-23 14:05         ` Ingo Molnar
2009-06-23 14:25           ` stephane eranian
2009-06-23 14:55             ` Ingo Molnar
2009-06-23 14:40       ` Rob Fowler
2009-06-22 19:17   ` stephane eranian
2009-06-22 12:00 ` IV.4 - Intel PEBS Ingo Molnar
2009-06-22 12:16   ` Andi Kleen
2009-06-22 12:01 ` IV.5 - Intel Last Branch Record (LBR) Ingo Molnar
2009-06-22 20:02   ` stephane eranian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.