All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] Performance Counters for Linux, v3
@ 2008-12-11 15:52 Ingo Molnar
  2008-12-11 18:02 ` Vince Weaver
                   ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-12-11 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Andrew Morton, Stephane Eranian, Eric Dumazet,
	Robert Richter, Arjan van de Veen, Peter Anvin, Peter Zijlstra,
	Paul Mackerras, David S. Miller


This is v3 of our performance counters subsystem implementation. It can 
be accessed at:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git perfcounters/core

(or via http://people.redhat.com/mingo/tip.git/README )

We've made a number of bigger enhancements in the -v3 release:

 - The introduction of new "software" performance counters:
   PERF_COUNTER_CPU_CLOCK and PERF_COUNTER_TASK_CLOCK. (With page-fault, 
   context-switches and block-read event counters planned as well)

   These sw-counters, besides being useful to applications and being nice
   generalizations of the performance counter concept, are also helpful 
   in porting performance counters to new architectures: the software 
   counters will work fine without any PMU. Applications can thus 
   standardize on the availability of _some_ performance counters on all 
   Linux systems all the time, regardless of current PMU support status.

 - The introduction of counter groups: counters can now be grouped up
   when created. Such counter groups are scheduled atomically and can 
   have their events taken with precise (and atomic) multi-dimensional 
   timestamps as well.

   The counter groups are a natural extension of the current single
   counters, they still act as individual counters as well.

   [ It's a bit like task or tty groups - losely coupled counters with a 
     strong self-identity. The grouping can be arbitrary - there can be 
     multiple counter groups per task - mixed with single counters as 
     well. The concept works for CPU/systemwide counters as well. ]

 - The addition of a lowlevel counter hw driver framework that allows
   assymetric counter implementation. The sw counters now use this 
   facility.

 - The ability to turn all counters of a task on/off via a single system 
   call.

 - The syscall API has been streamlined significantly - see further below
   for details. The event type has been widened to 64 bits for powerpc's
   needs, and a few reserve bits have been introduced.

 - The ability to turn all counters of a task on/off via a single system
   call. This is eseful to applications that self-profile and/or want to
   do runtime filtering of which functions to profile. (there's also a 
   "hw_event.disabled" bit in the API to create counters in disabled 
   state straight away - useful to powerpc for example - this code is not 
   fully complete yet. It's the next entry on our TODO list :-)

 - [ lots of other updates, fixes and cleanups. ]

New KernelTop features:

     http://redhat.com/~mingo/perfcounters/kerneltop.c

 - The ability to count multiple event sources at once and combine
   them into the same histogram. For example, to create a
   cache-misses versus cache-references histogram, just append event ids
   like this:

     $ ./kerneltop -e 3 -c 5000 -e 2

   To get output like this:

------------------------------------------------------------------------------
 KernelTop:    1601 irqs/sec  [NMI, cache-misses/cache-refs],  (all, 16 CPUs)
------------------------------------------------------------------------------

             weight         RIP          kernel function
             ______   ________________   _______________

              85.00 - ffffffff804fc96d : ip_local_deliver
              30.50 - ffffffff804cedfa : skb_copy_and_csum_dev
              27.11 - ffffffff804ceeb7 : skb_push
              27.00 - ffffffff805106a8 : tcp_established_options
              20.35 - ffffffff804e5675 : eth_type_trans
              19.00 - ffffffff8028a4e8 : zone_statistics
              18.40 - ffffffff804d9256 : dst_release
              18.07 - ffffffff804fc1cc : ip_rcv_finish
              16.00 - ffffffff8050022b : __ip_local_out
              15.69 - ffffffff804fc774 : ip_local_deliver_finish
              14.41 - ffffffff804cfc87 : skb_release_head_state
              14.00 - ffffffff804cbdf0 : sock_alloc_send_skb
              10.00 - ffffffff8027d788 : find_get_page
               9.71 - ffffffff8050084f : ip_queue_xmit
               8.00 - ffffffff802217d5 : read_hpet
               6.50 - ffffffff8050d999 : tcp_prune_queue
               3.59 - ffffffff80503209 : __inet_lookup_established
               2.16 - ffffffff802861ec : put_page
               2.00 - ffffffff80222554 : physflat_send_IPI_mask

 - the -g 1 option to put all counters into a counter group.

 - one-shot profiling

 - various other updates.

 - NOTE: pick up the latest version of kerneltop.c if you want to try out
   the v3 kernel side.

See "kerneltop --help" for all the options:

KernelTop Options (up to 4 event types can be specified):

 -e EID    --event_id=EID     # event type ID                     [default:  0]
                                   0: CPU cycles
                                   1: instructions
                                   2: cache accesses
                                   3: cache misses
                                   4: branch instructions
                                   5: branch prediction misses
                                 < 0: raw CPU events

 -c CNT    --count=CNT        # event period to sample

 -C CPU    --cpu=CPU          # CPU (-1 for all)                  [default: -1]
 -p PID    --pid=PID          # PID of sampled task (-1 for all)  [default: -1]

 -d delay  --delay=<seconds>  # sampling/display delay            [default:  2]
 -x path   --vmlinux=<path>   # the vmlinux binary, for -s use:
 -s symbol --symbol=<symbol>  # function to be showed annotated one-shot

The new syscall API looks the following way. There's a single system 
call which creates counters - VFS ops are used after that to operate on 
counters. The API details:

/*
 * Generalized performance counter event types, used by the hw_event.type
 * parameter of the sys_perf_counter_open() syscall:
 */
enum hw_event_types {
	/*
	 * Common hardware events, generalized by the kernel:
	 */
	PERF_COUNT_CYCLES		=  0,
	PERF_COUNT_INSTRUCTIONS		=  1,
	PERF_COUNT_CACHE_REFERENCES	=  2,
	PERF_COUNT_CACHE_MISSES		=  3,
	PERF_COUNT_BRANCH_INSTRUCTIONS	=  4,
	PERF_COUNT_BRANCH_MISSES	=  5,

	/*
	 * Special "software" counters provided by the kernel, even if
	 * the hardware does not support performance counters. These
	 * counters measure various physical and sw events of the
	 * kernel (and allow the profiling of them as well):
	 */
	PERF_COUNT_CPU_CLOCK		= -1,
	PERF_COUNT_TASK_CLOCK		= -2,
	/*
	 * Future software events:
	 */
	/* PERF_COUNT_PAGE_FAULTS	= -3,
	   PERF_COUNT_CONTEXT_SWITCHES	= -4, */
};

/*
 * IRQ-notification data record type:
 */
enum perf_counter_record_type {
	PERF_RECORD_SIMPLE		=  0,
	PERF_RECORD_IRQ			=  1,
	PERF_RECORD_GROUP		=  2,
};

/*
 * Hardware event to monitor via a performance monitoring counter:
 */
struct perf_counter_hw_event {
	s64			type;

	u64			irq_period;
	u32			record_type;

	u32			disabled     :  1, /* off by default */
				nmi	     :  1, /* NMI sampling   */
				raw	     :  1, /* raw event type */
				__reserved_1 : 29;

	u64			__reserved_2;
};

  asmlinkage int
  sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr __user,
                        pid_t pid, int cpu, int group_fd);


Thanks,

	Ingo, Thomas

------------------>
Ingo Molnar (16):
      performance counters: documentation
      performance counters: x86 support
      x86, perfcounters: read out MSR_CORE_PERF_GLOBAL_STATUS with counters disabled
      perfcounters: select ANON_INODES
      perfcounters, x86: simplify disable/enable of counters
      perfcounters, x86: clean up debug code
      perfcounters: consolidate global-disable codepaths
      perf counters: restructure the API
      perf counters: add support for group counters
      perf counters: group counter, fixes
      perf counters: hw driver API
      perf counters: implement PERF_COUNT_CPU_CLOCK
      perf counters: consolidate hw_perf save/restore APIs
      perf counters: implement PERF_COUNT_TASK_CLOCK
      perf counters: add prctl interface to disable/enable counters
      perf counters: clean up state transitions

Thomas Gleixner (4):
      performance counters: core code
      perf counters: protect them against CSTATE transitions
      perf counters: clean up 'raw' type API
      perf counters: expand use of counter->event


 Documentation/perf-counters.txt                |  104 ++
 arch/x86/Kconfig                               |    1 +
 arch/x86/ia32/ia32entry.S                      |    3 +-
 arch/x86/include/asm/hardirq_32.h              |    1 +
 arch/x86/include/asm/hw_irq.h                  |    2 +
 arch/x86/include/asm/intel_arch_perfmon.h      |   34 +-
 arch/x86/include/asm/irq_vectors.h             |    5 +
 arch/x86/include/asm/mach-default/entry_arch.h |    5 +
 arch/x86/include/asm/pda.h                     |    1 +
 arch/x86/include/asm/thread_info.h             |    4 +-
 arch/x86/include/asm/unistd_32.h               |    1 +
 arch/x86/include/asm/unistd_64.h               |    3 +-
 arch/x86/kernel/apic.c                         |    2 +
 arch/x86/kernel/cpu/Makefile                   |   12 +-
 arch/x86/kernel/cpu/common.c                   |    2 +
 arch/x86/kernel/cpu/perf_counter.c             |  563 +++++++++++
 arch/x86/kernel/entry_64.S                     |    5 +
 arch/x86/kernel/irq.c                          |    5 +
 arch/x86/kernel/irqinit_32.c                   |    3 +
 arch/x86/kernel/irqinit_64.c                   |    5 +
 arch/x86/kernel/signal.c                       |    7 +-
 arch/x86/kernel/syscall_table_32.S             |    1 +
 drivers/acpi/processor_idle.c                  |    8 +
 drivers/char/sysrq.c                           |    2 +
 include/linux/perf_counter.h                   |  244 +++++
 include/linux/prctl.h                          |    3 +
 include/linux/sched.h                          |    9 +
 include/linux/syscalls.h                       |    8 +
 init/Kconfig                                   |   30 +
 kernel/Makefile                                |    1 +
 kernel/fork.c                                  |    1 +
 kernel/perf_counter.c                          | 1266 ++++++++++++++++++++++++
 kernel/sched.c                                 |   24 +
 kernel/sys.c                                   |    7 +
 kernel/sys_ni.c                                |    3 +
 35 files changed, 2354 insertions(+), 21 deletions(-)
 create mode 100644 Documentation/perf-counters.txt
 create mode 100644 arch/x86/kernel/cpu/perf_counter.c
 create mode 100644 include/linux/perf_counter.h
 create mode 100644 kernel/perf_counter.c

diff --git a/Documentation/perf-counters.txt b/Documentation/perf-counters.txt
new file mode 100644
index 0000000..19033a0
--- /dev/null
+++ b/Documentation/perf-counters.txt
@@ -0,0 +1,104 @@
+
+Performance Counters for Linux
+------------------------------
+
+Performance counters are special hardware registers available on most modern
+CPUs. These registers count the number of certain types of hw events: such
+as instructions executed, cachemisses suffered, or branches mis-predicted -
+without slowing down the kernel or applications. These registers can also
+trigger interrupts when a threshold number of events have passed - and can
+thus be used to profile the code that runs on that CPU.
+
+The Linux Performance Counter subsystem provides an abstraction of these
+hardware capabilities. It provides per task and per CPU counters, and
+it provides event capabilities on top of those.
+
+Performance counters are accessed via special file descriptors.
+There's one file descriptor per virtual counter used.
+
+The special file descriptor is opened via the perf_counter_open()
+system call:
+
+ int
+ perf_counter_open(u32 hw_event_type,
+                   u32 hw_event_period,
+                   u32 record_type,
+                   pid_t pid,
+                   int cpu);
+
+The syscall returns the new fd. The fd can be used via the normal
+VFS system calls: read() can be used to read the counter, fcntl()
+can be used to set the blocking mode, etc.
+
+Multiple counters can be kept open at a time, and the counters
+can be poll()ed.
+
+When creating a new counter fd, 'hw_event_type' is one of:
+
+ enum hw_event_types {
+	PERF_COUNT_CYCLES,
+	PERF_COUNT_INSTRUCTIONS,
+	PERF_COUNT_CACHE_REFERENCES,
+	PERF_COUNT_CACHE_MISSES,
+	PERF_COUNT_BRANCH_INSTRUCTIONS,
+	PERF_COUNT_BRANCH_MISSES,
+ };
+
+These are standardized types of events that work uniformly on all CPUs
+that implements Performance Counters support under Linux. If a CPU is
+not able to count branch-misses, then the system call will return
+-EINVAL.
+
+[ Note: more hw_event_types are supported as well, but they are CPU
+  specific and are enumerated via /sys on a per CPU basis. Raw hw event
+  types can be passed in as negative numbers. For example, to count
+  "External bus cycles while bus lock signal asserted" events on Intel
+  Core CPUs, pass in a -0x4064 event type value. ]
+
+The parameter 'hw_event_period' is the number of events before waking up
+a read() that is blocked on a counter fd. Zero value means a non-blocking
+counter.
+
+'record_type' is the type of data that a read() will provide for the
+counter, and it can be one of:
+
+  enum perf_record_type {
+	PERF_RECORD_SIMPLE,
+	PERF_RECORD_IRQ,
+  };
+
+a "simple" counter is one that counts hardware events and allows
+them to be read out into a u64 count value. (read() returns 8 on
+a successful read of a simple counter.)
+
+An "irq" counter is one that will also provide an IRQ context information:
+the IP of the interrupted context. In this case read() will return
+the 8-byte counter value, plus the Instruction Pointer address of the
+interrupted context.
+
+The 'pid' parameter allows the counter to be specific to a task:
+
+ pid == 0: if the pid parameter is zero, the counter is attached to the
+ current task.
+
+ pid > 0: the counter is attached to a specific task (if the current task
+ has sufficient privilege to do so)
+
+ pid < 0: all tasks are counted (per cpu counters)
+
+The 'cpu' parameter allows a counter to be made specific to a full
+CPU:
+
+ cpu >= 0: the counter is restricted to a specific CPU
+ cpu == -1: the counter counts on all CPUs
+
+Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.
+
+A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
+events of that task and 'follows' that task to whatever CPU the task
+gets schedule to. Per task counters can be created by any user, for
+their own tasks.
+
+A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
+all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
+
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d4d4cb7..f2fdc18 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -643,6 +643,7 @@ config X86_UP_IOAPIC
 config X86_LOCAL_APIC
 	def_bool y
 	depends on X86_64 || (X86_32 && (X86_UP_APIC || (SMP && !X86_VOYAGER) || X86_GENERICARCH))
+	select HAVE_PERF_COUNTERS
 
 config X86_IO_APIC
 	def_bool y
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 256b00b..3c14ed0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -823,7 +823,8 @@ ia32_sys_call_table:
 	.quad compat_sys_signalfd4
 	.quad sys_eventfd2
 	.quad sys_epoll_create1
-	.quad sys_dup3			/* 330 */
+	.quad sys_dup3				/* 330 */
 	.quad sys_pipe2
 	.quad sys_inotify_init1
+	.quad sys_perf_counter_open
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/hardirq_32.h b/arch/x86/include/asm/hardirq_32.h
index 5ca135e..b3e475d 100644
--- a/arch/x86/include/asm/hardirq_32.h
+++ b/arch/x86/include/asm/hardirq_32.h
@@ -9,6 +9,7 @@ typedef struct {
 	unsigned long idle_timestamp;
 	unsigned int __nmi_count;	/* arch dependent */
 	unsigned int apic_timer_irqs;	/* arch dependent */
+	unsigned int apic_perf_irqs;	/* arch dependent */
 	unsigned int irq0_irqs;
 	unsigned int irq_resched_count;
 	unsigned int irq_call_count;
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 8de644b..aa93e53 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -30,6 +30,8 @@
 /* Interrupt handlers registered during init_IRQ */
 extern void apic_timer_interrupt(void);
 extern void error_interrupt(void);
+extern void perf_counter_interrupt(void);
+
 extern void spurious_interrupt(void);
 extern void thermal_interrupt(void);
 extern void reschedule_interrupt(void);
diff --git a/arch/x86/include/asm/intel_arch_perfmon.h b/arch/x86/include/asm/intel_arch_perfmon.h
index fa0fd06..71598a9 100644
--- a/arch/x86/include/asm/intel_arch_perfmon.h
+++ b/arch/x86/include/asm/intel_arch_perfmon.h
@@ -1,22 +1,24 @@
 #ifndef _ASM_X86_INTEL_ARCH_PERFMON_H
 #define _ASM_X86_INTEL_ARCH_PERFMON_H
 
-#define MSR_ARCH_PERFMON_PERFCTR0		0xc1
-#define MSR_ARCH_PERFMON_PERFCTR1		0xc2
+#define MSR_ARCH_PERFMON_PERFCTR0			      0xc1
+#define MSR_ARCH_PERFMON_PERFCTR1			      0xc2
 
-#define MSR_ARCH_PERFMON_EVENTSEL0		0x186
-#define MSR_ARCH_PERFMON_EVENTSEL1		0x187
+#define MSR_ARCH_PERFMON_EVENTSEL0			     0x186
+#define MSR_ARCH_PERFMON_EVENTSEL1			     0x187
 
-#define ARCH_PERFMON_EVENTSEL0_ENABLE	(1 << 22)
-#define ARCH_PERFMON_EVENTSEL_INT	(1 << 20)
-#define ARCH_PERFMON_EVENTSEL_OS	(1 << 17)
-#define ARCH_PERFMON_EVENTSEL_USR	(1 << 16)
+#define ARCH_PERFMON_EVENTSEL0_ENABLE			  (1 << 22)
+#define ARCH_PERFMON_EVENTSEL_INT			  (1 << 20)
+#define ARCH_PERFMON_EVENTSEL_OS			  (1 << 17)
+#define ARCH_PERFMON_EVENTSEL_USR			  (1 << 16)
 
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL	(0x3c)
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK	(0x00 << 8)
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX (0)
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL		      0x3c
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK		(0x00 << 8)
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX 		 0
 #define ARCH_PERFMON_UNHALTED_CORE_CYCLES_PRESENT \
-	(1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
+		(1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
+
+#define ARCH_PERFMON_BRANCH_MISSES_RETIRED			 6
 
 union cpuid10_eax {
 	struct {
@@ -28,4 +30,12 @@ union cpuid10_eax {
 	unsigned int full;
 };
 
+#ifdef CONFIG_PERF_COUNTERS
+extern void init_hw_perf_counters(void);
+extern void perf_counters_lapic_init(int nmi);
+#else
+static inline void init_hw_perf_counters(void)		{ }
+static inline void perf_counters_lapic_init(int nmi)	{ }
+#endif
+
 #endif /* _ASM_X86_INTEL_ARCH_PERFMON_H */
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 0005adb..b8d277f 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -87,6 +87,11 @@
 #define LOCAL_TIMER_VECTOR	0xef
 
 /*
+ * Performance monitoring interrupt vector:
+ */
+#define LOCAL_PERF_VECTOR	0xee
+
+/*
  * First APIC vector available to drivers: (vectors 0x30-0xee) we
  * start at 0x31(0x41) to spread out vectors evenly between priority
  * levels. (0x80 is the syscall vector)
diff --git a/arch/x86/include/asm/mach-default/entry_arch.h b/arch/x86/include/asm/mach-default/entry_arch.h
index 6b1add8..ad31e5d 100644
--- a/arch/x86/include/asm/mach-default/entry_arch.h
+++ b/arch/x86/include/asm/mach-default/entry_arch.h
@@ -25,10 +25,15 @@ BUILD_INTERRUPT(irq_move_cleanup_interrupt,IRQ_MOVE_CLEANUP_VECTOR)
  * a much simpler SMP time architecture:
  */
 #ifdef CONFIG_X86_LOCAL_APIC
+
 BUILD_INTERRUPT(apic_timer_interrupt,LOCAL_TIMER_VECTOR)
 BUILD_INTERRUPT(error_interrupt,ERROR_APIC_VECTOR)
 BUILD_INTERRUPT(spurious_interrupt,SPURIOUS_APIC_VECTOR)
 
+#ifdef CONFIG_PERF_COUNTERS
+BUILD_INTERRUPT(perf_counter_interrupt, LOCAL_PERF_VECTOR)
+#endif
+
 #ifdef CONFIG_X86_MCE_P4THERMAL
 BUILD_INTERRUPT(thermal_interrupt,THERMAL_APIC_VECTOR)
 #endif
diff --git a/arch/x86/include/asm/pda.h b/arch/x86/include/asm/pda.h
index 2fbfff8..90a8d9d 100644
--- a/arch/x86/include/asm/pda.h
+++ b/arch/x86/include/asm/pda.h
@@ -30,6 +30,7 @@ struct x8664_pda {
 	short isidle;
 	struct mm_struct *active_mm;
 	unsigned apic_timer_irqs;
+	unsigned apic_perf_irqs;
 	unsigned irq0_irqs;
 	unsigned irq_resched_count;
 	unsigned irq_call_count;
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e44d379..810bf26 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -80,6 +80,7 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
+#define TIF_PERF_COUNTERS	11	/* notify perf counter work */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -103,6 +104,7 @@ struct thread_info {
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
+#define _TIF_PERF_COUNTERS	(1 << TIF_PERF_COUNTERS)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
@@ -135,7 +137,7 @@ struct thread_info {
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
-	(_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_NOTIFY_RESUME)
+	(_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_PERF_COUNTERS|_TIF_NOTIFY_RESUME)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..7e47658 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,7 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_perf_counter_open	333
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index d2e415e..53025fe 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -653,7 +653,8 @@ __SYSCALL(__NR_dup3, sys_dup3)
 __SYSCALL(__NR_pipe2, sys_pipe2)
 #define __NR_inotify_init1			294
 __SYSCALL(__NR_inotify_init1, sys_inotify_init1)
-
+#define __NR_perf_counter_open		295
+__SYSCALL(__NR_perf_counter_open, sys_perf_counter_open)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/apic.c b/arch/x86/kernel/apic.c
index 16f9487..8ab8c18 100644
--- a/arch/x86/kernel/apic.c
+++ b/arch/x86/kernel/apic.c
@@ -31,6 +31,7 @@
 #include <linux/dmi.h>
 #include <linux/dmar.h>
 
+#include <asm/intel_arch_perfmon.h>
 #include <asm/atomic.h>
 #include <asm/smp.h>
 #include <asm/mtrr.h>
@@ -1147,6 +1148,7 @@ void __cpuinit setup_local_APIC(void)
 		apic_write(APIC_ESR, 0);
 	}
 #endif
+	perf_counters_lapic_init(0);
 
 	preempt_disable();
 
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 82ec607..89e5336 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -1,5 +1,5 @@
 #
-# Makefile for x86-compatible CPU details and quirks
+# Makefile for x86-compatible CPU details, features and quirks
 #
 
 obj-y			:= intel_cacheinfo.o addon_cpuid_features.o
@@ -16,11 +16,13 @@ obj-$(CONFIG_CPU_SUP_CENTAUR_64)	+= centaur_64.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32)	+= transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)		+= umc.o
 
-obj-$(CONFIG_X86_MCE)	+= mcheck/
-obj-$(CONFIG_MTRR)	+= mtrr/
-obj-$(CONFIG_CPU_FREQ)	+= cpufreq/
+obj-$(CONFIG_PERF_COUNTERS)		+= perf_counter.o
 
-obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o
+obj-$(CONFIG_X86_MCE)			+= mcheck/
+obj-$(CONFIG_MTRR)			+= mtrr/
+obj-$(CONFIG_CPU_FREQ)			+= cpufreq/
+
+obj-$(CONFIG_X86_LOCAL_APIC)		+= perfctr-watchdog.o
 
 quiet_cmd_mkcapflags = MKCAP   $@
       cmd_mkcapflags = $(PERL) $(srctree)/$(src)/mkcapflags.pl $< $@
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b9c9ea0..4461011 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -17,6 +17,7 @@
 #include <asm/mmu_context.h>
 #include <asm/mtrr.h>
 #include <asm/mce.h>
+#include <asm/intel_arch_perfmon.h>
 #include <asm/pat.h>
 #include <asm/asm.h>
 #include <asm/numa.h>
@@ -750,6 +751,7 @@ void __init identify_boot_cpu(void)
 #else
 	vgetcpu_set_mode();
 #endif
+	init_hw_perf_counters();
 }
 
 void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/cpu/perf_counter.c b/arch/x86/kernel/cpu/perf_counter.c
new file mode 100644
index 0000000..4854cca
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_counter.c
@@ -0,0 +1,563 @@
+/*
+ * Performance counter x86 architecture code
+ *
+ *  Copyright(C) 2008 Thomas Gleixner <tglx@linutronix.de>
+ *  Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ *  For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/perf_counter.h>
+#include <linux/capability.h>
+#include <linux/notifier.h>
+#include <linux/hardirq.h>
+#include <linux/kprobes.h>
+#include <linux/module.h>
+#include <linux/kdebug.h>
+#include <linux/sched.h>
+
+#include <asm/intel_arch_perfmon.h>
+#include <asm/apic.h>
+
+static bool perf_counters_initialized __read_mostly;
+
+/*
+ * Number of (generic) HW counters:
+ */
+static int nr_hw_counters __read_mostly;
+static u32 perf_counter_mask __read_mostly;
+
+/* No support for fixed function counters yet */
+
+#define MAX_HW_COUNTERS		8
+
+struct cpu_hw_counters {
+	struct perf_counter	*counters[MAX_HW_COUNTERS];
+	unsigned long		used[BITS_TO_LONGS(MAX_HW_COUNTERS)];
+};
+
+/*
+ * Intel PerfMon v3. Used on Core2 and later.
+ */
+static DEFINE_PER_CPU(struct cpu_hw_counters, cpu_hw_counters);
+
+const int intel_perfmon_event_map[] =
+{
+  [PERF_COUNT_CYCLES]			= 0x003c,
+  [PERF_COUNT_INSTRUCTIONS]		= 0x00c0,
+  [PERF_COUNT_CACHE_REFERENCES]		= 0x4f2e,
+  [PERF_COUNT_CACHE_MISSES]		= 0x412e,
+  [PERF_COUNT_BRANCH_INSTRUCTIONS]	= 0x00c4,
+  [PERF_COUNT_BRANCH_MISSES]		= 0x00c5,
+};
+
+const int max_intel_perfmon_events = ARRAY_SIZE(intel_perfmon_event_map);
+
+/*
+ * Setup the hardware configuration for a given hw_event_type
+ */
+static int __hw_perf_counter_init(struct perf_counter *counter)
+{
+	struct perf_counter_hw_event *hw_event = &counter->hw_event;
+	struct hw_perf_counter *hwc = &counter->hw;
+
+	if (unlikely(!perf_counters_initialized))
+		return -EINVAL;
+
+	/*
+	 * Count user events, and generate PMC IRQs:
+	 * (keep 'enabled' bit clear for now)
+	 */
+	hwc->config = ARCH_PERFMON_EVENTSEL_USR | ARCH_PERFMON_EVENTSEL_INT;
+
+	/*
+	 * If privileged enough, count OS events too, and allow
+	 * NMI events as well:
+	 */
+	hwc->nmi = 0;
+	if (capable(CAP_SYS_ADMIN)) {
+		hwc->config |= ARCH_PERFMON_EVENTSEL_OS;
+		if (hw_event->nmi)
+			hwc->nmi = 1;
+	}
+
+	hwc->config_base	= MSR_ARCH_PERFMON_EVENTSEL0;
+	hwc->counter_base	= MSR_ARCH_PERFMON_PERFCTR0;
+
+	hwc->irq_period		= hw_event->irq_period;
+	/*
+	 * Intel PMCs cannot be accessed sanely above 32 bit width,
+	 * so we install an artificial 1<<31 period regardless of
+	 * the generic counter period:
+	 */
+	if (!hwc->irq_period)
+		hwc->irq_period = 0x7FFFFFFF;
+
+	hwc->next_count	= -(s32)hwc->irq_period;
+
+	/*
+	 * Raw event type provide the config in the event structure
+	 */
+	if (hw_event->raw) {
+		hwc->config |= hw_event->type;
+	} else {
+		if (hw_event->type >= max_intel_perfmon_events)
+			return -EINVAL;
+		/*
+		 * The generic map:
+		 */
+		hwc->config |= intel_perfmon_event_map[hw_event->type];
+	}
+	counter->wakeup_pending = 0;
+
+	return 0;
+}
+
+void hw_perf_enable_all(void)
+{
+	wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, perf_counter_mask, 0);
+}
+
+void hw_perf_restore(u64 ctrl)
+{
+	wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, ctrl, 0);
+}
+EXPORT_SYMBOL_GPL(hw_perf_restore);
+
+u64 hw_perf_save_disable(void)
+{
+	u64 ctrl;
+
+	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, ctrl);
+	wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0, 0);
+	return ctrl;
+}
+EXPORT_SYMBOL_GPL(hw_perf_save_disable);
+
+static inline void
+__x86_perf_counter_disable(struct hw_perf_counter *hwc, unsigned int idx)
+{
+	wrmsr(hwc->config_base + idx, hwc->config, 0);
+}
+
+static DEFINE_PER_CPU(u64, prev_next_count[MAX_HW_COUNTERS]);
+
+static void __hw_perf_counter_set_period(struct hw_perf_counter *hwc, int idx)
+{
+	per_cpu(prev_next_count[idx], smp_processor_id()) = hwc->next_count;
+
+	wrmsr(hwc->counter_base + idx, hwc->next_count, 0);
+}
+
+static void __x86_perf_counter_enable(struct hw_perf_counter *hwc, int idx)
+{
+	wrmsr(hwc->config_base + idx,
+	      hwc->config | ARCH_PERFMON_EVENTSEL0_ENABLE, 0);
+}
+
+static void x86_perf_counter_enable(struct perf_counter *counter)
+{
+	struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+	struct hw_perf_counter *hwc = &counter->hw;
+	int idx = hwc->idx;
+
+	/* Try to get the previous counter again */
+	if (test_and_set_bit(idx, cpuc->used)) {
+		idx = find_first_zero_bit(cpuc->used, nr_hw_counters);
+		set_bit(idx, cpuc->used);
+		hwc->idx = idx;
+	}
+
+	perf_counters_lapic_init(hwc->nmi);
+
+	__x86_perf_counter_disable(hwc, idx);
+
+	cpuc->counters[idx] = counter;
+
+	__hw_perf_counter_set_period(hwc, idx);
+	__x86_perf_counter_enable(hwc, idx);
+}
+
+static void __hw_perf_save_counter(struct perf_counter *counter,
+				   struct hw_perf_counter *hwc, int idx)
+{
+	s64 raw = -1;
+	s64 delta;
+
+	/*
+	 * Get the raw hw counter value:
+	 */
+	rdmsrl(hwc->counter_base + idx, raw);
+
+	/*
+	 * Rebase it to zero (it started counting at -irq_period),
+	 * to see the delta since ->prev_count:
+	 */
+	delta = (s64)hwc->irq_period + (s64)(s32)raw;
+
+	atomic64_counter_set(counter, hwc->prev_count + delta);
+
+	/*
+	 * Adjust the ->prev_count offset - if we went beyond
+	 * irq_period of units, then we got an IRQ and the counter
+	 * was set back to -irq_period:
+	 */
+	while (delta >= (s64)hwc->irq_period) {
+		hwc->prev_count += hwc->irq_period;
+		delta -= (s64)hwc->irq_period;
+	}
+
+	/*
+	 * Calculate the next raw counter value we'll write into
+	 * the counter at the next sched-in time:
+	 */
+	delta -= (s64)hwc->irq_period;
+
+	hwc->next_count = (s32)delta;
+}
+
+void perf_counter_print_debug(void)
+{
+	u64 ctrl, status, overflow, pmc_ctrl, pmc_count, next_count;
+	int cpu, idx;
+
+	if (!nr_hw_counters)
+		return;
+
+	local_irq_disable();
+
+	cpu = smp_processor_id();
+
+	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, ctrl);
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+	rdmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, overflow);
+
+	printk(KERN_INFO "\n");
+	printk(KERN_INFO "CPU#%d: ctrl:       %016llx\n", cpu, ctrl);
+	printk(KERN_INFO "CPU#%d: status:     %016llx\n", cpu, status);
+	printk(KERN_INFO "CPU#%d: overflow:   %016llx\n", cpu, overflow);
+
+	for (idx = 0; idx < nr_hw_counters; idx++) {
+		rdmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + idx, pmc_ctrl);
+		rdmsrl(MSR_ARCH_PERFMON_PERFCTR0  + idx, pmc_count);
+
+		next_count = per_cpu(prev_next_count[idx], cpu);
+
+		printk(KERN_INFO "CPU#%d: PMC%d ctrl:  %016llx\n",
+			cpu, idx, pmc_ctrl);
+		printk(KERN_INFO "CPU#%d: PMC%d count: %016llx\n",
+			cpu, idx, pmc_count);
+		printk(KERN_INFO "CPU#%d: PMC%d next:  %016llx\n",
+			cpu, idx, next_count);
+	}
+	local_irq_enable();
+}
+
+static void x86_perf_counter_disable(struct perf_counter *counter)
+{
+	struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+	struct hw_perf_counter *hwc = &counter->hw;
+	unsigned int idx = hwc->idx;
+
+	__x86_perf_counter_disable(hwc, idx);
+
+	clear_bit(idx, cpuc->used);
+	cpuc->counters[idx] = NULL;
+	__hw_perf_save_counter(counter, hwc, idx);
+}
+
+static void x86_perf_counter_read(struct perf_counter *counter)
+{
+	struct hw_perf_counter *hwc = &counter->hw;
+	unsigned long addr = hwc->counter_base + hwc->idx;
+	s64 offs, val = -1LL;
+	s32 val32;
+
+	/* Careful: NMI might modify the counter offset */
+	do {
+		offs = hwc->prev_count;
+		rdmsrl(addr, val);
+	} while (offs != hwc->prev_count);
+
+	val32 = (s32) val;
+	val = (s64)hwc->irq_period + (s64)val32;
+	atomic64_counter_set(counter, hwc->prev_count + val);
+}
+
+static void perf_store_irq_data(struct perf_counter *counter, u64 data)
+{
+	struct perf_data *irqdata = counter->irqdata;
+
+	if (irqdata->len > PERF_DATA_BUFLEN - sizeof(u64)) {
+		irqdata->overrun++;
+	} else {
+		u64 *p = (u64 *) &irqdata->data[irqdata->len];
+
+		*p = data;
+		irqdata->len += sizeof(u64);
+	}
+}
+
+/*
+ * NMI-safe enable method:
+ */
+static void perf_save_and_restart(struct perf_counter *counter)
+{
+	struct hw_perf_counter *hwc = &counter->hw;
+	int idx = hwc->idx;
+	u64 pmc_ctrl;
+
+	rdmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + idx, pmc_ctrl);
+
+	__hw_perf_save_counter(counter, hwc, idx);
+	__hw_perf_counter_set_period(hwc, idx);
+
+	if (pmc_ctrl & ARCH_PERFMON_EVENTSEL0_ENABLE)
+		__x86_perf_counter_enable(hwc, idx);
+}
+
+static void
+perf_handle_group(struct perf_counter *sibling, u64 *status, u64 *overflown)
+{
+	struct perf_counter *counter, *group_leader = sibling->group_leader;
+	int bit;
+
+	/*
+	 * Store the counter's own timestamp first:
+	 */
+	perf_store_irq_data(sibling, sibling->hw_event.type);
+	perf_store_irq_data(sibling, atomic64_counter_read(sibling));
+
+	/*
+	 * Then store sibling timestamps (if any):
+	 */
+	list_for_each_entry(counter, &group_leader->sibling_list, list_entry) {
+		if (counter->state != PERF_COUNTER_STATE_ACTIVE) {
+			/*
+			 * When counter was not in the overflow mask, we have to
+			 * read it from hardware. We read it as well, when it
+			 * has not been read yet and clear the bit in the
+			 * status mask.
+			 */
+			bit = counter->hw.idx;
+			if (!test_bit(bit, (unsigned long *) overflown) ||
+			    test_bit(bit, (unsigned long *) status)) {
+				clear_bit(bit, (unsigned long *) status);
+				perf_save_and_restart(counter);
+			}
+		}
+		perf_store_irq_data(sibling, counter->hw_event.type);
+		perf_store_irq_data(sibling, atomic64_counter_read(counter));
+	}
+}
+
+/*
+ * This handler is triggered by the local APIC, so the APIC IRQ handling
+ * rules apply:
+ */
+static void __smp_perf_counter_interrupt(struct pt_regs *regs, int nmi)
+{
+	int bit, cpu = smp_processor_id();
+	u64 ack, status, saved_global;
+	struct cpu_hw_counters *cpuc;
+
+	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, saved_global);
+
+	/* Disable counters globally */
+	wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0, 0);
+	ack_APIC_irq();
+
+	cpuc = &per_cpu(cpu_hw_counters, cpu);
+
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+	if (!status)
+		goto out;
+
+again:
+	ack = status;
+	for_each_bit(bit, (unsigned long *) &status, nr_hw_counters) {
+		struct perf_counter *counter = cpuc->counters[bit];
+
+		clear_bit(bit, (unsigned long *) &status);
+		if (!counter)
+			continue;
+
+		perf_save_and_restart(counter);
+
+		switch (counter->hw_event.record_type) {
+		case PERF_RECORD_SIMPLE:
+			continue;
+		case PERF_RECORD_IRQ:
+			perf_store_irq_data(counter, instruction_pointer(regs));
+			break;
+		case PERF_RECORD_GROUP:
+			perf_handle_group(counter, &status, &ack);
+			break;
+		}
+		/*
+		 * From NMI context we cannot call into the scheduler to
+		 * do a task wakeup - but we mark these counters as
+		 * wakeup_pending and initate a wakeup callback:
+		 */
+		if (nmi) {
+			counter->wakeup_pending = 1;
+			set_tsk_thread_flag(current, TIF_PERF_COUNTERS);
+		} else {
+			wake_up(&counter->waitq);
+		}
+	}
+
+	wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack, 0);
+
+	/*
+	 * Repeat if there is more work to be done:
+	 */
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+	if (status)
+		goto again;
+out:
+	/*
+	 * Restore - do not reenable when global enable is off:
+	 */
+	wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, saved_global, 0);
+}
+
+void smp_perf_counter_interrupt(struct pt_regs *regs)
+{
+	irq_enter();
+#ifdef CONFIG_X86_64
+	add_pda(apic_perf_irqs, 1);
+#else
+	per_cpu(irq_stat, smp_processor_id()).apic_perf_irqs++;
+#endif
+	apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+	__smp_perf_counter_interrupt(regs, 0);
+
+	irq_exit();
+}
+
+/*
+ * This handler is triggered by NMI contexts:
+ */
+void perf_counter_notify(struct pt_regs *regs)
+{
+	struct cpu_hw_counters *cpuc;
+	unsigned long flags;
+	int bit, cpu;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	cpuc = &per_cpu(cpu_hw_counters, cpu);
+
+	for_each_bit(bit, cpuc->used, nr_hw_counters) {
+		struct perf_counter *counter = cpuc->counters[bit];
+
+		if (!counter)
+			continue;
+
+		if (counter->wakeup_pending) {
+			counter->wakeup_pending = 0;
+			wake_up(&counter->waitq);
+		}
+	}
+
+	local_irq_restore(flags);
+}
+
+void __cpuinit perf_counters_lapic_init(int nmi)
+{
+	u32 apic_val;
+
+	if (!perf_counters_initialized)
+		return;
+	/*
+	 * Enable the performance counter vector in the APIC LVT:
+	 */
+	apic_val = apic_read(APIC_LVTERR);
+
+	apic_write(APIC_LVTERR, apic_val | APIC_LVT_MASKED);
+	if (nmi)
+		apic_write(APIC_LVTPC, APIC_DM_NMI);
+	else
+		apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+	apic_write(APIC_LVTERR, apic_val);
+}
+
+static int __kprobes
+perf_counter_nmi_handler(struct notifier_block *self,
+			 unsigned long cmd, void *__args)
+{
+	struct die_args *args = __args;
+	struct pt_regs *regs;
+
+	if (likely(cmd != DIE_NMI_IPI))
+		return NOTIFY_DONE;
+
+	regs = args->regs;
+
+	apic_write(APIC_LVTPC, APIC_DM_NMI);
+	__smp_perf_counter_interrupt(regs, 1);
+
+	return NOTIFY_STOP;
+}
+
+static __read_mostly struct notifier_block perf_counter_nmi_notifier = {
+	.notifier_call		= perf_counter_nmi_handler
+};
+
+void __init init_hw_perf_counters(void)
+{
+	union cpuid10_eax eax;
+	unsigned int unused;
+	unsigned int ebx;
+
+	if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON))
+		return;
+
+	/*
+	 * Check whether the Architectural PerfMon supports
+	 * Branch Misses Retired Event or not.
+	 */
+	cpuid(10, &(eax.full), &ebx, &unused, &unused);
+	if (eax.split.mask_length <= ARCH_PERFMON_BRANCH_MISSES_RETIRED)
+		return;
+
+	printk(KERN_INFO "Intel Performance Monitoring support detected.\n");
+
+	printk(KERN_INFO "... version:      %d\n", eax.split.version_id);
+	printk(KERN_INFO "... num_counters: %d\n", eax.split.num_counters);
+	nr_hw_counters = eax.split.num_counters;
+	if (nr_hw_counters > MAX_HW_COUNTERS) {
+		nr_hw_counters = MAX_HW_COUNTERS;
+		WARN(1, KERN_ERR "hw perf counters %d > max(%d), clipping!",
+			nr_hw_counters, MAX_HW_COUNTERS);
+	}
+	perf_counter_mask = (1 << nr_hw_counters) - 1;
+	perf_max_counters = nr_hw_counters;
+
+	printk(KERN_INFO "... bit_width:    %d\n", eax.split.bit_width);
+	printk(KERN_INFO "... mask_length:  %d\n", eax.split.mask_length);
+
+	perf_counters_lapic_init(0);
+	register_die_notifier(&perf_counter_nmi_notifier);
+
+	perf_counters_initialized = true;
+}
+
+static const struct hw_perf_counter_ops x86_perf_counter_ops = {
+	.hw_perf_counter_enable		= x86_perf_counter_enable,
+	.hw_perf_counter_disable	= x86_perf_counter_disable,
+	.hw_perf_counter_read		= x86_perf_counter_read,
+};
+
+const struct hw_perf_counter_ops *
+hw_perf_counter_init(struct perf_counter *counter)
+{
+	int err;
+
+	err = __hw_perf_counter_init(counter);
+	if (err)
+		return NULL;
+
+	return &x86_perf_counter_ops;
+}
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3194636..fc013cf 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -984,6 +984,11 @@ apicinterrupt ERROR_APIC_VECTOR \
 apicinterrupt SPURIOUS_APIC_VECTOR \
 	spurious_interrupt smp_spurious_interrupt
 
+#ifdef CONFIG_PERF_COUNTERS
+apicinterrupt LOCAL_PERF_VECTOR \
+	perf_counter_interrupt smp_perf_counter_interrupt
+#endif
+
 /*
  * Exception entry points.
  */
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index d1d4dc5..d92bc71 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -56,6 +56,10 @@ static int show_other_interrupts(struct seq_file *p)
 	for_each_online_cpu(j)
 		seq_printf(p, "%10u ", irq_stats(j)->apic_timer_irqs);
 	seq_printf(p, "  Local timer interrupts\n");
+	seq_printf(p, "CNT: ");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ", irq_stats(j)->apic_perf_irqs);
+	seq_printf(p, "  Performance counter interrupts\n");
 #endif
 #ifdef CONFIG_SMP
 	seq_printf(p, "RES: ");
@@ -160,6 +164,7 @@ u64 arch_irq_stat_cpu(unsigned int cpu)
 
 #ifdef CONFIG_X86_LOCAL_APIC
 	sum += irq_stats(cpu)->apic_timer_irqs;
+	sum += irq_stats(cpu)->apic_perf_irqs;
 #endif
 #ifdef CONFIG_SMP
 	sum += irq_stats(cpu)->irq_resched_count;
diff --git a/arch/x86/kernel/irqinit_32.c b/arch/x86/kernel/irqinit_32.c
index 607db63..6a33b5e 100644
--- a/arch/x86/kernel/irqinit_32.c
+++ b/arch/x86/kernel/irqinit_32.c
@@ -160,6 +160,9 @@ void __init native_init_IRQ(void)
 	/* IPI vectors for APIC spurious and error interrupts */
 	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
 	alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+# ifdef CONFIG_PERF_COUNTERS
+	alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+# endif
 #endif
 
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_MCE_P4THERMAL)
diff --git a/arch/x86/kernel/irqinit_64.c b/arch/x86/kernel/irqinit_64.c
index 8670b3c..91d785c 100644
--- a/arch/x86/kernel/irqinit_64.c
+++ b/arch/x86/kernel/irqinit_64.c
@@ -138,6 +138,11 @@ static void __init apic_intr_init(void)
 	/* IPI vectors for APIC spurious and error interrupts */
 	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
 	alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+
+	/* Performance monitoring interrupt: */
+#ifdef CONFIG_PERF_COUNTERS
+	alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+#endif
 }
 
 void __init native_init_IRQ(void)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index b1cc6da..dee553c 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -6,7 +6,7 @@
  *  2000-06-20  Pentium III FXSR, SSE support by Gareth Hughes
  *  2000-2002   x86-64 support by Andi Kleen
  */
-
+#include <linux/perf_counter.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
@@ -891,6 +891,11 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 		tracehook_notify_resume(regs);
 	}
 
+	if (thread_info_flags & _TIF_PERF_COUNTERS) {
+		clear_thread_flag(TIF_PERF_COUNTERS);
+		perf_counter_notify(regs);
+	}
+
 #ifdef CONFIG_X86_32
 	clear_thread_flag(TIF_IRET);
 #endif /* CONFIG_X86_32 */
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..496726d 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,4 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_perf_counter_open
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 5f8d746..a3e66a3 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -270,8 +270,11 @@ static atomic_t c3_cpu_count;
 /* Common C-state entry for C2, C3, .. */
 static void acpi_cstate_enter(struct acpi_processor_cx *cstate)
 {
+	u64 perf_flags;
+
 	/* Don't trace irqs off for idle */
 	stop_critical_timings();
+	perf_flags = hw_perf_save_disable();
 	if (cstate->entry_method == ACPI_CSTATE_FFH) {
 		/* Call into architectural FFH based C-state */
 		acpi_processor_ffh_cstate_enter(cstate);
@@ -284,6 +287,7 @@ static void acpi_cstate_enter(struct acpi_processor_cx *cstate)
 		   gets asserted in time to freeze execution properly. */
 		unused = inl(acpi_gbl_FADT.xpm_timer_block.address);
 	}
+	hw_perf_restore(perf_flags);
 	start_critical_timings();
 }
 #endif /* !CONFIG_CPU_IDLE */
@@ -1425,8 +1429,11 @@ static inline void acpi_idle_update_bm_rld(struct acpi_processor *pr,
  */
 static inline void acpi_idle_do_entry(struct acpi_processor_cx *cx)
 {
+	u64 pctrl;
+
 	/* Don't trace irqs off for idle */
 	stop_critical_timings();
+	pctrl = hw_perf_save_disable();
 	if (cx->entry_method == ACPI_CSTATE_FFH) {
 		/* Call into architectural FFH based C-state */
 		acpi_processor_ffh_cstate_enter(cx);
@@ -1441,6 +1448,7 @@ static inline void acpi_idle_do_entry(struct acpi_processor_cx *cx)
 		   gets asserted in time to freeze execution properly. */
 		unused = inl(acpi_gbl_FADT.xpm_timer_block.address);
 	}
+	hw_perf_restore(pctrl);
 	start_critical_timings();
 }
 
diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
index ce0d9da..52146c2 100644
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -25,6 +25,7 @@
 #include <linux/kbd_kern.h>
 #include <linux/proc_fs.h>
 #include <linux/quotaops.h>
+#include <linux/perf_counter.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/suspend.h>
@@ -244,6 +245,7 @@ static void sysrq_handle_showregs(int key, struct tty_struct *tty)
 	struct pt_regs *regs = get_irq_regs();
 	if (regs)
 		show_regs(regs);
+	perf_counter_print_debug();
 }
 static struct sysrq_key_op sysrq_showregs_op = {
 	.handler	= sysrq_handle_showregs,
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
new file mode 100644
index 0000000..8cb095f
--- /dev/null
+++ b/include/linux/perf_counter.h
@@ -0,0 +1,244 @@
+/*
+ *  Performance counters:
+ *
+ *   Copyright(C) 2008, Thomas Gleixner <tglx@linutronix.de>
+ *   Copyright(C) 2008, Red Hat, Inc., Ingo Molnar
+ *
+ *  Data type definitions, declarations, prototypes.
+ *
+ *  Started by: Thomas Gleixner and Ingo Molnar
+ *
+ *  For licencing details see kernel-base/COPYING
+ */
+#ifndef _LINUX_PERF_COUNTER_H
+#define _LINUX_PERF_COUNTER_H
+
+#include <asm/atomic.h>
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
+#include <linux/spinlock.h>
+
+struct task_struct;
+
+/*
+ * User-space ABI bits:
+ */
+
+/*
+ * Generalized performance counter event types, used by the hw_event.type
+ * parameter of the sys_perf_counter_open() syscall:
+ */
+enum hw_event_types {
+	/*
+	 * Common hardware events, generalized by the kernel:
+	 */
+	PERF_COUNT_CYCLES		=  0,
+	PERF_COUNT_INSTRUCTIONS		=  1,
+	PERF_COUNT_CACHE_REFERENCES	=  2,
+	PERF_COUNT_CACHE_MISSES		=  3,
+	PERF_COUNT_BRANCH_INSTRUCTIONS	=  4,
+	PERF_COUNT_BRANCH_MISSES	=  5,
+
+	/*
+	 * Special "software" counters provided by the kernel, even if
+	 * the hardware does not support performance counters. These
+	 * counters measure various physical and sw events of the
+	 * kernel (and allow the profiling of them as well):
+	 */
+	PERF_COUNT_CPU_CLOCK		= -1,
+	PERF_COUNT_TASK_CLOCK		= -2,
+	/*
+	 * Future software events:
+	 */
+	/* PERF_COUNT_PAGE_FAULTS	= -3,
+	   PERF_COUNT_CONTEXT_SWITCHES	= -4, */
+};
+
+/*
+ * IRQ-notification data record type:
+ */
+enum perf_counter_record_type {
+	PERF_RECORD_SIMPLE		=  0,
+	PERF_RECORD_IRQ			=  1,
+	PERF_RECORD_GROUP		=  2,
+};
+
+/*
+ * Hardware event to monitor via a performance monitoring counter:
+ */
+struct perf_counter_hw_event {
+	s64			type;
+
+	u64			irq_period;
+	u32			record_type;
+
+	u32			disabled     :  1, /* off by default */
+				nmi	     :  1, /* NMI sampling   */
+				raw	     :  1, /* raw event type */
+				__reserved_1 : 29;
+
+	u64			__reserved_2;
+};
+
+/*
+ * Kernel-internal data types:
+ */
+
+/**
+ * struct hw_perf_counter - performance counter hardware details:
+ */
+struct hw_perf_counter {
+	u64				config;
+	unsigned long			config_base;
+	unsigned long			counter_base;
+	int				nmi;
+	unsigned int			idx;
+	u64				prev_count;
+	u64				irq_period;
+	s32				next_count;
+};
+
+/*
+ * Hardcoded buffer length limit for now, for IRQ-fed events:
+ */
+#define PERF_DATA_BUFLEN		2048
+
+/**
+ * struct perf_data - performance counter IRQ data sampling ...
+ */
+struct perf_data {
+	int				len;
+	int				rd_idx;
+	int				overrun;
+	u8				data[PERF_DATA_BUFLEN];
+};
+
+struct perf_counter;
+
+/**
+ * struct hw_perf_counter_ops - performance counter hw ops
+ */
+struct hw_perf_counter_ops {
+	void (*hw_perf_counter_enable)	(struct perf_counter *counter);
+	void (*hw_perf_counter_disable)	(struct perf_counter *counter);
+	void (*hw_perf_counter_read)	(struct perf_counter *counter);
+};
+
+/**
+ * enum perf_counter_active_state - the states of a counter
+ */
+enum perf_counter_active_state {
+	PERF_COUNTER_STATE_OFF		= -1,
+	PERF_COUNTER_STATE_INACTIVE	=  0,
+	PERF_COUNTER_STATE_ACTIVE	=  1,
+};
+
+/**
+ * struct perf_counter - performance counter kernel representation:
+ */
+struct perf_counter {
+	struct list_head		list_entry;
+	struct list_head		sibling_list;
+	struct perf_counter		*group_leader;
+	const struct hw_perf_counter_ops *hw_ops;
+
+	enum perf_counter_active_state	state;
+#if BITS_PER_LONG == 64
+	atomic64_t			count;
+#else
+	atomic_t			count32[2];
+#endif
+	struct perf_counter_hw_event	hw_event;
+	struct hw_perf_counter		hw;
+
+	struct perf_counter_context	*ctx;
+	struct task_struct		*task;
+
+	/*
+	 * Protect attach/detach:
+	 */
+	struct mutex			mutex;
+
+	int				oncpu;
+	int				cpu;
+
+	/* read() / irq related data */
+	wait_queue_head_t		waitq;
+	/* optional: for NMIs */
+	int				wakeup_pending;
+	struct perf_data		*irqdata;
+	struct perf_data		*usrdata;
+	struct perf_data		data[2];
+};
+
+/**
+ * struct perf_counter_context - counter context structure
+ *
+ * Used as a container for task counters and CPU counters as well:
+ */
+struct perf_counter_context {
+#ifdef CONFIG_PERF_COUNTERS
+	/*
+	 * Protect the list of counters:
+	 */
+	spinlock_t		lock;
+
+	struct list_head	counter_list;
+	int			nr_counters;
+	int			nr_active;
+	struct task_struct	*task;
+#endif
+};
+
+/**
+ * struct perf_counter_cpu_context - per cpu counter context structure
+ */
+struct perf_cpu_context {
+	struct perf_counter_context	ctx;
+	struct perf_counter_context	*task_ctx;
+	int				active_oncpu;
+	int				max_pertask;
+};
+
+/*
+ * Set by architecture code:
+ */
+extern int perf_max_counters;
+
+#ifdef CONFIG_PERF_COUNTERS
+extern const struct hw_perf_counter_ops *
+hw_perf_counter_init(struct perf_counter *counter);
+
+extern void perf_counter_task_sched_in(struct task_struct *task, int cpu);
+extern void perf_counter_task_sched_out(struct task_struct *task, int cpu);
+extern void perf_counter_task_tick(struct task_struct *task, int cpu);
+extern void perf_counter_init_task(struct task_struct *task);
+extern void perf_counter_notify(struct pt_regs *regs);
+extern void perf_counter_print_debug(void);
+extern u64 hw_perf_save_disable(void);
+extern void hw_perf_restore(u64 ctrl);
+extern void atomic64_counter_set(struct perf_counter *counter, u64 val64);
+extern u64 atomic64_counter_read(struct perf_counter *counter);
+extern int perf_counter_task_disable(void);
+extern int perf_counter_task_enable(void);
+
+#else
+static inline void
+perf_counter_task_sched_in(struct task_struct *task, int cpu)		{ }
+static inline void
+perf_counter_task_sched_out(struct task_struct *task, int cpu)		{ }
+static inline void
+perf_counter_task_tick(struct task_struct *task, int cpu)		{ }
+static inline void perf_counter_init_task(struct task_struct *task)	{ }
+static inline void perf_counter_notify(struct pt_regs *regs)		{ }
+static inline void perf_counter_print_debug(void)			{ }
+static inline void hw_perf_restore(u64 ctrl)			{ }
+static inline u64 hw_perf_save_disable(void)		      { return 0; }
+static inline int perf_counter_task_disable(void)	{ return -EINVAL; }
+static inline int perf_counter_task_enable(void)	{ return -EINVAL; }
+#endif
+
+#endif /* _LINUX_PERF_COUNTER_H */
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index 48d887e..b00df4c 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -85,4 +85,7 @@
 #define PR_SET_TIMERSLACK 29
 #define PR_GET_TIMERSLACK 30
 
+#define PR_TASK_PERF_COUNTERS_DISABLE		31
+#define PR_TASK_PERF_COUNTERS_ENABLE		32
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..4c53027 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -71,6 +71,7 @@ struct sched_param {
 #include <linux/fs_struct.h>
 #include <linux/compiler.h>
 #include <linux/completion.h>
+#include <linux/perf_counter.h>
 #include <linux/pid.h>
 #include <linux/percpu.h>
 #include <linux/topology.h>
@@ -1326,6 +1327,7 @@ struct task_struct {
 	struct list_head pi_state_list;
 	struct futex_pi_state *pi_state_cache;
 #endif
+	struct perf_counter_context perf_counter_ctx;
 #ifdef CONFIG_NUMA
 	struct mempolicy *mempolicy;
 	short il_next;
@@ -2285,6 +2287,13 @@ static inline void inc_syscw(struct task_struct *tsk)
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+/*
+ * Call the function if the target task is executing on a CPU right now:
+ */
+extern void task_oncpu_function_call(struct task_struct *p,
+				     void (*func) (void *info), void *info);
+
+
 #ifdef CONFIG_MM_OWNER
 extern void mm_update_next_owner(struct mm_struct *mm);
 extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 04fb47b..a549678 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,7 @@ struct compat_stat;
 struct compat_timeval;
 struct robust_list_head;
 struct getcpu_cache;
+struct perf_counter_hw_event;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -624,4 +625,11 @@ asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
+
+asmlinkage int sys_perf_counter_open(
+
+	struct perf_counter_hw_event	*hw_event_uptr		__user,
+	pid_t				pid,
+	int				cpu,
+	int				group_fd);
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index f763762..7d147a3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -732,6 +732,36 @@ config AIO
           by some high performance threaded applications. Disabling
           this option saves about 7k.
 
+config HAVE_PERF_COUNTERS
+	bool
+
+menu "Performance Counters"
+
+config PERF_COUNTERS
+	bool "Kernel Performance Counters"
+	depends on HAVE_PERF_COUNTERS
+	default y
+	select ANON_INODES
+	help
+	  Enable kernel support for performance counter hardware.
+
+	  Performance counters are special hardware registers available
+	  on most modern CPUs. These registers count the number of certain
+	  types of hw events: such as instructions executed, cachemisses
+	  suffered, or branches mis-predicted - without slowing down the
+	  kernel or applications. These registers can also trigger interrupts
+	  when a threshold number of events have passed - and can thus be
+	  used to profile the code that runs on that CPU.
+
+	  The Linux Performance Counter subsystem provides an abstraction of
+	  these hardware capabilities, available via a system call. It
+	  provides per task and per CPU counters, and it provides event
+	  capabilities on top of those.
+
+	  Say Y if unsure.
+
+endmenu
+
 config VM_EVENT_COUNTERS
 	default y
 	bool "Enable VM event counters for /proc/vmstat" if EMBEDDED
diff --git a/kernel/Makefile b/kernel/Makefile
index 19fad00..1f184a1 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -89,6 +89,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace/
 obj-$(CONFIG_TRACING) += trace/
 obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/fork.c b/kernel/fork.c
index 2a372a0..441fadf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -975,6 +975,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto fork_out;
 
 	rt_mutex_init_task(p);
+	perf_counter_init_task(p);
 
 #ifdef CONFIG_PROVE_LOCKING
 	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
new file mode 100644
index 0000000..559130b
--- /dev/null
+++ b/kernel/perf_counter.c
@@ -0,0 +1,1266 @@
+/*
+ * Performance counter core code
+ *
+ *  Copyright(C) 2008 Thomas Gleixner <tglx@linutronix.de>
+ *  Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ *  For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/fs.h>
+#include <linux/cpu.h>
+#include <linux/smp.h>
+#include <linux/file.h>
+#include <linux/poll.h>
+#include <linux/sysfs.h>
+#include <linux/ptrace.h>
+#include <linux/percpu.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/anon_inodes.h>
+#include <linux/perf_counter.h>
+
+/*
+ * Each CPU has a list of per CPU counters:
+ */
+DEFINE_PER_CPU(struct perf_cpu_context, perf_cpu_context);
+
+int perf_max_counters __read_mostly;
+static int perf_reserved_percpu __read_mostly;
+static int perf_overcommit __read_mostly = 1;
+
+/*
+ * Mutex for (sysadmin-configurable) counter reservations:
+ */
+static DEFINE_MUTEX(perf_resource_mutex);
+
+/*
+ * Architecture provided APIs - weak aliases:
+ */
+extern __weak const struct hw_perf_counter_ops *
+hw_perf_counter_init(struct perf_counter *counter)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+u64 __weak hw_perf_save_disable(void)		{ return 0; }
+void __weak hw_perf_restore(u64 ctrl)	{ }
+void __weak hw_perf_counter_setup(void)		{ }
+
+#if BITS_PER_LONG == 64
+
+/*
+ * Read the cached counter in counter safe against cross CPU / NMI
+ * modifications. 64 bit version - no complications.
+ */
+static inline u64 perf_counter_read_safe(struct perf_counter *counter)
+{
+	return (u64) atomic64_read(&counter->count);
+}
+
+void atomic64_counter_set(struct perf_counter *counter, u64 val)
+{
+	atomic64_set(&counter->count, val);
+}
+
+u64 atomic64_counter_read(struct perf_counter *counter)
+{
+	return atomic64_read(&counter->count);
+}
+
+#else
+
+/*
+ * Read the cached counter in counter safe against cross CPU / NMI
+ * modifications. 32 bit version.
+ */
+static u64 perf_counter_read_safe(struct perf_counter *counter)
+{
+	u32 cntl, cnth;
+
+	local_irq_disable();
+	do {
+		cnth = atomic_read(&counter->count32[1]);
+		cntl = atomic_read(&counter->count32[0]);
+	} while (cnth != atomic_read(&counter->count32[1]));
+
+	local_irq_enable();
+
+	return cntl | ((u64) cnth) << 32;
+}
+
+void atomic64_counter_set(struct perf_counter *counter, u64 val64)
+{
+	u32 *val32 = (void *)&val64;
+
+	atomic_set(counter->count32 + 0, *(val32 + 0));
+	atomic_set(counter->count32 + 1, *(val32 + 1));
+}
+
+u64 atomic64_counter_read(struct perf_counter *counter)
+{
+	return atomic_read(counter->count32 + 0) |
+		(u64) atomic_read(counter->count32 + 1) << 32;
+}
+
+#endif
+
+static void
+list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
+{
+	struct perf_counter *group_leader = counter->group_leader;
+
+	/*
+	 * Depending on whether it is a standalone or sibling counter,
+	 * add it straight to the context's counter list, or to the group
+	 * leader's sibling list:
+	 */
+	if (counter->group_leader == counter)
+		list_add_tail(&counter->list_entry, &ctx->counter_list);
+	else
+		list_add_tail(&counter->list_entry, &group_leader->sibling_list);
+}
+
+static void
+list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
+{
+	struct perf_counter *sibling, *tmp;
+
+	list_del_init(&counter->list_entry);
+
+	/*
+	 * If this was a group counter with sibling counters then
+	 * upgrade the siblings to singleton counters by adding them
+	 * to the context list directly:
+	 */
+	list_for_each_entry_safe(sibling, tmp,
+				 &counter->sibling_list, list_entry) {
+
+		list_del_init(&sibling->list_entry);
+		list_add_tail(&sibling->list_entry, &ctx->counter_list);
+		WARN_ON_ONCE(!sibling->group_leader);
+		WARN_ON_ONCE(sibling->group_leader == sibling);
+		sibling->group_leader = sibling;
+	}
+}
+
+/*
+ * Cross CPU call to remove a performance counter
+ *
+ * We disable the counter on the hardware level first. After that we
+ * remove it from the context list.
+ */
+static void __perf_counter_remove_from_context(void *info)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_counter *counter = info;
+	struct perf_counter_context *ctx = counter->ctx;
+	u64 perf_flags;
+
+	/*
+	 * If this is a task context, we need to check whether it is
+	 * the current task context of this cpu. If not it has been
+	 * scheduled out before the smp call arrived.
+	 */
+	if (ctx->task && cpuctx->task_ctx != ctx)
+		return;
+
+	spin_lock(&ctx->lock);
+
+	if (counter->state == PERF_COUNTER_STATE_ACTIVE) {
+		counter->hw_ops->hw_perf_counter_disable(counter);
+		counter->state = PERF_COUNTER_STATE_INACTIVE;
+		ctx->nr_active--;
+		cpuctx->active_oncpu--;
+		counter->task = NULL;
+	}
+	ctx->nr_counters--;
+
+	/*
+	 * Protect the list operation against NMI by disabling the
+	 * counters on a global level. NOP for non NMI based counters.
+	 */
+	perf_flags = hw_perf_save_disable();
+	list_del_counter(counter, ctx);
+	hw_perf_restore(perf_flags);
+
+	if (!ctx->task) {
+		/*
+		 * Allow more per task counters with respect to the
+		 * reservation:
+		 */
+		cpuctx->max_pertask =
+			min(perf_max_counters - ctx->nr_counters,
+			    perf_max_counters - perf_reserved_percpu);
+	}
+
+	spin_unlock(&ctx->lock);
+}
+
+
+/*
+ * Remove the counter from a task's (or a CPU's) list of counters.
+ *
+ * Must be called with counter->mutex held.
+ *
+ * CPU counters are removed with a smp call. For task counters we only
+ * call when the task is on a CPU.
+ */
+static void perf_counter_remove_from_context(struct perf_counter *counter)
+{
+	struct perf_counter_context *ctx = counter->ctx;
+	struct task_struct *task = ctx->task;
+
+	if (!task) {
+		/*
+		 * Per cpu counters are removed via an smp call and
+		 * the removal is always sucessful.
+		 */
+		smp_call_function_single(counter->cpu,
+					 __perf_counter_remove_from_context,
+					 counter, 1);
+		return;
+	}
+
+retry:
+	task_oncpu_function_call(task, __perf_counter_remove_from_context,
+				 counter);
+
+	spin_lock_irq(&ctx->lock);
+	/*
+	 * If the context is active we need to retry the smp call.
+	 */
+	if (ctx->nr_active && !list_empty(&counter->list_entry)) {
+		spin_unlock_irq(&ctx->lock);
+		goto retry;
+	}
+
+	/*
+	 * The lock prevents that this context is scheduled in so we
+	 * can remove the counter safely, if the call above did not
+	 * succeed.
+	 */
+	if (!list_empty(&counter->list_entry)) {
+		ctx->nr_counters--;
+		list_del_counter(counter, ctx);
+		counter->task = NULL;
+	}
+	spin_unlock_irq(&ctx->lock);
+}
+
+/*
+ * Cross CPU call to install and enable a preformance counter
+ */
+static void __perf_install_in_context(void *info)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_counter *counter = info;
+	struct perf_counter_context *ctx = counter->ctx;
+	int cpu = smp_processor_id();
+	u64 perf_flags;
+
+	/*
+	 * If this is a task context, we need to check whether it is
+	 * the current task context of this cpu. If not it has been
+	 * scheduled out before the smp call arrived.
+	 */
+	if (ctx->task && cpuctx->task_ctx != ctx)
+		return;
+
+	spin_lock(&ctx->lock);
+
+	/*
+	 * Protect the list operation against NMI by disabling the
+	 * counters on a global level. NOP for non NMI based counters.
+	 */
+	perf_flags = hw_perf_save_disable();
+	list_add_counter(counter, ctx);
+	hw_perf_restore(perf_flags);
+
+	ctx->nr_counters++;
+
+	if (cpuctx->active_oncpu < perf_max_counters) {
+		counter->hw_ops->hw_perf_counter_enable(counter);
+		counter->state = PERF_COUNTER_STATE_ACTIVE;
+		counter->oncpu = cpu;
+		ctx->nr_active++;
+		cpuctx->active_oncpu++;
+	}
+
+	if (!ctx->task && cpuctx->max_pertask)
+		cpuctx->max_pertask--;
+
+	spin_unlock(&ctx->lock);
+}
+
+/*
+ * Attach a performance counter to a context
+ *
+ * First we add the counter to the list with the hardware enable bit
+ * in counter->hw_config cleared.
+ *
+ * If the counter is attached to a task which is on a CPU we use a smp
+ * call to enable it in the task context. The task might have been
+ * scheduled away, but we check this in the smp call again.
+ */
+static void
+perf_install_in_context(struct perf_counter_context *ctx,
+			struct perf_counter *counter,
+			int cpu)
+{
+	struct task_struct *task = ctx->task;
+
+	counter->ctx = ctx;
+	if (!task) {
+		/*
+		 * Per cpu counters are installed via an smp call and
+		 * the install is always sucessful.
+		 */
+		smp_call_function_single(cpu, __perf_install_in_context,
+					 counter, 1);
+		return;
+	}
+
+	counter->task = task;
+retry:
+	task_oncpu_function_call(task, __perf_install_in_context,
+				 counter);
+
+	spin_lock_irq(&ctx->lock);
+	/*
+	 * we need to retry the smp call.
+	 */
+	if (ctx->nr_active && list_empty(&counter->list_entry)) {
+		spin_unlock_irq(&ctx->lock);
+		goto retry;
+	}
+
+	/*
+	 * The lock prevents that this context is scheduled in so we
+	 * can add the counter safely, if it the call above did not
+	 * succeed.
+	 */
+	if (list_empty(&counter->list_entry)) {
+		list_add_counter(counter, ctx);
+		ctx->nr_counters++;
+	}
+	spin_unlock_irq(&ctx->lock);
+}
+
+static void
+counter_sched_out(struct perf_counter *counter,
+		  struct perf_cpu_context *cpuctx,
+		  struct perf_counter_context *ctx)
+{
+	if (counter->state != PERF_COUNTER_STATE_ACTIVE)
+		return;
+
+	counter->hw_ops->hw_perf_counter_disable(counter);
+	counter->state = PERF_COUNTER_STATE_INACTIVE;
+	counter->oncpu = -1;
+
+	cpuctx->active_oncpu--;
+	ctx->nr_active--;
+}
+
+static void
+group_sched_out(struct perf_counter *group_counter,
+		struct perf_cpu_context *cpuctx,
+		struct perf_counter_context *ctx)
+{
+	struct perf_counter *counter;
+
+	counter_sched_out(group_counter, cpuctx, ctx);
+
+	/*
+	 * Schedule out siblings (if any):
+	 */
+	list_for_each_entry(counter, &group_counter->sibling_list, list_entry)
+		counter_sched_out(counter, cpuctx, ctx);
+}
+
+/*
+ * Called from scheduler to remove the counters of the current task,
+ * with interrupts disabled.
+ *
+ * We stop each counter and update the counter value in counter->count.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_disable()
+ * sets the disabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * not restart the counter.
+ */
+void perf_counter_task_sched_out(struct task_struct *task, int cpu)
+{
+	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+	struct perf_counter *counter;
+
+	if (likely(!cpuctx->task_ctx))
+		return;
+
+	spin_lock(&ctx->lock);
+	if (ctx->nr_active) {
+		list_for_each_entry(counter, &ctx->counter_list, list_entry)
+			group_sched_out(counter, cpuctx, ctx);
+	}
+	spin_unlock(&ctx->lock);
+	cpuctx->task_ctx = NULL;
+}
+
+static void
+counter_sched_in(struct perf_counter *counter,
+		 struct perf_cpu_context *cpuctx,
+		 struct perf_counter_context *ctx,
+		 int cpu)
+{
+	if (counter->state == PERF_COUNTER_STATE_OFF)
+		return;
+
+	counter->hw_ops->hw_perf_counter_enable(counter);
+	counter->state = PERF_COUNTER_STATE_ACTIVE;
+	counter->oncpu = cpu;	/* TODO: put 'cpu' into cpuctx->cpu */
+
+	cpuctx->active_oncpu++;
+	ctx->nr_active++;
+}
+
+static void
+group_sched_in(struct perf_counter *group_counter,
+	       struct perf_cpu_context *cpuctx,
+	       struct perf_counter_context *ctx,
+	       int cpu)
+{
+	struct perf_counter *counter;
+
+	counter_sched_in(group_counter, cpuctx, ctx, cpu);
+
+	/*
+	 * Schedule in siblings as one group (if any):
+	 */
+	list_for_each_entry(counter, &group_counter->sibling_list, list_entry)
+		counter_sched_in(counter, cpuctx, ctx, cpu);
+}
+
+/*
+ * Called from scheduler to add the counters of the current task
+ * with interrupts disabled.
+ *
+ * We restore the counter value and then enable it.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_enable()
+ * sets the enabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * keep the counter running.
+ */
+void perf_counter_task_sched_in(struct task_struct *task, int cpu)
+{
+	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+	struct perf_counter *counter;
+
+	if (likely(!ctx->nr_counters))
+		return;
+
+	spin_lock(&ctx->lock);
+	list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+		if (ctx->nr_active == cpuctx->max_pertask)
+			break;
+
+		/*
+		 * Listen to the 'cpu' scheduling filter constraint
+		 * of counters:
+		 */
+		if (counter->cpu != -1 && counter->cpu != cpu)
+			continue;
+
+		group_sched_in(counter, cpuctx, ctx, cpu);
+	}
+	spin_unlock(&ctx->lock);
+
+	cpuctx->task_ctx = ctx;
+}
+
+int perf_counter_task_disable(void)
+{
+	struct task_struct *curr = current;
+	struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+	struct perf_counter *counter;
+	u64 perf_flags;
+	int cpu;
+
+	if (likely(!ctx->nr_counters))
+		return 0;
+
+	local_irq_disable();
+	cpu = smp_processor_id();
+
+	perf_counter_task_sched_out(curr, cpu);
+
+	spin_lock(&ctx->lock);
+
+	/*
+	 * Disable all the counters:
+	 */
+	perf_flags = hw_perf_save_disable();
+
+	list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+		WARN_ON_ONCE(counter->state == PERF_COUNTER_STATE_ACTIVE);
+		counter->state = PERF_COUNTER_STATE_OFF;
+	}
+	hw_perf_restore(perf_flags);
+
+	spin_unlock(&ctx->lock);
+
+	local_irq_enable();
+
+	return 0;
+}
+
+int perf_counter_task_enable(void)
+{
+	struct task_struct *curr = current;
+	struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+	struct perf_counter *counter;
+	u64 perf_flags;
+	int cpu;
+
+	if (likely(!ctx->nr_counters))
+		return 0;
+
+	local_irq_disable();
+	cpu = smp_processor_id();
+
+	spin_lock(&ctx->lock);
+
+	/*
+	 * Disable all the counters:
+	 */
+	perf_flags = hw_perf_save_disable();
+
+	list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+		if (counter->state != PERF_COUNTER_STATE_OFF)
+			continue;
+		counter->state = PERF_COUNTER_STATE_INACTIVE;
+	}
+	hw_perf_restore(perf_flags);
+
+	spin_unlock(&ctx->lock);
+
+	perf_counter_task_sched_in(curr, cpu);
+
+	local_irq_enable();
+
+	return 0;
+}
+
+void perf_counter_task_tick(struct task_struct *curr, int cpu)
+{
+	struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+	struct perf_counter *counter;
+	u64 perf_flags;
+
+	if (likely(!ctx->nr_counters))
+		return;
+
+	perf_counter_task_sched_out(curr, cpu);
+
+	spin_lock(&ctx->lock);
+
+	/*
+	 * Rotate the first entry last (works just fine for group counters too):
+	 */
+	perf_flags = hw_perf_save_disable();
+	list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+		list_del(&counter->list_entry);
+		list_add_tail(&counter->list_entry, &ctx->counter_list);
+		break;
+	}
+	hw_perf_restore(perf_flags);
+
+	spin_unlock(&ctx->lock);
+
+	perf_counter_task_sched_in(curr, cpu);
+}
+
+/*
+ * Initialize the perf_counter context in a task_struct:
+ */
+static void
+__perf_counter_init_context(struct perf_counter_context *ctx,
+			    struct task_struct *task)
+{
+	spin_lock_init(&ctx->lock);
+	INIT_LIST_HEAD(&ctx->counter_list);
+	ctx->nr_counters	= 0;
+	ctx->task		= task;
+}
+/*
+ * Initialize the perf_counter context in task_struct
+ */
+void perf_counter_init_task(struct task_struct *task)
+{
+	__perf_counter_init_context(&task->perf_counter_ctx, task);
+}
+
+/*
+ * Cross CPU call to read the hardware counter
+ */
+static void __hw_perf_counter_read(void *info)
+{
+	struct perf_counter *counter = info;
+
+	counter->hw_ops->hw_perf_counter_read(counter);
+}
+
+static u64 perf_counter_read(struct perf_counter *counter)
+{
+	/*
+	 * If counter is enabled and currently active on a CPU, update the
+	 * value in the counter structure:
+	 */
+	if (counter->state == PERF_COUNTER_STATE_ACTIVE) {
+		smp_call_function_single(counter->oncpu,
+					 __hw_perf_counter_read, counter, 1);
+	}
+
+	return perf_counter_read_safe(counter);
+}
+
+/*
+ * Cross CPU call to switch performance data pointers
+ */
+static void __perf_switch_irq_data(void *info)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_counter *counter = info;
+	struct perf_counter_context *ctx = counter->ctx;
+	struct perf_data *oldirqdata = counter->irqdata;
+
+	/*
+	 * If this is a task context, we need to check whether it is
+	 * the current task context of this cpu. If not it has been
+	 * scheduled out before the smp call arrived.
+	 */
+	if (ctx->task) {
+		if (cpuctx->task_ctx != ctx)
+			return;
+		spin_lock(&ctx->lock);
+	}
+
+	/* Change the pointer NMI safe */
+	atomic_long_set((atomic_long_t *)&counter->irqdata,
+			(unsigned long) counter->usrdata);
+	counter->usrdata = oldirqdata;
+
+	if (ctx->task)
+		spin_unlock(&ctx->lock);
+}
+
+static struct perf_data *perf_switch_irq_data(struct perf_counter *counter)
+{
+	struct perf_counter_context *ctx = counter->ctx;
+	struct perf_data *oldirqdata = counter->irqdata;
+	struct task_struct *task = ctx->task;
+
+	if (!task) {
+		smp_call_function_single(counter->cpu,
+					 __perf_switch_irq_data,
+					 counter, 1);
+		return counter->usrdata;
+	}
+
+retry:
+	spin_lock_irq(&ctx->lock);
+	if (counter->state != PERF_COUNTER_STATE_ACTIVE) {
+		counter->irqdata = counter->usrdata;
+		counter->usrdata = oldirqdata;
+		spin_unlock_irq(&ctx->lock);
+		return oldirqdata;
+	}
+	spin_unlock_irq(&ctx->lock);
+	task_oncpu_function_call(task, __perf_switch_irq_data, counter);
+	/* Might have failed, because task was scheduled out */
+	if (counter->irqdata == oldirqdata)
+		goto retry;
+
+	return counter->usrdata;
+}
+
+static void put_context(struct perf_counter_context *ctx)
+{
+	if (ctx->task)
+		put_task_struct(ctx->task);
+}
+
+static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
+{
+	struct perf_cpu_context *cpuctx;
+	struct perf_counter_context *ctx;
+	struct task_struct *task;
+
+	/*
+	 * If cpu is not a wildcard then this is a percpu counter:
+	 */
+	if (cpu != -1) {
+		/* Must be root to operate on a CPU counter: */
+		if (!capable(CAP_SYS_ADMIN))
+			return ERR_PTR(-EACCES);
+
+		if (cpu < 0 || cpu > num_possible_cpus())
+			return ERR_PTR(-EINVAL);
+
+		/*
+		 * We could be clever and allow to attach a counter to an
+		 * offline CPU and activate it when the CPU comes up, but
+		 * that's for later.
+		 */
+		if (!cpu_isset(cpu, cpu_online_map))
+			return ERR_PTR(-ENODEV);
+
+		cpuctx = &per_cpu(perf_cpu_context, cpu);
+		ctx = &cpuctx->ctx;
+
+		WARN_ON_ONCE(ctx->task);
+		return ctx;
+	}
+
+	rcu_read_lock();
+	if (!pid)
+		task = current;
+	else
+		task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	rcu_read_unlock();
+
+	if (!task)
+		return ERR_PTR(-ESRCH);
+
+	ctx = &task->perf_counter_ctx;
+	ctx->task = task;
+
+	/* Reuse ptrace permission checks for now. */
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		put_context(ctx);
+		return ERR_PTR(-EACCES);
+	}
+
+	return ctx;
+}
+
+/*
+ * Called when the last reference to the file is gone.
+ */
+static int perf_release(struct inode *inode, struct file *file)
+{
+	struct perf_counter *counter = file->private_data;
+	struct perf_counter_context *ctx = counter->ctx;
+
+	file->private_data = NULL;
+
+	mutex_lock(&counter->mutex);
+
+	perf_counter_remove_from_context(counter);
+	put_context(ctx);
+
+	mutex_unlock(&counter->mutex);
+
+	kfree(counter);
+
+	return 0;
+}
+
+/*
+ * Read the performance counter - simple non blocking version for now
+ */
+static ssize_t
+perf_read_hw(struct perf_counter *counter, char __user *buf, size_t count)
+{
+	u64 cntval;
+
+	if (count != sizeof(cntval))
+		return -EINVAL;
+
+	mutex_lock(&counter->mutex);
+	cntval = perf_counter_read(counter);
+	mutex_unlock(&counter->mutex);
+
+	return put_user(cntval, (u64 __user *) buf) ? -EFAULT : sizeof(cntval);
+}
+
+static ssize_t
+perf_copy_usrdata(struct perf_data *usrdata, char __user *buf, size_t count)
+{
+	if (!usrdata->len)
+		return 0;
+
+	count = min(count, (size_t)usrdata->len);
+	if (copy_to_user(buf, usrdata->data + usrdata->rd_idx, count))
+		return -EFAULT;
+
+	/* Adjust the counters */
+	usrdata->len -= count;
+	if (!usrdata->len)
+		usrdata->rd_idx = 0;
+	else
+		usrdata->rd_idx += count;
+
+	return count;
+}
+
+static ssize_t
+perf_read_irq_data(struct perf_counter	*counter,
+		   char __user		*buf,
+		   size_t		count,
+		   int			nonblocking)
+{
+	struct perf_data *irqdata, *usrdata;
+	DECLARE_WAITQUEUE(wait, current);
+	ssize_t res;
+
+	irqdata = counter->irqdata;
+	usrdata = counter->usrdata;
+
+	if (usrdata->len + irqdata->len >= count)
+		goto read_pending;
+
+	if (nonblocking)
+		return -EAGAIN;
+
+	spin_lock_irq(&counter->waitq.lock);
+	__add_wait_queue(&counter->waitq, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (usrdata->len + irqdata->len >= count)
+			break;
+
+		if (signal_pending(current))
+			break;
+
+		spin_unlock_irq(&counter->waitq.lock);
+		schedule();
+		spin_lock_irq(&counter->waitq.lock);
+	}
+	__remove_wait_queue(&counter->waitq, &wait);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock_irq(&counter->waitq.lock);
+
+	if (usrdata->len + irqdata->len < count)
+		return -ERESTARTSYS;
+read_pending:
+	mutex_lock(&counter->mutex);
+
+	/* Drain pending data first: */
+	res = perf_copy_usrdata(usrdata, buf, count);
+	if (res < 0 || res == count)
+		goto out;
+
+	/* Switch irq buffer: */
+	usrdata = perf_switch_irq_data(counter);
+	if (perf_copy_usrdata(usrdata, buf + res, count - res) < 0) {
+		if (!res)
+			res = -EFAULT;
+	} else {
+		res = count;
+	}
+out:
+	mutex_unlock(&counter->mutex);
+
+	return res;
+}
+
+static ssize_t
+perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+	struct perf_counter *counter = file->private_data;
+
+	switch (counter->hw_event.record_type) {
+	case PERF_RECORD_SIMPLE:
+		return perf_read_hw(counter, buf, count);
+
+	case PERF_RECORD_IRQ:
+	case PERF_RECORD_GROUP:
+		return perf_read_irq_data(counter, buf, count,
+					  file->f_flags & O_NONBLOCK);
+	}
+	return -EINVAL;
+}
+
+static unsigned int perf_poll(struct file *file, poll_table *wait)
+{
+	struct perf_counter *counter = file->private_data;
+	unsigned int events = 0;
+	unsigned long flags;
+
+	poll_wait(file, &counter->waitq, wait);
+
+	spin_lock_irqsave(&counter->waitq.lock, flags);
+	if (counter->usrdata->len || counter->irqdata->len)
+		events |= POLLIN;
+	spin_unlock_irqrestore(&counter->waitq.lock, flags);
+
+	return events;
+}
+
+static const struct file_operations perf_fops = {
+	.release		= perf_release,
+	.read			= perf_read,
+	.poll			= perf_poll,
+};
+
+static void cpu_clock_perf_counter_enable(struct perf_counter *counter)
+{
+}
+
+static void cpu_clock_perf_counter_disable(struct perf_counter *counter)
+{
+}
+
+static void cpu_clock_perf_counter_read(struct perf_counter *counter)
+{
+	int cpu = raw_smp_processor_id();
+
+	atomic64_counter_set(counter, cpu_clock(cpu));
+}
+
+static const struct hw_perf_counter_ops perf_ops_cpu_clock = {
+	.hw_perf_counter_enable		= cpu_clock_perf_counter_enable,
+	.hw_perf_counter_disable	= cpu_clock_perf_counter_disable,
+	.hw_perf_counter_read		= cpu_clock_perf_counter_read,
+};
+
+static void task_clock_perf_counter_enable(struct perf_counter *counter)
+{
+}
+
+static void task_clock_perf_counter_disable(struct perf_counter *counter)
+{
+}
+
+static void task_clock_perf_counter_read(struct perf_counter *counter)
+{
+	atomic64_counter_set(counter, current->se.sum_exec_runtime);
+}
+
+static const struct hw_perf_counter_ops perf_ops_task_clock = {
+	.hw_perf_counter_enable		= task_clock_perf_counter_enable,
+	.hw_perf_counter_disable	= task_clock_perf_counter_disable,
+	.hw_perf_counter_read		= task_clock_perf_counter_read,
+};
+
+static const struct hw_perf_counter_ops *
+sw_perf_counter_init(struct perf_counter *counter)
+{
+	const struct hw_perf_counter_ops *hw_ops = NULL;
+
+	switch (counter->hw_event.type) {
+	case PERF_COUNT_CPU_CLOCK:
+		hw_ops = &perf_ops_cpu_clock;
+		break;
+	case PERF_COUNT_TASK_CLOCK:
+		hw_ops = &perf_ops_task_clock;
+		break;
+	default:
+		break;
+	}
+	return hw_ops;
+}
+
+/*
+ * Allocate and initialize a counter structure
+ */
+static struct perf_counter *
+perf_counter_alloc(struct perf_counter_hw_event *hw_event,
+		   int cpu,
+		   struct perf_counter *group_leader)
+{
+	const struct hw_perf_counter_ops *hw_ops;
+	struct perf_counter *counter;
+
+	counter = kzalloc(sizeof(*counter), GFP_KERNEL);
+	if (!counter)
+		return NULL;
+
+	/*
+	 * Single counters are their own group leaders, with an
+	 * empty sibling list:
+	 */
+	if (!group_leader)
+		group_leader = counter;
+
+	mutex_init(&counter->mutex);
+	INIT_LIST_HEAD(&counter->list_entry);
+	INIT_LIST_HEAD(&counter->sibling_list);
+	init_waitqueue_head(&counter->waitq);
+
+	counter->irqdata		= &counter->data[0];
+	counter->usrdata		= &counter->data[1];
+	counter->cpu			= cpu;
+	counter->hw_event		= *hw_event;
+	counter->wakeup_pending		= 0;
+	counter->group_leader		= group_leader;
+	counter->hw_ops			= NULL;
+
+	hw_ops = NULL;
+	if (!hw_event->raw && hw_event->type < 0)
+		hw_ops = sw_perf_counter_init(counter);
+	if (!hw_ops) {
+		hw_ops = hw_perf_counter_init(counter);
+	}
+
+	if (!hw_ops) {
+		kfree(counter);
+		return NULL;
+	}
+	counter->hw_ops = hw_ops;
+
+	return counter;
+}
+
+/**
+ * sys_perf_task_open - open a performance counter, associate it to a task/cpu
+ *
+ * @hw_event_uptr:	event type attributes for monitoring/sampling
+ * @pid:		target pid
+ * @cpu:		target cpu
+ * @group_fd:		group leader counter fd
+ */
+asmlinkage int
+sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr __user,
+		      pid_t pid, int cpu, int group_fd)
+{
+	struct perf_counter *counter, *group_leader;
+	struct perf_counter_hw_event hw_event;
+	struct perf_counter_context *ctx;
+	struct file *group_file = NULL;
+	int fput_needed = 0;
+	int ret;
+
+	if (copy_from_user(&hw_event, hw_event_uptr, sizeof(hw_event)) != 0)
+		return -EFAULT;
+
+	/*
+	 * Get the target context (task or percpu):
+	 */
+	ctx = find_get_context(pid, cpu);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	/*
+	 * Look up the group leader (we will attach this counter to it):
+	 */
+	group_leader = NULL;
+	if (group_fd != -1) {
+		ret = -EINVAL;
+		group_file = fget_light(group_fd, &fput_needed);
+		if (!group_file)
+			goto err_put_context;
+		if (group_file->f_op != &perf_fops)
+			goto err_put_context;
+
+		group_leader = group_file->private_data;
+		/*
+		 * Do not allow a recursive hierarchy (this new sibling
+		 * becoming part of another group-sibling):
+		 */
+		if (group_leader->group_leader != group_leader)
+			goto err_put_context;
+		/*
+		 * Do not allow to attach to a group in a different
+		 * task or CPU context:
+		 */
+		if (group_leader->ctx != ctx)
+			goto err_put_context;
+	}
+
+	ret = -EINVAL;
+	counter = perf_counter_alloc(&hw_event, cpu, group_leader);
+	if (!counter)
+		goto err_put_context;
+
+	perf_install_in_context(ctx, counter, cpu);
+
+	ret = anon_inode_getfd("[perf_counter]", &perf_fops, counter, 0);
+	if (ret < 0)
+		goto err_remove_free_put_context;
+
+out_fput:
+	fput_light(group_file, fput_needed);
+
+	return ret;
+
+err_remove_free_put_context:
+	mutex_lock(&counter->mutex);
+	perf_counter_remove_from_context(counter);
+	mutex_unlock(&counter->mutex);
+	kfree(counter);
+
+err_put_context:
+	put_context(ctx);
+
+	goto out_fput;
+}
+
+static void __cpuinit perf_counter_init_cpu(int cpu)
+{
+	struct perf_cpu_context *cpuctx;
+
+	cpuctx = &per_cpu(perf_cpu_context, cpu);
+	__perf_counter_init_context(&cpuctx->ctx, NULL);
+
+	mutex_lock(&perf_resource_mutex);
+	cpuctx->max_pertask = perf_max_counters - perf_reserved_percpu;
+	mutex_unlock(&perf_resource_mutex);
+
+	hw_perf_counter_setup();
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void __perf_counter_exit_cpu(void *info)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_counter_context *ctx = &cpuctx->ctx;
+	struct perf_counter *counter, *tmp;
+
+	list_for_each_entry_safe(counter, tmp, &ctx->counter_list, list_entry)
+		__perf_counter_remove_from_context(counter);
+
+}
+static void perf_counter_exit_cpu(int cpu)
+{
+	smp_call_function_single(cpu, __perf_counter_exit_cpu, NULL, 1);
+}
+#else
+static inline void perf_counter_exit_cpu(int cpu) { }
+#endif
+
+static int __cpuinit
+perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (long)hcpu;
+
+	switch (action) {
+
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		perf_counter_init_cpu(cpu);
+		break;
+
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		perf_counter_exit_cpu(cpu);
+		break;
+
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata perf_cpu_nb = {
+	.notifier_call		= perf_cpu_notify,
+};
+
+static int __init perf_counter_init(void)
+{
+	perf_cpu_notify(&perf_cpu_nb, (unsigned long)CPU_UP_PREPARE,
+			(void *)(long)smp_processor_id());
+	register_cpu_notifier(&perf_cpu_nb);
+
+	return 0;
+}
+early_initcall(perf_counter_init);
+
+static ssize_t perf_show_reserve_percpu(struct sysdev_class *class, char *buf)
+{
+	return sprintf(buf, "%d\n", perf_reserved_percpu);
+}
+
+static ssize_t
+perf_set_reserve_percpu(struct sysdev_class *class,
+			const char *buf,
+			size_t count)
+{
+	struct perf_cpu_context *cpuctx;
+	unsigned long val;
+	int err, cpu, mpt;
+
+	err = strict_strtoul(buf, 10, &val);
+	if (err)
+		return err;
+	if (val > perf_max_counters)
+		return -EINVAL;
+
+	mutex_lock(&perf_resource_mutex);
+	perf_reserved_percpu = val;
+	for_each_online_cpu(cpu) {
+		cpuctx = &per_cpu(perf_cpu_context, cpu);
+		spin_lock_irq(&cpuctx->ctx.lock);
+		mpt = min(perf_max_counters - cpuctx->ctx.nr_counters,
+			  perf_max_counters - perf_reserved_percpu);
+		cpuctx->max_pertask = mpt;
+		spin_unlock_irq(&cpuctx->ctx.lock);
+	}
+	mutex_unlock(&perf_resource_mutex);
+
+	return count;
+}
+
+static ssize_t perf_show_overcommit(struct sysdev_class *class, char *buf)
+{
+	return sprintf(buf, "%d\n", perf_overcommit);
+}
+
+static ssize_t
+perf_set_overcommit(struct sysdev_class *class, const char *buf, size_t count)
+{
+	unsigned long val;
+	int err;
+
+	err = strict_strtoul(buf, 10, &val);
+	if (err)
+		return err;
+	if (val > 1)
+		return -EINVAL;
+
+	mutex_lock(&perf_resource_mutex);
+	perf_overcommit = val;
+	mutex_unlock(&perf_resource_mutex);
+
+	return count;
+}
+
+static SYSDEV_CLASS_ATTR(
+				reserve_percpu,
+				0644,
+				perf_show_reserve_percpu,
+				perf_set_reserve_percpu
+			);
+
+static SYSDEV_CLASS_ATTR(
+				overcommit,
+				0644,
+				perf_show_overcommit,
+				perf_set_overcommit
+			);
+
+static struct attribute *perfclass_attrs[] = {
+	&attr_reserve_percpu.attr,
+	&attr_overcommit.attr,
+	NULL
+};
+
+static struct attribute_group perfclass_attr_group = {
+	.attrs			= perfclass_attrs,
+	.name			= "perf_counters",
+};
+
+static int __init perf_counter_sysfs_init(void)
+{
+	return sysfs_create_group(&cpu_sysdev_class.kset.kobj,
+				  &perfclass_attr_group);
+}
+device_initcall(perf_counter_sysfs_init);
+
diff --git a/kernel/sched.c b/kernel/sched.c
index b7480fb..254d56d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2212,6 +2212,27 @@ static int sched_balance_self(int cpu, int flag)
 
 #endif /* CONFIG_SMP */
 
+/**
+ * task_oncpu_function_call - call a function on the cpu on which a task runs
+ * @p:		the task to evaluate
+ * @func:	the function to be called
+ * @info:	the function call argument
+ *
+ * Calls the function @func when the task is currently running. This might
+ * be on the current CPU, which just calls the function directly
+ */
+void task_oncpu_function_call(struct task_struct *p,
+			      void (*func) (void *info), void *info)
+{
+	int cpu;
+
+	preempt_disable();
+	cpu = task_cpu(p);
+	if (task_curr(p))
+		smp_call_function_single(cpu, func, info, 1);
+	preempt_enable();
+}
+
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -2534,6 +2555,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
 	fire_sched_out_preempt_notifiers(prev, next);
+	perf_counter_task_sched_out(prev, cpu_of(rq));
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
 }
@@ -2574,6 +2596,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 */
 	prev_state = prev->state;
 	finish_arch_switch(prev);
+	perf_counter_task_sched_in(current, cpu_of(rq));
 	finish_lock_switch(rq, prev);
 #ifdef CONFIG_SMP
 	if (current->sched_class->post_schedule)
@@ -4296,6 +4319,7 @@ void scheduler_tick(void)
 	rq->idle_at_tick = idle_cpu(cpu);
 	trigger_load_balance(rq, cpu);
 #endif
+	perf_counter_task_tick(curr, cpu);
 }
 
 #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
diff --git a/kernel/sys.c b/kernel/sys.c
index 31deba8..0f66633 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -14,6 +14,7 @@
 #include <linux/prctl.h>
 #include <linux/highuid.h>
 #include <linux/fs.h>
+#include <linux/perf_counter.h>
 #include <linux/resource.h>
 #include <linux/kernel.h>
 #include <linux/kexec.h>
@@ -1716,6 +1717,12 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
 		case PR_SET_TSC:
 			error = SET_TSC_CTL(arg2);
 			break;
+		case PR_TASK_PERF_COUNTERS_DISABLE:
+			error = perf_counter_task_disable();
+			break;
+		case PR_TASK_PERF_COUNTERS_ENABLE:
+			error = perf_counter_task_enable();
+			break;
 		case PR_GET_TIMERSLACK:
 			error = current->timer_slack_ns;
 			break;
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e14a232..4be8bbc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,3 +174,6 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+/* performance counters: */
+cond_syscall(sys_perf_counter_open);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-11 15:52 [patch] Performance Counters for Linux, v3 Ingo Molnar
@ 2008-12-11 18:02 ` Vince Weaver
  2008-12-12  8:25   ` Peter Zijlstra
  2008-12-11 18:35 ` Andrew Morton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 52+ messages in thread
From: Vince Weaver @ 2008-12-11 18:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, Andrew Morton, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Paul Mackerras, David S. Miller


Can someone tell me which performance counter implementation is likely to 
get merged into the Kernel?

I have at least 60 machines that I do regular performance counter work on. 
They involve Pentium Pro, Pentium II, 32-bit Athlon, 64-bit Athlon, 
Pentium 4, Pentium D, Core, Core2, Atom, MIPS R12k, Niagara T1, 
and PPC/Playstation 3.

Perfmon3 works for all of those 60 machines.  This new proposal works on a 
2 out of the 60.

Who is going to add support for all of those machines?  I've spent a lot 
of developer time getting prefmon going for all of those configurations. 
But why should I help out with this new inferior proposal?  It could all 
be another waste of time.

So I'd like someone to commit to some performance monitoring architecture. 
Otherwise we're going to waste thousands of hours of developer time around 
the world.  It's all pointless.

Also, my primary method of using counters is total aggregate count for a 
single user-space process.  So I use perfmon's pfmon tool to run an entire 
lon-running program, gathering full stats only at the very end.  pfmon can 
do this with pretty much zero overhead (I have lots of data and a few 
publications using this method).  Can this new infrastructure to this?  I 
find the documentation/tools support to be very incomplete.

One comment on the patch.


> +	/*
> +	 * Common hardware events, generalized by the kernel:
> +	 */
> +	PERF_COUNT_CYCLES		=  0,
> +	PERF_COUNT_INSTRUCTIONS		=  1,
> +	PERF_COUNT_CACHE_REFERENCES	=  2,
> +	PERF_COUNT_CACHE_MISSES		=  3,
> +	PERF_COUNT_BRANCH_INSTRUCTIONS	=  4,
> +	PERF_COUNT_BRANCH_MISSES	=  5,

Many machines do not support these counts.  For example, Niagara T1 does 
not have a CYCLES count.  And good luck if you think you can easily come 
up with something meaningful for the various kind of CACHE_MISSES on the 
Pentium 4.  Also, the Pentium D has various flavors of retired instruction 
count with slightly different semantics.  This kind of abstraction should 
be done in userspace.

Vince

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-11 15:52 [patch] Performance Counters for Linux, v3 Ingo Molnar
  2008-12-11 18:02 ` Vince Weaver
@ 2008-12-11 18:35 ` Andrew Morton
  2008-12-12  6:22   ` Ingo Molnar
  2008-12-11 19:11 ` Tony Luck
  2008-12-14 14:51 ` Performance counter API review was " Andi Kleen
  3 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2008-12-11 18:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, tglx, eranian, dada1, robert.richter, arjan, hpa,
	a.p.zijlstra, paulus, davem

On Thu, 11 Dec 2008 16:52:30 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> To: linux-kernel@vger.kernel.org
> Cc: Thomas Gleixner <tglx@linutronix.de>, Andrew Morton <akpm@linux-foundation.org>, Stephane Eranian <eranian@googlemail.com>, Eric Dumazet <dada1@cosmosbay.com>, Robert Richter <robert.richter@amd.com>, Arjan van de Veen <arjan@infradead.org>, Peter Anvin <hpa@zytor.com>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Paul Mackerras <paulus@samba.org>, "David S. Miller" <davem@davemloft.net>

Please copy perfctr-devel@lists.sourceforge.net on all this.  That is where
the real-world people who use these facilities on a regular basis hang out.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-11 15:52 [patch] Performance Counters for Linux, v3 Ingo Molnar
  2008-12-11 18:02 ` Vince Weaver
  2008-12-11 18:35 ` Andrew Morton
@ 2008-12-11 19:11 ` Tony Luck
  2008-12-11 19:34   ` Ingo Molnar
  2008-12-14 14:51 ` Performance counter API review was " Andi Kleen
  3 siblings, 1 reply; 52+ messages in thread
From: Tony Luck @ 2008-12-11 19:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, Andrew Morton, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Paul Mackerras, David S. Miller

>        /*
>         * Special "software" counters provided by the kernel, even if
>         * the hardware does not support performance counters. These
>         * counters measure various physical and sw events of the
>         * kernel (and allow the profiling of them as well):
>         */
>        PERF_COUNT_CPU_CLOCK            = -1,
>        PERF_COUNT_TASK_CLOCK           = -2,
>        /*
>         * Future software events:
>         */
>        /* PERF_COUNT_PAGE_FAULTS       = -3,
>           PERF_COUNT_CONTEXT_SWITCHES  = -4, */

  ...
> +[ Note: more hw_event_types are supported as well, but they are CPU
> +  specific and are enumerated via /sys on a per CPU basis. Raw hw event
> +  types can be passed in as negative numbers. For example, to count
> +  "External bus cycles while bus lock signal asserted" events on Intel
> +  Core CPUs, pass in a -0x4064 event type value. ]

It looks like you have an overlap here.  You are using some negative numbers
to denote your special software events, but also as "raw" hardware events.
What if these conflict?

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-11 19:11 ` Tony Luck
@ 2008-12-11 19:34   ` Ingo Molnar
  2008-12-12  8:29     ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2008-12-11 19:34 UTC (permalink / raw)
  To: Tony Luck
  Cc: linux-kernel, Thomas Gleixner, Andrew Morton, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Paul Mackerras, David S. Miller


* Tony Luck <tony.luck@intel.com> wrote:

> >        /*
> >         * Special "software" counters provided by the kernel, even if
> >         * the hardware does not support performance counters. These
> >         * counters measure various physical and sw events of the
> >         * kernel (and allow the profiling of them as well):
> >         */
> >        PERF_COUNT_CPU_CLOCK            = -1,
> >        PERF_COUNT_TASK_CLOCK           = -2,
> >        /*
> >         * Future software events:
> >         */
> >        /* PERF_COUNT_PAGE_FAULTS       = -3,
> >           PERF_COUNT_CONTEXT_SWITCHES  = -4, */
> 
>   ...
> > +[ Note: more hw_event_types are supported as well, but they are CPU
> > +  specific and are enumerated via /sys on a per CPU basis. Raw hw event
> > +  types can be passed in as negative numbers. For example, to count
> > +  "External bus cycles while bus lock signal asserted" events on Intel
> > +  Core CPUs, pass in a -0x4064 event type value. ]
> 
> It looks like you have an overlap here.  You are using some negative 
> numbers to denote your special software events, but also as "raw" 
> hardware events. What if these conflict?

that's an old comment, not a bug in the code - thx for pointing it out, i 
just fixed the comments - see the commit below.

Raw events are now done without using up negative numbers, they are done 
via:

 struct perf_counter_hw_event {
        s64                     type;

        u64                     irq_period;
        u32                     record_type;

        u32                     disabled     :  1, /* off by default */
                                nmi          :  1, /* NMI sampling   */
                                raw          :  1, /* raw event type */
                                __reserved_1 : 29;

        u64                     __reserved_2;
 };

if the hw_event.raw bit is set to 1, then the hw_event.type is fully 
'raw'. The default is for raw to be 0. So negative numbers can be used 
for sw events, positive numbers for hw events. Both can be extended 
gradually, without arbitrarily limits introduced.

	Ingo

------------------------->
>From 447557ac7ce120306b4a31d6003faef39cb1bf14 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 11 Dec 2008 20:40:18 +0100
Subject: [PATCH] perf counters: update docs

Impact: update docs

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 Documentation/perf-counters.txt |  107 +++++++++++++++++++++++++++------------
 1 files changed, 75 insertions(+), 32 deletions(-)

diff --git a/Documentation/perf-counters.txt b/Documentation/perf-counters.txt
index 19033a0..fddd321 100644
--- a/Documentation/perf-counters.txt
+++ b/Documentation/perf-counters.txt
@@ -10,8 +10,8 @@ trigger interrupts when a threshold number of events have passed - and can
 thus be used to profile the code that runs on that CPU.
 
 The Linux Performance Counter subsystem provides an abstraction of these
-hardware capabilities. It provides per task and per CPU counters, and
-it provides event capabilities on top of those.
+hardware capabilities. It provides per task and per CPU counters, counter
+groups, and it provides event capabilities on top of those.
 
 Performance counters are accessed via special file descriptors.
 There's one file descriptor per virtual counter used.
@@ -19,12 +19,8 @@ There's one file descriptor per virtual counter used.
 The special file descriptor is opened via the perf_counter_open()
 system call:
 
- int
- perf_counter_open(u32 hw_event_type,
-                   u32 hw_event_period,
-                   u32 record_type,
-                   pid_t pid,
-                   int cpu);
+   int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr,
+			     pid_t pid, int cpu, int group_fd);
 
 The syscall returns the new fd. The fd can be used via the normal
 VFS system calls: read() can be used to read the counter, fcntl()
@@ -33,39 +29,78 @@ can be used to set the blocking mode, etc.
 Multiple counters can be kept open at a time, and the counters
 can be poll()ed.
 
-When creating a new counter fd, 'hw_event_type' is one of:
-
- enum hw_event_types {
-	PERF_COUNT_CYCLES,
-	PERF_COUNT_INSTRUCTIONS,
-	PERF_COUNT_CACHE_REFERENCES,
-	PERF_COUNT_CACHE_MISSES,
-	PERF_COUNT_BRANCH_INSTRUCTIONS,
-	PERF_COUNT_BRANCH_MISSES,
- };
+When creating a new counter fd, 'perf_counter_hw_event' is:
+
+/*
+ * Hardware event to monitor via a performance monitoring counter:
+ */
+struct perf_counter_hw_event {
+	s64			type;
+
+	u64			irq_period;
+	u32			record_type;
+
+	u32			disabled     :  1, /* off by default */
+				nmi	     :  1, /* NMI sampling   */
+				raw	     :  1, /* raw event type */
+				__reserved_1 : 29;
+
+	u64			__reserved_2;
+};
+
+/*
+ * Generalized performance counter event types, used by the hw_event.type
+ * parameter of the sys_perf_counter_open() syscall:
+ */
+enum hw_event_types {
+	/*
+	 * Common hardware events, generalized by the kernel:
+	 */
+	PERF_COUNT_CYCLES		=  0,
+	PERF_COUNT_INSTRUCTIONS		=  1,
+	PERF_COUNT_CACHE_REFERENCES	=  2,
+	PERF_COUNT_CACHE_MISSES		=  3,
+	PERF_COUNT_BRANCH_INSTRUCTIONS	=  4,
+	PERF_COUNT_BRANCH_MISSES	=  5,
+
+	/*
+	 * Special "software" counters provided by the kernel, even if
+	 * the hardware does not support performance counters. These
+	 * counters measure various physical and sw events of the
+	 * kernel (and allow the profiling of them as well):
+	 */
+	PERF_COUNT_CPU_CLOCK		= -1,
+	PERF_COUNT_TASK_CLOCK		= -2,
+	/*
+	 * Future software events:
+	 */
+	/* PERF_COUNT_PAGE_FAULTS	= -3,
+	   PERF_COUNT_CONTEXT_SWITCHES	= -4, */
+};
 
 These are standardized types of events that work uniformly on all CPUs
 that implements Performance Counters support under Linux. If a CPU is
 not able to count branch-misses, then the system call will return
 -EINVAL.
 
-[ Note: more hw_event_types are supported as well, but they are CPU
-  specific and are enumerated via /sys on a per CPU basis. Raw hw event
-  types can be passed in as negative numbers. For example, to count
-  "External bus cycles while bus lock signal asserted" events on Intel
-  Core CPUs, pass in a -0x4064 event type value. ]
-
-The parameter 'hw_event_period' is the number of events before waking up
-a read() that is blocked on a counter fd. Zero value means a non-blocking
-counter.
+More hw_event_types are supported as well, but they are CPU
+specific and are enumerated via /sys on a per CPU basis. Raw hw event
+types can be passed in under hw_event.type if hw_event.raw is 1.
+For example, to count "External bus cycles while bus lock signal asserted"
+events on Intel Core CPUs, pass in a 0x4064 event type value and set
+hw_event.raw to 1.
 
 'record_type' is the type of data that a read() will provide for the
 counter, and it can be one of:
 
-  enum perf_record_type {
-	PERF_RECORD_SIMPLE,
-	PERF_RECORD_IRQ,
-  };
+/*
+ * IRQ-notification data record type:
+ */
+enum perf_counter_record_type {
+	PERF_RECORD_SIMPLE		=  0,
+	PERF_RECORD_IRQ			=  1,
+	PERF_RECORD_GROUP		=  2,
+};
 
 a "simple" counter is one that counts hardware events and allows
 them to be read out into a u64 count value. (read() returns 8 on
@@ -76,6 +111,10 @@ the IP of the interrupted context. In this case read() will return
 the 8-byte counter value, plus the Instruction Pointer address of the
 interrupted context.
 
+The parameter 'hw_event_period' is the number of events before waking up
+a read() that is blocked on a counter fd. Zero value means a non-blocking
+counter.
+
 The 'pid' parameter allows the counter to be specific to a task:
 
  pid == 0: if the pid parameter is zero, the counter is attached to the
@@ -92,7 +131,7 @@ CPU:
  cpu >= 0: the counter is restricted to a specific CPU
  cpu == -1: the counter counts on all CPUs
 
-Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.
+(Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)
 
 A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
 events of that task and 'follows' that task to whatever CPU the task
@@ -102,3 +141,7 @@ their own tasks.
 A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
 all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
 
+Group counters are created by passing in a group_fd of another counter.
+Groups are scheduled at once and can be used with PERF_RECORD_GROUP
+to record multi-dimensional timestamps.
+

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-11 18:35 ` Andrew Morton
@ 2008-12-12  6:22   ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-12-12  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, tglx, eranian, dada1, robert.richter, arjan, hpa,
	a.p.zijlstra, paulus, davem


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 11 Dec 2008 16:52:30 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > To: linux-kernel@vger.kernel.org
> > Cc: Thomas Gleixner <tglx@linutronix.de>, Andrew Morton <akpm@linux-foundation.org>, Stephane Eranian <eranian@googlemail.com>, Eric Dumazet <dada1@cosmosbay.com>, Robert Richter <robert.richter@amd.com>, Arjan van de Veen <arjan@infradead.org>, Peter Anvin <hpa@zytor.com>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Paul Mackerras <paulus@samba.org>, "David S. Miller" <davem@davemloft.net>
> 
> Please copy perfctr-devel@lists.sourceforge.net on all this.  That is 
> where the real-world people who use these facilities on a regular basis 
> hang out.

Sure, we'll do that for v4.

The reason we kept posting this to lkml initially was because there is a 
visible detachment of this community from kernel developers. And that is 
at least in part because this stuff has never been made interesting 
enough to kernel developers. I dont remember a _single_ perfmon-generated 
profile (be that user-space or kernel-space) in my mailbox before - and 
optimizing the kernel is supposed to be one of the most important aspects 
of performance tuning.

That's why we concentrate on making this useful and interesting to kernel 
developers too via KernelTop, that's why we made the BTS/[PEBS] hardware 
tracer available via an ftrace plugin, etc.

Furthermore, kernel developers tend to be quite good at co-designing, 
influencing [and flaming ;-) ] such APIs at the early prototype stages, 
so the main early technical feedback we were looking for on the kernel 
side structure was lkml. But the wider community is not ignored either, 
of course - with v4 it might be useful already for wider circulation.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-11 18:02 ` Vince Weaver
@ 2008-12-12  8:25   ` Peter Zijlstra
  2008-12-12  8:35     ` stephane eranian
                       ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: Peter Zijlstra @ 2008-12-12  8:25 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller

On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:

> I have at least 60 machines that I do regular performance counter work on. 
> They involve Pentium Pro, Pentium II, 32-bit Athlon, 64-bit Athlon, 
> Pentium 4, Pentium D, Core, Core2, Atom, MIPS R12k, Niagara T1, 
> and PPC/Playstation 3.

Good.

> Perfmon3 works for all of those 60 machines.  This new proposal works on a 
> 2 out of the 60.

s/works/is implemented/

> Who is going to add support for all of those machines?  I've spent a lot 
> of developer time getting prefmon going for all of those configurations. 
> But why should I help out with this new inferior proposal?  It could all 
> be another waste of time.

So much for constructive critisism.. have you tried taking the design to
its limits, if so, where do you see problems?

I read the above as: I invested a lot of time in something of dubious
statue (out of tree patch), and now expect it to be merged because I
have invested in it.

> Also, my primary method of using counters is total aggregate count for a 
> single user-space process. 

Process, as in single thread, or multi-threaded? I'll assume
single-thread.

> Can this new infrastructure to this? 

Yes, afaict it can.

You can group counters in v3, a read out of such a group will be an
atomic read out and provide vectored output that contains all the data
in one stream.

> I find the documentation/tools support to be very incomplete.

Gosh, what does one expect from something that is hardly a week old..

> One comment on the patch.
>
> > +	/*
> > +	 * Common hardware events, generalized by the kernel:
> > +	 */
> > +	PERF_COUNT_CYCLES		=  0,
> > +	PERF_COUNT_INSTRUCTIONS		=  1,
> > +	PERF_COUNT_CACHE_REFERENCES	=  2,
> > +	PERF_COUNT_CACHE_MISSES		=  3,
> > +	PERF_COUNT_BRANCH_INSTRUCTIONS	=  4,
> > +	PERF_COUNT_BRANCH_MISSES	=  5,
> 
> Many machines do not support these counts.  For example, Niagara T1 does 
> not have a CYCLES count.  And good luck if you think you can easily come 
> up with something meaningful for the various kind of CACHE_MISSES on the 
> Pentium 4.  Also, the Pentium D has various flavors of retired instruction 
> count with slightly different semantics.  This kind of abstraction should 
> be done in userspace.

I'll argue to disagree, sure such events might not be supported by any
particular hardware implementation - but the fact that PAPI gives a list
of 'common' events means that they are, well, common. So unifying them
between those archs that do implement them seems like a sane choice, no?

For those archs that do not support it, it will just fail to open. No
harm done.

The proposal allows for you to specify raw hardware events, so you can
just totally ignore this part of the abstraction.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-11 19:34   ` Ingo Molnar
@ 2008-12-12  8:29     ` Peter Zijlstra
  2008-12-12  8:54       ` Ingo Molnar
  2008-12-12 13:42       ` Andi Kleen
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2008-12-12  8:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller

On Thu, 2008-12-11 at 20:34 +0100, Ingo Molnar wrote:

>  struct perf_counter_hw_event {
>         s64                     type;
> 
>         u64                     irq_period;
>         u32                     record_type;
> 
>         u32                     disabled     :  1, /* off by default */
>                                 nmi          :  1, /* NMI sampling   */
>                                 raw          :  1, /* raw event type */
>                                 __reserved_1 : 29;
> 
>         u64                     __reserved_2;
>  };
> 
> if the hw_event.raw bit is set to 1, then the hw_event.type is fully 
> 'raw'. The default is for raw to be 0. So negative numbers can be used 
> for sw events, positive numbers for hw events. Both can be extended 
> gradually, without arbitrarily limits introduced.

On that, I still don't think its a good idea to use bitfields in an ABI.
The C std is just not strict enough on them, and I guess that is the
reason this would be the first such usage.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  8:25   ` Peter Zijlstra
@ 2008-12-12  8:35     ` stephane eranian
  2008-12-12  8:51       ` Peter Zijlstra
  2008-12-12  8:59     ` stephane eranian
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 52+ messages in thread
From: stephane eranian @ 2008-12-12  8:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

Peter,

On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> > +   /*
>> > +    * Common hardware events, generalized by the kernel:
>> > +    */
>> > +   PERF_COUNT_CYCLES               =  0,
>> > +   PERF_COUNT_INSTRUCTIONS         =  1,
>> > +   PERF_COUNT_CACHE_REFERENCES     =  2,
>> > +   PERF_COUNT_CACHE_MISSES         =  3,
>> > +   PERF_COUNT_BRANCH_INSTRUCTIONS  =  4,
>> > +   PERF_COUNT_BRANCH_MISSES        =  5,
>>
>> Many machines do not support these counts.  For example, Niagara T1 does
>> not have a CYCLES count.  And good luck if you think you can easily come
>> up with something meaningful for the various kind of CACHE_MISSES on the
>> Pentium 4.  Also, the Pentium D has various flavors of retired instruction
>> count with slightly different semantics.  This kind of abstraction should
>> be done in userspace.
>
> I'll argue to disagree, sure such events might not be supported by any
> particular hardware implementation - but the fact that PAPI gives a list
> of 'common' events means that they are, well, common. So unifying them
> between those archs that do implement them seems like a sane choice, no?
>
> For those archs that do not support it, it will just fail to open. No
> harm done.
>
> The proposal allows for you to specify raw hardware events, so you can
> just totally ignore this part of the abstraction.
>
I believe the cache related events do not belong in here. There is no definition
for them. You don't know what cache miss level, what kind of access. You cannot
do this even on Intel Core processors.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  8:35     ` stephane eranian
@ 2008-12-12  8:51       ` Peter Zijlstra
  2008-12-12  9:00         ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2008-12-12  8:51 UTC (permalink / raw)
  To: eranian
  Cc: Vince Weaver, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

On Fri, 2008-12-12 at 09:35 +0100, stephane eranian wrote:
> Peter,
> 
> On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >> > +   /*
> >> > +    * Common hardware events, generalized by the kernel:
> >> > +    */
> >> > +   PERF_COUNT_CYCLES               =  0,
> >> > +   PERF_COUNT_INSTRUCTIONS         =  1,
> >> > +   PERF_COUNT_CACHE_REFERENCES     =  2,
> >> > +   PERF_COUNT_CACHE_MISSES         =  3,
> >> > +   PERF_COUNT_BRANCH_INSTRUCTIONS  =  4,
> >> > +   PERF_COUNT_BRANCH_MISSES        =  5,
> >>
> >> Many machines do not support these counts.  For example, Niagara T1 does
> >> not have a CYCLES count.  And good luck if you think you can easily come
> >> up with something meaningful for the various kind of CACHE_MISSES on the
> >> Pentium 4.  Also, the Pentium D has various flavors of retired instruction
> >> count with slightly different semantics.  This kind of abstraction should
> >> be done in userspace.
> >
> > I'll argue to disagree, sure such events might not be supported by any
> > particular hardware implementation - but the fact that PAPI gives a list
> > of 'common' events means that they are, well, common. So unifying them
> > between those archs that do implement them seems like a sane choice, no?
> >
> > For those archs that do not support it, it will just fail to open. No
> > harm done.
> >
> > The proposal allows for you to specify raw hardware events, so you can
> > just totally ignore this part of the abstraction.
> >
> I believe the cache related events do not belong in here. There is no definition
> for them. You don't know what cache miss level, what kind of access. You cannot
> do this even on Intel Core processors.

I might agree with that, perhaps we should model this to the common list
PAPI specifies?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  8:29     ` Peter Zijlstra
@ 2008-12-12  8:54       ` Ingo Molnar
  2008-12-12 13:42       ` Andi Kleen
  1 sibling, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-12-12  8:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2008-12-11 at 20:34 +0100, Ingo Molnar wrote:
> 
> >  struct perf_counter_hw_event {
> >         s64                     type;
> > 
> >         u64                     irq_period;
> >         u32                     record_type;
> > 
> >         u32                     disabled     :  1, /* off by default */
> >                                 nmi          :  1, /* NMI sampling   */
> >                                 raw          :  1, /* raw event type */
> >                                 __reserved_1 : 29;
> > 
> >         u64                     __reserved_2;
> >  };
> > 
> > if the hw_event.raw bit is set to 1, then the hw_event.type is fully 
> > 'raw'. The default is for raw to be 0. So negative numbers can be used 
> > for sw events, positive numbers for hw events. Both can be extended 
> > gradually, without arbitrarily limits introduced.
> 
> On that, I still don't think its a good idea to use bitfields in an 
> ABI. The C std is just not strict enough on them, and I guess that is 
> the reason this would be the first such usage.

I dont feel strongly about this, we could certainly change it.

But these are system calls which have per platform bit order anyway - is 
it really an issue? I'd agree that it would be bad for any sort of 
persistent or otherwise cross-platform data such as filesystems, network 
protocol bits, etc.

We use bitfields in a couple of system calls ABIs already, for example in 
PPP:

if_ppp.h-/* For PPPIOCGL2TPSTATS */
if_ppp.h-struct pppol2tp_ioc_stats {
if_ppp.h-       __u16           tunnel_id;      /* redundant */
if_ppp.h-       __u16           session_id;     /* if zero, get tunnel stats */
if_ppp.h:       __u32           using_ipsec:1;  /* valid only for session_id == 

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  8:25   ` Peter Zijlstra
  2008-12-12  8:35     ` stephane eranian
@ 2008-12-12  8:59     ` stephane eranian
  2008-12-12  9:23       ` Peter Zijlstra
  2008-12-12 17:03     ` Samuel Thibault
  2008-12-12 18:18     ` Vince Weaver
  3 siblings, 1 reply; 52+ messages in thread
From: stephane eranian @ 2008-12-12  8:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

Peter,

On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
>

>> Perfmon3 works for all of those 60 machines.  This new proposal works on a
>> 2 out of the 60.
>
> s/works/is implemented/
>
>> Who is going to add support for all of those machines?  I've spent a lot
>> of developer time getting prefmon going for all of those configurations.
>> But why should I help out with this new inferior proposal?  It could all
>> be another waste of time.
>
> So much for constructive critisism.. have you tried taking the design to
> its limits, if so, where do you see problems?
>
People have pointed out problems, but you keep forgetting to answer them.

For instance, people have pointed out that your design necessarily implies
pulling into the kernel the event table for all PMU models out there. This
is not just data, this is also complex algorithms to assign events to counters.
The constraints between events can be very tricky to solve. If you get this
wrong, this leads to silent errors, and that is really bad.

Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
the complexity of this. Paul pointed out earlier the complexity on Power.
I can relate to the complexity on Itanium (I implemented all the code in
the user level libpfm for them).  Read the Itanium PMU description and I
hope you'll understand.

Events constraints are not going away anytime soon, quite the contrary.

Furthermore, event tables are not always correct. In fact, they are
always bogus.
Event semantics varies between steppings. New events shows up, others
get removed.
Constraints are discovered later on.

If you have all of that in the kernel, it means you'll have to
generate a kernel patch each
time. Even if that can be encapsulated into a kernel module, you will
still have problems.

Furthermore, Linux commercial distribution release cycles do not
align well with new processor
releases. I can boot my RHEL5 kernel on a Nehalem system and it would
be  nice not to have to
wait for a new kernel update to get the full Nehalem PMU event table,
so I can program more than
the basic 6 architected events of Intel X86.

I know the argument about the fact that you'll have a patch with 24h
on kernel.org. The problem
is that no end-user runs a kernel.org kernel, nobody. Changing the
kernel is not an option for
many end-users, it may even require re-certifications for many customers.

I believe many people would like to see how you plan on addressing those issues.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  8:51       ` Peter Zijlstra
@ 2008-12-12  9:00         ` Peter Zijlstra
  2008-12-12  9:07           ` Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2008-12-12  9:00 UTC (permalink / raw)
  To: eranian
  Cc: Vince Weaver, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

On Fri, 2008-12-12 at 09:51 +0100, Peter Zijlstra wrote:
> On Fri, 2008-12-12 at 09:35 +0100, stephane eranian wrote:
> > Peter,
> > 
> > On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > >> > +   /*
> > >> > +    * Common hardware events, generalized by the kernel:
> > >> > +    */
> > >> > +   PERF_COUNT_CYCLES               =  0,
> > >> > +   PERF_COUNT_INSTRUCTIONS         =  1,
> > >> > +   PERF_COUNT_CACHE_REFERENCES     =  2,
> > >> > +   PERF_COUNT_CACHE_MISSES         =  3,
> > >> > +   PERF_COUNT_BRANCH_INSTRUCTIONS  =  4,
> > >> > +   PERF_COUNT_BRANCH_MISSES        =  5,
> > >>
> > >> Many machines do not support these counts.  For example, Niagara T1 does
> > >> not have a CYCLES count.  And good luck if you think you can easily come
> > >> up with something meaningful for the various kind of CACHE_MISSES on the
> > >> Pentium 4.  Also, the Pentium D has various flavors of retired instruction
> > >> count with slightly different semantics.  This kind of abstraction should
> > >> be done in userspace.
> > >
> > > I'll argue to disagree, sure such events might not be supported by any
> > > particular hardware implementation - but the fact that PAPI gives a list
> > > of 'common' events means that they are, well, common. So unifying them
> > > between those archs that do implement them seems like a sane choice, no?
> > >
> > > For those archs that do not support it, it will just fail to open. No
> > > harm done.
> > >
> > > The proposal allows for you to specify raw hardware events, so you can
> > > just totally ignore this part of the abstraction.
> > >
> > I believe the cache related events do not belong in here. There is no definition
> > for them. You don't know what cache miss level, what kind of access. You cannot
> > do this even on Intel Core processors.
> 
> I might agree with that, perhaps we should model this to the common list
> PAPI specifies?

http://icl.cs.utk.edu/projects/papi/files/html_man3/papi_presets.html

Has a lot of cache events.

And I can see the use of a set without the L[123] in there, which would
signify either all or the lack of more specific knowledge. Like with
PAPI its perfectly fine to not support these common events on a
particular hardware platform.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  9:00         ` Peter Zijlstra
@ 2008-12-12  9:07           ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-12-12  9:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, Vince Weaver, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, 2008-12-12 at 09:51 +0100, Peter Zijlstra wrote:
> > On Fri, 2008-12-12 at 09:35 +0100, stephane eranian wrote:
> > > Peter,
> > > 
> > > On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > >> > +   /*
> > > >> > +    * Common hardware events, generalized by the kernel:
> > > >> > +    */
> > > >> > +   PERF_COUNT_CYCLES               =  0,
> > > >> > +   PERF_COUNT_INSTRUCTIONS         =  1,
> > > >> > +   PERF_COUNT_CACHE_REFERENCES     =  2,
> > > >> > +   PERF_COUNT_CACHE_MISSES         =  3,
> > > >> > +   PERF_COUNT_BRANCH_INSTRUCTIONS  =  4,
> > > >> > +   PERF_COUNT_BRANCH_MISSES        =  5,
> > > >>
> > > >> Many machines do not support these counts.  For example, Niagara T1 does
> > > >> not have a CYCLES count.  And good luck if you think you can easily come
> > > >> up with something meaningful for the various kind of CACHE_MISSES on the
> > > >> Pentium 4.  Also, the Pentium D has various flavors of retired instruction
> > > >> count with slightly different semantics.  This kind of abstraction should
> > > >> be done in userspace.
> > > >
> > > > I'll argue to disagree, sure such events might not be supported by any
> > > > particular hardware implementation - but the fact that PAPI gives a list
> > > > of 'common' events means that they are, well, common. So unifying them
> > > > between those archs that do implement them seems like a sane choice, no?
> > > >
> > > > For those archs that do not support it, it will just fail to open. No
> > > > harm done.
> > > >
> > > > The proposal allows for you to specify raw hardware events, so you can
> > > > just totally ignore this part of the abstraction.
> > > >
> > > I believe the cache related events do not belong in here. There is no definition
> > > for them. You don't know what cache miss level, what kind of access. You cannot
> > > do this even on Intel Core processors.
> > 
> > I might agree with that, perhaps we should model this to the common list
> > PAPI specifies?
> 
> http://icl.cs.utk.edu/projects/papi/files/html_man3/papi_presets.html
> 
> Has a lot of cache events.
> 
> And I can see the use of a set without the L[123] in there, which would 
> signify either all or the lack of more specific knowledge. Like with 
> PAPI its perfectly fine to not support these common events on a 
> particular hardware platform.

yes, exactly.

A PAPI wrapper on top of this code might even opt to never use any of the 
generic types, because it can be well aware of all the CPU types and 
their exact event mappings to raw types, and can use those directly.

Different apps like KernelTop might opt to utilize the generic types.

A kernel is all about providing intelligent, generalized access to hw 
resources.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  8:59     ` stephane eranian
@ 2008-12-12  9:23       ` Peter Zijlstra
  2008-12-12 10:21         ` Robert Richter
                           ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Peter Zijlstra @ 2008-12-12  9:23 UTC (permalink / raw)
  To: eranian
  Cc: Vince Weaver, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

On Fri, 2008-12-12 at 09:59 +0100, stephane eranian wrote:
> Peter,
> 
> On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
> >
> 
> >> Perfmon3 works for all of those 60 machines.  This new proposal works on a
> >> 2 out of the 60.
> >
> > s/works/is implemented/
> >
> >> Who is going to add support for all of those machines?  I've spent a lot
> >> of developer time getting prefmon going for all of those configurations.
> >> But why should I help out with this new inferior proposal?  It could all
> >> be another waste of time.
> >
> > So much for constructive critisism.. have you tried taking the design to
> > its limits, if so, where do you see problems?
> >
> People have pointed out problems, but you keep forgetting to answer them.

I thought some of that (and surely more to follow) has been
incorporated.

> For instance, people have pointed out that your design necessarily implies
> pulling into the kernel the event table for all PMU models out there. This
> is not just data, this is also complex algorithms to assign events to counters.
> The constraints between events can be very tricky to solve. If you get this
> wrong, this leads to silent errors, and that is really bad.

(well, its not my design - I'm just trying to see how far we can push it
out of sheer curiosity)

This has to be done anyway, and getting it wrong in userspace is just as
bad no? 

The _ONLY_ technical argument I've seen to do this in userspace is that
these tables and text segments are unswappable in-kernel - which doesn't
count too heavily in my book.

> Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
> the complexity of this. Paul pointed out earlier the complexity on Power.
> I can relate to the complexity on Itanium (I implemented all the code in
> the user level libpfm for them).  Read the Itanium PMU description and I
> hope you'll understand.

Again, I appreciate the fact that multi-dimensional constraint solving
isn't easy. But any which way we turn this thing, it still needs to be
done.

> Events constraints are not going away anytime soon, quite the contrary.
> 
> Furthermore, event tables are not always correct. In fact, they are
> always bogus.
> Event semantics varies between steppings. New events shows up, others
> get removed.
> Constraints are discovered later on.
> 
> If you have all of that in the kernel, it means you'll have to
> generate a kernel patch each
> time. Even if that can be encapsulated into a kernel module, you will
> still have problems.

How is updating a kernel module (esp one that only contains constraint
tables) more difficult than upgrading a user-space library? That just
doesn't make sense.

> Furthermore, Linux commercial distribution release cycles do not
> align well with new processor
> releases. I can boot my RHEL5 kernel on a Nehalem system and it would
> be  nice not to have to
> wait for a new kernel update to get the full Nehalem PMU event table,
> so I can program more than
> the basic 6 architected events of Intel X86.

Talking with my community hat on, that is an artificial problem created
by distributions, tell them to fix it.

All it requires is a new kernel module that describes the new chip,
surely that can be shipped as easily as a new library.

> I know the argument about the fact that you'll have a patch with 24h
> on kernel.org. The problem
> is that no end-user runs a kernel.org kernel, nobody. Changing the
> kernel is not an option for
> many end-users, it may even require re-certifications for many customers.
> 
> I believe many people would like to see how you plan on addressing those issues.

You're talking to LKML here - we don't care about stuff older than -git
(well, only a little, but not much more beyond n-1).

What we do care about is technical arguments, and last time I checked,
hardware resource scheduling was an OS level job.

But if the PMU control is critical to the enterprise deployment of
$customer, then he would have to re-certify on the library update too.

If its only development phase stuff, then the deployment machines won't
even load the module so there'd be no problem anyway.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  9:23       ` Peter Zijlstra
@ 2008-12-12 10:21         ` Robert Richter
  2008-12-12 10:59           ` Christoph Hellwig
  2008-12-12 16:45         ` Chris Friesen
  2008-12-12 17:42         ` stephane eranian
  2 siblings, 1 reply; 52+ messages in thread
From: Robert Richter @ 2008-12-12 10:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, Vince Weaver, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

On 12.12.08 10:23:54, Peter Zijlstra wrote:
> On Fri, 2008-12-12 at 09:59 +0100, stephane eranian wrote:
> > For instance, people have pointed out that your design necessarily implies
> > pulling into the kernel the event table for all PMU models out there. This
> > is not just data, this is also complex algorithms to assign events to counters.
> > The constraints between events can be very tricky to solve. If you get this
> > wrong, this leads to silent errors, and that is really bad.
> 
> (well, its not my design - I'm just trying to see how far we can push it
> out of sheer curiosity)
> 
> This has to be done anyway, and getting it wrong in userspace is just as
> bad no? 
> 
> The _ONLY_ technical argument I've seen to do this in userspace is that
> these tables and text segments are unswappable in-kernel - which doesn't
> count too heavily in my book.

But, there are also no arguments to implement it not in userspace.

> > Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
> > the complexity of this. Paul pointed out earlier the complexity on Power.
> > I can relate to the complexity on Itanium (I implemented all the code in
> > the user level libpfm for them).  Read the Itanium PMU description and I
> > hope you'll understand.
> 
> Again, I appreciate the fact that multi-dimensional constraint solving
> isn't easy. But any which way we turn this thing, it still needs to be
> done.

I agree with Stephane. There are already many different PMU
descriptions depending on family, model and steppping and with *every*
new cpu revision you will get one more update. Implementing this in
the kernel would require kernel updates where otherwise no changes
would be necessary.

If you look at current pmu implementations, there are tons of
descriptions files and code you don't want to have in the kernel.

Also, a profiling tool that needs a certain pmu feature would depend
then on its kernel implementation. (Actually, it is impossible to have
a 100% implementation coverage.) If the pmu could be programmed from
userspace, the tool could provide the feature itself.

> > Events constraints are not going away anytime soon, quite the contrary.
> > 
> > Furthermore, event tables are not always correct. In fact, they are
> > always bogus.
> > Event semantics varies between steppings. New events shows up, others
> > get removed.
> > Constraints are discovered later on.
> > 
> > If you have all of that in the kernel, it means you'll have to
> > generate a kernel patch each
> > time. Even if that can be encapsulated into a kernel module, you will
> > still have problems.
> 
> How is updating a kernel module (esp one that only contains constraint
> tables) more difficult than upgrading a user-space library? That just
> doesn't make sense.

At least this would require a kernel with modules enabled.

> > Furthermore, Linux commercial distribution release cycles do not
> > align well with new processor
> > releases. I can boot my RHEL5 kernel on a Nehalem system and it would
> > be  nice not to have to
> > wait for a new kernel update to get the full Nehalem PMU event table,
> > so I can program more than
> > the basic 6 architected events of Intel X86.
> 
> Talking with my community hat on, that is an artificial problem created
> by distributions, tell them to fix it.

It does not make sense to close the eyes to reality. There are systems
where it is not possible to update the kernel frequently. Probably you
have one running yourself.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12 10:21         ` Robert Richter
@ 2008-12-12 10:59           ` Christoph Hellwig
  2008-12-12 11:35             ` Robert Richter
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2008-12-12 10:59 UTC (permalink / raw)
  To: Robert Richter
  Cc: Peter Zijlstra, eranian, Vince Weaver, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

On Fri, Dec 12, 2008 at 11:21:11AM +0100, Robert Richter wrote:
> I agree with Stephane. There are already many different PMU
> descriptions depending on family, model and steppping and with *every*
> new cpu revision you will get one more update. Implementing this in
> the kernel would require kernel updates where otherwise no changes
> would be necessary.

Please stop the Bullshit.  You have to update _something_.  It makes a
lot of sense to update the thing you need to udpate anyway for new
hardware support, and not some piece of junk library like libperfmon.

> > Talking with my community hat on, that is an artificial problem created
> > by distributions, tell them to fix it.
> 
> It does not make sense to close the eyes to reality. There are systems
> where it is not possible to update the kernel frequently. Probably you
> have one running yourself.

Of course it is.  And on many of my systems it's much easier to update a
kernel than a library.  A kernel I can build myself, for libraries I'm
more or less reliant on the distro or hacking fugly rpm or debian
packagging bits.

Having HW support in the kernel is a lot easier than in weird libraries.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12 10:59           ` Christoph Hellwig
@ 2008-12-12 11:35             ` Robert Richter
  0 siblings, 0 replies; 52+ messages in thread
From: Robert Richter @ 2008-12-12 11:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, eranian, Vince Weaver, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

On 12.12.08 05:59:38, Christoph Hellwig wrote:
> On Fri, Dec 12, 2008 at 11:21:11AM +0100, Robert Richter wrote:
> > I agree with Stephane. There are already many different PMU
> > descriptions depending on family, model and steppping and with *every*
> > new cpu revision you will get one more update. Implementing this in
> > the kernel would require kernel updates where otherwise no changes
> > would be necessary.
> 
> Please stop the Bullshit.  You have to update _something_.  It makes a
> lot of sense to update the thing you need to udpate anyway for new
> hardware support, and not some piece of junk library like libperfmon.

New hardware does not always mean to implement new hardware
support. Sometimes it is sufficient to simply program the same
registers in another way. Why changing the kernel for this?

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  8:29     ` Peter Zijlstra
  2008-12-12  8:54       ` Ingo Molnar
@ 2008-12-12 13:42       ` Andi Kleen
  1 sibling, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2008-12-12 13:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Tony Luck, linux-kernel, Thomas Gleixner,
	Andrew Morton, Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller

Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
> On that, I still don't think its a good idea to use bitfields in an ABI.
> The C std is just not strict enough on them,

If you constrain yourself to a single architecture in practice C
bitfield standards are quite good e.g. on Linux/x86 it is "everyone
implements what gcc does" (and on linux/ppc "what ppc gcc does"). 
And the syscall ABI is certainly restricted to one architecture.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  9:23       ` Peter Zijlstra
  2008-12-12 10:21         ` Robert Richter
@ 2008-12-12 16:45         ` Chris Friesen
  2008-12-12 17:42         ` stephane eranian
  2 siblings, 0 replies; 52+ messages in thread
From: Chris Friesen @ 2008-12-12 16:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, Vince Weaver, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller

Peter Zijlstra wrote:
> On Fri, 2008-12-12 at 09:59 +0100, stephane eranian wrote:

>>Furthermore, Linux commercial distribution release cycles do not
>>align well with new processor
>>releases. I can boot my RHEL5 kernel on a Nehalem system and it would
>>be  nice not to have to
>>wait for a new kernel update to get the full Nehalem PMU event table,
>>so I can program more than
>>the basic 6 architected events of Intel X86.
> 
> 
> Talking with my community hat on, that is an artificial problem created
> by distributions, tell them to fix it.
> 
> All it requires is a new kernel module that describes the new chip,
> surely that can be shipped as easily as a new library.

I have to confess that I haven't had a chance to look at the code.  Is 
the current proposal set up in such a way as to support loading a module 
and having the new description picked up automatically?


>>Changing the
>>kernel is not an option for
>>many end-users, it may even require re-certifications for many customers.

> What we do care about is technical arguments, and last time I checked,
> hardware resource scheduling was an OS level job.

Here I agree.

> But if the PMU control is critical to the enterprise deployment of
> $customer, then he would have to re-certify on the library update too.

It may not have any basis in fact, but in practice it seems like kernel 
changes are considered more risky than userspace changes.

As you say though, it's not likely that most production systems would be 
running performance monitoring code, so this may only be an issue for 
development machines.


Chris

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  8:25   ` Peter Zijlstra
  2008-12-12  8:35     ` stephane eranian
  2008-12-12  8:59     ` stephane eranian
@ 2008-12-12 17:03     ` Samuel Thibault
  2008-12-12 17:11       ` Peter Zijlstra
  2008-12-12 18:18     ` Vince Weaver
  3 siblings, 1 reply; 52+ messages in thread
From: Samuel Thibault @ 2008-12-12 17:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller

Peter Zijlstra, le Fri 12 Dec 2008 09:25:45 +0100, a écrit :
> On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
> > Also, my primary method of using counters is total aggregate count for a 
> > single user-space process. 
> 
> Process, as in single thread, or multi-threaded? I'll assume
> single-thread.

BTW, just to make sure it is taken into account (I haven't followed the
thread up to here, just saw a "pid_t" somwhere that alarmed me): for our
uses, we _do_ need per-kernelthread counters.

Samuel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12 17:03     ` Samuel Thibault
@ 2008-12-12 17:11       ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2008-12-12 17:11 UTC (permalink / raw)
  To: Samuel Thibault
  Cc: Vince Weaver, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller

On Fri, 2008-12-12 at 18:03 +0100, Samuel Thibault wrote:
> Peter Zijlstra, le Fri 12 Dec 2008 09:25:45 +0100, a écrit :
> > On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
> > > Also, my primary method of using counters is total aggregate count for a 
> > > single user-space process. 
> > 
> > Process, as in single thread, or multi-threaded? I'll assume
> > single-thread.
> 
> BTW, just to make sure it is taken into account (I haven't followed the
> thread up to here, just saw a "pid_t" somwhere that alarmed me): for our
> uses, we _do_ need per-kernelthread counters.

Yes, counters are per task - not sure on the exact interface thingy
though - I guess it should be tid_t but glibc does a bit weird there or
something.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  9:23       ` Peter Zijlstra
  2008-12-12 10:21         ` Robert Richter
  2008-12-12 16:45         ` Chris Friesen
@ 2008-12-12 17:42         ` stephane eranian
  2008-12-12 18:01           ` stephane eranian
  2008-12-13 11:17           ` Peter Zijlstra
  2 siblings, 2 replies; 52+ messages in thread
From: stephane eranian @ 2008-12-12 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

Peter,

On Fri, Dec 12, 2008 at 10:23 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> For instance, people have pointed out that your design necessarily implies
>> pulling into the kernel the event table for all PMU models out there. This
>> is not just data, this is also complex algorithms to assign events to counters.
>> The constraints between events can be very tricky to solve. If you get this
>> wrong, this leads to silent errors, and that is really bad.
>
> (well, its not my design - I'm just trying to see how far we can push it
> out of sheer curiosity)
>
> This has to be done anyway, and getting it wrong in userspace is just as
> bad no?
>
No as bad. If a library is bad, then just don't the library. In fact,
I know tools
which do not even need a library. What is important is that there is a way
to avoid the problem. If the kernel controls this, then there is no way out.

To remain in your world, look at the Pentium 4 (Netburst) PMU
description. And you'll see that things are very complicated already there.


> The _ONLY_ technical argument I've seen to do this in userspace is that
> these tables and text segments are unswappable in-kernel - which doesn't
> count too heavily in my book.
>
>> Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
>> the complexity of this. Paul pointed out earlier the complexity on Power.
>> I can relate to the complexity on Itanium (I implemented all the code in
>> the user level libpfm for them).  Read the Itanium PMU description and I
>> hope you'll understand.
>
> Again, I appreciate the fact that multi-dimensional constraint solving
> isn't easy. But any which way we turn this thing, it still needs to be
> done.
>

Yes, but you have lots of ways of doing this at the user level. For all I know,
you could even hardcode the values (register, value) pairs in your tool if you
know what you are doing. And don't discount the fact that advanced tools
know what they are doing very precisely.

>> Events constraints are not going away anytime soon, quite the contrary.
>>
>> Furthermore, event tables are not always correct. In fact, they are
>> always bogus.
>> Event semantics varies between steppings. New events shows up, others
>> get removed.
>> Constraints are discovered later on.
>>
>> If you have all of that in the kernel, it means you'll have to
>> generate a kernel patch each
>> time. Even if that can be encapsulated into a kernel module, you will
>> still have problems.
>
> How is updating a kernel module (esp one that only contains constraint
> tables) more difficult than upgrading a user-space library? That just
> doesn't make sense.
>
Go ask end-users what they think of that?

You don't even need a library. All of this could be integrated into the tool.
New processor, just go download the updated version of the tool.
No kernel changes.

>> Furthermore, Linux commercial distribution release cycles do not
>> align well with new processor
>> releases. I can boot my RHEL5 kernel on a Nehalem system and it would
>> be  nice not to have to
>> wait for a new kernel update to get the full Nehalem PMU event table,
>> so I can program more than
>> the basic 6 architected events of Intel X86.
>
> Talking with my community hat on, that is an artificial problem created
> by distributions, tell them to fix it.
>
> All it requires is a new kernel module that describes the new chip,
> surely that can be shipped as easily as a new library.
>

No, because you need tons of versions of that module based on kernel
versions. People do not recompile kernel modules.

>> I know the argument about the fact that you'll have a patch with 24h
>> on kernel.org. The problem
>> is that no end-user runs a kernel.org kernel, nobody. Changing the
>> kernel is not an option for
>> many end-users, it may even require re-certifications for many customers.
>>
>> I believe many people would like to see how you plan on addressing those issues.
>
> You're talking to LKML here - we don't care about stuff older than -git
> (well, only a little, but not much more beyond n-1).
>
That is why you don't always understand the issues of users, unfortunately.

> What we do care about is technical arguments, and last time I checked,
> hardware resource scheduling was an OS level job.
>
Yes, if you get it wrong, applications are screwed.

> But if the PMU control is critical to the enterprise deployment of
> $customer, then he would have to re-certify on the library update too.
>
No, they just download a new version of the tool.

> If its only development phase stuff, then the deployment machines won't
> even load the module so there'd be no problem anyway.
>
This not just development stuff anymore.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12 17:42         ` stephane eranian
@ 2008-12-12 18:01           ` stephane eranian
  2008-12-12 19:45             ` Chris Friesen
  2008-12-14 23:13             ` Ingo Molnar
  2008-12-13 11:17           ` Peter Zijlstra
  1 sibling, 2 replies; 52+ messages in thread
From: stephane eranian @ 2008-12-12 18:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

Hi,

Given the level of abstractions you are using for the API, and given
your argument
that the kernel can do the HW resource scheduling better than anybody else.

What happens in the following test case:

   - 2-way system (cpu0, cpu1)

   - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
     Event E1 can only be measured on counter C1.

   - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1

   - the scheduler decides to migrate P1 onto CPU1. You now have a
conflict on C1.

How is this managed?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12  8:25   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2008-12-12 17:03     ` Samuel Thibault
@ 2008-12-12 18:18     ` Vince Weaver
  3 siblings, 0 replies; 52+ messages in thread
From: Vince Weaver @ 2008-12-12 18:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller


On Fri, 12 Dec 2008, Peter Zijlstra wrote:
> On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
>
>> Perfmon3 works for all of those 60 machines.  This new proposal works on a
>> 2 out of the 60.
>
> s/works/is implemented/

Once you "implement" the new solution for all the machines I listed, it's 
going to be just as bad, if not worse, than current perfmon3.

> So much for constructive critisism.. have you tried taking the design to
> its limits, if so, where do you see problems?

I have a currently working solution in perfmon3.
I need a pretty strong reason to abandon that.

> I read the above as: I invested a lot of time in something of dubious
> statue (out of tree patch), and now expect it to be merged because I
> have invested in it.

perfmon has been around for years.  It's even been in the kernel (in 
Itanium form) for years.  The perfmon patchset has been posted numerous 
time for reviews to the linux-kernel list.  It's not like perfmon was some 
sort of secret project sprung on the world last-minute.

I know the way the Linux kernel development works.  If some other 
performance monitoring implementation does get merged, I will cope and 
move on.  I'm just trying to help avoid a costly mistake.

>> Also, my primary method of using counters is total aggregate count for a
>> single user-space process.
>
> Process, as in single thread, or multi-threaded? I'll assume
> single-thread.

No.  Multi-thread too.

> I'll argue to disagree, sure such events might not be supported by any
> particular hardware implementation - but the fact that PAPI gives a list
> of 'common' events means that they are, well, common. So unifying them
> between those archs that do implement them seems like a sane choice, no?

No.

I do not use PAPI.  PAPI only supports a small subset of counters.

What is needed is a tool for accessing _all_ performance counters on 
various machines.

What is _not_ needed is pushing PAPI into kernel space.

> The proposal allows for you to specify raw hardware events, so you can
> just totally ignore this part of the abstraction.

If you can do raw events, then that's enough.  There's no need to put some 
sort of abstraction level into the kernel.  That way lies madness if 
you've ever looked at any code that tries to do it.

As others have suggested, check out the P4 PMU documentation.

Vince

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12 18:01           ` stephane eranian
@ 2008-12-12 19:45             ` Chris Friesen
  2008-12-15 14:50               ` stephane eranian
  2008-12-14 23:13             ` Ingo Molnar
  1 sibling, 1 reply; 52+ messages in thread
From: Chris Friesen @ 2008-12-12 19:45 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Vince Weaver, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller

stephane eranian wrote:

> What happens in the following test case:
> 
>    - 2-way system (cpu0, cpu1)
> 
>    - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
>      Event E1 can only be measured on counter C1.
> 
>    - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
> 
>    - the scheduler decides to migrate P1 onto CPU1. You now have a
> conflict on C1.
> 
> How is this managed?

Prevent the load balancer from moving P1 onto cpu1?

Chris

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12 17:42         ` stephane eranian
  2008-12-12 18:01           ` stephane eranian
@ 2008-12-13 11:17           ` Peter Zijlstra
  2008-12-13 13:48             ` Henrique de Moraes Holschuh
                               ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: Peter Zijlstra @ 2008-12-13 11:17 UTC (permalink / raw)
  To: eranian
  Cc: Vince Weaver, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> In fact, I know tools which do not even need a library. 

By your own saying, the problem solved by libperfmon is a hard problem
(and I fully understand that).

Now you say there is software out there that doesn't use libperfmon,
that means they'll have to duplicate that functionality.

And only commercial software has a clear gain by wastefully duplicating
that effort. This means there is an active commercial interest to not
make perfmon the best technical solution there is, which is contrary to
the very thing Linux is about.

What is worse, you defend that:

> Go ask end-users what they think of that?
> 
> You don't even need a library. All of this could be integrated into the tool.
> New processor, just go download the updated version of the tool.

No! what people want is their problem fixed - no matter how. That is one
of the powers of FOSS, you can fix your problems in any way suitable.

Would it not be much better if those folks duped into using a binary
only product only had to upgrade their FOSS kernel, instead of possibly
forking over more $$$ for an upgrade?

You have just irrevocably proven to me this needs to go into the kernel,
as the design of perfmon is little more than a GPL circumvention device
- independent of whether you are aware of that or not.

For that I hereby fully NAK perfmon

Nacked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-13 11:17           ` Peter Zijlstra
@ 2008-12-13 13:48             ` Henrique de Moraes Holschuh
  2008-12-13 17:44             ` stephane eranian
  2008-12-14  1:02             ` Paul Mackerras
  2 siblings, 0 replies; 52+ messages in thread
From: Henrique de Moraes Holschuh @ 2008-12-13 13:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, Vince Weaver, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller

On Sat, 13 Dec 2008, Peter Zijlstra wrote:
> You have just irrevocably proven to me this needs to go into the kernel,
> as the design of perfmon is little more than a GPL circumvention device
> - independent of whether you are aware of that or not.

As long as it uses some sort of "module plugin" approach, perhaps coupled to
the firmware loader system to avoid wasting a ton of space with tables for
processors other than the one you need...  you could just move all of the
hardware-related parts of perfmon lib into the kernel.

That would close the doors to non-gpl badness.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-13 11:17           ` Peter Zijlstra
  2008-12-13 13:48             ` Henrique de Moraes Holschuh
@ 2008-12-13 17:44             ` stephane eranian
  2008-12-14  1:02             ` Paul Mackerras
  2 siblings, 0 replies; 52+ messages in thread
From: stephane eranian @ 2008-12-13 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller, Papi

Peter,

I don't think you understand what libpfm actually does and therefore
you rush to the wrong conclusion.

At its core, libpfm does NOT know anything about the perfmon kernel API.

I think you missed that, unfortunately.

It is a helper library which helps tool writer solves the event-> code
-> counter assignment problems.
That's it. It does not make any perfmon syscall at ALL to do that.
Proof is people have been using it on
Windows, I can also use it on MacOS.

Looking at your proposal, you think you won't need such a library and
that the kernel is
going to do all this for you. Let's go back to your kerneltop program:

KernelTop Options (up to 4 event types can be specified):

 -e EID    --event_id=EID     # event type ID                     [default:  0]
                                  0: CPU cycles
                                  1: instructions
                                  2: cache accesses
                                  3: cache misses
                                  4: branch instructions
                                  5: branch prediction misses
                                < 0: raw CPU events

Looks like I can do:

$ kerneltop --event_id=-0x510088

You think users are going to come up with 0x510088 out of the blue?

I want to say:

$ kerneltop --event_id=BR_INST_EXEC --plm=user

Where do you think they are going to get that from?

The kernel or a helper user library?

Do not denigrate other people's software without understanding what it does.


On Sat, Dec 13, 2008 at 12:17 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
>> In fact, I know tools which do not even need a library.
>
> By your own saying, the problem solved by libperfmon is a hard problem
> (and I fully understand that).
>
> Now you say there is software out there that doesn't use libperfmon,
> that means they'll have to duplicate that functionality.
>
> And only commercial software has a clear gain by wastefully duplicating
> that effort. This means there is an active commercial interest to not
> make perfmon the best technical solution there is, which is contrary to
> the very thing Linux is about.
>
> What is worse, you defend that:
>
>> Go ask end-users what they think of that?
>>
>> You don't even need a library. All of this could be integrated into the tool.
>> New processor, just go download the updated version of the tool.
>
> No! what people want is their problem fixed - no matter how. That is one
> of the powers of FOSS, you can fix your problems in any way suitable.
>
> Would it not be much better if those folks duped into using a binary
> only product only had to upgrade their FOSS kernel, instead of possibly
> forking over more $$$ for an upgrade?
>
> You have just irrevocably proven to me this needs to go into the kernel,
> as the design of perfmon is little more than a GPL circumvention device
> - independent of whether you are aware of that or not.
>
> For that I hereby fully NAK perfmon
>
> Nacked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>
>
>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-13 11:17           ` Peter Zijlstra
  2008-12-13 13:48             ` Henrique de Moraes Holschuh
  2008-12-13 17:44             ` stephane eranian
@ 2008-12-14  1:02             ` Paul Mackerras
  2008-12-14 22:37               ` Ingo Molnar
  2 siblings, 1 reply; 52+ messages in thread
From: Paul Mackerras @ 2008-12-14  1:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, Vince Weaver, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, David S. Miller

Peter Zijlstra writes:

> On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> > In fact, I know tools which do not even need a library. 
> 
> By your own saying, the problem solved by libperfmon is a hard problem
> (and I fully understand that).
> 
> Now you say there is software out there that doesn't use libperfmon,
> that means they'll have to duplicate that functionality.
> 
> And only commercial software has a clear gain by wastefully duplicating
> that effort. This means there is an active commercial interest to not
> make perfmon the best technical solution there is, which is contrary to
> the very thing Linux is about.
> 
> What is worse, you defend that:
> 
> > Go ask end-users what they think of that?
> > 
> > You don't even need a library. All of this could be integrated into the tool.
> > New processor, just go download the updated version of the tool.
> 
> No! what people want is their problem fixed - no matter how. That is one
> of the powers of FOSS, you can fix your problems in any way suitable.
> 
> Would it not be much better if those folks duped into using a binary
> only product only had to upgrade their FOSS kernel, instead of possibly
> forking over more $$$ for an upgrade?
> 
> You have just irrevocably proven to me this needs to go into the kernel,
> as the design of perfmon is little more than a GPL circumvention device
> - independent of whether you are aware of that or not.

I'm sorry, but that is a pretty silly argument.

By that logic, the kernel module loader should include an in-kernel
copy of gcc and binutils, and the fact that it doesn't proves that the
module loader is little more than a GPL circumvention device -
independent of whether you are aware of that or not.  8-)

Paul.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Performance counter API review was [patch] Performance Counters for Linux, v3
  2008-12-11 15:52 [patch] Performance Counters for Linux, v3 Ingo Molnar
                   ` (2 preceding siblings ...)
  2008-12-11 19:11 ` Tony Luck
@ 2008-12-14 14:51 ` Andi Kleen
  2009-02-02 20:03   ` Corey Ashford
  3 siblings, 1 reply; 52+ messages in thread
From: Andi Kleen @ 2008-12-14 14:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, Andrew Morton, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Paul Mackerras, David S. Miller

Ingo Molnar <mingo@elte.hu> writes:

Here are some comments from my (mostly x86) perspective on the interface.
I'm focusing on the interface only, not the code.

- There was a lot of discussion about counter assignment. But an event
actually needs much more meta data than just the counter assignments.
For example here's an event-set out of the upcoming Core i7 oprofile
events file:

event:0xC3 counters:0,1,2,3 um:machine_clears minimum:6000 name:machine_clears : Counts the cycles machine clear is asserted. 

and the associated sub unit masks:

name:machine_clears type:bitmask default:0x01
        0x01 cycles Counts the cycles machine clear is asserted
        0x02 mem_order Counts the number of machine clears due to memory order conflicts
        0x04 smc Counts the number of times that a program writes to a code section
        0x10 fusion_assist Counts the number of macro-fusion assists 


As you can see there is a lot of meta data in there and to my knowledge
none of it is really optional. For example without the name and the description
it's pretty much impossible to use the event (in fact even with description
it is often hard enough to figure out what it means). I think every
non trivial perfctr user front end will need a way to query name and 
description. Where should they be stored? 

Then the minimum overflow period is needed (see below)

Counter assignment is needed as discussed earlier: there are some events
that can only go to specific counters, and then there are complication
like fixed event counters and uncore events in separate registers.

Then there is the concept of unit_masks, which define the sub-events.
Right now the single event number does not specify how unit masks
are specified. Unit masks also are complicated because they are 
sometimes masks (you can or them up) or enumerations (you can't)
To make good use of them the software needs to know the difference.

So these all need to be somewhere. I assume the right place is 
not the kernel. I don't think it would be a good idea to duplicate
all of this in every application. So some user space library is needed anyways.

- All the event meta data should be ideally stored in a single place,
otherwise there is risk of it getting out of sync. Events are relatively
often updated (even during a CPU life-cycle when a event is found
to be buggy), so a smooth upgrade procedure is crucial.

- There doesn't seem to be a way to enforce minimum overflow periods.
It's also pretty easy to hang a system by programming a too short
overflow period to a commonly encountered event. For example
if you program a counter to trigger an NMI every hundred cycles
then the system will not do much useful work anymore.

This might even be a security hazard because the interface is available
to non-root. Solving that one would actually argue to put at least
some knowledge into the kernel or always enforce a minimum safe period?

The minimum safe period has the problem that it might break some
useful tracing setups on low frequency event where it might
be quite useful to useful on each event. But on a common event
that's a really bad idea. So probably it needs per event information.

Hard problem. oprofile avoids it by only allowing root to configure events.

[btw i'm not sure perfmon3 has solved that one either]

- Split of event into event and unit mask
On x86 events consist of a event number and a unit mask (which
can be sometimes an enumeration, not a mask). It's unclear 
right now how the unit mask is specified in the perfctr structure.
While it could be both encoded in type that would be clumsy,
requiring special macros. So likely it needs a separate field.

- PEBS/Debug Store

Intel/x86 has support for letting the CPU directly log events into a memory 
ring buffer with some additional information like register contents.  From
the first look this could be supported with additional record types. One
issue there is that the record layout is not architectural and varies
with different CPUs. Getting a nice general API out of that might be tricky.
Would each new CPU need a new record type? 

Processing PEBS records is also moderately performance critical
(and they can be quite big) so it would be a good idea to have some way
to process them copy less.

Another issue is that you need to specify the buffer size/overflow threshold 
somewhere. Right now there is no way in the API to do that (and the
existing syscall has already quite a lot of arguments). So PEBS would
likely need a new syscall?

- Additional bits. x86 has some more flag bits in the perfctr
registers like edge triggering or counter inversion. Right now there
doesn't seem to be any way to specify those in the syscall. There are
some events (especially when multiple events are counted together)
which can be only counted by setting those bits. Likely needs to be
controlled by the application.

I suppose adding new fields to perf_counter_hw_event would be possible.

- It's unclear to me why the API has a special NMI mode. For me it looks
like that if NMIs are implemented they should be the default way.
Or rather if you have NMI events, why ever not use them?
The only exception I can think of would be if the system is known
to have NMI problems in the BIOS like some ThinkPads. In that case
it shouldn't be per syscall/user controlled though, but some global
root only knob (ideally set automatically)

- Global tracing. Right now there seem to be two modi: per task and
per CPU. But a common variant is global tracing of all CPUs. While this
could be in theory done right now by attaching to each CPU
this has the problem that it doesn't interact very well with CPU
hot plug. The application would need to poll for additional/lost
CPUs somehow and then re-attach to them (or detach). This would
likely be quite clumsy and slow. It would be better if the kernel supported 
that better.

Or alternative here is to do nothing and keep oprofile for that job
(which it doesn't do that badly)

- Ring 3 vs ring 0.
x86 supports counting only user space or only kernel space. Right 
now there is no way to specify that in the syscall interface.
I suppose adding a new field to perf_counter_hw_event would be possible.

- SMT support
Sometimes you want to count events occurred by both SMT siblings.
For example this is useful when measuring a multi threaded
application that uses both threads and you want to see the
shared cache events of both.
In arch perfmon v3 there is a new perfctr "AnyThread" bit 
that controls this.  It needs to be exposed.

- In general the SMT and shared resource semantics seem to be a
bit unclear recently. Some clarification of that would be good.
What happens when the resource is not available? How are
the reservation semantics?

- Uncore monitoring
Nehalem has some additional performance counters in the Uncore
which count specific uncore events.  They have slightly different
semantics and additional register (like an opcode filter).
It's unclear how they would be programmed in this API.

Also the shared resource problem applies. An uncore is shared
by multiple cores/threads on a socket. Neither a CPU number nor
a pid are particularly useful to address them.

- RDPMC self monitoring
x86 supports reading performance counters from user space
using the RDPMC application. I find that rather useful
as a replacement for RDTSC because it allows to count
real cycles using one of the fixed performance counter.

One problem is that it needs to be explicitely enabled and also
controlled because it always exposes information from
all performance counters (which could be an information
leak). So ideally it needs to cooperate with the kernel 
and allow to set up suitable counters for own use and also
to make sure that counters do not leak information on context
switch. There should be some way in the API to specify that.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-14  1:02             ` Paul Mackerras
@ 2008-12-14 22:37               ` Ingo Molnar
  2008-12-15  0:50                 ` Paul Mackerras
  0 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2008-12-14 22:37 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Peter Zijlstra, eranian, Vince Weaver, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, David S. Miller


* Paul Mackerras <paulus@samba.org> wrote:

> Peter Zijlstra writes:
> 
> > On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> > > In fact, I know tools which do not even need a library. 
> > 
> > By your own saying, the problem solved by libperfmon is a hard problem
> > (and I fully understand that).
> > 
> > Now you say there is software out there that doesn't use libperfmon,
> > that means they'll have to duplicate that functionality.
> > 
> > And only commercial software has a clear gain by wastefully duplicating
> > that effort. This means there is an active commercial interest to not
> > make perfmon the best technical solution there is, which is contrary to
> > the very thing Linux is about.
> > 
> > What is worse, you defend that:
> > 
> > > Go ask end-users what they think of that?
> > > 
> > > You don't even need a library. All of this could be integrated into the tool.
> > > New processor, just go download the updated version of the tool.
> > 
> > No! what people want is their problem fixed - no matter how. That is one
> > of the powers of FOSS, you can fix your problems in any way suitable.
> > 
> > Would it not be much better if those folks duped into using a binary
> > only product only had to upgrade their FOSS kernel, instead of possibly
> > forking over more $$$ for an upgrade?
> > 
> > You have just irrevocably proven to me this needs to go into the kernel,
> > as the design of perfmon is little more than a GPL circumvention device
> > - independent of whether you are aware of that or not.
> 
> I'm sorry, but that is a pretty silly argument.
> 
> By that logic, the kernel module loader should include an in-kernel copy 
> of gcc and binutils, and the fact that it doesn't proves that the module 
> loader is little more than a GPL circumvention device - independent of 
> whether you are aware of that or not.  8-)

i'm not sure how your example applies: the kernel module loader is not an 
application that needs to be updated to new versions of syscalls. Nor is 
it a needless duplication of infrastructure - it runs in a completely 
different protection domain - just to name one of the key differences.

Applications going to complex raw syscalls and avoiding a neutral hw 
infrastructure library that implements a non-trivial job is quite typical 
for FOSS-library-shy bin-only apps. The "you cannot infringe what you do 
not link to at all" kind of defensive thinking.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12 18:01           ` stephane eranian
  2008-12-12 19:45             ` Chris Friesen
@ 2008-12-14 23:13             ` Ingo Molnar
  2008-12-15  0:37               ` Paul Mackerras
                                 ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-12-14 23:13 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Vince Weaver, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller


* stephane eranian <eranian@googlemail.com> wrote:

> Hi,
> 
> Given the level of abstractions you are using for the API, and given 
> your argument that the kernel can do the HW resource scheduling better 
> than anybody else.
> 
> What happens in the following test case:
> 
>    - 2-way system (cpu0, cpu1)
> 
>    - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
>      Event E1 can only be measured on counter C1.
> 
>    - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
> 
>    - the scheduler decides to migrate P1 onto CPU1. You now have a
>      conflict on C1.
> 
> How is this managed?

If there's a single unit of sharable resource [such as an event counter, 
or a physical CPU], then there's just three main possibilities: either 
user 1 gets it all, or user 2 gets it all, or they share it.

We've implemented the essence of these variants, with sharing the resource 
being the sane default, and with the sysadmin also having a configuration 
vector to reserve the resource to himself permanently. (There could be 
more variations of this.)

What is your point?

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-14 23:13             ` Ingo Molnar
@ 2008-12-15  0:37               ` Paul Mackerras
  2008-12-15 12:58                 ` stephane eranian
  2008-12-15 14:42                 ` stephane eranian
  2008-12-15 20:58               ` stephane eranian
  2008-12-15 22:53               ` Paul Mackerras
  2 siblings, 2 replies; 52+ messages in thread
From: Paul Mackerras @ 2008-12-15  0:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, Peter Zijlstra, Vince Weaver, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, David S. Miller

Ingo Molnar writes:

> * stephane eranian <eranian@googlemail.com> wrote:
> 
> > Hi,
> > 
> > Given the level of abstractions you are using for the API, and given 
> > your argument that the kernel can do the HW resource scheduling better 
> > than anybody else.
> > 
> > What happens in the following test case:
> > 
> >    - 2-way system (cpu0, cpu1)
> > 
> >    - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
> >      Event E1 can only be measured on counter C1.
> > 
> >    - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
> > 
> >    - the scheduler decides to migrate P1 onto CPU1. You now have a
> >      conflict on C1.
> > 
> > How is this managed?
> 
> If there's a single unit of sharable resource [such as an event counter, 
> or a physical CPU], then there's just three main possibilities: either 
> user 1 gets it all, or user 2 gets it all, or they share it.
> 
> We've implemented the essence of these variants, with sharing the resource 
> being the sane default, and with the sysadmin also having a configuration 
> vector to reserve the resource to himself permanently. (There could be 
> more variations of this.)
> 
> What is your point?

Note that Stephane said *counting* event E1.

One of the important things about counting (as opposed to sampling) is
that it matters whether or not the event is being counted the whole
time or only part of the time.  Thus it puts constraints on counter
scheduling and reporting that don't apply for sampling.

In other words, if I'm counting an event, I want it to be counted all
the time (i.e. whenever the task is executing, for a per-task counter,
or continuously for a per-cpu counter).  If that causes conflicts and
the kernel decides not to count the event for part of the time, that
is very much second-best, and I absolutely need to know that that
happened, and also when the kernel started and stopped counting the
event (so I can scale the result to get some idea what the result
would have been if it had been counted the whole time).

Now, I haven't digested V4 yet, so you might have already implemented
something like that.  Have you? :)

Paul.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-14 22:37               ` Ingo Molnar
@ 2008-12-15  0:50                 ` Paul Mackerras
  2008-12-15 13:02                   ` stephane eranian
  0 siblings, 1 reply; 52+ messages in thread
From: Paul Mackerras @ 2008-12-15  0:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, eranian, Vince Weaver, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, David S. Miller

Ingo Molnar writes:

> * Paul Mackerras <paulus@samba.org> wrote:
> 
> > Peter Zijlstra writes:
> > 
> > > On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> > > > In fact, I know tools which do not even need a library. 
> > > 
> > > By your own saying, the problem solved by libperfmon is a hard problem
> > > (and I fully understand that).
> > > 
> > > Now you say there is software out there that doesn't use libperfmon,
> > > that means they'll have to duplicate that functionality.
> > > 
> > > And only commercial software has a clear gain by wastefully duplicating
> > > that effort. This means there is an active commercial interest to not
> > > make perfmon the best technical solution there is, which is contrary to
> > > the very thing Linux is about.
> > > 
> > > What is worse, you defend that:
> > > 
> > > > Go ask end-users what they think of that?
> > > > 
> > > > You don't even need a library. All of this could be integrated into the tool.
> > > > New processor, just go download the updated version of the tool.
> > > 
> > > No! what people want is their problem fixed - no matter how. That is one
> > > of the powers of FOSS, you can fix your problems in any way suitable.
> > > 
> > > Would it not be much better if those folks duped into using a binary
> > > only product only had to upgrade their FOSS kernel, instead of possibly
> > > forking over more $$$ for an upgrade?
> > > 
> > > You have just irrevocably proven to me this needs to go into the kernel,
> > > as the design of perfmon is little more than a GPL circumvention device
> > > - independent of whether you are aware of that or not.
> > 
> > I'm sorry, but that is a pretty silly argument.
> > 
> > By that logic, the kernel module loader should include an in-kernel copy 
> > of gcc and binutils, and the fact that it doesn't proves that the module 
> > loader is little more than a GPL circumvention device - independent of 
> > whether you are aware of that or not.  8-)
> 
> i'm not sure how your example applies: the kernel module loader is not an 
> application that needs to be updated to new versions of syscalls. Nor is 
> it a needless duplication of infrastructure - it runs in a completely 
> different protection domain - just to name one of the key differences.

Peter's argument was in essence that since using perfmon3 involves some
userspace computation that can be done by proprietary software instead
of a GPL'd library (libpfm), that makes perfmon3 a GPL-circumvention
device.

I was trying to point out that that argument is silly by applying it
to the kernel module loader.  There the userspace component is gcc and
binutils, and the computation they do can be done alternatively by
proprietary software such as icc or xlc.  That of itself doesn't make
the module loader a GPL-circumvention device (though it may be for
other reasons).

And if the argument is silly in that case (which it is), it is even
more silly in the case of perfmon3, where what is being computed and
passed to the kernel is just a few register values, not instructions.

> Applications going to complex raw syscalls and avoiding a neutral hw 
> infrastructure library that implements a non-trivial job is quite typical 
> for FOSS-library-shy bin-only apps. The "you cannot infringe what you do 
> not link to at all" kind of defensive thinking.

FOSS is about freedom - we don't force anyone to use our code.  If
someone wants to use their own code instead of glibc or libpfm on the
user-space side of the syscall interface, that's fine.

Paul.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-15  0:37               ` Paul Mackerras
@ 2008-12-15 12:58                 ` stephane eranian
  2008-12-15 14:42                 ` stephane eranian
  1 sibling, 0 replies; 52+ messages in thread
From: stephane eranian @ 2008-12-15 12:58 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Peter Zijlstra, Vince Weaver, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, David S. Miller

Hi,

On Mon, Dec 15, 2008 at 1:37 AM, Paul Mackerras <paulus@samba.org> wrote:
> Ingo Molnar writes:
>
>> * stephane eranian <eranian@googlemail.com> wrote:
>>
>> > Hi,
>> >
>> > Given the level of abstractions you are using for the API, and given
>> > your argument that the kernel can do the HW resource scheduling better
>> > than anybody else.
>> >
>> > What happens in the following test case:
>> >
>> >    - 2-way system (cpu0, cpu1)
>> >
>> >    - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
>> >      Event E1 can only be measured on counter C1.
>> >
>> >    - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
>> >
>> >    - the scheduler decides to migrate P1 onto CPU1. You now have a
>> >      conflict on C1.
>> >
>> > How is this managed?
>>
>> If there's a single unit of sharable resource [such as an event counter,
>> or a physical CPU], then there's just three main possibilities: either
>> user 1 gets it all, or user 2 gets it all, or they share it.
>>
>> We've implemented the essence of these variants, with sharing the resource
>> being the sane default, and with the sysadmin also having a configuration
>> vector to reserve the resource to himself permanently. (There could be
>> more variations of this.)
>>
>> What is your point?
>>
Could you explain what you mean by sharing here?

Are you talking about time multiplexing the counter?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-15  0:50                 ` Paul Mackerras
@ 2008-12-15 13:02                   ` stephane eranian
  0 siblings, 0 replies; 52+ messages in thread
From: stephane eranian @ 2008-12-15 13:02 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Peter Zijlstra, Vince Weaver, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, David S. Miller

Hi,

On Mon, Dec 15, 2008 at 1:50 AM, Paul Mackerras <paulus@samba.org> wrote:

> FOSS is about freedom - we don't force anyone to use our code.  If
> someone wants to use their own code instead of glibc or libpfm on the
> user-space side of the syscall interface, that's fine.
>
Exactly right!

That was exactly my point when I said, you are free to not use libpfm
in your tool.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-15  0:37               ` Paul Mackerras
  2008-12-15 12:58                 ` stephane eranian
@ 2008-12-15 14:42                 ` stephane eranian
  1 sibling, 0 replies; 52+ messages in thread
From: stephane eranian @ 2008-12-15 14:42 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Peter Zijlstra, Vince Weaver, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, David S. Miller

Hi,

On Mon, Dec 15, 2008 at 1:37 AM, Paul Mackerras <paulus@samba.org> wrote:
> Ingo Molnar writes:
>
>> * stephane eranian <eranian@googlemail.com> wrote:
>>
>> > Hi,
>> >
>> > Given the level of abstractions you are using for the API, and given
>> > your argument that the kernel can do the HW resource scheduling better
>> > than anybody else.
>> >
>> > What happens in the following test case:
>> >
>> >    - 2-way system (cpu0, cpu1)
>> >
>> >    - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
>> >      Event E1 can only be measured on counter C1.
>> >
>> >    - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
>> >
>> >    - the scheduler decides to migrate P1 onto CPU1. You now have a
>> >      conflict on C1.
>> >
>> > How is this managed?
>>
>> If there's a single unit of sharable resource [such as an event counter,
>> or a physical CPU], then there's just three main possibilities: either
>> user 1 gets it all, or user 2 gets it all, or they share it.
>>
>> We've implemented the essence of these variants, with sharing the resource
>> being the sane default, and with the sysadmin also having a configuration
>> vector to reserve the resource to himself permanently. (There could be
>> more variations of this.)
>>
>> What is your point?
>
> Note that Stephane said *counting* event E1.
>
> One of the important things about counting (as opposed to sampling) is
> that it matters whether or not the event is being counted the whole
> time or only part of the time.  Thus it puts constraints on counter
> scheduling and reporting that don't apply for sampling.
>
Paul is right.

> In other words, if I'm counting an event, I want it to be counted all
> the time (i.e. whenever the task is executing, for a per-task counter,
> or continuously for a per-cpu counter).  If that causes conflicts and
> the kernel decides not to count the event for part of the time, that
> is very much second-best, and I absolutely need to know that that
> happened, and also when the kernel started and stopped counting the
> event (so I can scale the result to get some idea what the result
> would have been if it had been counted the whole time).
>
That is very true.

You cannot multiplex events onto counters without applications knowing.
They need to know how long each 'set' has been active. This is needed
to scale the results. This is especially true for cpu-wide measurements.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-12 19:45             ` Chris Friesen
@ 2008-12-15 14:50               ` stephane eranian
  2008-12-15 22:32                 ` Chris Friesen
  0 siblings, 1 reply; 52+ messages in thread
From: stephane eranian @ 2008-12-15 14:50 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Peter Zijlstra, Vince Weaver, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller

On Fri, Dec 12, 2008 at 8:45 PM, Chris Friesen <cfriesen@nortel.com> wrote:
> stephane eranian wrote:
>
>> What happens in the following test case:
>>
>>   - 2-way system (cpu0, cpu1)
>>
>>   - on cpu0, two processes P1, P2, each self-monitoring and counting event
>> E1.
>>     Event E1 can only be measured on counter C1.
>>
>>   - on cpu1, there is a cpu-wide session, monitoring event E1, thus using
>> C1
>>
>>   - the scheduler decides to migrate P1 onto CPU1. You now have a
>> conflict on C1.
>>
>> How is this managed?
>
> Prevent the load balancer from moving P1 onto cpu1?
>
You don't want to do that.

There was a reason why the scheduler decided to move the task.
Now, because of monitoring you would change the behavior of the task
and scheduler.
Monitoring should be unintrusive. You want the task/scheduler to
behave as if no monitoring
was present otherwise what is it you are actually measuring?

Changing or forcing the affinity because of monitoring is also a bad
idea, for the same reason.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-14 23:13             ` Ingo Molnar
  2008-12-15  0:37               ` Paul Mackerras
@ 2008-12-15 20:58               ` stephane eranian
  2008-12-15 22:53               ` Paul Mackerras
  2 siblings, 0 replies; 52+ messages in thread
From: stephane eranian @ 2008-12-15 20:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Vince Weaver, linux-kernel, Thomas Gleixner,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Paul Mackerras, David S. Miller

Hi,

On Mon, Dec 15, 2008 at 12:13 AM, Ingo Molnar <mingo@elte.hu> wrote:
> We've implemented the essence of these variants, with sharing the resource
> being the sane default, and with the sysadmin also having a configuration
> vector to reserve the resource to himself permanently. (There could be
> more variations of this.)
>
Reading the v4 code, it does not appear the sysadmin can specify which
resource to reserve. The current code reserves a number of counters.
This is problematic with hardware where not all counters can measure
everything, or when not all PMU registers are counters.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-15 14:50               ` stephane eranian
@ 2008-12-15 22:32                 ` Chris Friesen
  2008-12-17  7:45                   ` stephane eranian
  0 siblings, 1 reply; 52+ messages in thread
From: Chris Friesen @ 2008-12-15 22:32 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Vince Weaver, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller

stephane eranian wrote:
> On Fri, Dec 12, 2008 at 8:45 PM, Chris Friesen <cfriesen@nortel.com> wrote:
> 
>>stephane eranian wrote:
>>
>>
>>>What happens in the following test case:
>>>
>>>  - 2-way system (cpu0, cpu1)
>>>
>>>  - on cpu0, two processes P1, P2, each self-monitoring and counting event
>>>E1.
>>>    Event E1 can only be measured on counter C1.
>>>
>>>  - on cpu1, there is a cpu-wide session, monitoring event E1, thus using
>>>C1
>>>
>>>  - the scheduler decides to migrate P1 onto CPU1. You now have a
>>>conflict on C1.
>>>
>>>How is this managed?
>>
>>Prevent the load balancer from moving P1 onto cpu1?
>>
> 
> You don't want to do that.
> 
> There was a reason why the scheduler decided to move the task.
> Now, because of monitoring you would change the behavior of the task
> and scheduler.
> Monitoring should be unintrusive. You want the task/scheduler to
> behave as if no monitoring
> was present otherwise what is it you are actually measuring?

In a scenario where the system physically cannot gather the desired data 
without influencing the behaviour of the program, I see two options:

1) limit the behaviour of the system to ensure that we can gather the 
performance monitoring data as specified

2) limit the performance monitoring to minimize any influence on the 
program, and report the fact that performance monitoring was limited.

You've indicated that you don't want option 1, so I assume that you 
prefer option 2.  In the above scenario, how would _you_ handle it?


Chris


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-14 23:13             ` Ingo Molnar
  2008-12-15  0:37               ` Paul Mackerras
  2008-12-15 20:58               ` stephane eranian
@ 2008-12-15 22:53               ` Paul Mackerras
  2 siblings, 0 replies; 52+ messages in thread
From: Paul Mackerras @ 2008-12-15 22:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, Peter Zijlstra, Vince Weaver, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, David S. Miller

Ingo Molnar writes:

> If there's a single unit of sharable resource [such as an event counter, 
> or a physical CPU], then there's just three main possibilities: either 
> user 1 gets it all, or user 2 gets it all, or they share it.
> 
> We've implemented the essence of these variants, with sharing the resource 
> being the sane default, and with the sysadmin also having a configuration 
> vector to reserve the resource to himself permanently. (There could be 
> more variations of this.)

Thinking about this a bit more, it seems to me that there is an
unstated assumption that dealing with performance counters is mostly a
scheduling problem - that the hardware resource of a fixed number of
performance counters can be virtualized to provide a larger number of
software counters in much the same way that a fixed number of physical
cpus are virtualized to support a larger number of tasks.

Put another way, your assumption seems to be that software counters
can be transparently time-multiplexed onto the physical counters,
without affecting the end results.  In other words, you assume that
time-multiplexing is a reasonable way to implement sharing of hardware
performance counters, and that users shouldn't have to know or care
that their counters are being time-multiplexed.  Is that an accurate
statement of your belief?

If it is (and the code you've posted seems to indicate that it is)
then you are going to have unhappy users, because counting part of the
time is not at all the same thing as counting all the time.  As just
one example, imagine that the period over which you are counting is
shorter than the counter timeslice period (for example because the
executable you are measuring doesn't run for very long).  If you have
N software counters but only M < N hardware counters, then only the
first M software counters will report anything useful, and the
remaining M - N will report zero!

Sampling, as opposed to counting, may be more tolerant of
time-multiplexing of counters, particularly for long-running programs,
but even there time-multiplexing will affect the results and users
need to know about it.

It seems to me that this assumption is pretty deeply rooted in the
design of your performance counter subsystem, and I'm not sure at this
point what is the best way to fix it.

Paul.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
  2008-12-15 22:32                 ` Chris Friesen
@ 2008-12-17  7:45                   ` stephane eranian
  0 siblings, 0 replies; 52+ messages in thread
From: stephane eranian @ 2008-12-17  7:45 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Peter Zijlstra, Vince Weaver, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller

On Mon, Dec 15, 2008 at 11:32 PM, Chris Friesen <cfriesen@nortel.com> wrote:
> stephane eranian wrote:
>>
>> On Fri, Dec 12, 2008 at 8:45 PM, Chris Friesen <cfriesen@nortel.com>
>> wrote:
>>
>>> stephane eranian wrote:
>>>
>>>
>>>> What happens in the following test case:
>>>>
>>>>  - 2-way system (cpu0, cpu1)
>>>>
>>>>  - on cpu0, two processes P1, P2, each self-monitoring and counting
>>>> event
>>>> E1.
>>>>   Event E1 can only be measured on counter C1.
>>>>
>>>>  - on cpu1, there is a cpu-wide session, monitoring event E1, thus using
>>>> C1
>>>>
>>>>  - the scheduler decides to migrate P1 onto CPU1. You now have a
>>>> conflict on C1.
>>>>
>>>> How is this managed?
>>>
>>> Prevent the load balancer from moving P1 onto cpu1?
>>>
>>
>> You don't want to do that.
>>
>> There was a reason why the scheduler decided to move the task.
>> Now, because of monitoring you would change the behavior of the task
>> and scheduler.
>> Monitoring should be unintrusive. You want the task/scheduler to
>> behave as if no monitoring
>> was present otherwise what is it you are actually measuring?
>
> In a scenario where the system physically cannot gather the desired data
> without influencing the behaviour of the program, I see two options:
>
> 1) limit the behaviour of the system to ensure that we can gather the
> performance monitoring data as specified
>
> 2) limit the performance monitoring to minimize any influence on the
> program, and report the fact that performance monitoring was limited.
>
> You've indicated that you don't want option 1, so I assume that you prefer
> option 2.  In the above scenario, how would _you_ handle it?
>
That's right, you have to fail monitoring.

In this particular example, it is okay for per-thread sessions to each use C1.
Any cpu-wide session trying to access C1 should fail. Vice versa if a
cpu-wide session is using C1, then no per-thread session can be accessing it.

Things can get even more complicated than that even for per-thread sessions.
Some PMU registers may be shared per core, e.g, Nehalem or Pentium 4. Thus
if HT is enabled, you also have to fail per-thread sessions, as only
one can grab
the resource globally.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Performance counter API review was [patch] Performance Counters for Linux, v3
  2008-12-14 14:51 ` Performance counter API review was " Andi Kleen
@ 2009-02-02 20:03   ` Corey Ashford
  2009-02-02 20:33     ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Corey Ashford @ 2009-02-02 20:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Paul Mackerras,
	David S. Miller, Maynard Johnson, carll

Andi Kleen wrote:
[snip]
> - Global tracing. Right now there seem to be two modi: per task and
> per CPU. But a common variant is global tracing of all CPUs. While this
> could be in theory done right now by attaching to each CPU
> this has the problem that it doesn't interact very well with CPU
> hot plug. The application would need to poll for additional/lost
> CPUs somehow and then re-attach to them (or detach). This would
> likely be quite clumsy and slow. It would be better if the kernel supported 
> that better.
> 
> Or alternative here is to do nothing and keep oprofile for that job
> (which it doesn't do that badly)
> 

This issue is of particular interest to us, from the IBM Power toolchain 
perspective.

Ingo, do you think it would be feasible to add an ability to open a 
single file descriptor that could give global counting (and sampling) on 
all CPU's?  I realize this would entail creating a context per cpu in 
the kernel.

How to present the count data back to user space is another issue.  For 
example, do you sum the counts of a particular event type across all 
CPUs or do you keep them separate, and have the user space app read them 
up per-cpu  (perhaps not knowing exactly which cpu they come from)?

I realize that perfmon doesn't have this ability either, it's currently 
per-cpu as well for global counting.

But it seems as long as you are going so far as providing a thread 
inheritance feature (which I assume uses a summing approach for 
providing counts back to user space), that this "pan-cpu" counting 
feature might not be too difficult to implement.  It sure would simplify 
the life of user space apps, as Andi said.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Performance counter API review was [patch] Performance Counters for Linux, v3
  2009-02-02 20:03   ` Corey Ashford
@ 2009-02-02 20:33     ` Peter Zijlstra
  2009-02-03 16:53       ` Maynard Johnson
  2009-02-04  2:18       ` Paul Mackerras
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2009-02-02 20:33 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Andi Kleen, Ingo Molnar, linux-kernel, Thomas Gleixner,
	Andrew Morton, Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Paul Mackerras, David S. Miller,
	Maynard Johnson, carll

On Mon, 2009-02-02 at 12:03 -0800, Corey Ashford wrote:
> Andi Kleen wrote:
> [snip]
> > - Global tracing. Right now there seem to be two modi: per task and
> > per CPU. But a common variant is global tracing of all CPUs. While this
> > could be in theory done right now by attaching to each CPU
> > this has the problem that it doesn't interact very well with CPU
> > hot plug. The application would need to poll for additional/lost
> > CPUs somehow and then re-attach to them (or detach). This would
> > likely be quite clumsy and slow. It would be better if the kernel supported 
> > that better.
> > 
> > Or alternative here is to do nothing and keep oprofile for that job
> > (which it doesn't do that badly)
> > 
> 
> This issue is of particular interest to us, from the IBM Power toolchain 
> perspective.
> 
> Ingo, do you think it would be feasible to add an ability to open a 
> single file descriptor that could give global counting (and sampling) on 
> all CPU's?  I realize this would entail creating a context per cpu in 
> the kernel.
> 
> How to present the count data back to user space is another issue.  For 
> example, do you sum the counts of a particular event type across all 
> CPUs or do you keep them separate, and have the user space app read them 
> up per-cpu  (perhaps not knowing exactly which cpu they come from)?
> 
> I realize that perfmon doesn't have this ability either, it's currently 
> per-cpu as well for global counting.
> 
> But it seems as long as you are going so far as providing a thread 
> inheritance feature (which I assume uses a summing approach for 
> providing counts back to user space), that this "pan-cpu" counting 
> feature might not be too difficult to implement.  It sure would simplify 
> the life of user space apps, as Andi said.

Doing a single fd for all cpus is going to suck chunks because its going
to be a global serialization point.

Also, why would you be profiling while doing a hotplug? Both cpu
profiling, and hotplug, are administrator operations, just don't do
that.

The inheritance thing will also suffer this issue, if you're going to do
reads of your fds at any other point than at the end -- it will have to
walk the whole inheritance tree and sum all the values (or propagate
interrupts up the tree). Which sounds rather expensive.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Performance counter API review was [patch] Performance Counters for Linux, v3
  2009-02-02 20:33     ` Peter Zijlstra
@ 2009-02-03 16:53       ` Maynard Johnson
  2009-02-04  2:18       ` Paul Mackerras
  1 sibling, 0 replies; 52+ messages in thread
From: Maynard Johnson @ 2009-02-03 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Andi Kleen, Arjan van de Veen, Carl Love,
	Corey Ashford, Eric Dumazet, David S. Miller, Stephane Eranian,
	Peter Anvin, linux-kernel, Ingo Molnar, Paul Mackerras,
	Robert Richter, Thomas Gleixner

Peter Zijlstra <a.p.zijlstra@chello.nl> wrote on 02/02/2009 02:33:01 PM:

> On Mon, 2009-02-02 at 12:03 -0800, Corey Ashford wrote:
> > Andi Kleen wrote:
> > [snip]
> > > - Global tracing. Right now there seem to be two modi: per task and
> > > per CPU. But a common variant is global tracing of all CPUs. While
this
> > > could be in theory done right now by attaching to each CPU
> > > this has the problem that it doesn't interact very well with CPU
> > > hot plug. The application would need to poll for additional/lost
> > > CPUs somehow and then re-attach to them (or detach). This would
> > > likely be quite clumsy and slow. It would be better if the
> kernel supported
> > > that better.
> > >
> > > Or alternative here is to do nothing and keep oprofile for that job
> > > (which it doesn't do that badly)
> > >
> >
> > This issue is of particular interest to us, from the IBM Power
toolchain
> > perspective.
> >
> > Ingo, do you think it would be feasible to add an ability to open a
> > single file descriptor that could give global counting (and sampling)
on
> > all CPU's?  I realize this would entail creating a context per cpu in
> > the kernel.
> >
> > How to present the count data back to user space is another issue.  For

> > example, do you sum the counts of a particular event type across all
> > CPUs or do you keep them separate, and have the user space app read
them
> > up per-cpu  (perhaps not knowing exactly which cpu they come from)?
> >
> > I realize that perfmon doesn't have this ability either, it's currently

> > per-cpu as well for global counting.
> >
> > But it seems as long as you are going so far as providing a thread
> > inheritance feature (which I assume uses a summing approach for
> > providing counts back to user space), that this "pan-cpu" counting
> > feature might not be too difficult to implement.  It sure would
simplify
> > the life of user space apps, as Andi said.
>
> Doing a single fd for all cpus is going to suck chunks because its going
> to be a global serialization point.
Right, a single fd is probably not the way to go, since some users are
going to want to see per-cpu counts.  The user tool can do the accumulation
for global counts.  However, expecting the user tool to manage the opening
of per-cpu fds is less than ideal for several reasons, as has already been
stated by others.

I suggest allowing cpu=-1 and pid=-1 to be passed on the perf_counter_open
call (which should require root authority for security reasons -- as does
cpu=<cpu#> and pid=-1).  With such a capability, the OProfile kernel driver
code could be re-written on top of PCL instead of continuing to maintain so
much processor-specific code (which no doubt would duplicate a lot of
processor-specific PCL code).  And, of course, this capability could be
used by other performance tools, as well.
>
> Also, why would you be profiling while doing a hotplug? Both cpu
> profiling, and hotplug, are administrator operations, just don't do
> that.
Surely you jest.  Profiling, an administrator operation?  Maybe on your
laptop.

-Maynard
>
> The inheritance thing will also suffer this issue, if you're going to do
> reads of your fds at any other point than at the end -- it will have to
> walk the whole inheritance tree and sum all the values (or propagate
> interrupts up the tree). Which sounds rather expensive.
>
>
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Performance counter API review was [patch] Performance Counters for Linux, v3
  2009-02-02 20:33     ` Peter Zijlstra
  2009-02-03 16:53       ` Maynard Johnson
@ 2009-02-04  2:18       ` Paul Mackerras
  2009-02-04  2:32         ` Nathan Lynch
  2009-02-04  8:45         ` Peter Zijlstra
  1 sibling, 2 replies; 52+ messages in thread
From: Paul Mackerras @ 2009-02-04  2:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Corey Ashford, Andi Kleen, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Stephane Eranian, Eric Dumazet,
	Robert Richter, Arjan van de Veen, Peter Anvin, David S. Miller,
	Maynard Johnson, carll

Peter Zijlstra writes:

> Doing a single fd for all cpus is going to suck chunks because its going
> to be a global serialization point.

If we need statistics for the system as a whole (and we do), then the
serialization is going to happen somewhere - the only question is
whether it's in the kernel or in userspace.  I don't see that it needs
to be a _global_ serialization point in either case.  Given that the
kernel has facilities like smp_call_function() available that
userspace doesn't, I think it will end up cleaner to do it in the
kernel.

That's actually a bit independent of whether it should be accessed via
one fd or multiple fds.  One alternative might be to use something
analogous to the counter group concept we have (i.e. multiple fd's,
but have a way to logically join them together).

By the way, how does userspace get to know about cpus being added or
removed?  Is there a better way than continually reading
/sys/devices/system/cpu/online?

> Also, why would you be profiling while doing a hotplug? Both cpu
> profiling, and hotplug, are administrator operations, just don't do
> that.

Performance counters are also used for counting, which by definition
is something that takes place over a period of time, possibly quite a
long time.  It would be annoying to have to stop counting and start a
new count every time we need to plug or unplug a cpu.

> The inheritance thing will also suffer this issue, if you're going to do
> reads of your fds at any other point than at the end -- it will have to
> walk the whole inheritance tree and sum all the values (or propagate
> interrupts up the tree). Which sounds rather expensive.

I'm planning to make that operation (summing over all children) be
something that userspace can request via an ioctl, so userspace gets
to decide when and how often it's worth the expense of doing it.

Paul.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Performance counter API review was [patch] Performance Counters for Linux, v3
  2009-02-04  2:18       ` Paul Mackerras
@ 2009-02-04  2:32         ` Nathan Lynch
  2009-02-04  8:45         ` Peter Zijlstra
  1 sibling, 0 replies; 52+ messages in thread
From: Nathan Lynch @ 2009-02-04  2:32 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Peter Zijlstra, Corey Ashford, Andi Kleen, Ingo Molnar,
	linux-kernel, Thomas Gleixner, Andrew Morton, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	David S. Miller, Maynard Johnson, carll

Paul Mackerras wrote:
>
> By the way, how does userspace get to know about cpus being added or
> removed?  Is there a better way than continually reading
> /sys/devices/system/cpu/online?

The kernel generates uevents for cpu online and offline operations;
you can see them with udevmonitor.  Not sure how you get those events
to an arbitrary application, though.  Alternatively you can set an
inotify watch on e.g. /sys/devices/system/cpu/cpu1/online (at least it
seems to work here).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Performance counter API review was [patch] Performance Counters for Linux, v3
  2009-02-04  2:18       ` Paul Mackerras
  2009-02-04  2:32         ` Nathan Lynch
@ 2009-02-04  8:45         ` Peter Zijlstra
  2009-02-04 10:47           ` Paul Mackerras
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2009-02-04  8:45 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Corey Ashford, Andi Kleen, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Stephane Eranian, Eric Dumazet,
	Robert Richter, Arjan van de Veen, Peter Anvin, David S. Miller,
	Maynard Johnson, carll

On Wed, 2009-02-04 at 13:18 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > Doing a single fd for all cpus is going to suck chunks because its going
> > to be a global serialization point.
> 
> If we need statistics for the system as a whole (and we do), then the
> serialization is going to happen somewhere - the only question is
> whether it's in the kernel or in userspace.  I don't see that it needs
> to be a _global_ serialization point in either case.  Given that the
> kernel has facilities like smp_call_function() available that
> userspace doesn't, I think it will end up cleaner to do it in the
> kernel.

How is smp_call_function() going to help here? You still need to pull
all that data through that one FD. That's a cacheline bounce fest.

Why not collect all this data with per-cpu threads and post-process in
user-space. The processing might even be capable of doing per-cpu
filtering, reducing the amount of data that needs to be merged.

No way that's better done in the kernel.

> That's actually a bit independent of whether it should be accessed via
> one fd or multiple fds.  One alternative might be to use something
> analogous to the counter group concept we have (i.e. multiple fd's,
> but have a way to logically join them together).

Why would you ever want to do that for? Per-cpu channels is a good
thing.

> > Also, why would you be profiling while doing a hotplug? Both cpu
> > profiling, and hotplug, are administrator operations, just don't do
> > that.
> 
> Performance counters are also used for counting, which by definition
> is something that takes place over a period of time, possibly quite a
> long time.  It would be annoying to have to stop counting and start a
> new count every time we need to plug or unplug a cpu.

Well, you need to at least stop/start the cpu to be hot-(un)plugged, no
way around that.

> > The inheritance thing will also suffer this issue, if you're going to do
> > reads of your fds at any other point than at the end -- it will have to
> > walk the whole inheritance tree and sum all the values (or propagate
> > interrupts up the tree). Which sounds rather expensive.
> 
> I'm planning to make that operation (summing over all children) be
> something that userspace can request via an ioctl, so userspace gets
> to decide when and how often it's worth the expense of doing it.

Userspace already has that control, you don't have to read the counter
before you get SIGCHLD.

I'm not seeing how an ioctl will help here, or did you mean a toggle
between:
  - collect the full hierarchy
  - read the currently collected data and don't bother with the
    active kids

Which might be useful.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Performance counter API review was [patch] Performance Counters for Linux, v3
  2009-02-04  8:45         ` Peter Zijlstra
@ 2009-02-04 10:47           ` Paul Mackerras
  2009-02-04 10:51             ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Paul Mackerras @ 2009-02-04 10:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Corey Ashford, Andi Kleen, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Stephane Eranian, Eric Dumazet,
	Robert Richter, Arjan van de Veen, Peter Anvin, David S. Miller,
	Maynard Johnson, carll

Peter Zijlstra writes:

> How is smp_call_function() going to help here? You still need to pull
> all that data through that one FD. That's a cacheline bounce fest.

Well, let's put this into perspective.  We would be collecting 8 bytes
of data from each CPU.  Hardly a "cacheline bounce fest". :)

> Why not collect all this data with per-cpu threads and post-process in
> user-space. The processing might even be capable of doing per-cpu
> filtering, reducing the amount of data that needs to be merged.
> 
> No way that's better done in the kernel.

Not quite sure why you think there's an enormous volume of data to be
managed...

> > > Also, why would you be profiling while doing a hotplug? Both cpu
> > > profiling, and hotplug, are administrator operations, just don't do
> > > that.
> > 
> > Performance counters are also used for counting, which by definition
> > is something that takes place over a period of time, possibly quite a
> > long time.  It would be annoying to have to stop counting and start a
> > new count every time we need to plug or unplug a cpu.
> 
> Well, you need to at least stop/start the cpu to be hot-(un)plugged, no
> way around that.

It might be worth having the kernel do that automatically, given that
the perfcounters code already has a hotplug notifier routine.
However, I don't think this point is worth debating until we have a
more concrete proposal.

> > I'm planning to make that operation (summing over all children) be
> > something that userspace can request via an ioctl, so userspace gets
> > to decide when and how often it's worth the expense of doing it.
> 
> Userspace already has that control, you don't have to read the counter
> before you get SIGCHLD.
> 
> I'm not seeing how an ioctl will help here, or did you mean a toggle
> between:
>   - collect the full hierarchy
>   - read the currently collected data and don't bother with the
>     active kids

No, I meant an operation that syncs up all the child counters to the
parent so that a subsequent read of the counter immediately afterwards
will get a full total (just by reading the parent counter).  But it
could be implemented as a toggle instead.

Paul.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Performance counter API review was [patch] Performance Counters for Linux, v3
  2009-02-04 10:47           ` Paul Mackerras
@ 2009-02-04 10:51             ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2009-02-04 10:51 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Corey Ashford, Andi Kleen, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Andrew Morton, Stephane Eranian, Eric Dumazet,
	Robert Richter, Arjan van de Veen, Peter Anvin, David S. Miller,
	Maynard Johnson, carll

On Wed, 2009-02-04 at 21:47 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > How is smp_call_function() going to help here? You still need to pull
> > all that data through that one FD. That's a cacheline bounce fest.
> 
> Well, let's put this into perspective.  We would be collecting 8 bytes
> of data from each CPU.  Hardly a "cacheline bounce fest". :)

Ah, I was thinking more of the event triggered profiling, like NMI time,
cachemiss or pagefault profiling.

In those cases you'd get a continuous stream of data for each cpu, at
possibly quite high speeds.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] Performance Counters for Linux, v3
@ 2008-12-11 22:05 William Cohen
  0 siblings, 0 replies; 52+ messages in thread
From: William Cohen @ 2008-12-11 22:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 542 bytes --]

I was taking a look at the proposed performance monitoring and kerneltop.c. I 
noticed that  http://redhat.com/~mingo/perfcounters/kerneltop.c doesn't work 
with the v3 version. I didn't see a more recent version available, so I made 
some modifications to make allow it to work with the v3 kernel (with the 
attached). However, I assume some where there is an updated version of kerneltop.c

The Documentation/perf-counters.txt doesn't describe how the group_fd is used. 
Found that -1 used to indicate not connected to any other fd.

-Will

[-- Attachment #2: v3.diff --]
[-- Type: text/x-patch, Size: 2420 bytes --]

--- kerneltop.c.old	2008-12-11 15:34:58.000000000 -0500
+++ kerneltop.c	2008-12-11 16:06:28.000000000 -0500
@@ -62,15 +62,31 @@
 # define __NR_perf_counter_open 333
 #endif
 
+/*
+ * Hardware event to monitor via a performance monitoring counter:
+ */
+struct perf_counter_hw_event {
+	int64_t			type;
+
+	u_int64_t		irq_period;
+	u_int32_t		record_type;
+
+	u_int32_t		disabled     :  1, /* off by default */
+				nmi	     :  1, /* NMI sampling   */
+				raw	     :  1, /* raw event type */
+				__reserved_1 : 29;
+
+	u_int64_t		__reserved_2;
+};
+
 int
-perf_counter_open(int		hw_event_type,
-                  unsigned int	hw_event_period,
-                  unsigned int	record_type,
+perf_counter_open(struct perf_counter_hw_event *hw_event_uptr,
                   pid_t		pid,
-                  int		cpu)
+                  int		cpu,
+		  int		group_fd)
 {
-	return syscall(__NR_perf_counter_open, hw_event_type, hw_event_period,
-			record_type, pid, cpu);
+	return syscall(__NR_perf_counter_open, hw_event_uptr,
+		       pid, cpu, group_fd);
 }
 
 enum hw_event_types {
@@ -82,10 +98,6 @@
 	PERF_COUNT_BRANCH_MISSES,
 	PERF_COUNT_MAX,
 
-	/*
-	 * If this bit is set in the type, then trigger NMI sampling:
-	 */
-	PERF_COUNT_NMI			= (1 << 30),
 };
 
 const char *event_types [] = {
@@ -616,14 +628,14 @@
 {
 	struct pollfd event_array[MAX_NR_CPUS][MAX_COUNTERS];
 	int fd[MAX_NR_CPUS][MAX_COUNTERS];
-	unsigned int nmi_flag = 0;
-	unsigned int flags, cpu;
+	unsigned int cpu;
 	int i, counter;
 	uint64_t ip;
 	ssize_t res;
 #if USE_POLL
 	int ret;
 #endif
+	struct perf_counter_hw_event hw_event;
 
 	process_options(argc, argv);
 
@@ -633,18 +645,18 @@
 
 	assert(nr_cpus <= MAX_NR_CPUS);
 
-	if (nmi)
-		nmi_flag |= PERF_COUNT_NMI;
-
 	for (i = 0; i < nr_cpus; i++) {
 		for (counter = 0; counter < nr_counters; counter++) {
-			flags	= event_id[counter] | nmi_flag;
-
 			cpu	= profile_cpu;
 			if (tid == -1 && profile_cpu == -1)
 				cpu = i;
-
-			fd[i][counter] = perf_counter_open(flags, event_count[counter], 1, tid, cpu);
+			hw_event.type = event_id[counter];
+			hw_event.irq_period = event_count[counter];
+			hw_event.record_type = 1;
+			hw_event.nmi = nmi ? 1 : 0;
+				
+			fd[i][counter] = perf_counter_open(&hw_event, tid,
+							   cpu, -1 );
 			if (fd[i][counter] < 0) {
 				printf("kerneltop error: syscall returned with %d (%s)\n",
 					fd[i][counter], strerror(-fd[i][counter]));

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2009-02-04 10:52 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-11 15:52 [patch] Performance Counters for Linux, v3 Ingo Molnar
2008-12-11 18:02 ` Vince Weaver
2008-12-12  8:25   ` Peter Zijlstra
2008-12-12  8:35     ` stephane eranian
2008-12-12  8:51       ` Peter Zijlstra
2008-12-12  9:00         ` Peter Zijlstra
2008-12-12  9:07           ` Ingo Molnar
2008-12-12  8:59     ` stephane eranian
2008-12-12  9:23       ` Peter Zijlstra
2008-12-12 10:21         ` Robert Richter
2008-12-12 10:59           ` Christoph Hellwig
2008-12-12 11:35             ` Robert Richter
2008-12-12 16:45         ` Chris Friesen
2008-12-12 17:42         ` stephane eranian
2008-12-12 18:01           ` stephane eranian
2008-12-12 19:45             ` Chris Friesen
2008-12-15 14:50               ` stephane eranian
2008-12-15 22:32                 ` Chris Friesen
2008-12-17  7:45                   ` stephane eranian
2008-12-14 23:13             ` Ingo Molnar
2008-12-15  0:37               ` Paul Mackerras
2008-12-15 12:58                 ` stephane eranian
2008-12-15 14:42                 ` stephane eranian
2008-12-15 20:58               ` stephane eranian
2008-12-15 22:53               ` Paul Mackerras
2008-12-13 11:17           ` Peter Zijlstra
2008-12-13 13:48             ` Henrique de Moraes Holschuh
2008-12-13 17:44             ` stephane eranian
2008-12-14  1:02             ` Paul Mackerras
2008-12-14 22:37               ` Ingo Molnar
2008-12-15  0:50                 ` Paul Mackerras
2008-12-15 13:02                   ` stephane eranian
2008-12-12 17:03     ` Samuel Thibault
2008-12-12 17:11       ` Peter Zijlstra
2008-12-12 18:18     ` Vince Weaver
2008-12-11 18:35 ` Andrew Morton
2008-12-12  6:22   ` Ingo Molnar
2008-12-11 19:11 ` Tony Luck
2008-12-11 19:34   ` Ingo Molnar
2008-12-12  8:29     ` Peter Zijlstra
2008-12-12  8:54       ` Ingo Molnar
2008-12-12 13:42       ` Andi Kleen
2008-12-14 14:51 ` Performance counter API review was " Andi Kleen
2009-02-02 20:03   ` Corey Ashford
2009-02-02 20:33     ` Peter Zijlstra
2009-02-03 16:53       ` Maynard Johnson
2009-02-04  2:18       ` Paul Mackerras
2009-02-04  2:32         ` Nathan Lynch
2009-02-04  8:45         ` Peter Zijlstra
2009-02-04 10:47           ` Paul Mackerras
2009-02-04 10:51             ` Peter Zijlstra
2008-12-11 22:05 William Cohen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.