All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/3] [Announcement] Performance Counters for Linux
@ 2008-12-04 23:44 Thomas Gleixner
  2008-12-04 23:44 ` [patch 1/3] performance counters: core code Thomas Gleixner
                   ` (7 more replies)
  0 siblings, 8 replies; 73+ messages in thread
From: Thomas Gleixner @ 2008-12-04 23:44 UTC (permalink / raw)
  To: LKML
  Cc: linux-arch, Andrew Morton, Ingo Molnar, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras

Performance counters are special hardware registers available on most modern
CPUs. These register count the number of certain types of hw events: such
as instructions executed, cachemisses suffered, or branches mis-predicted,
without slowing down the kernel or applications. These registers can also
trigger interrupts when a threshold number of events have passed - and can
thus be used to profile the code that runs on that CPU.

We'd like to announce a brand new implementation of performance counter
support for Linux. It is a very simple and extensible design that has the
potential to implement the full range of features we would expect from such
a subsystem.

The Linux Performance Counter subsystem (implemented via the patches
posted in this announcement) provides an abstraction of performance counter
hardware capabilities. It provides per task and per CPU counters, and it
provides event capabilities on top of those.

The code is far from complete - but the basic approach is already there
and stable.

The biggest missing detail is lowlevel support for non-Intel CPUs and
older Intel CPUs - right now the code is implemented for Intel Core2
(and later) Intel CPUs that have the PERFMON CPU feature. (see below
a wider list of missing/upcoming features)

We are aware of the perfmon3 patchset that has been submitted to lkml
recently. Our patchset tries to achieve a similar end result, with
a fundamentally different (and we believe, superior :-) design:

 - The API is based on a single counter abstraction

 - Only one single new system call is needed: sys_perf_counter_open().
   All performance-counter operations are implemented via standard
   VFS APIs such as read() / fcntl() and poll().

 - User-space is not exposed to lowlevel details like contexts or
   arrays of counters. Opening and reading a basic counter is as simple
   as 2 lines of C code:

   void main(void)
   {
      u64 count;

      fd = perf_counter_open(3 /* PERF_COUNT_CACHE_MISSES */, 0, 0, 0, -1);
      ret = read(fd, &count, sizeof(count));
      if (ret == sizeof(count))
              printf("Current count: %Ld cachemisses!", count);
   }

 - Events, blocking/sleep are natural built-in properties of counters.

 - No interaction with ptrace: any task (with sufficient permissions) can
   monitor other tasks, without having to stop that task.

 - Mapping of counters to hw counters is not static - counters are
   scheduled dynamically on each CPU where a task runs.

 - There's a /sys based reservation facility that allows the allocation
   of a certain number of hw counters for guaranteed sysadmin access.

 - Generalized enumeration for common hw event types. Raw event codes
   can be passed to the API too - but the most common (and most useful)
   event codes are generalized into a hardware-independent registry
   of events:

    enum hw_event_types {
           PERF_COUNT_CYCLES,
           PERF_COUNT_INSTRUCTIONS,
           PERF_COUNT_CACHE_REFERENCES,
           PERF_COUNT_CACHE_MISSES,
           PERF_COUNT_BRANCH_INSTRUCTIONS,
           PERF_COUNT_BRANCH_MISSES,
    };

 - Simplified lowlevel/arch support. The x86 code for Intel CPUs (with
   the PERFMON CPU feature) is 340 lines of code that implements
   7 straightforward lowlevel API calls:

    int hw_perf_counter_init(struct perf_counter *counter, u32 hw_event_type);
    void hw_perf_counter_enable(struct perf_counter *counter);
    void hw_perf_counter_disable(struct perf_counter *counter);
    void hw_perf_counter_read(struct perf_counter *counter);
    void hw_perf_counter_enable_config(struct perf_counter *counter);
    void hw_perf_counter_disable_config(struct perf_counter *counter);
    void hw_perf_counter_setup(void);

   There's one kernel/perf_counter.c core file, and a single
   arch/x86/kernel/cpu/perf_counter.c architecture support file.

   The impact on the kernel tree is relatively moderate:

       27 files changed, 1641 insertions(+), 7 deletions(-)

TODO:

 - Non-Intel CPU support. Help is welcome :-)

 - Round-robin scheduling of counters, when there's more task counters
   than hw counters available.

 - Support for extended record types such as PEBS.

 - Support for NMI events in the x86 code (the core design is ready)

 - Make sure it works well with OProfile and the x86 NMI watchdog

Short documentation is available in Documentation/perf-counters.txt

Find below the source of a simple monitoring demo.

We'd like to seek the feedback of perfmon developers and architecture
maintainers - what do you think about this approach?

Comments, reports, suggestions, flames and other types of feedback
is more than welcome,

	Thomas, Ingo
---

/*
 * Performance counters monitoring test case
 */
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <unistd.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
#include <fcntl.h>
#include <stdio.h>
#include <errno.h>

#define __user

#include "sys.h"

static int count = 10000;
static int eventid;
static int tid;
static char *debuginfo;

static void display_help(void)
{
	printf("monitor\n");
	printf("Usage:\n"
	       "monitor options threadid\n\n"
	       "-e EID   --eventid=EID  eventid\n"
	       "-c CNT   --count=CNT    event count on which IP is sampled\n"
	       "-d FILE  --debug=FILE   path to binary file with debug info\n");
	exit(0);
}

static void process_options (int argc, char *argv[])
{
	int error = 0;

	for (;;) {
		int option_index = 0;
		/** Options for getopt */
		static struct option long_options[] = {
			{"count", required_argument, NULL, 'c'},
			{"debug", required_argument, NULL, 'd'},
			{"eventid", required_argument, NULL, 'e'},
			{"help", no_argument, NULL, 'h'},
			{NULL, 0, NULL, 0}
		};
		int c = getopt_long(argc, argv, "c:d:e:",
				    long_options, &option_index);
		if (c == -1)
			break;
		switch (c) {
		case 'c': count = atoi(optarg); break;
		case 'd': debuginfo = strdup(optarg); break;
		case 'e': eventid = atoi(optarg); break;
		default: error = 1; break;
		}
	}
	if (error || optind == argc)
		display_help ();

	tid = atoi(argv[optind]);
}

int main(int argc, char *argv[])
{
	char str[256];
	uint64_t ip;
	ssize_t res;
	int fd;

	process_options(argc, argv);

	fd = perf_counter_open(eventid, count, 1, tid, -1);
	if (fd < 0) {
		perror("Create counter");
		exit(-1);
	}

	while (1) {
		res = read(fd, (char *) &ip, sizeof(ip));
		if (res != sizeof(ip)) {
			perror("Read counter");
			break;
		}

		if (!debuginfo) {
			printf("IP: 0x%016llx\n", (unsigned long long)ip);
		} else {
			sprintf(str, "addr2line -e %s 0x%llx\n", debuginfo,
				(unsigned long long)ip);
			system(str);
		}
	}

	close(fd);
	exit(0);
}




^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 1/3] performance counters: core code
  2008-12-04 23:44 [patch 0/3] [Announcement] Performance Counters for Linux Thomas Gleixner
@ 2008-12-04 23:44 ` Thomas Gleixner
  2008-12-05 10:55   ` Paul Mackerras
  2008-12-04 23:44 ` [patch 2/3] performance counters: documentation Thomas Gleixner
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 73+ messages in thread
From: Thomas Gleixner @ 2008-12-04 23:44 UTC (permalink / raw)
  To: LKML
  Cc: linux-arch, Andrew Morton, Ingo Molnar, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras

[-- Attachment #1: perf-counters-core.patch --]
[-- Type: text/plain, Size: 33213 bytes --]

Implement the core kernel bits of Performance Counters subsystem.

The Linux Performance Counter subsystem provides an abstraction of
performance counter hardware capabilities. It provides per task and per
CPU counters, and it provides event capabilities on top of those.

Performance counters are accessed via special file descriptors.
There's one file descriptor per virtual counter used.

The special file descriptor is opened via the perf_counter_open()
system call:

 int
 perf_counter_open(u32 hw_event_type,
                   u32 hw_event_period,
                   u32 record_type,
                   pid_t pid,
                   int cpu);

The syscall returns the new fd. The fd can be used via the normal
VFS system calls: read() can be used to read the counter, fcntl()
can be used to set the blocking mode, etc.

Multiple counters can be kept open at a time, and the counters
can be poll()ed.

See more details in Documentation/perf-counters.txt.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/perf_counter.h |  152 +++++++
 include/linux/sched.h        |    9 
 include/linux/syscalls.h     |    6 
 init/Kconfig                 |   29 +
 kernel/Makefile              |    1 
 kernel/fork.c                |    1 
 kernel/perf_counter.c        |  918 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c               |   23 +
 kernel/sys_ni.c              |    3 
 9 files changed, 1142 insertions(+)

Index: linux/include/linux/perf_counter.h
===================================================================
--- /dev/null
+++ linux/include/linux/perf_counter.h
@@ -0,0 +1,152 @@
+/*
+ *  Performance counters:
+ *
+ *   Copyright(C) 2008, Thomas Gleixner <tglx@linutronix.de>
+ *   Copyright(C) 2008, Red Hat, Inc., Ingo Molnar
+ *
+ *  Data type definitions, declarations, prototypes.
+ *
+ *  Started by: Thomas Gleixner and Ingo Molnar
+ *
+ *  For licencing details see kernel-base/COPYING
+ */
+#ifndef _LINUX_PERF_COUNTER_H
+#define _LINUX_PERF_COUNTER_H
+
+#include <asm/atomic.h>
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
+#include <linux/spinlock.h>
+
+struct task_struct;
+
+/*
+ * Generalized hardware event types, used by the hw_event_type parameter
+ * of the sys_perf_counter_open() syscall:
+ */
+enum hw_event_types {
+	PERF_COUNT_CYCLES,
+	PERF_COUNT_INSTRUCTIONS,
+	PERF_COUNT_CACHE_REFERENCES,
+	PERF_COUNT_CACHE_MISSES,
+	PERF_COUNT_BRANCH_INSTRUCTIONS,
+	PERF_COUNT_BRANCH_MISSES,
+};
+
+/*
+ * IRQ-notification data record type:
+ */
+enum perf_record_type {
+	PERF_RECORD_SIMPLE,
+	PERF_RECORD_IRQ,
+};
+
+/**
+ * struct hw_perf_counter - performance counter hardware details
+ */
+struct hw_perf_counter {
+	u64			config;
+	unsigned long		config_base;
+	unsigned long		counter_base;
+	unsigned int		idx;
+	u64			prev_count;
+	s32			next_count;
+	u64			irq_period;
+};
+
+/*
+ * Hardcoded buffer length limit for now, for IRQ-fed events:
+ */
+#define PERF_DATA_BUFLEN	512
+
+/**
+ * struct perf_data - performance counter IRQ data sampling ...
+ */
+struct perf_data {
+	int			len;
+	int			rd_idx;
+	int			overrun;
+	u8			data[PERF_DATA_BUFLEN];
+};
+
+/**
+ * struct perf_counter - performance counter kernel representation:
+ */
+struct perf_counter {
+	struct list_head		list;
+	int				active;
+#if BITS_PER_LONG == 64
+	atomic64_t			count;
+#else
+	atomic_t			count32[2];
+#endif
+	u64				__irq_period;
+
+	struct hw_perf_counter		hw;
+
+	struct perf_counter_context	*ctx;
+	struct task_struct		*task;
+
+	/*
+	 * Protect attach/detach:
+	 */
+	struct mutex			mutex;
+	struct rcu_head			rcu;
+
+	int				oncpu;
+	int				cpu;
+
+	s32				hw_event_type;
+	enum perf_record_type		record_type;
+
+	/* read() / irq related data */
+	wait_queue_head_t		waitq;
+	struct perf_data		*irqdata;
+	struct perf_data		*usrdata;
+	struct perf_data		data[2];
+};
+
+/**
+ * struct perf_counter_context - counter context structure
+ *
+ * Used as a container for task counters and CPU counters as well:
+ */
+struct perf_counter_context {
+#ifdef CONFIG_PERF_COUNTERS
+	/*
+	 * Protect the list of counters:
+	 */
+	spinlock_t		lock;
+	struct list_head	counters;
+	int			nr_counters;
+	int			nr_active;
+	struct task_struct	*task;
+#endif
+};
+
+/**
+ * struct perf_counter_cpu_context - per cpu counter context structure
+ */
+struct perf_cpu_context {
+	struct perf_counter_context	ctx;
+	struct perf_counter_context	*task_ctx;
+	int				active_oncpu;
+	int				max_pertask;
+};
+
+#ifdef CONFIG_PERF_COUNTERS
+extern void perf_counter_task_sched_in(struct task_struct *task, int cpu);
+extern void perf_counter_task_sched_out(struct task_struct *task, int cpu);
+extern void perf_counter_init_task(struct task_struct *task);
+#else
+static inline void
+perf_counter_task_sched_in(struct task_struct *task, int cpu)		{ }
+static inline void
+perf_counter_task_sched_out(struct task_struct *task, int cpu)		{ }
+static inline void perf_counter_init_task(struct task_struct *task)	{ }
+#endif
+
+#endif /* _LINUX_PERF_COUNTER_H */
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -71,6 +71,7 @@ struct sched_param {
 #include <linux/fs_struct.h>
 #include <linux/compiler.h>
 #include <linux/completion.h>
+#include <linux/perf_counter.h>
 #include <linux/pid.h>
 #include <linux/percpu.h>
 #include <linux/topology.h>
@@ -1326,6 +1327,7 @@ struct task_struct {
 	struct list_head pi_state_list;
 	struct futex_pi_state *pi_state_cache;
 #endif
+	struct perf_counter_context perf_counter_ctx;
 #ifdef CONFIG_NUMA
 	struct mempolicy *mempolicy;
 	short il_next;
@@ -2285,6 +2287,13 @@ static inline void inc_syscw(struct task
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+/*
+ * Call the function if the target task is executing on a CPU right now:
+ */
+extern void task_oncpu_function_call(struct task_struct *p,
+				     void (*func) (void *info), void *info);
+
+
 #ifdef CONFIG_MM_OWNER
 extern void mm_update_next_owner(struct mm_struct *mm);
 extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
Index: linux/include/linux/syscalls.h
===================================================================
--- linux.orig/include/linux/syscalls.h
+++ linux/include/linux/syscalls.h
@@ -624,4 +624,10 @@ asmlinkage long sys_fallocate(int fd, in
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
+asmlinkage int
+sys_perf_counter_open(u32 hw_event_type,
+		      u32 hw_event_period,
+		      u32 record_type,
+		      pid_t pid,
+		      int cpu);
 #endif
Index: linux/init/Kconfig
===================================================================
--- linux.orig/init/Kconfig
+++ linux/init/Kconfig
@@ -732,6 +732,35 @@ config AIO
           by some high performance threaded applications. Disabling
           this option saves about 7k.
 
+config HAVE_PERF_COUNTERS
+	bool
+
+menu "Performance Counters"
+
+config PERF_COUNTERS
+	bool "Kernel Performance Counters"
+	depends on HAVE_PERF_COUNTERS
+	default y
+	help
+	  Enable kernel support for performance counter hardware.
+
+	  Performance counters are special hardware registers available
+	  on most modern CPUs. These registers count the number of certain
+	  types of hw events: such as instructions executed, cachemisses
+	  suffered, or branches mis-predicted - without slowing down the
+	  kernel or applications. These registers can also trigger interrupts
+	  when a threshold number of events have passed - and can thus be
+	  used to profile the code that runs on that CPU.
+
+	  The Linux Performance Counter subsystem provides an abstraction of
+	  these hardware capabilities, available via a system call. It
+	  provides per task and per CPU counters, and it provides event
+	  capabilities on top of those.
+
+	  Say Y if unsure.
+
+endmenu
+
 config VM_EVENT_COUNTERS
 	default y
 	bool "Enable VM event counters for /proc/vmstat" if EMBEDDED
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -89,6 +89,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) 
 obj-$(CONFIG_FUNCTION_TRACER) += trace/
 obj-$(CONFIG_TRACING) += trace/
 obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -975,6 +975,7 @@ static struct task_struct *copy_process(
 		goto fork_out;
 
 	rt_mutex_init_task(p);
+	perf_counter_init_task(p);
 
 #ifdef CONFIG_PROVE_LOCKING
 	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
Index: linux/kernel/perf_counter.c
===================================================================
--- /dev/null
+++ linux/kernel/perf_counter.c
@@ -0,0 +1,918 @@
+/*
+ * Performance counter core code
+ *
+ *  Copyright(C) 2008 Thomas Gleixner <tglx@linutronix.de>
+ *  Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ *  For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/fs.h>
+#include <linux/cpu.h>
+#include <linux/smp.h>
+#include <linux/poll.h>
+#include <linux/sysfs.h>
+#include <linux/ptrace.h>
+#include <linux/percpu.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/anon_inodes.h>
+#include <linux/perf_counter.h>
+
+/*
+ * Each CPU has a list of per CPU counters:
+ */
+DEFINE_PER_CPU(struct perf_cpu_context, perf_cpu_context);
+
+static int perf_max_counters __read_mostly = 2;
+static int perf_reserved_percpu __read_mostly;
+static int perf_overcommit __read_mostly = 1;
+
+/*
+ * Mutex for (sysadmin-configurable) counter reservations:
+ */
+static DEFINE_MUTEX(perf_resource_mutex);
+
+/*
+ * Architecture provided APIs - weak aliases:
+ */
+
+int __weak hw_perf_counter_init(struct perf_counter *counter, u32 hw_event_type)
+{
+	return -EINVAL;
+}
+
+void __weak hw_perf_counter_enable(struct perf_counter *counter)	 { }
+void __weak hw_perf_counter_disable(struct perf_counter *counter)	 { }
+void __weak hw_perf_counter_read(struct perf_counter *counter)		 { }
+void __weak hw_perf_counter_enable_config(struct perf_counter *counter)	 { }
+void __weak hw_perf_counter_disable_config(struct perf_counter *counter) { }
+void __weak hw_perf_counter_setup(void) { }
+
+/*
+ * RCU callback to free a performance counter:
+ */
+static void perf_free_ctr_rcu(struct rcu_head *rhp)
+
+{
+	kfree(container_of(rhp, struct perf_counter, rcu));
+}
+
+#if BITS_PER_LONG == 64
+
+/*
+ * Read the cached counter in counter safe against cross CPU / NMI
+ * modifications. 64 bit version - no complications.
+ */
+static inline u64 perf_read_counter_safe(struct perf_counter *counter)
+{
+	return (u64) atomic64_read(&counter->count);
+}
+
+#else
+
+/*
+ * Read the cached counter in counter safe against cross CPU / NMI
+ * modifications. 32 bit version.
+ */
+static u64 perf_read_counter_safe(struct perf_counter *counter)
+{
+	u32 cntl, cnth;
+
+	local_irq_disable();
+	do {
+		cnth = atomic_read(&counter->count32[1]);
+		cntl = atomic_read(&counter->count32[0]);
+	} while (cnth != atomic_read(&counter->count32[1]));
+
+	local_irq_enable();
+
+	return cntl | ((u64) cnth) << 32;
+}
+
+#endif
+
+/*
+ * Cross CPU call to remove a performance counter
+ *
+ * We disable the counter on the hardware level first. After that we
+ * remove it from the context list.
+ */
+static void __perf_remove_from_context(void *info)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_counter *counter = info;
+	struct perf_counter_context *ctx = counter->ctx;
+
+	spin_lock(&cpuctx->ctx.lock);
+	/*
+	 * If this is a task context, we need to check whether it is
+	 * the current task context of this cpu. If not it has been
+	 * scheduled out before the smp call arrived.
+	 */
+	if (ctx->task) {
+		if (cpuctx->task_ctx != ctx)
+			goto out;
+		spin_lock(&ctx->lock);
+	}
+
+	if (counter->active) {
+		hw_perf_counter_disable_config(counter);
+		hw_perf_counter_disable(counter);
+		counter->active = 0;
+		ctx->nr_active--;
+		cpuctx->active_oncpu--;
+	}
+
+	ctx->nr_counters--;
+
+	list_del_rcu(&counter->list);
+
+	if (ctx->task) {
+		spin_unlock(&ctx->lock);
+	} else {
+		/*
+		 * Allow more per task counters with respect to the
+		 * reservation:
+		 */
+		cpuctx->max_pertask =
+			min(perf_max_counters - ctx->nr_counters,
+			    perf_max_counters - perf_reserved_percpu);
+	}
+out:
+	spin_unlock(&cpuctx->ctx.lock);
+}
+
+
+/*
+ * Remove the counter from a task's (or a CPU's) list of counters.
+ *
+ * Must be called with counter->mutex held.
+ *
+ * First we disable the counter enable bit in counter->hw_config. This
+ * ensures that a context switch on another CPU or a NMI on the local
+ * CPU does not enable the counter.
+ *
+ * CPU counters are removed with a smp call. For task counters we only
+ * call when the task is on a CPU.
+ */
+static void perf_remove_from_context(struct perf_counter *counter)
+{
+	struct perf_counter_context *ctx = counter->ctx;
+	struct task_struct *task = ctx->task;
+
+	hw_perf_counter_disable_config(counter);
+
+	if (!task) {
+		/*
+		 * Per cpu counters are removed via an smp call and
+		 * the removal is always sucessful.
+		 */
+		smp_call_function_single(counter->cpu,
+					 __perf_remove_from_context,
+					 counter, 1);
+		return;
+	}
+
+retry:
+	task_oncpu_function_call(task, __perf_remove_from_context,
+				 counter);
+	/*
+	 * We might have failed to deactivate it due to
+	 * task migration:
+	 */
+	spin_lock_irq(&ctx->lock);
+
+	/* FIXME: No generic function for this */
+	if (counter->list.prev != LIST_POISON2) {
+		if (counter->active) {
+			spin_unlock_irq(&ctx->lock);
+			goto retry;
+		}
+		ctx->nr_counters--;
+		list_del_rcu(&counter->list);
+	}
+	spin_unlock_irq(&ctx->lock);
+
+	counter->task = NULL;
+}
+
+/*
+ * Cross CPU call to install and enable a preformance counter
+ */
+static void __perf_install_in_context(void *info)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_counter *counter = info;
+	struct perf_counter_context *ctx = counter->ctx;
+	int cpu = smp_processor_id();
+
+	spin_lock(&cpuctx->ctx.lock);
+
+	/*
+	 * If this is a task context, we need to check whether it is
+	 * the current task context of this cpu. If not it has been
+	 * scheduled out before the smp call arrived.
+	 */
+	if (ctx->task) {
+		if (cpuctx->task_ctx != ctx)
+			goto out;
+		spin_lock(&ctx->lock);
+	}
+
+	if (cpuctx->active_oncpu < perf_max_counters) {
+		hw_perf_counter_enable_config(counter);
+		hw_perf_counter_enable(counter);
+		counter->active = 1;
+		counter->oncpu = cpu;
+		ctx->nr_active++;
+		cpuctx->active_oncpu++;
+	}
+
+	if (ctx->task) {
+		spin_unlock(&ctx->lock);
+	} else {
+		if (cpuctx->max_pertask)
+			cpuctx->max_pertask--;
+	}
+out:
+	spin_unlock(&cpuctx->ctx.lock);
+}
+
+/*
+ * Attach a performance counter to a context
+ *
+ * First we add the counter to the list with the hardware enable bit
+ * in counter->hw_config cleared.
+ *
+ * Now we enable the hardware enable bit in counter->hw_config and if
+ * the counter is attached to a task which is on a CPU we use a smp call
+ * to enable it in the task context. The task might have been scheduled
+ * away, but we check this in the smp call again.
+ */
+static void
+perf_install_in_context(struct perf_counter_context *ctx,
+			struct perf_counter *counter,
+			int cpu)
+{
+	struct task_struct *task = ctx->task;
+
+	if (task)
+		counter->task = task;
+
+	spin_lock_irq(&ctx->lock);
+	list_add_tail_rcu(&counter->list, &ctx->counters);
+	ctx->nr_counters++;
+	counter->ctx = ctx;
+	spin_unlock_irq(&ctx->lock);
+
+	hw_perf_counter_enable_config(counter);
+
+	if (task) {
+		task_oncpu_function_call(task, __perf_install_in_context,
+					 counter);
+	} else {
+		smp_call_function_single(cpu, __perf_install_in_context,
+					 counter, 1);
+	}
+}
+
+/*
+ * Called from scheduler to remove the counters of the current task,
+ * with interrupts disabled.
+ *
+ * We stop each counter and update the counter value in counter->count.
+ *
+ * We dont use the rcu list walk here as we are protected by the
+ * spinlock.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_disable()
+ * sets the disabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * not restart the counter.
+ */
+void perf_counter_task_sched_out(struct task_struct *task, int cpu)
+{
+	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+	struct perf_counter *counter;
+
+	if (likely(!cpuctx->task_ctx))
+		return;
+
+	spin_lock(&cpuctx->ctx.lock);
+	spin_lock(&ctx->lock);
+	list_for_each_entry(counter, &ctx->counters, list) {
+		if (!ctx->nr_active)
+			break;
+		if (counter->active) {
+			hw_perf_counter_disable(counter);
+			counter->active = 0;
+			counter->oncpu = -1;
+			ctx->nr_active--;
+			cpuctx->active_oncpu--;
+		}
+	}
+	cpuctx->task_ctx = NULL;
+	spin_unlock(&ctx->lock);
+	spin_unlock(&cpuctx->ctx.lock);
+}
+
+/*
+ * Called from scheduler to add the counters of the current task
+ * with interrupts disabled.
+ *
+ * We restore the counter value and then enable it.
+ *
+ * We dont use the rcu list walk here as we are protected by the
+ * spinlock.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_enable()
+ * sets the enabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * keep the counter running.
+ */
+void perf_counter_task_sched_in(struct task_struct *task, int cpu)
+{
+	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+	struct perf_counter *counter;
+
+	if (likely(!ctx->nr_counters))
+		return;
+
+	spin_lock(&cpuctx->ctx.lock);
+	spin_lock(&ctx->lock);
+	list_for_each_entry(counter, &ctx->counters, list) {
+		if (ctx->nr_active == cpuctx->max_pertask)
+			break;
+		if (counter->cpu != -1 && counter->cpu != cpu)
+			continue;
+		hw_perf_counter_enable(counter);
+		counter->active = 1;
+		counter->oncpu = cpu;
+		ctx->nr_active++;
+		cpuctx->active_oncpu++;
+	}
+	cpuctx->task_ctx = ctx;
+	spin_unlock(&ctx->lock);
+	spin_unlock(&cpuctx->ctx.lock);
+}
+
+/*
+ * Initialize the perf_counter context in task_struct
+ */
+void perf_counter_init_task(struct task_struct *task)
+{
+	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+
+	spin_lock_init(&ctx->lock);
+	INIT_LIST_HEAD(&ctx->counters);
+	ctx->nr_counters = 0;
+	ctx->task = NULL;
+}
+
+/*
+ * Cross CPU call to read the hardware counter
+ */
+static void __hw_perf_counter_read(void *info)
+{
+	hw_perf_counter_read(info);
+}
+
+static u64 perf_read_counter(struct perf_counter *counter)
+{
+	/*
+	 * If counter is enabled and currently active on a CPU, update the
+	 * value in the counter structure:
+	 */
+	if (counter->active) {
+		smp_call_function_single(counter->oncpu,
+					 __hw_perf_counter_read, counter, 1);
+	}
+
+	return perf_read_counter_safe(counter);
+}
+
+/*
+ * Cross CPU call to switch performance data pointers
+ */
+static void __perf_switch_irq_data(void *info)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_counter *counter = info;
+	struct perf_counter_context *ctx = counter->ctx;
+	struct perf_data *oldirqdata = counter->irqdata;
+
+	spin_lock(&cpuctx->ctx.lock);
+	/*
+	 * If this is a task context, we need to check whether it is
+	 * the current task context of this cpu. If not it has been
+	 * scheduled out before the smp call arrived.
+	 */
+	if (ctx->task) {
+		if (cpuctx->task_ctx != ctx)
+			goto out;
+		spin_lock(&ctx->lock);
+	}
+
+	/* Change the pointer NMI safe */
+	atomic_long_set((atomic_long_t *)&counter->irqdata,
+			(unsigned long) counter->usrdata);
+	counter->usrdata = oldirqdata;
+
+	if (ctx->task)
+		spin_unlock(&ctx->lock);
+out:
+	spin_unlock(&cpuctx->ctx.lock);
+}
+
+static struct perf_data *perf_switch_irq_data(struct perf_counter *counter)
+{
+	struct perf_counter_context *ctx = counter->ctx;
+	struct perf_data *oldirqdata = counter->irqdata;
+	struct task_struct *task = ctx->task;
+
+	if (!task) {
+		smp_call_function_single(counter->cpu,
+					 __perf_switch_irq_data,
+					 counter, 1);
+		return counter->usrdata;
+	}
+
+retry:
+	spin_lock_irq(&ctx->lock);
+	if (!counter->active) {
+		counter->irqdata = counter->usrdata;
+		counter->usrdata = oldirqdata;
+		spin_unlock_irq(&ctx->lock);
+		return oldirqdata;
+	}
+	spin_unlock_irq(&ctx->lock);
+	task_oncpu_function_call(task, __perf_switch_irq_data, counter);
+	/* Might have failed, because task was scheduled out */
+	if (counter->irqdata == oldirqdata)
+		goto retry;
+
+	return counter->usrdata;
+}
+
+static void put_context(struct perf_counter_context *ctx)
+{
+	if (ctx->task) {
+		put_task_struct(ctx->task);
+		ctx->task = NULL;
+	}
+}
+
+static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
+{
+	struct perf_cpu_context *cpuctx;
+	struct perf_counter_context *ctx;
+	struct task_struct *task;
+
+	/*
+	 * If cpu is not a wildcard then this is a percpu counter:
+	 */
+	if (cpu != -1) {
+		/* Must be root to operate on a CPU counter: */
+		if (!capable(CAP_SYS_ADMIN))
+			return ERR_PTR(-EACCES);
+
+		if (cpu < 0 || cpu > num_possible_cpus())
+			return ERR_PTR(-EINVAL);
+
+		/*
+		 * We could be clever and allow to attach a counter to an
+		 * offline CPU and activate it when the CPU comes up, but
+		 * that's for later.
+		 */
+		if (!cpu_isset(cpu, cpu_online_map))
+			return ERR_PTR(-ENODEV);
+
+		cpuctx = &per_cpu(perf_cpu_context, cpu);
+		ctx = &cpuctx->ctx;
+
+		WARN_ON_ONCE(ctx->task);
+		return ctx;
+	}
+
+	rcu_read_lock();
+	if (!pid)
+		task = current;
+	else
+		task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	rcu_read_unlock();
+
+	if (!task)
+		return ERR_PTR(-ESRCH);
+
+	ctx = &task->perf_counter_ctx;
+	ctx->task = task;
+
+	/* Reuse ptrace permission checks for now. */
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		put_context(ctx);
+		return ERR_PTR(-EACCES);
+	}
+
+	return ctx;
+}
+
+/*
+ * Called when the last reference to the file is gone.
+ */
+static int perf_release(struct inode *inode, struct file *file)
+{
+	struct perf_counter *counter = file->private_data;
+	struct perf_counter_context *ctx = counter->ctx;
+
+	file->private_data = NULL;
+
+	mutex_lock(&counter->mutex);
+
+	perf_remove_from_context(counter);
+	put_context(ctx);
+
+	mutex_unlock(&counter->mutex);
+
+	call_rcu(&counter->rcu, perf_free_ctr_rcu);
+
+	return 0;
+}
+
+/*
+ * Read the performance counter - simple non blocking version for now
+ */
+static ssize_t
+perf_read_hw(struct perf_counter *counter, char __user *buf, size_t count)
+{
+	u64 cntval;
+
+	if (count != sizeof(cntval))
+		return -EINVAL;
+
+	mutex_lock(&counter->mutex);
+	cntval = perf_read_counter(counter);
+	mutex_unlock(&counter->mutex);
+
+	return put_user(cntval, (u64 __user *) buf) ? -EFAULT : sizeof(cntval);
+}
+
+static ssize_t
+perf_copy_usrdata(struct perf_data *usrdata, char __user *buf, size_t count)
+{
+	if (!usrdata->len)
+		return 0;
+
+	count = min(count, (size_t)usrdata->len);
+	if (copy_to_user(buf, usrdata->data + usrdata->rd_idx, count))
+		return -EFAULT;
+
+	/* Adjust the counters */
+	usrdata->len -= count;
+	if (!usrdata->len)
+		usrdata->rd_idx = 0;
+	else
+		usrdata->rd_idx += count;
+
+	return count;
+}
+
+static ssize_t
+perf_read_irq_data(struct perf_counter	*counter,
+		   char __user		*buf,
+		   size_t		count,
+		   int			nonblocking)
+{
+	struct perf_data *irqdata, *usrdata;
+	DECLARE_WAITQUEUE(wait, current);
+	ssize_t res;
+
+	irqdata = counter->irqdata;
+	usrdata = counter->usrdata;
+
+	if (usrdata->len + irqdata->len >= count)
+		goto read_pending;
+
+	if (nonblocking)
+		return -EAGAIN;
+
+	spin_lock_irq(&counter->waitq.lock);
+	__add_wait_queue(&counter->waitq, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (usrdata->len + irqdata->len >= count)
+			break;
+
+		if (signal_pending(current))
+			break;
+
+		spin_unlock_irq(&counter->waitq.lock);
+		schedule();
+		spin_lock_irq(&counter->waitq.lock);
+	}
+	__remove_wait_queue(&counter->waitq, &wait);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock_irq(&counter->waitq.lock);
+
+	if (usrdata->len + irqdata->len < count)
+		return -ERESTARTSYS;
+read_pending:
+	mutex_lock(&counter->mutex);
+
+	/* Drain pending data first: */
+	res = perf_copy_usrdata(usrdata, buf, count);
+	if (res < 0 || res == count)
+		goto out;
+
+	/* Switch irq buffer: */
+	usrdata = perf_switch_irq_data(counter);
+	if (perf_copy_usrdata(usrdata, buf + res, count - res) < 0) {
+		if (!res)
+			res = -EFAULT;
+	} else {
+		res = count;
+	}
+out:
+	mutex_unlock(&counter->mutex);
+
+	return res;
+}
+
+static ssize_t
+perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+	struct perf_counter *counter = file->private_data;
+
+	switch (counter->record_type) {
+	case PERF_RECORD_SIMPLE:
+		return perf_read_hw(counter, buf, count);
+
+	case PERF_RECORD_IRQ:
+		return perf_read_irq_data(counter, buf, count,
+					  file->f_flags & O_NONBLOCK);
+	}
+	return -EINVAL;
+}
+
+static unsigned int perf_poll(struct file *file, poll_table *wait)
+{
+	struct perf_counter *counter = file->private_data;
+	unsigned int events = 0;
+	unsigned long flags;
+
+	poll_wait(file, &counter->waitq, wait);
+
+	spin_lock_irqsave(&counter->waitq.lock, flags);
+	if (counter->usrdata->len || counter->irqdata->len)
+		events |= POLLIN;
+	spin_unlock_irqrestore(&counter->waitq.lock, flags);
+
+	return events;
+}
+
+static const struct file_operations perf_fops = {
+	.release		= perf_release,
+	.read			= perf_read,
+	.poll			= perf_poll,
+};
+
+/*
+ * Allocate and initialize a counter structure
+ */
+static struct perf_counter *
+perf_counter_alloc(u32 hw_event_period, int cpu, u32 record_type)
+{
+	struct perf_counter *counter = kzalloc(sizeof(*counter), GFP_KERNEL);
+
+	if (!counter)
+		return NULL;
+
+	mutex_init(&counter->mutex);
+	INIT_LIST_HEAD(&counter->list);
+	init_waitqueue_head(&counter->waitq);
+
+	counter->irqdata	= &counter->data[0];
+	counter->usrdata	= &counter->data[1];
+	counter->cpu		= cpu;
+	counter->record_type	= record_type;
+	counter->__irq_period	= hw_event_period;
+
+	return counter;
+}
+
+/**
+ * sys_perf_task_open - open a performance counter associate it to a task
+ * @hw_event_type:	event type for monitoring/sampling...
+ * @pid:		target pid
+ */
+asmlinkage int
+sys_perf_counter_open(u32 hw_event_type,
+		      u32 hw_event_period,
+		      u32 record_type,
+		      pid_t pid,
+		      int cpu)
+{
+	struct perf_counter_context *ctx;
+	struct perf_counter *counter;
+	int ret;
+
+	ctx = find_get_context(pid, cpu);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = -ENOMEM;
+	counter = perf_counter_alloc(hw_event_period, cpu, record_type);
+	if (!counter)
+		goto err_put_context;
+
+	ret = hw_perf_counter_init(counter, hw_event_type);
+	if (ret)
+		goto err_free_put_context;
+
+	perf_install_in_context(ctx, counter, cpu);
+
+	ret = anon_inode_getfd("[perf_counter]", &perf_fops, counter, 0);
+	if (ret < 0)
+		goto err_remove_free_put_context;
+
+	return ret;
+
+err_remove_free_put_context:
+	mutex_lock(&counter->mutex);
+	perf_remove_from_context(counter);
+	mutex_unlock(&counter->mutex);
+
+err_free_put_context:
+	call_rcu(&counter->rcu, perf_free_ctr_rcu);
+
+err_put_context:
+	put_context(ctx);
+
+	return ret;
+}
+
+static void __cpuinit perf_init_cpu(int cpu)
+{
+	struct perf_cpu_context *ctx;
+
+	ctx = &per_cpu(perf_cpu_context, cpu);
+	spin_lock_init(&ctx->ctx.lock);
+	INIT_LIST_HEAD(&ctx->ctx.counters);
+
+	mutex_lock(&perf_resource_mutex);
+	ctx->max_pertask = perf_max_counters - perf_reserved_percpu;
+	mutex_unlock(&perf_resource_mutex);
+	hw_perf_counter_setup();
+}
+
+static void perf_exit_cpu(int cpu)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+	struct perf_counter_context *ctx;
+	struct perf_counter *counter;
+
+	ctx = &per_cpu(perf_cpu_context, cpu).ctx;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(counter, &ctx->counters, list) {
+		mutex_lock(&counter->mutex);
+		perf_remove_from_context(counter);
+		mutex_unlock(&counter->mutex);
+	}
+	rcu_read_unlock();
+#endif
+}
+
+static int __cpuinit
+perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (long)hcpu;
+
+	switch (action) {
+
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		perf_init_cpu(cpu);
+		break;
+
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		perf_exit_cpu(cpu);
+		break;
+
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata perf_cpu_nb = {
+	.notifier_call		= perf_cpu_notify,
+};
+
+static int __init perf_counter_init(void)
+{
+	perf_cpu_notify(&perf_cpu_nb, (unsigned long)CPU_UP_PREPARE,
+			(void *)(long)smp_processor_id());
+	register_cpu_notifier(&perf_cpu_nb);
+
+	return 0;
+}
+early_initcall(perf_counter_init);
+
+static ssize_t perf_show_reserve_percpu(struct sysdev_class *class, char *buf)
+{
+	return sprintf(buf, "%d\n", perf_reserved_percpu);
+}
+
+static ssize_t
+perf_set_reserve_percpu(struct sysdev_class *class,
+			const char *buf,
+			size_t count)
+{
+	struct perf_cpu_context *cpuctx;
+	unsigned long val;
+	int err, cpu, mpt;
+
+	err = strict_strtoul(buf, 10, &val);
+	if (err)
+		return err;
+	if (val > perf_max_counters)
+		return -EINVAL;
+
+	mutex_lock(&perf_resource_mutex);
+	perf_reserved_percpu = val;
+	for_each_online_cpu(cpu) {
+		cpuctx = &per_cpu(perf_cpu_context, cpu);
+		spin_lock_irq(&cpuctx->ctx.lock);
+		mpt = min(perf_max_counters - cpuctx->ctx.nr_counters,
+			  perf_max_counters - perf_reserved_percpu);
+		cpuctx->max_pertask = mpt;
+		spin_unlock_irq(&cpuctx->ctx.lock);
+	}
+	mutex_unlock(&perf_resource_mutex);
+
+	return count;
+}
+
+static ssize_t perf_show_overcommit(struct sysdev_class *class, char *buf)
+{
+	return sprintf(buf, "%d\n", perf_overcommit);
+}
+
+static ssize_t
+perf_set_overcommit(struct sysdev_class *class, const char *buf, size_t count)
+{
+	unsigned long val;
+	int err;
+
+	err = strict_strtoul(buf, 10, &val);
+	if (err)
+		return err;
+	if (val > 1)
+		return -EINVAL;
+
+	mutex_lock(&perf_resource_mutex);
+	perf_overcommit = val;
+	mutex_unlock(&perf_resource_mutex);
+
+	return count;
+}
+
+static SYSDEV_CLASS_ATTR(
+				reserve_percpu,
+				0644,
+				perf_show_reserve_percpu,
+				perf_set_reserve_percpu
+			);
+
+static SYSDEV_CLASS_ATTR(
+				overcommit,
+				0644,
+				perf_show_overcommit,
+				perf_set_overcommit
+			);
+
+static struct attribute *perfclass_attrs[] = {
+	&attr_reserve_percpu.attr,
+	&attr_overcommit.attr,
+	NULL
+};
+
+static struct attribute_group perfclass_attr_group = {
+	.attrs			= perfclass_attrs,
+	.name			= "perf_counters",
+};
+
+static int __init perf_counter_sysfs_init(void)
+{
+	return sysfs_create_group(&cpu_sysdev_class.kset.kobj,
+				  &perfclass_attr_group);
+}
+device_initcall(perf_counter_sysfs_init);
+
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -2212,6 +2212,27 @@ static int sched_balance_self(int cpu, i
 
 #endif /* CONFIG_SMP */
 
+/**
+ * task_oncpu_function_call - call a function on the cpu on which a task runs
+ * @p:		the task to evaluate
+ * @func:	the function to be called
+ * @info:	the function call argument
+ *
+ * Calls the function @func when the task is currently running. This might
+ * be on the current CPU, which just calls the function directly
+ */
+void task_oncpu_function_call(struct task_struct *p,
+			      void (*func) (void *info), void *info)
+{
+	int cpu;
+
+	preempt_disable();
+	cpu = task_cpu(p);
+	if (task_curr(p))
+		smp_call_function_single(cpu, func, info, 1);
+	preempt_enable();
+}
+
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -2534,6 +2555,7 @@ prepare_task_switch(struct rq *rq, struc
 		    struct task_struct *next)
 {
 	fire_sched_out_preempt_notifiers(prev, next);
+	perf_counter_task_sched_out(prev, cpu_of(rq));
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
 }
@@ -2574,6 +2596,7 @@ static void finish_task_switch(struct rq
 	 */
 	prev_state = prev->state;
 	finish_arch_switch(prev);
+	perf_counter_task_sched_in(current, cpu_of(rq));
 	finish_lock_switch(rq, prev);
 #ifdef CONFIG_SMP
 	if (current->sched_class->post_schedule)
Index: linux/kernel/sys_ni.c
===================================================================
--- linux.orig/kernel/sys_ni.c
+++ linux/kernel/sys_ni.c
@@ -174,3 +174,6 @@ cond_syscall(compat_sys_timerfd_settime)
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+/* performance counters: */
+cond_syscall(sys_perf_counter_open);



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 2/3] performance counters: documentation
  2008-12-04 23:44 [patch 0/3] [Announcement] Performance Counters for Linux Thomas Gleixner
  2008-12-04 23:44 ` [patch 1/3] performance counters: core code Thomas Gleixner
@ 2008-12-04 23:44 ` Thomas Gleixner
  2008-12-05  0:33   ` Paul Mackerras
  2008-12-04 23:45 ` [patch 3/3] performance counters: x86 support Thomas Gleixner
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 73+ messages in thread
From: Thomas Gleixner @ 2008-12-04 23:44 UTC (permalink / raw)
  To: LKML
  Cc: linux-arch, Andrew Morton, Ingo Molnar, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras

[-- Attachment #1: perf-counters-docs.patch --]
[-- Type: text/plain, Size: 4374 bytes --]

From: Ingo Molnar <mingo@elte.hu>

Add more documentation about performance counters.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/perf-counters.txt |  104 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 104 insertions(+)

Index: linux/Documentation/perf-counters.txt
===================================================================
--- /dev/null
+++ linux/Documentation/perf-counters.txt
@@ -0,0 +1,104 @@
+
+Performance Counters for Linux
+------------------------------
+
+Performance counters are special hardware registers available on most modern
+CPUs. These registers count the number of certain types of hw events: such
+as instructions executed, cachemisses suffered, or branches mis-predicted -
+without slowing down the kernel or applications. These registers can also
+trigger interrupts when a threshold number of events have passed - and can
+thus be used to profile the code that runs on that CPU.
+
+The Linux Performance Counter subsystem provides an abstraction of these
+hardware capabilities. It provides per task and per CPU counters, and
+it provides event capabilities on top of those.
+
+Performance counters are accessed via special file descriptors.
+There's one file descriptor per virtual counter used.
+
+The special file descriptor is opened via the perf_counter_open()
+system call:
+
+ int
+ perf_counter_open(u32 hw_event_type,
+                   u32 hw_event_period,
+                   u32 record_type,
+                   pid_t pid,
+                   int cpu);
+
+The syscall returns the new fd. The fd can be used via the normal
+VFS system calls: read() can be used to read the counter, fcntl()
+can be used to set the blocking mode, etc.
+
+Multiple counters can be kept open at a time, and the counters
+can be poll()ed.
+
+When creating a new counter fd, 'hw_event_type' is one of:
+
+ enum hw_event_types {
+	PERF_COUNT_CYCLES,
+	PERF_COUNT_INSTRUCTIONS,
+	PERF_COUNT_CACHE_REFERENCES,
+	PERF_COUNT_CACHE_MISSES,
+	PERF_COUNT_BRANCH_INSTRUCTIONS,
+	PERF_COUNT_BRANCH_MISSES,
+ };
+
+These are standardized types of events that work uniformly on all CPUs
+that implements Performance Counters support under Linux. If a CPU is
+not able to count branch-misses, then the system call will return
+-EINVAL.
+
+[ Note: more hw_event_types are supported as well, but they are CPU
+  specific and are enumerated via /sys on a per CPU basis. Raw hw event
+  types can be passed in as negative numbers. For example, to count
+  "External bus cycles while bus lock signal asserted" events on Intel
+  Core CPUs, pass in a -0x4064 event type value. ]
+
+The parameter 'hw_event_period' is the number of events before waking up
+a read() that is blocked on a counter fd. Zero value means a non-blocking
+counter.
+
+'record_type' is the type of data that a read() will provide for the
+counter, and it can be one of:
+
+  enum perf_record_type {
+	PERF_RECORD_SIMPLE,
+	PERF_RECORD_IRQ,
+  };
+
+a "simple" counter is one that counts hardware events and allows
+them to be read out into a u64 count value. (read() returns 8 on
+a successful read of a simple counter.)
+
+An "irq" counter is one that will also provide an IRQ context information:
+the IP of the interrupted context. In this case read() will return
+the 8-byte counter value, plus the Instruction Pointer address of the
+interrupted context.
+
+The 'pid' parameter allows the counter to be specific to a task:
+
+ pid == 0: if the pid parameter is zero, the counter is attached to the
+ current task.
+
+ pid > 0: the counter is attached to a specific task (if the current task
+ has sufficient privilege to do so)
+
+ pid < 0: all tasks are counted (per cpu counters)
+
+The 'cpu' parameter allows a counter to be made specific to a full
+CPU:
+
+ cpu >= 0: the counter is restricted to a specific CPU
+ cpu == -1: the counter counts on all CPUs
+
+Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.
+
+A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
+events of that task and 'follows' that task to whatever CPU the task
+gets schedule to. Per task counters can be created by any user, for
+their own tasks.
+
+A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
+all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
+



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 3/3] performance counters: x86 support
  2008-12-04 23:44 [patch 0/3] [Announcement] Performance Counters for Linux Thomas Gleixner
  2008-12-04 23:44 ` [patch 1/3] performance counters: core code Thomas Gleixner
  2008-12-04 23:44 ` [patch 2/3] performance counters: documentation Thomas Gleixner
@ 2008-12-04 23:45 ` Thomas Gleixner
  2008-12-05  0:22 ` [patch 0/3] [Announcement] Performance Counters for Linux Paul Mackerras
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 73+ messages in thread
From: Thomas Gleixner @ 2008-12-04 23:45 UTC (permalink / raw)
  To: LKML
  Cc: linux-arch, Andrew Morton, Ingo Molnar, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras

[-- Attachment #1: perf-counters-x86.patch --]
[-- Type: text/plain, Size: 21302 bytes --]

From: Ingo Molnar <mingo@elte.hu>

Implement performance counters for x86 Intel CPUs.

It's simplified right now: the PERFMON CPU feature is assumed,
which is available in Core2 and later Intel CPUs.

The design is flexible to be extended to more CPU types as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/Kconfig                               |    1 
 arch/x86/ia32/ia32entry.S                      |    3 
 arch/x86/include/asm/hardirq_32.h              |    1 
 arch/x86/include/asm/hw_irq.h                  |    2 
 arch/x86/include/asm/intel_arch_perfmon.h      |   10 
 arch/x86/include/asm/irq_vectors.h             |    5 
 arch/x86/include/asm/mach-default/entry_arch.h |    5 
 arch/x86/include/asm/pda.h                     |    1 
 arch/x86/include/asm/unistd_32.h               |    1 
 arch/x86/include/asm/unistd_64.h               |    3 
 arch/x86/kernel/apic.c                         |    2 
 arch/x86/kernel/cpu/Makefile                   |   12 
 arch/x86/kernel/cpu/common.c                   |    2 
 arch/x86/kernel/cpu/perf_counter.c             |  363 +++++++++++++++++++++++++
 arch/x86/kernel/entry_64.S                     |    6 
 arch/x86/kernel/irq.c                          |    5 
 arch/x86/kernel/irqinit_32.c                   |    3 
 arch/x86/kernel/irqinit_64.c                   |    5 
 arch/x86/kernel/syscall_table_32.S             |    1 
 19 files changed, 424 insertions(+), 7 deletions(-)

Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig
+++ linux/arch/x86/Kconfig
@@ -651,6 +651,7 @@ config X86_UP_IOAPIC
 config X86_LOCAL_APIC
 	def_bool y
 	depends on X86_64 || (X86_32 && (X86_UP_APIC || (SMP && !X86_VOYAGER) || X86_GENERICARCH))
+	select HAVE_PERF_COUNTERS
 
 config X86_IO_APIC
 	def_bool y
Index: linux/arch/x86/ia32/ia32entry.S
===================================================================
--- linux.orig/arch/x86/ia32/ia32entry.S
+++ linux/arch/x86/ia32/ia32entry.S
@@ -823,7 +823,8 @@ ia32_sys_call_table:
 	.quad compat_sys_signalfd4
 	.quad sys_eventfd2
 	.quad sys_epoll_create1
-	.quad sys_dup3			/* 330 */
+	.quad sys_dup3				/* 330 */
 	.quad sys_pipe2
 	.quad sys_inotify_init1
+	.quad sys_perf_counter_open
 ia32_syscall_end:
Index: linux/arch/x86/include/asm/hardirq_32.h
===================================================================
--- linux.orig/arch/x86/include/asm/hardirq_32.h
+++ linux/arch/x86/include/asm/hardirq_32.h
@@ -9,6 +9,7 @@ typedef struct {
 	unsigned long idle_timestamp;
 	unsigned int __nmi_count;	/* arch dependent */
 	unsigned int apic_timer_irqs;	/* arch dependent */
+	unsigned int apic_perf_irqs;	/* arch dependent */
 	unsigned int irq0_irqs;
 	unsigned int irq_resched_count;
 	unsigned int irq_call_count;
Index: linux/arch/x86/include/asm/hw_irq.h
===================================================================
--- linux.orig/arch/x86/include/asm/hw_irq.h
+++ linux/arch/x86/include/asm/hw_irq.h
@@ -30,6 +30,8 @@
 /* Interrupt handlers registered during init_IRQ */
 extern void apic_timer_interrupt(void);
 extern void error_interrupt(void);
+extern void perf_counter_interrupt(void);
+
 extern void spurious_interrupt(void);
 extern void thermal_interrupt(void);
 extern void reschedule_interrupt(void);
Index: linux/arch/x86/include/asm/intel_arch_perfmon.h
===================================================================
--- linux.orig/arch/x86/include/asm/intel_arch_perfmon.h
+++ linux/arch/x86/include/asm/intel_arch_perfmon.h
@@ -18,6 +18,8 @@
 #define ARCH_PERFMON_UNHALTED_CORE_CYCLES_PRESENT \
 	(1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
 
+#define ARCH_PERFMON_BRANCH_MISSES_RETIRED	(6)
+
 union cpuid10_eax {
 	struct {
 		unsigned int version_id:8;
@@ -28,4 +30,12 @@ union cpuid10_eax {
 	unsigned int full;
 };
 
+#ifdef CONFIG_PERF_COUNTERS
+extern void init_hw_perf_counters(void);
+extern void perf_counters_lapic_init(void);
+#else
+static inline void init_hw_perf_counters(void)		{ }
+static inline void perf_counters_lapic_init(void)	{ }
+#endif
+
 #endif /* _ASM_X86_INTEL_ARCH_PERFMON_H */
Index: linux/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux.orig/arch/x86/include/asm/irq_vectors.h
+++ linux/arch/x86/include/asm/irq_vectors.h
@@ -87,6 +87,11 @@
 #define LOCAL_TIMER_VECTOR	0xef
 
 /*
+ * Performance monitoring interrupt vector:
+ */
+#define LOCAL_PERF_VECTOR	0xee
+
+/*
  * First APIC vector available to drivers: (vectors 0x30-0xee) we
  * start at 0x31(0x41) to spread out vectors evenly between priority
  * levels. (0x80 is the syscall vector)
Index: linux/arch/x86/include/asm/mach-default/entry_arch.h
===================================================================
--- linux.orig/arch/x86/include/asm/mach-default/entry_arch.h
+++ linux/arch/x86/include/asm/mach-default/entry_arch.h
@@ -25,10 +25,15 @@ BUILD_INTERRUPT(irq_move_cleanup_interru
  * a much simpler SMP time architecture:
  */
 #ifdef CONFIG_X86_LOCAL_APIC
+
 BUILD_INTERRUPT(apic_timer_interrupt,LOCAL_TIMER_VECTOR)
 BUILD_INTERRUPT(error_interrupt,ERROR_APIC_VECTOR)
 BUILD_INTERRUPT(spurious_interrupt,SPURIOUS_APIC_VECTOR)
 
+#ifdef CONFIG_PERF_COUNTERS
+BUILD_INTERRUPT(perf_counter_interrupt, LOCAL_PERF_VECTOR)
+#endif
+
 #ifdef CONFIG_X86_MCE_P4THERMAL
 BUILD_INTERRUPT(thermal_interrupt,THERMAL_APIC_VECTOR)
 #endif
Index: linux/arch/x86/include/asm/pda.h
===================================================================
--- linux.orig/arch/x86/include/asm/pda.h
+++ linux/arch/x86/include/asm/pda.h
@@ -30,6 +30,7 @@ struct x8664_pda {
 	short isidle;
 	struct mm_struct *active_mm;
 	unsigned apic_timer_irqs;
+	unsigned apic_perf_irqs;
 	unsigned irq0_irqs;
 	unsigned irq_resched_count;
 	unsigned irq_call_count;
Index: linux/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux.orig/arch/x86/include/asm/unistd_32.h
+++ linux/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,7 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_perf_counter_open	333
 
 #ifdef __KERNEL__
 
Index: linux/arch/x86/include/asm/unistd_64.h
===================================================================
--- linux.orig/arch/x86/include/asm/unistd_64.h
+++ linux/arch/x86/include/asm/unistd_64.h
@@ -653,7 +653,8 @@ __SYSCALL(__NR_dup3, sys_dup3)
 __SYSCALL(__NR_pipe2, sys_pipe2)
 #define __NR_inotify_init1			294
 __SYSCALL(__NR_inotify_init1, sys_inotify_init1)
-
+#define __NR_perf_counter_open		295
+__SYSCALL(__NR_perf_counter_open, sys_perf_counter_open)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux/arch/x86/kernel/apic.c
===================================================================
--- linux.orig/arch/x86/kernel/apic.c
+++ linux/arch/x86/kernel/apic.c
@@ -31,6 +31,7 @@
 #include <linux/dmi.h>
 #include <linux/dmar.h>
 
+#include <asm/intel_arch_perfmon.h>
 #include <asm/atomic.h>
 #include <asm/smp.h>
 #include <asm/mtrr.h>
@@ -1147,6 +1148,7 @@ void __cpuinit setup_local_APIC(void)
 		apic_write(APIC_ESR, 0);
 	}
 #endif
+	perf_counters_lapic_init();
 
 	preempt_disable();
 
Index: linux/arch/x86/kernel/cpu/Makefile
===================================================================
--- linux.orig/arch/x86/kernel/cpu/Makefile
+++ linux/arch/x86/kernel/cpu/Makefile
@@ -1,5 +1,5 @@
 #
-# Makefile for x86-compatible CPU details and quirks
+# Makefile for x86-compatible CPU details, features and quirks
 #
 
 obj-y			:= intel_cacheinfo.o addon_cpuid_features.o
@@ -16,11 +16,13 @@ obj-$(CONFIG_CPU_SUP_CENTAUR_64)	+= cent
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32)	+= transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)		+= umc.o
 
-obj-$(CONFIG_X86_MCE)	+= mcheck/
-obj-$(CONFIG_MTRR)	+= mtrr/
-obj-$(CONFIG_CPU_FREQ)	+= cpufreq/
+obj-$(CONFIG_PERF_COUNTERS)		+= perf_counter.o
 
-obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o
+obj-$(CONFIG_X86_MCE)			+= mcheck/
+obj-$(CONFIG_MTRR)			+= mtrr/
+obj-$(CONFIG_CPU_FREQ)			+= cpufreq/
+
+obj-$(CONFIG_X86_LOCAL_APIC)		+= perfctr-watchdog.o
 
 quiet_cmd_mkcapflags = MKCAP   $@
       cmd_mkcapflags = $(PERL) $(srctree)/$(src)/mkcapflags.pl $< $@
Index: linux/arch/x86/kernel/cpu/common.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/common.c
+++ linux/arch/x86/kernel/cpu/common.c
@@ -17,6 +17,7 @@
 #include <asm/mmu_context.h>
 #include <asm/mtrr.h>
 #include <asm/mce.h>
+#include <asm/intel_arch_perfmon.h>
 #include <asm/pat.h>
 #include <asm/asm.h>
 #include <asm/numa.h>
@@ -750,6 +751,7 @@ void __init identify_boot_cpu(void)
 #else
 	vgetcpu_set_mode();
 #endif
+	init_hw_perf_counters();
 }
 
 void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c)
Index: linux/arch/x86/kernel/cpu/perf_counter.c
===================================================================
--- /dev/null
+++ linux/arch/x86/kernel/cpu/perf_counter.c
@@ -0,0 +1,363 @@
+/*
+ * Performance counter x86 architecture code
+ *
+ *  Copyright(C) 2008 Thomas Gleixner <tglx@linutronix.de>
+ *  Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ *  For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/perf_counter.h>
+#include <linux/capability.h>
+#include <linux/hardirq.h>
+#include <linux/sched.h>
+
+#include <asm/intel_arch_perfmon.h>
+#include <asm/apic.h>
+
+static bool perf_counters_initialized __read_mostly;
+
+/*
+ * Number of (generic) HW counters:
+ */
+static int nr_perf_counters __read_mostly;
+
+/* No support for fixed function counters yet */
+
+#define MAX_COUNTERS		8
+
+struct used_counters {
+	struct perf_counter	*counters[MAX_COUNTERS];
+	unsigned long		used[BITS_TO_LONGS(MAX_COUNTERS)];
+};
+
+/* Read from cpuid ! */
+#define COUNTER_OVERFLOW	(1ULL << 40)
+
+/*
+ * Intel PerfMon v3. Used on Core2 and later.
+ */
+static DEFINE_PER_CPU(struct used_counters, used_counters);
+
+const int intel_perfmon_event_map[] =
+{
+  [PERF_COUNT_CYCLES]			= 0x003c,
+  [PERF_COUNT_INSTRUCTIONS]		= 0x00c0,
+  [PERF_COUNT_CACHE_REFERENCES]		= 0x4f2e,
+  [PERF_COUNT_CACHE_MISSES]		= 0x412e,
+  [PERF_COUNT_BRANCH_INSTRUCTIONS]	= 0x00c4,
+  [PERF_COUNT_BRANCH_MISSES]		= 0x00c5,
+};
+
+const int max_intel_perfmon_events = ARRAY_SIZE(intel_perfmon_event_map);
+
+/*
+ * Setup the hardware configuration for a given hw_event_type
+ */
+int hw_perf_counter_init(struct perf_counter *counter, s32 hw_event_type)
+{
+	struct hw_perf_counter *hwc = &counter->hw;
+
+	if (unlikely(!perf_counters_initialized))
+		return -EINVAL;
+
+	/*
+	 * Count user events, and generate PMC IRQs:
+	 * (keep 'enabled' bit clear for now)
+	 */
+	hwc->config = ARCH_PERFMON_EVENTSEL_USR | ARCH_PERFMON_EVENTSEL_INT;
+
+	/*
+	 * If privileged enough, count OS events too:
+	 */
+	if (capable(CAP_SYS_ADMIN))
+		hwc->config |= ARCH_PERFMON_EVENTSEL_OS;
+
+	hwc->config_base = MSR_ARCH_PERFMON_EVENTSEL0;
+	hwc->counter_base = MSR_ARCH_PERFMON_PERFCTR0;
+
+	hwc->irq_period = counter->__irq_period;
+	/*
+	 * Intel PMCs cannot be accessed sanely above 32 bit width,
+	 * so we install an artificial 1<<31 period regardless of
+	 * the generic counter period:
+	 */
+	if (!hwc->irq_period)
+		hwc->irq_period = 0x7FFFFFFF;
+
+	hwc->next_count = -((s32) hwc->irq_period);
+
+	/*
+	 * Negative event types mean raw encoded event+umask values:
+	 */
+	if (hw_event_type < 0) {
+		counter->hw_event_type = -hw_event_type;
+	} else {
+		if (hw_event_type >= max_intel_perfmon_events)
+			return -EINVAL;
+		/*
+		 * The generic map:
+		 */
+		counter->hw_event_type = intel_perfmon_event_map[hw_event_type];
+	}
+	hwc->config |= counter->hw_event_type;
+
+	return 0;
+}
+
+void hw_perf_counter_enable_config(struct perf_counter *counter)
+{
+	counter->hw.config |= ARCH_PERFMON_EVENTSEL0_ENABLE;
+}
+
+void hw_perf_counter_disable_config(struct perf_counter *counter)
+{
+	counter->hw.config &= ~ARCH_PERFMON_EVENTSEL0_ENABLE;
+}
+
+static void __hw_perf_counter_enable(struct hw_perf_counter *hwc, int idx)
+{
+	wrmsr(hwc->counter_base + idx, hwc->next_count, 0);
+	wrmsr(hwc->config_base + idx, hwc->config, 0);
+}
+
+void hw_perf_counter_enable(struct perf_counter *counter)
+{
+	struct used_counters *uc = &__get_cpu_var(used_counters);
+	struct hw_perf_counter *hwc = &counter->hw;
+	int idx = hwc->idx;
+
+	/* Try to get the previous counter again */
+	if (test_and_set_bit(idx, uc->used)) {
+		idx = find_first_zero_bit(uc->used, nr_perf_counters);
+		set_bit(idx, uc->used);
+		hwc->idx = idx;
+	}
+
+	perf_counters_lapic_init();
+
+	wrmsr(hwc->config_base + idx,
+	      hwc->config & ~ARCH_PERFMON_EVENTSEL0_ENABLE, 0);
+
+	uc->counters[idx] = counter;
+	__hw_perf_counter_enable(hwc, idx);
+}
+
+#ifdef CONFIG_X86_64
+static inline void atomic64_counter_set(struct perf_counter *counter, u64 val)
+{
+	atomic64_set(&counter->count, val);
+}
+#else
+/*
+ * Todo: add proper atomic64_t support to 32-bit x86:
+ */
+static inline void atomic64_counter_set(struct perf_counter *counter, u64 val64)
+{
+	u32 *val32 = (void *)&val64;
+
+	atomic_set(counter->count32 + 0, *(val32 + 0));
+	atomic_set(counter->count32 + 1, *(val32 + 1));
+}
+#endif
+
+static void __hw_perf_save_counter(struct perf_counter *counter,
+				   struct hw_perf_counter *hwc, int idx)
+{
+	s64 raw = -1;
+	s64 delta;
+	int err;
+
+	/*
+	 * Get the raw hw counter value:
+	 */
+	err = rdmsrl_safe(hwc->counter_base + idx, &raw);
+	WARN_ON_ONCE(err);
+
+	/*
+	 * Rebase it to zero (it started counting at -irq_period),
+	 * to see the delta since ->prev_count:
+	 */
+	delta = (s64)hwc->irq_period + (s64)(s32)raw;
+
+	atomic64_counter_set(counter, hwc->prev_count + delta);
+
+	/*
+	 * Adjust the ->prev_count offset - if we went beyond
+	 * irq_period of units, then we got an IRQ and the counter
+	 * was set back to -irq_period:
+	 */
+	while (delta > (s64)hwc->irq_period) {
+		hwc->prev_count += hwc->irq_period;
+		delta -= (s64)hwc->irq_period;
+	}
+
+	/*
+	 * Calculate the next raw counter value we'll write into
+	 * the counter at the next sched-in time:
+	 */
+	delta -= (s64)hwc->irq_period;
+	hwc->next_count = (s32)delta;
+}
+
+void hw_perf_counter_disable(struct perf_counter *counter)
+{
+	struct used_counters *uc = &__get_cpu_var(used_counters);
+	struct hw_perf_counter *hwc = &counter->hw;
+	unsigned int idx = hwc->idx;
+
+	wrmsr(hwc->config_base + idx,
+	      hwc->config & ~ARCH_PERFMON_EVENTSEL0_ENABLE, 0);
+
+	clear_bit(idx, uc->used);
+	uc->counters[idx] = NULL;
+	__hw_perf_save_counter(counter, hwc, idx);
+}
+
+void hw_perf_counter_read(struct perf_counter *counter)
+{
+	struct hw_perf_counter *hwc = &counter->hw;
+	unsigned long addr = hwc->counter_base + hwc->idx;
+	s64 offs, val = -1LL;
+	s32 val32;
+	int err;
+
+	/* Careful: NMI might modify the counter offset */
+	do {
+		offs = hwc->prev_count;
+		err = rdmsrl_safe(addr, &val);
+		WARN_ON_ONCE(err);
+	} while (offs != hwc->prev_count);
+
+	val32 = (s32) val;
+	val =  (s64)hwc->irq_period + (s64)val32;
+	atomic64_counter_set(counter, hwc->prev_count + val);
+}
+
+/*
+ * This handler is triggered by the local APIC, so the APIC IRQ handling
+ * rules apply:
+ */
+void smp_perf_counter_interrupt(struct pt_regs *regs)
+{
+	int bit, cpu = smp_processor_id();
+	struct used_counters *uc;
+	u64 status, *p;
+
+	ack_APIC_irq();
+
+	irq_enter();
+
+#ifdef CONFIG_X86_64
+	add_pda(apic_perf_irqs, 1);
+#else
+	per_cpu(irq_stat, cpu).apic_perf_irqs++;
+#endif
+
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+	if (!status)
+		goto out;
+
+	uc = &per_cpu(used_counters, cpu);
+
+	for_each_bit(bit, (unsigned long *) &status, nr_perf_counters) {
+		struct perf_counter *counter = uc->counters[bit];
+		struct hw_perf_counter *hwc;
+		struct perf_data *irqdata;
+		int idx;
+
+		if (!counter)
+			continue;
+
+		hwc = &counter->hw;
+		idx = hwc->idx;
+
+		wrmsr(hwc->config_base + idx,
+		      hwc->config & ~ARCH_PERFMON_EVENTSEL0_ENABLE, 0);
+
+		__hw_perf_save_counter(counter, hwc, idx);
+		__hw_perf_counter_enable(hwc, idx);
+
+		if (counter->record_type != PERF_RECORD_IRQ)
+			continue;
+
+		irqdata = counter->irqdata;
+		if (irqdata->len > PERF_DATA_BUFLEN - sizeof(u64))
+			irqdata->overrun++;
+		else {
+			p = (u64 *) &irqdata->data[irqdata->len];
+			*p = instruction_pointer(regs);
+			irqdata->len += sizeof(u64);
+		}
+		wake_up(&counter->waitq);
+	}
+out:
+	/*
+	 * Clear the MASK field of the LAPIC's LVTPC.
+	 *
+	 * IA_SDM_Vol3A says:
+	 *
+	 * " (Pentium 4 and Intel Xeon processors.) When a performance
+	 *   monitoring counters interrupt is generated, the mask bit for
+	 *   its associated LVT entry is set. "
+	 *
+	 * So we need to unmask the LVT entry, otherwise future IRQs are
+	 * masked. Since this does not harm on other CPUs we do this
+	 * unconditionally:
+	 */
+	apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+
+	irq_exit();
+}
+
+void __cpuinit perf_counters_lapic_init(void)
+{
+	u32 apic_val;
+
+	if (!perf_counters_initialized)
+		return;
+	/*
+	 * Enable the performance counter vector in the APIC LVT:
+	 */
+	apic_val = apic_read(APIC_LVTERR);
+
+	apic_write(APIC_LVTERR, apic_val | APIC_LVT_MASKED);
+	apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+	apic_write(APIC_LVTERR, apic_val);
+}
+
+
+void __init init_hw_perf_counters(void)
+{
+	union cpuid10_eax eax;
+	unsigned int unused;
+	unsigned int ebx;
+
+	if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON))
+		return;
+
+	/*
+	 * Check whether the Architectural PerfMon supports
+	 * Branch Misses Retired Event or not.
+	 */
+	cpuid(10, &(eax.full), &ebx, &unused, &unused);
+	if (eax.split.mask_length <= ARCH_PERFMON_BRANCH_MISSES_RETIRED)
+		return;
+
+	printk(KERN_INFO "Intel Performance Monitoring support detected.\n");
+
+	printk(KERN_INFO "... version:      %d\n", eax.split.version_id);
+	printk(KERN_INFO "... num_counters: %d\n", eax.split.num_counters);
+	nr_perf_counters = eax.split.num_counters;
+	if (nr_perf_counters > MAX_COUNTERS) {
+		nr_perf_counters = MAX_COUNTERS;
+		WARN(1, KERN_ERR "hw perf counters %d > max(%d), clipping!",
+			nr_perf_counters, MAX_COUNTERS);
+	}
+	printk(KERN_INFO "... bit_width:    %d\n", eax.split.bit_width);
+	printk(KERN_INFO "... mask_length:  %d\n", eax.split.mask_length);
+
+	perf_counters_lapic_init();
+
+	perf_counters_initialized = true;
+}
Index: linux/arch/x86/kernel/entry_64.S
===================================================================
--- linux.orig/arch/x86/kernel/entry_64.S
+++ linux/arch/x86/kernel/entry_64.S
@@ -869,6 +869,12 @@ END(error_interrupt)
 ENTRY(spurious_interrupt)
 	apicinterrupt SPURIOUS_APIC_VECTOR,smp_spurious_interrupt
 END(spurious_interrupt)
+
+#ifdef CONFIG_PERF_COUNTERS
+ENTRY(perf_counter_interrupt)
+	apicinterrupt LOCAL_PERF_VECTOR,smp_perf_counter_interrupt
+END(perf_counter_interrupt)
+#endif
 				
 /*
  * Exception entry points.
Index: linux/arch/x86/kernel/irq.c
===================================================================
--- linux.orig/arch/x86/kernel/irq.c
+++ linux/arch/x86/kernel/irq.c
@@ -56,6 +56,10 @@ static int show_other_interrupts(struct 
 	for_each_online_cpu(j)
 		seq_printf(p, "%10u ", irq_stats(j)->apic_timer_irqs);
 	seq_printf(p, "  Local timer interrupts\n");
+	seq_printf(p, "CNT: ");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ", irq_stats(j)->apic_perf_irqs);
+	seq_printf(p, "  Performance counter interrupts\n");
 #endif
 #ifdef CONFIG_SMP
 	seq_printf(p, "RES: ");
@@ -160,6 +164,7 @@ u64 arch_irq_stat_cpu(unsigned int cpu)
 
 #ifdef CONFIG_X86_LOCAL_APIC
 	sum += irq_stats(cpu)->apic_timer_irqs;
+	sum += irq_stats(cpu)->apic_perf_irqs;
 #endif
 #ifdef CONFIG_SMP
 	sum += irq_stats(cpu)->irq_resched_count;
Index: linux/arch/x86/kernel/irqinit_32.c
===================================================================
--- linux.orig/arch/x86/kernel/irqinit_32.c
+++ linux/arch/x86/kernel/irqinit_32.c
@@ -160,6 +160,9 @@ void __init native_init_IRQ(void)
 	/* IPI vectors for APIC spurious and error interrupts */
 	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
 	alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+# ifdef CONFIG_PERF_COUNTERS
+	alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+# endif
 #endif
 
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_MCE_P4THERMAL)
Index: linux/arch/x86/kernel/irqinit_64.c
===================================================================
--- linux.orig/arch/x86/kernel/irqinit_64.c
+++ linux/arch/x86/kernel/irqinit_64.c
@@ -204,6 +204,11 @@ static void __init apic_intr_init(void)
 	/* IPI vectors for APIC spurious and error interrupts */
 	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
 	alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+
+	/* Performance monitoring interrupt: */
+#ifdef CONFIG_PERF_COUNTERS
+	alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+#endif
 }
 
 void __init native_init_IRQ(void)
Index: linux/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux.orig/arch/x86/kernel/syscall_table_32.S
+++ linux/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,4 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_perf_counter_open



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-04 23:44 [patch 0/3] [Announcement] Performance Counters for Linux Thomas Gleixner
                   ` (2 preceding siblings ...)
  2008-12-04 23:45 ` [patch 3/3] performance counters: x86 support Thomas Gleixner
@ 2008-12-05  0:22 ` Paul Mackerras
  2008-12-05  6:31   ` Ingo Molnar
  2008-12-05  0:22 ` H. Peter Anvin
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 73+ messages in thread
From: Paul Mackerras @ 2008-12-05  0:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, linux-arch, Andrew Morton, Ingo Molnar, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller

Thomas Gleixner writes:

> We'd like to announce a brand new implementation of performance counter
> support for Linux. It is a very simple and extensible design that has the
> potential to implement the full range of features we would expect from such
> a subsystem.

Looks like the sort of thing I was thinking about a year or so ago
when I was trying to come up with a simpler API than perfmon2.
However, it turned out that my design, and I believe yours too, can't
do some things that users really want to do with performance
counters.

One thing that this sort of thing can't do is to get values from
multiple counters that correlate with each other.  For instance, we
would often want to count, say, L2 cache misses and instructions
completed at the same time, and be able to read both counters at very
close to the same time, so that we can measure average L2 cache misses
per instruction completed, which is useful.

Another problem is that this abstraction provides no way to deal with
interrelationships between counters.  For example, on PowerPC it is
common to have a facility where one counter overflowing can cause all
the other counters to freeze.  I don't see this abstraction providing
any way to handle that.

It looks to me that your new API will be unworkable for real
performance measurement and tuning, just like mine ended up being. :)

Paul.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-04 23:44 [patch 0/3] [Announcement] Performance Counters for Linux Thomas Gleixner
                   ` (3 preceding siblings ...)
  2008-12-05  0:22 ` [patch 0/3] [Announcement] Performance Counters for Linux Paul Mackerras
@ 2008-12-05  0:22 ` H. Peter Anvin
  2008-12-05  0:43   ` Paul Mackerras
  2008-12-05  1:12 ` David Miller
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 73+ messages in thread
From: H. Peter Anvin @ 2008-12-05  0:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, linux-arch, Andrew Morton, Ingo Molnar, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Zijlstra,
	Steven Rostedt, David Miller, Paul Mackerras

Thomas Gleixner wrote:
> 
> We'd like to announce a brand new implementation of performance counter
> support for Linux. It is a very simple and extensible design that has the
> potential to implement the full range of features we would expect from such
> a subsystem.
> 

First of all, let me say I really like what I've seen so far.  The file
descriptor paradigm seems really elegant to me.

>  - Only one single new system call is needed: sys_perf_counter_open().
>    All performance-counter operations are implemented via standard
>    VFS APIs such as read() / fcntl() and poll().

As previously discussed, I think this should be a filesystem rather than
a system call.  There are a couple of advantages to doing it that way, IMO:

- Strings, rather than numbers, which means fewer constraints across
  architectures.
- The events available can be exported in the filesystem itself (via
  readdir) rather than via sysfs.
- Compatibility with existing tools, esp. non-C tools.

I'm thinking of something like:

/dev/perfctr/3/cache_misses/all/simple/300

i.e. /dev/perfctr/<cpu>/<event>/<pid>/<type>/<period>.  I am putting
<cpu> ahead of <event> in the hierarchy, so a readdir() on the <cpu>
directory can show the events available by name on that CPU.  Raw
hardware events can be accessed by something like
/dev/perfctr/<cpu>/0x4064/...

	-hpa

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 2/3] performance counters: documentation
  2008-12-04 23:44 ` [patch 2/3] performance counters: documentation Thomas Gleixner
@ 2008-12-05  0:33   ` Paul Mackerras
  2008-12-05  0:37     ` David Miller
  2008-12-05  2:33     ` Andi Kleen
  0 siblings, 2 replies; 73+ messages in thread
From: Paul Mackerras @ 2008-12-05  0:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, linux-arch, Andrew Morton, Ingo Molnar, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller

Thomas Gleixner writes:

> + enum hw_event_types {
> +	PERF_COUNT_CYCLES,
> +	PERF_COUNT_INSTRUCTIONS,
> +	PERF_COUNT_CACHE_REFERENCES,
> +	PERF_COUNT_CACHE_MISSES,
> +	PERF_COUNT_BRANCH_INSTRUCTIONS,
> +	PERF_COUNT_BRANCH_MISSES,
> + };
> +
> +These are standardized types of events that work uniformly on all CPUs
> +that implements Performance Counters support under Linux. If a CPU is
> +not able to count branch-misses, then the system call will return
> +-EINVAL.
> +
> +[ Note: more hw_event_types are supported as well, but they are CPU
> +  specific and are enumerated via /sys on a per CPU basis. Raw hw event
> +  types can be passed in as negative numbers. For example, to count
> +  "External bus cycles while bus lock signal asserted" events on Intel
> +  Core CPUs, pass in a -0x4064 event type value. ]

This is going to be a huge problem, at least on powerpc, because it
means that the kernel will have to know which events can be counted on
which counters and what values need to be put in the control registers
to select them.

The thing is that not all the counters count the same set of events,
or use the same select values when they can count the same events.
For example, on a MPC7450 cpu, you can count L2 cache misses in PMC5
or PMC6.  If you're counting them on PMC5 you need to put 19 into the
PCM5 event selector field in the MMCR1 register.  But if you're
counting them on PMC6 then you need to put 29 in the PMC6 event
selector field in MMCR1.

Since we don't get to say which counter to use in perf_counter_open,
we would have to pass an abstracted "L2 cache miss" event code and
have that map to 19 or 29 depending on which PMC register we get to
use.  But that means that the kernel then has to have the entire table
of countable events for every supported CPU model - something that
perfmon3 manages to keep out of the kernel.

The situation will be even worse with POWER5 and POWER6, where the
event selection logic is very complex, with multiple layers of
multiplexers.  I really really don't want the kernel to have to know
about all that.

Basically, what it boils down to is that treating performance monitor
counters as independent units is just not feasible, at least on
powerpc.  We really need to be able to deal with the full set of
counters as one thing.

Paul.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 2/3] performance counters: documentation
  2008-12-05  0:33   ` Paul Mackerras
@ 2008-12-05  0:37     ` David Miller
  2008-12-05  2:50       ` Arjan van de Ven
  2008-12-05  2:33     ` Andi Kleen
  1 sibling, 1 reply; 73+ messages in thread
From: David Miller @ 2008-12-05  0:37 UTC (permalink / raw)
  To: paulus
  Cc: tglx, linux-kernel, linux-arch, akpm, mingo, eranian, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt

From: Paul Mackerras <paulus@samba.org>
Date: Fri, 5 Dec 2008 11:33:31 +1100

> This is going to be a huge problem, at least on powerpc, because it
> means that the kernel will have to know which events can be counted on
> which counters and what values need to be put in the control registers
> to select them.

Sparc64 is the same.

> The situation will be even worse with POWER5 and POWER6, where the
> event selection logic is very complex, with multiple layers of
> multiplexers.  I really really don't want the kernel to have to know
> about all that.

Niagara2 has deep multiplexing and sub-event masking too.

I really appreciated how perfmon kept all of those details
in userspace.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  0:22 ` H. Peter Anvin
@ 2008-12-05  0:43   ` Paul Mackerras
  0 siblings, 0 replies; 73+ messages in thread
From: Paul Mackerras @ 2008-12-05  0:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, LKML, linux-arch, Andrew Morton, Ingo Molnar,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Zijlstra, Steven Rostedt, David Miller

H. Peter Anvin writes:

> First of all, let me say I really like what I've seen so far.  The file
> descriptor paradigm seems really elegant to me.

I have to say, without intending any disrespect, that it looks to me
like it was designed by someone who hasn't actually ever done much
serious performance analysis or tuning using these hardware
facilities.  If I'm wrong about that, I'm willing to be corrected.

Paul.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-04 23:44 [patch 0/3] [Announcement] Performance Counters for Linux Thomas Gleixner
                   ` (4 preceding siblings ...)
  2008-12-05  0:22 ` H. Peter Anvin
@ 2008-12-05  1:12 ` David Miller
  2008-12-05  6:10   ` Ingo Molnar
  2008-12-05  3:30 ` Andrew Morton
  2008-12-06  2:36 ` stephane eranian
  7 siblings, 1 reply; 73+ messages in thread
From: David Miller @ 2008-12-05  1:12 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, linux-arch, akpm, mingo, eranian, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt, paulus

From: Thomas Gleixner <tglx@linutronix.de>
Date: Thu, 04 Dec 2008 23:44:39 -0000

>  - No interaction with ptrace: any task (with sufficient permissions) can
>    monitor other tasks, without having to stop that task.

This isn't going to work.

If you look at the things the perfmon libraries do, you do need to
stop the task.

Consider counter virtualization as the most direct example.  Perfmon
allows you to count 6 events even if you can only monitor 2 at a time
with your hardware.  It does this by periodically changing the counter
configuration during the run of the program(s) being inspected.  These
control register changes and counter captures have to be atomic or
else you'll get garbage or less accurate results.

There are entire families of cases where you want to perform a
sequence of operations on the control registers and counters if one of
them overflows.  And these operations must be atomic.  The only way
to ensure this is to stop the task, then let the library in the
monitoring task make the changes, and finally let the library
release that task.

The crux of the matter is, when a counter overflows, what you want to
do in response to that event is non-trivial and it must be performed
without letting the monitored task continue executing.  So you have to
stop the task, and unless you want tons of cpu specific knowledge and
counter virtualization support code in the kernel, we want userspace
telling the kernel how to program the registers.  And since we have
to stop the task, there is no benefit doing this work in the kernel
anyways.

If you don't like the NMI and IPI business on x86 in the perfmon
patches, suggest alternatives.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 2/3] performance counters: documentation
  2008-12-05  0:33   ` Paul Mackerras
  2008-12-05  0:37     ` David Miller
@ 2008-12-05  2:33     ` Andi Kleen
  1 sibling, 0 replies; 73+ messages in thread
From: Andi Kleen @ 2008-12-05  2:33 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Thomas Gleixner, LKML, linux-arch, Andrew Morton, Ingo Molnar,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller

Paul Mackerras <paulus@samba.org> writes:
>> +
>> +[ Note: more hw_event_types are supported as well, but they are CPU
>> +  specific and are enumerated via /sys on a per CPU basis. Raw hw event
>> +  types can be passed in as negative numbers. For example, to count
>> +  "External bus cycles while bus lock signal asserted" events on Intel
>> +  Core CPUs, pass in a -0x4064 event type value. ]
>
> This is going to be a huge problem, at least on powerpc, because it
> means that the kernel will have to know which events can be counted on
> which counters and what values need to be put in the control registers
> to select them.

P4 has similar problems, and to some extent there's also the same
problem on newer Intel CPUs (e.g. with fixed counters and if you
consider PEBS which has some special restrictions)

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 2/3] performance counters: documentation
  2008-12-05  0:37     ` David Miller
@ 2008-12-05  2:50       ` Arjan van de Ven
  2008-12-05  3:26         ` David Miller
  0 siblings, 1 reply; 73+ messages in thread
From: Arjan van de Ven @ 2008-12-05  2:50 UTC (permalink / raw)
  To: David Miller
  Cc: paulus, tglx, linux-kernel, linux-arch, akpm, mingo, eranian,
	dada1, robert.richter, hpa, a.p.zijlstra, rostedt

On Thu, 04 Dec 2008 16:37:41 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Paul Mackerras <paulus@samba.org>
> Date: Fri, 5 Dec 2008 11:33:31 +1100
> 
> > This is going to be a huge problem, at least on powerpc, because it
> > means that the kernel will have to know which events can be counted
> > on which counters and what values need to be put in the control
> > registers to select them.
> 
> Sparc64 is the same.
> 
> > The situation will be even worse with POWER5 and POWER6, where the
> > event selection logic is very complex, with multiple layers of
> > multiplexers.  I really really don't want the kernel to have to know
> > about all that.
> 
> Niagara2 has deep multiplexing and sub-event masking too.
> 
> I really appreciated how perfmon kept all of those details
> in userspace.

I would like to respectfully disagree with this some. The kernel needs
to abstract hardware to some degree for userspace. The problem in this
case is that userspace can't really do a better job, in fact it can
only do a worse job since it lacks the coordination capability of
knowing it has full control of all the hardware registers. 
I am sure the corner cases you're talking about are nasty, I just don't
think they are less nasty when dealt with in userspace. Sure the kernel
might be simpler, but the system as a whole sure is not.



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 2/3] performance counters: documentation
  2008-12-05  2:50       ` Arjan van de Ven
@ 2008-12-05  3:26         ` David Miller
  0 siblings, 0 replies; 73+ messages in thread
From: David Miller @ 2008-12-05  3:26 UTC (permalink / raw)
  To: arjan
  Cc: paulus, tglx, linux-kernel, linux-arch, akpm, mingo, eranian,
	dada1, robert.richter, hpa, a.p.zijlstra, rostedt

From: Arjan van de Ven <arjan@infradead.org>
Date: Thu, 4 Dec 2008 18:50:02 -0800

> I would like to respectfully disagree with this some. The kernel needs
> to abstract hardware to some degree for userspace. The problem in this
> case is that userspace can't really do a better job, in fact it can
> only do a worse job since it lacks the coordination capability of
> knowing it has full control of all the hardware registers. 

The perfmon context abstraction dealt with that.  Code using the
perfmon interfaces provided a set of counter and control register
values to the kernel.

The kernel merely loaded and unloaded them when a process (or group of
processes) ran.

The kernel is a validity checker, and that minimal stuff is exactly
what the perfmon kernel component implemented.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-04 23:44 [patch 0/3] [Announcement] Performance Counters for Linux Thomas Gleixner
                   ` (5 preceding siblings ...)
  2008-12-05  1:12 ` David Miller
@ 2008-12-05  3:30 ` Andrew Morton
  2008-12-06  2:36 ` stephane eranian
  7 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2008-12-05  3:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, linux-arch, Ingo Molnar, Stephane Eranian, Eric Dumazet,
	Robert Richter, Arjan van de Veen, Peter Anvin, Peter Zijlstra,
	Steven Rostedt, David Miller, Paul Mackerras, perfctr-devel

On Thu, 04 Dec 2008 23:44:39 -0000 Thomas Gleixner <tglx@linutronix.de> wrote:

> Performance counters are special hardware registers available on most modern
> CPUs. These register count the number of certain types of hw events: such
> as instructions executed, cachemisses suffered, or branches mis-predicted,
> without slowing down the kernel or applications. These registers can also
> trigger interrupts when a threshold number of events have passed - and can
> thus be used to profile the code that runs on that CPU.
> 
> We'd like to announce a brand new implementation of performance counter
> support for Linux. It is a very simple and extensible design that has the
> potential to implement the full range of features we would expect from such
> a subsystem.
> 
> The Linux Performance Counter subsystem (implemented via the patches
> posted in this announcement) provides an abstraction of performance counter
> hardware capabilities. It provides per task and per CPU counters, and it
> provides event capabilities on top of those.
> 
> The code is far from complete - but the basic approach is already there
> and stable.
> 
> The biggest missing detail is lowlevel support for non-Intel CPUs and
> older Intel CPUs - right now the code is implemented for Intel Core2
> (and later) Intel CPUs that have the PERFMON CPU feature. (see below
> a wider list of missing/upcoming features)
> 
> We are aware of the perfmon3 patchset that has been submitted to lkml
> recently. Our patchset tries to achieve a similar end result, with
> a fundamentally different (and we believe, superior :-) design:

There's also the perfctr patchset, which has been available for a long
time.

I believe that established users of this sort of capability often
access it via the supposed-to-be-cross-platform PAPI interface/library.

Please cc perfctr-devel@lists.sourceforge.net on emails related to this
work.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  1:12 ` David Miller
@ 2008-12-05  6:10   ` Ingo Molnar
  2008-12-05  7:50     ` David Miller
                       ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05  6:10 UTC (permalink / raw)
  To: David Miller
  Cc: tglx, linux-kernel, linux-arch, akpm, eranian, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt, paulus


* David Miller <davem@davemloft.net> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Thu, 04 Dec 2008 23:44:39 -0000
> 
> >  - No interaction with ptrace: any task (with sufficient permissions) can
> >    monitor other tasks, without having to stop that task.
> 
> This isn't going to work.
>
> If you look at the things the perfmon libraries do, you do need to stop 
> the task.
>
> Consider counter virtualization as the most direct example. [...]

Note that counter virtualization is not offered in the perfmon3 patchset 
that has been posted to lkml. (It is part of the much larger 'full' 
perfmon patchset which has not been submitted for integration)

Nevertheless we will offer counter virtualization in -v2 of our patchset 
and we mentioned it in the TODO list:

> >  - Round-robin scheduling of counters, when there's more task
> >    counters than hw counters available.

The 'target' task does not have to be stopped to offer counter 
virtualization (counter overcommit or counter scheduling) - or to offer 
any of the other performance counter features. Please let us know why it 
needs the task to be stopped - we asked about that on lkml in the perfmon 
thread and no technical answer was given, and couldnt find any such 
technical reason while implementing it ourselves.

Relying on ptrace machinery can be considered one of the bigger design 
mistakes of the permon3 patchset.

We pointed that out in review, and now we demonstrate it via this 
patchset that it can be done much cleaner and much simpler. (Please stay 
tuned for -v2 if you want to see the proof of the pudding.)

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  0:22 ` [patch 0/3] [Announcement] Performance Counters for Linux Paul Mackerras
@ 2008-12-05  6:31   ` Ingo Molnar
  2008-12-05  7:02     ` Arjan van de Ven
                       ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05  6:31 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller


* Paul Mackerras <paulus@samba.org> wrote:

> Thomas Gleixner writes:
> 
> > We'd like to announce a brand new implementation of performance counter
> > support for Linux. It is a very simple and extensible design that has the
> > potential to implement the full range of features we would expect from such
> > a subsystem.
> 
> Looks like the sort of thing I was thinking about a year or so ago when 
> I was trying to come up with a simpler API than perfmon2. However, it 
> turned out that my design, and I believe yours too, can't do some 
> things that users really want to do with performance counters.
> 
> One thing that this sort of thing can't do is to get values from 
> multiple counters that correlate with each other.  For instance, we 
> would often want to count, say, L2 cache misses and instructions 
> completed at the same time, and be able to read both counters at very 
> close to the same time, so that we can measure average L2 cache misses 
> per instruction completed, which is useful.

This can be done in a very natural way with our abstraction, and the 
"hello.c" example happens to do exactly that:

  aldebaran:~/perf-counter-test> ./hello
  doing perf_counter_open() call:
  counter[0]... fd: 3.
  counter[1]... fd: 4.
  counter[0] delta: 10866 cycles
  counter[1] delta: 414 cycles
  counter[0] delta: 23640 cycles
  counter[1] delta: 3673 cycles
  counter[0] delta: 28225 cycles
  counter[1] delta: 3695 cycles

This counts cycles executed and instructions executed, and reads the two 
counters out at the same time.

I just modified it to measure the exact example you mentioned above - L2 
cache misses and instructions completed, sampled once every second:

  titan:~/perf-counter-test> ./hello
  doing perf_counter_open() call:

  counter[0] delta: 1 cachemisses
  counter[1] delta: 497 instructions

  counter[0] delta: 14 cachemisses
  counter[1] delta: 4303 instructions

  counter[0] delta: 6 cachemisses
  counter[1] delta: 3666 instructions

  counter[0] delta: 2 cachemisses
  counter[1] delta: 3641 instructions

  counter[0] delta: 1 cachemisses
  counter[1] delta: 3641 instructions

It's a matter of:

        fd1 = perf_counter_open(PERF_COUNT_CACHE_MISSES, 0, 0, 0, -1);
        fd2 = perf_counter_open(PERF_COUNT_INSTRUCTIONS, 0, 0, 0, -1);

So it's very much possible. (If i've missed something about your example 
then please let me know.)

> Another problem is that this abstraction provides no way to deal with 
> interrelationships between counters.  For example, on PowerPC it is 
> common to have a facility where one counter overflowing can cause all 
> the other counters to freeze.  I don't see this abstraction providing 
> any way to handle that.

We could add that facility if it makes sense - there's no reason why 
there couldnt be event interaction between counters - we just went for 
the most common event variants in v1.

Btw., i'm curious, why would we want to do that? It skews the results if 
the task continues executing and counters stop. To get the highest 
quality profiling output the counters should follow the true state of the 
task that is profiled - and events should be passed to the monitoring 
task asynchronously. The _events_ can contain precise coupled information 
- but the counters should continue.

What i'd do is what hello.c does: if you want to read out multiple 
counters at once, you can read them out at once.

(Again, please explain in more detail if i have missed something about 
your observation.)

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  6:31   ` Ingo Molnar
@ 2008-12-05  7:02     ` Arjan van de Ven
  2008-12-05  7:52       ` David Miller
  2008-12-05  7:03     ` Ingo Molnar
  2008-12-05  7:54     ` Paul Mackerras
  2 siblings, 1 reply; 73+ messages in thread
From: Arjan van de Ven @ 2008-12-05  7:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Mackerras, Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller

On Fri, 5 Dec 2008 07:31:31 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> Btw., i'm curious, why would we want to do that? It skews the results
> if the task continues executing and counters stop. To get the highest 
> quality profiling output the counters should follow the true state of
> the task that is profiled - and events should be passed to the
> monitoring task asynchronously. The _events_ can contain precise
> coupled information 
> - but the counters should continue.

btw stopping the task on counter overflow is an issue for things that
want to self profile, like JITs


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  6:31   ` Ingo Molnar
  2008-12-05  7:02     ` Arjan van de Ven
@ 2008-12-05  7:03     ` Ingo Molnar
  2008-12-05  7:16       ` Peter Zijlstra
  2008-12-05  7:57       ` David Miller
  2008-12-05  7:54     ` Paul Mackerras
  2 siblings, 2 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05  7:03 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller


* Ingo Molnar <mingo@elte.hu> wrote:

> This can be done in a very natural way with our abstraction, and the 
> "hello.c" example happens to do exactly that:

multiple people pointed out that we have not posted hello.c :-/

Here's a simple standalone example (full working code attached below):

int main(void)
{
	unsigned long long count1, count2;
	int fd1, fd2, ret;

	fd1 = perf_counter_open(PERF_COUNT_INSTRUCTIONS, 0, 0, 0, -1);
	assert(fd1 >= 0);
	fd2 = perf_counter_open(PERF_COUNT_CACHE_MISSES, 0, 0, 0, -1);
	assert(fd1 >= 0);

	for (;;) {
		ret = read(fd1, &count1, sizeof(count1));
		assert(ret == 8);
		ret = read(fd2, &count2, sizeof(count2));
		assert(ret == 8);

		printf("counter1 value: %Ld instructions\n", count1);
		printf("counter2 value: %Ld cachemisses\n",  count2);
		sleep(1);
	}
	return 0;
}


which gives this output (one readout per second):

  titan:~/perf-counter-test> ./simple 
  counter1 value: 0 instructions
  counter2 value: 0 cachemisses
  counter1 value: 23 instructions
  counter2 value: 0 cachemisses
  counter1 value: 2853 instructions
  counter2 value: 6 cachemisses
  counter1 value: 5736 instructions
  counter2 value: 7 cachemisses
  counter1 value: 8619 instructions
  counter2 value: 8 cachemisses
  counter1 value: 11502 instructions
  counter2 value: 8 cachemisses
  ^C

You need our patchset but then the code below will work just fine. No 
libraries, no context setup, nothing - just what is more interesting: the 
counter and profiling data.

	Ingo

----------------->
/*
 * Very simple performance counter testcase.
 */
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/uio.h>

#include <linux/unistd.h>

#include <assert.h>
#include <unistd.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <fcntl.h>

#ifdef __x86_64__
# define __NR_perf_counter_open	295
#endif

#ifdef __i386__
# define __NR_perf_counter_open 333
#endif

int
perf_counter_open(int		hw_event_type,
                  unsigned int	hw_event_period,
                  unsigned int	record_type,
                  pid_t		pid,
                  int		cpu)
{
	return syscall(__NR_perf_counter_open, hw_event_type, hw_event_period,
			record_type, pid, cpu);
}

enum hw_event_types {
	PERF_COUNT_CYCLES,
	PERF_COUNT_INSTRUCTIONS,
	PERF_COUNT_CACHE_REFERENCES,
	PERF_COUNT_CACHE_MISSES,
	PERF_COUNT_BRANCH_INSTRUCTIONS,
	PERF_COUNT_BRANCH_MISSES,
};

int main(void)
{
	unsigned long long count1, count2;
	int fd1, fd2, ret;

	fd1 = perf_counter_open(PERF_COUNT_INSTRUCTIONS, 0, 0, 0, -1);
	assert(fd1 >= 0);
	fd2 = perf_counter_open(PERF_COUNT_CACHE_MISSES, 0, 0, 0, -1);
	assert(fd1 >= 0);

	for (;;) {
		ret = read(fd1, &count1, sizeof(count1));
		assert(ret == 8);
		ret = read(fd2, &count2, sizeof(count2));
		assert(ret == 8);

		printf("counter1 value: %Ld instructions\n", count1);
		printf("counter2 value: %Ld cachemisses\n",  count2);
		sleep(1);
	}
	return 0;
}


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  7:03     ` Ingo Molnar
@ 2008-12-05  7:16       ` Peter Zijlstra
  2008-12-05  7:57         ` Paul Mackerras
  2008-12-05  7:57       ` David Miller
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-12-05  7:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Mackerras, Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Steven Rostedt, David Miller

On Fri, 2008-12-05 at 08:03 +0100, Ingo Molnar wrote:

> int main(void)
> {
> 	unsigned long long count1, count2;
> 	int fd1, fd2, ret;
> 
> 	fd1 = perf_counter_open(PERF_COUNT_INSTRUCTIONS, 0, 0, 0, -1);
> 	assert(fd1 >= 0);
> 	fd2 = perf_counter_open(PERF_COUNT_CACHE_MISSES, 0, 0, 0, -1);
> 	assert(fd1 >= 0);
> 
> 	for (;;) {
> 		ret = read(fd1, &count1, sizeof(count1));
> 		assert(ret == 8);
> 		ret = read(fd2, &count2, sizeof(count2));
> 		assert(ret == 8);
> 
> 		printf("counter1 value: %Ld instructions\n", count1);
> 		printf("counter2 value: %Ld cachemisses\n",  count2);
> 		sleep(1);
> 	}
> 	return 0;
> }

So, while most people would not consider two consecutive read() ops to
be close or near the same time, due to preemption and such, that is
taken away by the fact that the counters are task local time based - so
preemption doesn't affect thing. Right?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  6:10   ` Ingo Molnar
@ 2008-12-05  7:50     ` David Miller
  2008-12-05  9:34     ` Paul Mackerras
  2008-12-05 10:05     ` Ingo Molnar
  2 siblings, 0 replies; 73+ messages in thread
From: David Miller @ 2008-12-05  7:50 UTC (permalink / raw)
  To: mingo
  Cc: tglx, linux-kernel, linux-arch, akpm, eranian, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt, paulus

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 5 Dec 2008 07:10:12 +0100

> * David Miller <davem@davemloft.net> wrote:
> 
> > From: Thomas Gleixner <tglx@linutronix.de>
> > Date: Thu, 04 Dec 2008 23:44:39 -0000
> > 
> > >  - No interaction with ptrace: any task (with sufficient permissions) can
> > >    monitor other tasks, without having to stop that task.
> > 
> > This isn't going to work.
> >
> > If you look at the things the perfmon libraries do, you do need to stop 
> > the task.
> >
> > Consider counter virtualization as the most direct example. [...]
> 
> Note that counter virtualization is not offered in the perfmon3 patchset 
> that has been posted to lkml. (It is part of the much larger 'full' 
> perfmon patchset which has not been submitted for integration)

I know, it was yanked out to make a merge more likely.

> Relying on ptrace machinery can be considered one of the bigger design 
> mistakes of the permon3 patchset.

I totally disagree.

> We pointed that out in review, and now we demonstrate it via this 
> patchset that it can be done much cleaner and much simpler. (Please stay 
> tuned for -v2 if you want to see the proof of the pudding.)

I hope it will provide enough for full PAPI library support, otherwise
it's useless for most of the world.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  7:02     ` Arjan van de Ven
@ 2008-12-05  7:52       ` David Miller
  0 siblings, 0 replies; 73+ messages in thread
From: David Miller @ 2008-12-05  7:52 UTC (permalink / raw)
  To: arjan
  Cc: mingo, paulus, tglx, linux-kernel, linux-arch, akpm, eranian,
	dada1, robert.richter, hpa, a.p.zijlstra, rostedt

From: Arjan van de Ven <arjan@infradead.org>
Date: Thu, 4 Dec 2008 23:02:06 -0800

> On Fri, 5 Dec 2008 07:31:31 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > Btw., i'm curious, why would we want to do that? It skews the results
> > if the task continues executing and counters stop. To get the highest 
> > quality profiling output the counters should follow the true state of
> > the task that is profiled - and events should be passed to the
> > monitoring task asynchronously. The _events_ can contain precise
> > coupled information 
> > - but the counters should continue.
> 
> btw stopping the task on counter overflow is an issue for things that
> want to self profile, like JITs

They can fork off a thread to do this.

No blocking on couter overflow leads to inaccurate results.
This is a pretty fundamental aspect of perf counter usage.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  6:31   ` Ingo Molnar
  2008-12-05  7:02     ` Arjan van de Ven
  2008-12-05  7:03     ` Ingo Molnar
@ 2008-12-05  7:54     ` Paul Mackerras
  2008-12-05  8:08       ` Ingo Molnar
  2 siblings, 1 reply; 73+ messages in thread
From: Paul Mackerras @ 2008-12-05  7:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller

Ingo Molnar writes:
> 
> * Paul Mackerras <paulus@samba.org> wrote:
[snip]
> > One thing that this sort of thing can't do is to get values from 
> > multiple counters that correlate with each other.  For instance, we 
> > would often want to count, say, L2 cache misses and instructions 
> > completed at the same time, and be able to read both counters at very 
> > close to the same time, so that we can measure average L2 cache misses 
> > per instruction completed, which is useful.
> 
> This can be done in a very natural way with our abstraction, and the 
> "hello.c" example happens to do exactly that:

Has hello.c been posted?  I can't find it in any of the posts from you
or Thomas.  Am I just being blind? :)

>   aldebaran:~/perf-counter-test> ./hello
>   doing perf_counter_open() call:
>   counter[0]... fd: 3.
>   counter[1]... fd: 4.
>   counter[0] delta: 10866 cycles
>   counter[1] delta: 414 cycles
>   counter[0] delta: 23640 cycles
>   counter[1] delta: 3673 cycles
>   counter[0] delta: 28225 cycles
>   counter[1] delta: 3695 cycles
> 
> This counts cycles executed and instructions executed, and reads the two 
> counters out at the same time.

Isn't it two separate read() calls to read the two counters?  If so,
the only way the two values are actually going to correspond to the
same point in time is if the task being monitored is stopped.  In
which case the monitoring task needs to use ptrace or something
similar in order to make sure that the monitored task is actually
stopped.

If the monitored task is not stopped, then the interval between the
two reads will be sufficient to render the results useless -
particularly since the monitoring task could get preempted for an
arbitrary length of time between the two reads.  But even if it
doesn't, the hundreds of cycles between the two reads will introduce
considerable imprecision in the results.

There really is value in being able to read all the counters you're
using in one system call.

Paul.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  7:16       ` Peter Zijlstra
@ 2008-12-05  7:57         ` Paul Mackerras
  2008-12-05  8:03           ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Paul Mackerras @ 2008-12-05  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Steven Rostedt, David Miller

Peter Zijlstra writes:

> So, while most people would not consider two consecutive read() ops to
> be close or near the same time, due to preemption and such, that is
> taken away by the fact that the counters are task local time based - so
> preemption doesn't affect thing. Right?

I'm sorry, I don't follow the argument here.  What do you mean by
"task local time based"?

Paul.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  7:03     ` Ingo Molnar
  2008-12-05  7:16       ` Peter Zijlstra
@ 2008-12-05  7:57       ` David Miller
  2008-12-05  8:18         ` Ingo Molnar
  1 sibling, 1 reply; 73+ messages in thread
From: David Miller @ 2008-12-05  7:57 UTC (permalink / raw)
  To: mingo
  Cc: paulus, tglx, linux-kernel, linux-arch, akpm, eranian, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 5 Dec 2008 08:03:29 +0100

> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > This can be done in a very natural way with our abstraction, and the 
> > "hello.c" example happens to do exactly that:
> 
> multiple people pointed out that we have not posted hello.c :-/

Because it's completely not providing the facility.  This is not how
people want to use the performance counters at all.

And it doesn't even do what Paulus said is necessary, he said:

--------------------
> One thing that this sort of thing can't do is to get values from 
> multiple counters that correlate with each other.  For instance, we 
> would often want to count, say, L2 cache misses and instructions 
> completed at the same time, and be able to read both counters at very 
> close to the same time, so that we can measure average L2 cache misses 
> per instruction completed, which is useful.
--------------------

And if you read one counter then read the other as seperate operations,
you get extra events in there as a side effect of going back into
userspace between the two reads.

Nobody wants that, it's inaccurate and if you're looking for if one
event happens at all it's not only inaccurate it's useless if the
reads trigger that counter event.  Also, correlation has other
meanings.

What people want is blocking on overflow events, and a monitoring task
or thread doing all of the tricky control register management and task
inspection.

I mean look at some of the test cases and sample programs in the PAPI
and perfmon2 librarys, that stuff is extremely cool and this proposal
cannot do half of that stuff correctly.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  7:57         ` Paul Mackerras
@ 2008-12-05  8:03           ` Peter Zijlstra
  2008-12-05  8:07             ` David Miller
  2008-12-05  9:16             ` Paul Mackerras
  0 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-12-05  8:03 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Steven Rostedt, David Miller

On Fri, 2008-12-05 at 18:57 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > So, while most people would not consider two consecutive read() ops to
> > be close or near the same time, due to preemption and such, that is
> > taken away by the fact that the counters are task local time based - so
> > preemption doesn't affect thing. Right?
> 
> I'm sorry, I don't follow the argument here.  What do you mean by
> "task local time based"?

time only flows when the task is running.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:03           ` Peter Zijlstra
@ 2008-12-05  8:07             ` David Miller
  2008-12-05  8:11               ` Ingo Molnar
  2008-12-05 15:00               ` Arjan van de Ven
  2008-12-05  9:16             ` Paul Mackerras
  1 sibling, 2 replies; 73+ messages in thread
From: David Miller @ 2008-12-05  8:07 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: paulus, mingo, tglx, linux-kernel, linux-arch, akpm, eranian,
	dada1, robert.richter, arjan, hpa, rostedt

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri, 05 Dec 2008 09:03:36 +0100

> On Fri, 2008-12-05 at 18:57 +1100, Paul Mackerras wrote:
> > Peter Zijlstra writes:
> > 
> > > So, while most people would not consider two consecutive read() ops to
> > > be close or near the same time, due to preemption and such, that is
> > > taken away by the fact that the counters are task local time based - so
> > > preemption doesn't affect thing. Right?
> > 
> > I'm sorry, I don't follow the argument here.  What do you mean by
> > "task local time based"?
> 
> time only flows when the task is running.

These things aren't measuring time, or even just cycles, they
are measuring things like L2 cache misses, cpu cycles, and
other similar kinds of events.

So these counters are going to measure all of the damn crap
assosciated with doing the read() call as well as the real work
the task does.

That's not useful to people.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  7:54     ` Paul Mackerras
@ 2008-12-05  8:08       ` Ingo Molnar
  2008-12-05  8:15         ` David Miller
  2008-12-05  9:10         ` Paul Mackerras
  0 siblings, 2 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05  8:08 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller


* Paul Mackerras <paulus@samba.org> wrote:

> Ingo Molnar writes:
> > 
> > * Paul Mackerras <paulus@samba.org> wrote:
> [snip]
> > > One thing that this sort of thing can't do is to get values from 
> > > multiple counters that correlate with each other.  For instance, we 
> > > would often want to count, say, L2 cache misses and instructions 
> > > completed at the same time, and be able to read both counters at very 
> > > close to the same time, so that we can measure average L2 cache misses 
> > > per instruction completed, which is useful.
> > 
> > This can be done in a very natural way with our abstraction, and the 
> > "hello.c" example happens to do exactly that:
> 
> Has hello.c been posted?  I can't find it in any of the posts from you 
> or Thomas.  Am I just being blind? :)

Sorry, was late at night when we did the release - monitor.c was posted - 
and i just posted hello.c it half an hour ago :)

> >   aldebaran:~/perf-counter-test> ./hello
> >   doing perf_counter_open() call:
> >   counter[0]... fd: 3.
> >   counter[1]... fd: 4.
> >   counter[0] delta: 10866 cycles
> >   counter[1] delta: 414 cycles
> >   counter[0] delta: 23640 cycles
> >   counter[1] delta: 3673 cycles
> >   counter[0] delta: 28225 cycles
> >   counter[1] delta: 3695 cycles
> > 
> > This counts cycles executed and instructions executed, and reads the two 
> > counters out at the same time.
> 
> Isn't it two separate read() calls to read the two counters?  If so, 
> the only way the two values are actually going to correspond to the 
> same point in time is if the task being monitored is stopped.  In which 
> case the monitoring task needs to use ptrace or something similar in 
> order to make sure that the monitored task is actually stopped.

It doesnt matter in practice.

Also, look at our code: we buffer notification events and do not have to 
stop the thread for recording the context information.

Also, if you _do_ care about getting immediate readouts, the _monitoring_ 
task can be set to higher priority. (not that i'd advocate it in general: 
any task stopping or scheduling can destroy a workload's true behavior)

> If the monitored task is not stopped, then the interval between the two 
> reads will be sufficient to render the results useless - particularly 
> since the monitoring task could get preempted for an arbitrary length 
> of time between the two reads.  But even if it doesn't, the hundreds of 
> cycles between the two reads will introduce considerable imprecision in 
> the results.

Even if the two read()s are done apart, stopping a task is _far_ more 
intrusive to the event flow of a single application. Most workloads are 
multithreaded - so stopping a task causes another task to be scheduled 
in, which would not have occured were the profiling more transparent and 
less intrusive.

Furthermore, even for the special case of single task monitoring, a 
context-switch is more expensive than a system call.

Furthermore, in most of the practical cases there's very few events 
happening between two read()s. The interval of profiling versus the 
interval between two reads()s is a couple of orders of magnitude.

This 'task has to be stopped' aspect is a red herring that has no 
technical basis.

> There really is value in being able to read all the counters you're 
> using in one system call.

It's possible with our code too: what you are asking for is in essence a 
sys_read_fds() system call extension - a bit like readv(), but from a 
vector of separate fds.

Such kind of 'group system call facility' has been suggested several 
times in the past - but ... never got anywhere because system calls are 
cheap enough, it really does not count in practice.

It could be implemented, and note that because our code uses a proper 
Linux file descriptor abstraction, such a sys_read_fds() facility would 
help _other_ applications as well, not just performance counters.

But it brings complications: demultiplexing of error conditions on 
individual counters is a real pain with any compound abstraction. We very 
consciously went with the 'one fd, one object, one counter' design.

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:07             ` David Miller
@ 2008-12-05  8:11               ` Ingo Molnar
  2008-12-05  8:17                 ` David Miller
  2008-12-05 15:00               ` Arjan van de Ven
  1 sibling, 1 reply; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05  8:11 UTC (permalink / raw)
  To: David Miller
  Cc: a.p.zijlstra, paulus, tglx, linux-kernel, linux-arch, akpm,
	eranian, dada1, robert.richter, arjan, hpa, rostedt


* David Miller <davem@davemloft.net> wrote:

> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Fri, 05 Dec 2008 09:03:36 +0100
> 
> > On Fri, 2008-12-05 at 18:57 +1100, Paul Mackerras wrote:
> > > Peter Zijlstra writes:
> > > 
> > > > So, while most people would not consider two consecutive read() ops to
> > > > be close or near the same time, due to preemption and such, that is
> > > > taken away by the fact that the counters are task local time based - so
> > > > preemption doesn't affect thing. Right?
> > > 
> > > I'm sorry, I don't follow the argument here.  What do you mean by
> > > "task local time based"?
> > 
> > time only flows when the task is running.
> 
> These things aren't measuring time, or even just cycles, they are 
> measuring things like L2 cache misses, cpu cycles, and other similar 
> kinds of events.
> 
> So these counters are going to measure all of the damn crap assosciated 
> with doing the read() call as well as the real work the task does.

that's wrong, look at the example we posted - see it pasted below.

When monitoring another task it does _not_ count the read() done in the 
monitoring task, it does _not_ include it in the event count. It is a 
fundamental property of our code to be as unintrusive as possible. It 
only measures the work done by that task.

( You _can_ measure your own overhead of course too, if you want to. It's 
  a natural special-case of our performance counter abstraction. )

	Ingo

---

/*
 * Performance counters monitoring test case
 */
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <unistd.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
#include <fcntl.h>
#include <stdio.h>
#include <errno.h>

#define __user

#include "sys.h"

static int count = 10000;
static int eventid;
static int tid;
static char *debuginfo;

static void display_help(void)
{
	printf("monitor\n");
	printf("Usage:\n"
	       "monitor options threadid\n\n"
	       "-e EID   --eventid=EID  eventid\n"
	       "-c CNT   --count=CNT    event count on which IP is sampled\n"
	       "-d FILE  --debug=FILE   path to binary file with debug info\n");
	exit(0);
}

static void process_options (int argc, char *argv[])
{
	int error = 0;

	for (;;) {
		int option_index = 0;
		/** Options for getopt */
		static struct option long_options[] = {
			{"count", required_argument, NULL, 'c'},
			{"debug", required_argument, NULL, 'd'},
			{"eventid", required_argument, NULL, 'e'},
			{"help", no_argument, NULL, 'h'},
			{NULL, 0, NULL, 0}
		};
		int c = getopt_long(argc, argv, "c:d:e:",
				    long_options, &option_index);
		if (c == -1)
			break;
		switch (c) {
		case 'c': count = atoi(optarg); break;
		case 'd': debuginfo = strdup(optarg); break;
		case 'e': eventid = atoi(optarg); break;
		default: error = 1; break;
		}
	}
	if (error || optind == argc)
		display_help ();

	tid = atoi(argv[optind]);
}

int main(int argc, char *argv[])
{
	char str[256];
	uint64_t ip;
	ssize_t res;
	int fd;

	process_options(argc, argv);

	fd = perf_counter_open(eventid, count, 1, tid, -1);
	if (fd < 0) {
		perror("Create counter");
		exit(-1);
	}

	while (1) {
		res = read(fd, (char *) &ip, sizeof(ip));
		if (res != sizeof(ip)) {
			perror("Read counter");
			break;
		}

		if (!debuginfo) {
			printf("IP: 0x%016llx\n", (unsigned long long)ip);
		} else {
			sprintf(str, "addr2line -e %s 0x%llx\n", debuginfo,
				(unsigned long long)ip);
			system(str);
		}
	}

	close(fd);
	exit(0);
}



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:08       ` Ingo Molnar
@ 2008-12-05  8:15         ` David Miller
  2008-12-05 13:25           ` Ingo Molnar
  2008-12-05  9:10         ` Paul Mackerras
  1 sibling, 1 reply; 73+ messages in thread
From: David Miller @ 2008-12-05  8:15 UTC (permalink / raw)
  To: mingo
  Cc: paulus, tglx, linux-kernel, linux-arch, akpm, eranian, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 5 Dec 2008 09:08:13 +0100

> 
> * Paul Mackerras <paulus@samba.org> wrote:
> 
> > Ingo Molnar writes:
> > > 
> > > * Paul Mackerras <paulus@samba.org> wrote:
> > [snip]
> > > > One thing that this sort of thing can't do is to get values from 
> > > > multiple counters that correlate with each other.  For instance, we 
> > > > would often want to count, say, L2 cache misses and instructions 
> > > > completed at the same time, and be able to read both counters at very 
> > > > close to the same time, so that we can measure average L2 cache misses 
> > > > per instruction completed, which is useful.
> > > 
> > > This can be done in a very natural way with our abstraction, and the 
> > > "hello.c" example happens to do exactly that:
> > 
> > Has hello.c been posted?  I can't find it in any of the posts from you 
> > or Thomas.  Am I just being blind? :)
> 
> Sorry, was late at night when we did the release - monitor.c was posted - 
> and i just posted hello.c it half an hour ago :)
> 
> > >   aldebaran:~/perf-counter-test> ./hello
> > >   doing perf_counter_open() call:
> > >   counter[0]... fd: 3.
> > >   counter[1]... fd: 4.
> > >   counter[0] delta: 10866 cycles
> > >   counter[1] delta: 414 cycles
> > >   counter[0] delta: 23640 cycles
> > >   counter[1] delta: 3673 cycles
> > >   counter[0] delta: 28225 cycles
> > >   counter[1] delta: 3695 cycles
> > > 
> > > This counts cycles executed and instructions executed, and reads the two 
> > > counters out at the same time.
> > 
> > Isn't it two separate read() calls to read the two counters?  If so, 
> > the only way the two values are actually going to correspond to the 
> > same point in time is if the task being monitored is stopped.  In which 
> > case the monitoring task needs to use ptrace or something similar in 
> > order to make sure that the monitored task is actually stopped.
> 
> It doesnt matter in practice.

Yes it DOES!

If I want to know if a code block triggers event X or Y, and your read
call triggers one of those events, I can't figure out the answer to my
profiling problem.

That is completely fundamental to all of this.  And this is why this
proposal is a non-workable solution.


> Also, look at our code: we buffer notification events and do not have to 
> stop the thread for recording the context information.

But that's what monitoring libraries want, they want to stop the task
and inspect it.

Look at the PAPI library.  If you can't implement what that thing
provides, all the real users of profiling information can't use
this stuff.

> Even if the two read()s are done apart, stopping a task is _far_ more 
> intrusive to the event flow of a single application.

I really don't think you get the use case for these kinds of
facilities.

Once again I encourage you to look at the test programs, test cases,
and wonderful documentation provided with the PAPI and perfmon2
library bits.  That's how people want to use this stuff.  Ignore
at your own peril :-)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:11               ` Ingo Molnar
@ 2008-12-05  8:17                 ` David Miller
  2008-12-05  8:24                   ` Ingo Molnar
  0 siblings, 1 reply; 73+ messages in thread
From: David Miller @ 2008-12-05  8:17 UTC (permalink / raw)
  To: mingo
  Cc: a.p.zijlstra, paulus, tglx, linux-kernel, linux-arch, akpm,
	eranian, dada1, robert.richter, arjan, hpa, rostedt

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 5 Dec 2008 09:11:37 +0100

> 
> * David Miller <davem@davemloft.net> wrote:
> 
> > From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Date: Fri, 05 Dec 2008 09:03:36 +0100
> > 
> > > On Fri, 2008-12-05 at 18:57 +1100, Paul Mackerras wrote:
> > > > Peter Zijlstra writes:
> > > > 
> > > > > So, while most people would not consider two consecutive read() ops to
> > > > > be close or near the same time, due to preemption and such, that is
> > > > > taken away by the fact that the counters are task local time based - so
> > > > > preemption doesn't affect thing. Right?
> > > > 
> > > > I'm sorry, I don't follow the argument here.  What do you mean by
> > > > "task local time based"?
> > > 
> > > time only flows when the task is running.
> > 
> > These things aren't measuring time, or even just cycles, they are 
> > measuring things like L2 cache misses, cpu cycles, and other similar 
> > kinds of events.
> > 
> > So these counters are going to measure all of the damn crap assosciated 
> > with doing the read() call as well as the real work the task does.
> 
> that's wrong, look at the example we posted - see it pasted below.

It's still too simple to be useful.

There are so many aspects other than the immediate PC that monitoring
tasks want to inspect when a counter overflows.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  7:57       ` David Miller
@ 2008-12-05  8:18         ` Ingo Molnar
  2008-12-05  8:20           ` David Miller
  0 siblings, 1 reply; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05  8:18 UTC (permalink / raw)
  To: David Miller
  Cc: paulus, tglx, linux-kernel, linux-arch, akpm, eranian, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Fri, 5 Dec 2008 08:03:29 +0100
> 
> > 
> > * Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > This can be done in a very natural way with our abstraction, and the 
> > > "hello.c" example happens to do exactly that:
> > 
> > multiple people pointed out that we have not posted hello.c :-/
> 
> Because it's completely not providing the facility.  This is not how
> people want to use the performance counters at all.
> 
> And it doesn't even do what Paulus said is necessary, he said:
> 
> --------------------
> > One thing that this sort of thing can't do is to get values from 
> > multiple counters that correlate with each other.  For instance, we 
> > would often want to count, say, L2 cache misses and instructions 
> > completed at the same time, and be able to read both counters at very 
> > close to the same time, so that we can measure average L2 cache misses 
> > per instruction completed, which is useful.
> --------------------
> 
> And if you read one counter then read the other as seperate operations, 
> you get extra events in there as a side effect of going back into 
> userspace between the two reads.

that's wrong. If you _want_ to measure in a different context, with as 
little measurement impact as possible, you can do it with our code. The 
announcement provides the example for that.

For example, i just started this bash infinite loop:

  $ while :; do :; done &
  [1] 1877

  $ ./monitor -e 1 -c 1000000000 1877
  IP: 0x00000031a2e70d4b
  IP: 0x0000000000455f64
  IP: 0x00000031a2f028a0
  IP: 0x0000000000440692
  IP: 0x0000000000441b8e
  IP: 0x00000031a2e6f630
  IP: 0x0000000000446129
  IP: 0x00000031a2e6edbc
  IP: 0x0000000000443736
  IP: 0x0000000000441c80
  IP: 0x000000000043913a
  ^C

We get IP readouts every 1 billion instructions executed in that shell. 
That shell is never stopped or otherwise intruded - it's kept as an as 
pristine of an execution environment as possible.

Furthermore, the event readouts strictly only include event counts of the 
shell PID, _not_ of the monitor context's read() or other activities.

> Nobody wants that, [...]

Nobody wants that and we dont do it.

Really, you should take a more serious look at our code.

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:18         ` Ingo Molnar
@ 2008-12-05  8:20           ` David Miller
  0 siblings, 0 replies; 73+ messages in thread
From: David Miller @ 2008-12-05  8:20 UTC (permalink / raw)
  To: mingo
  Cc: paulus, tglx, linux-kernel, linux-arch, akpm, eranian, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 5 Dec 2008 09:18:38 +0100

> Really, you should take a more serious look at our code.

People don't want code, they want a usable port of the PAPI libraries
for profiling.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:17                 ` David Miller
@ 2008-12-05  8:24                   ` Ingo Molnar
  2008-12-05  8:27                     ` David Miller
  0 siblings, 1 reply; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05  8:24 UTC (permalink / raw)
  To: David Miller
  Cc: a.p.zijlstra, paulus, tglx, linux-kernel, linux-arch, akpm,
	eranian, dada1, robert.richter, arjan, hpa, rostedt


* David Miller <davem@davemloft.net> wrote:

> > > These things aren't measuring time, or even just cycles, they are 
> > > measuring things like L2 cache misses, cpu cycles, and other 
> > > similar kinds of events.
> > > 
> > > So these counters are going to measure all of the damn crap 
> > > assosciated with doing the read() call as well as the real work the 
> > > task does.
> > 
> > that's wrong, look at the example we posted - see it pasted below.
> 
> It's still too simple to be useful.
> 
> There are so many aspects other than the immediate PC that monitoring 
> tasks want to inspect when a counter overflows.

fully agreed.

While most of the flat profilers like oprofile will be happy with the PC 
alone, i do think we want a couple of extended notification types.

Right now we begun with the most trivial ones:

  enum perf_record_type {
          PERF_RECORD_SIMPLE,
          PERF_RECORD_IRQ,
  };

... but it would be natural to do a PERF_RECORD_GP_REGISTERS as well. 
Perhaps even a PERF_RECORD_STACKTRACE using the sysprof facilities, to do 
a hierarchic multi-dimension profile that sysprof does so nicely.

Note that the record type is an independent attribute of a counter. It 
can be set regardless of the even type - and it can be set independently 
for each counter. So you can have say 3 'simple' counters with no irqs 
plus one 'all registers' counter which generates an IRQ: and then you can 
read out the simple counters at the same type.

We could also perhaps do a PERF_RECORD_ALL: it represents a snapshot of 
all active counter values in the task. This is _far_ better than forcibly 
scheduling the monitored task.

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:24                   ` Ingo Molnar
@ 2008-12-05  8:27                     ` David Miller
  2008-12-05  8:42                       ` Ingo Molnar
  0 siblings, 1 reply; 73+ messages in thread
From: David Miller @ 2008-12-05  8:27 UTC (permalink / raw)
  To: mingo
  Cc: a.p.zijlstra, paulus, tglx, linux-kernel, linux-arch, akpm,
	eranian, dada1, robert.richter, arjan, hpa, rostedt

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 5 Dec 2008 09:24:31 +0100

> Right now we begun with the most trivial ones:
> 
>   enum perf_record_type {
>           PERF_RECORD_SIMPLE,
>           PERF_RECORD_IRQ,
>   };
> 
> ... but it would be natural to do a PERF_RECORD_GP_REGISTERS as well. 
> Perhaps even a PERF_RECORD_STACKTRACE using the sysprof facilities, to do 
> a hierarchic multi-dimension profile that sysprof does so nicely.

Maybe even add something like PERF_RECORD_THE_MOON...

see how rediculious this is?

It's not your business in the kernel to decide what things are
useful.  The monitor can stop the task and inspect whatever
it wants with _existing_ facilities.  We need none of this stuff.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:27                     ` David Miller
@ 2008-12-05  8:42                       ` Ingo Molnar
  2008-12-05  8:49                         ` David Miller
  2008-12-05 12:39                         ` Andi Kleen
  0 siblings, 2 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05  8:42 UTC (permalink / raw)
  To: David Miller
  Cc: a.p.zijlstra, paulus, tglx, linux-kernel, linux-arch, akpm,
	eranian, dada1, robert.richter, arjan, hpa, rostedt


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Fri, 5 Dec 2008 09:24:31 +0100
> 
> > Right now we begun with the most trivial ones:
> > 
> >   enum perf_record_type {
> >           PERF_RECORD_SIMPLE,
> >           PERF_RECORD_IRQ,
> >   };
> > 
> > ... but it would be natural to do a PERF_RECORD_GP_REGISTERS as well. 
> > Perhaps even a PERF_RECORD_STACKTRACE using the sysprof facilities, to do 
> > a hierarchic multi-dimension profile that sysprof does so nicely.
> 
> Maybe even add something like PERF_RECORD_THE_MOON...
> 
> see how rediculious this is?

Note that more notification record types is actually where latest 
hardware is going: for example in Nehalem there's a PEBS notification 
record type that has cachemiss latency included in the record. I.e. we 
can get profiles with _cachemiss latency_ included (as measured from 
issuing the instruction to completion).

You cannot get that information out of any 'stop the task' interface ...

Stopping a task is way too intrusive, i dont know why you keep harping on 
it. Listen to the scheduler guys: it's a non-starter.

> It's not your business in the kernel to decide what things are useful.  
> The monitor can stop the task and inspect whatever it wants with 
> _existing_ facilities.  We need none of this stuff.

You try to ridicule our efforts, while you have not answered our 
technical arguments in substance.

Please let me repeat: it's a _fundamental_ thesis of performance 
instrumentation to not disturb the monitored context. Your insistence on 
_stopping_ the monitored task breaks that fundamental axiom!

Stopping a task destroys the characteristics of many, many workloads. To 
get a reasonable histogram out of a system a highlevel event count of 
thousands a second is desired (but hundreds of them are a minimum, to get 
any reasonable coverage).

But injecting even hundreds of artificialy task-stoppages will destroy 
the true behavior of many reference workloads we care about in Linux!

Stopping the task is a fundamental and obvious design failure of perfmon.

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:42                       ` Ingo Molnar
@ 2008-12-05  8:49                         ` David Miller
  2008-12-05 12:13                           ` Ingo Molnar
  2008-12-05 12:39                         ` Andi Kleen
  1 sibling, 1 reply; 73+ messages in thread
From: David Miller @ 2008-12-05  8:49 UTC (permalink / raw)
  To: mingo
  Cc: a.p.zijlstra, paulus, tglx, linux-kernel, linux-arch, akpm,
	eranian, dada1, robert.richter, arjan, hpa, rostedt

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 5 Dec 2008 09:42:33 +0100

> Please let me repeat: it's a _fundamental_ thesis of performance 
> instrumentation to not disturb the monitored context. Your insistence on 
> _stopping_ the monitored task breaks that fundamental axiom!

This is only a problem if you make your measurement quantums too
small.

Furthermore, there are multiple registers and states to update
atomically when a perf counter overflows.  You're read/write thing
just doesn't cut it, especially for certain kinds of hardware.

It's really a utopian view of the world. :)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:08       ` Ingo Molnar
  2008-12-05  8:15         ` David Miller
@ 2008-12-05  9:10         ` Paul Mackerras
  2008-12-05 12:07           ` Ingo Molnar
  1 sibling, 1 reply; 73+ messages in thread
From: Paul Mackerras @ 2008-12-05  9:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller

Ingo Molnar writes:

> * Paul Mackerras <paulus@samba.org> wrote:
[...]
> > Isn't it two separate read() calls to read the two counters?  If so, 
> > the only way the two values are actually going to correspond to the 
> > same point in time is if the task being monitored is stopped.  In which 
> > case the monitoring task needs to use ptrace or something similar in 
> > order to make sure that the monitored task is actually stopped.
> 
> It doesnt matter in practice.

Can I ask - and this is a real question, I'm not being sarcastic - is
that statement made with substantial serious experience in performance
analysis behind it, or is it just an intuition?

I will happily admit that I am not a great expert on performance
analysis with years of experience.  But I have taken a bit of a look
at what people with that sort of experience do, and I don't think they
would agree with your "doesn't matter" statement.

> Such kind of 'group system call facility' has been suggested several 
> times in the past - but ... never got anywhere because system calls are 
> cheap enough, it really does not count in practice.
> 
> It could be implemented, and note that because our code uses a proper 
> Linux file descriptor abstraction, such a sys_read_fds() facility would 
> help _other_ applications as well, not just performance counters.
> 
> But it brings complications: demultiplexing of error conditions on 
> individual counters is a real pain with any compound abstraction. We very 
> consciously went with the 'one fd, one object, one counter' design.

And I think that is the fundamental flaw.  On the machines I am
familiar with, the performance counters as not separate things that
can individually and independently be assigned to count one thing or
another.

Rather, what the hardware provides is ONE performance monitor unit,
which the OS can context-switch between tasks.  The performance
monitor unit has several counters that can be assigned (within limits)
to count various aspects of the performance of the code being
executed.  That is why, for instance, if you ask for the counters to
be frozen when one of them overflows, they all get frozen at that
point.

And that's how the hardware is designed because that's how the people
that do performance analysis want to do their measurements.  This idea
of splitting things up into separate counters that look independent
but aren't is just going to cause unnecessary complications and
difficulties.

Paul.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:03           ` Peter Zijlstra
  2008-12-05  8:07             ` David Miller
@ 2008-12-05  9:16             ` Paul Mackerras
  1 sibling, 0 replies; 73+ messages in thread
From: Paul Mackerras @ 2008-12-05  9:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Steven Rostedt, David Miller

Peter Zijlstra writes:

> On Fri, 2008-12-05 at 18:57 +1100, Paul Mackerras wrote:
> > Peter Zijlstra writes:
> > 
> > > So, while most people would not consider two consecutive read() ops to
> > > be close or near the same time, due to preemption and such, that is
> > > taken away by the fact that the counters are task local time based - so
> > > preemption doesn't affect thing. Right?
> > 
> > I'm sorry, I don't follow the argument here.  What do you mean by
> > "task local time based"?
> 
> time only flows when the task is running.

Right - but the monitored task is running while the monitoring task is
running.  So time is flowing for the monitored task between the two
reads done by the monitoring task, meaning that you can't actually
relate the two values you read with any precision.

Paul.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  6:10   ` Ingo Molnar
  2008-12-05  7:50     ` David Miller
@ 2008-12-05  9:34     ` Paul Mackerras
  2008-12-05 10:41       ` Ingo Molnar
  2008-12-05 10:05     ` Ingo Molnar
  2 siblings, 1 reply; 73+ messages in thread
From: Paul Mackerras @ 2008-12-05  9:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, tglx, linux-kernel, linux-arch, akpm, eranian,
	dada1, robert.richter, arjan, hpa, a.p.zijlstra, rostedt

Ingo Molnar writes:

> The 'target' task does not have to be stopped to offer counter 
> virtualization (counter overcommit or counter scheduling) - or to offer 
> any of the other performance counter features. Please let us know why it 
> needs the task to be stopped - we asked about that on lkml in the perfmon 
> thread and no technical answer was given, and couldnt find any such 
> technical reason while implementing it ourselves.

I like this feature of your patchset, in fact, and the code looks
pretty clean (as I would expect :).  What I don't like (as I have
already said) is having to use an API that splits up the PMU into
pieces, plus the requirement that flows from that to have the kernel
know about the event selection logic on every CPU model we support.

One thing I haven't figured out yet is what happens if you have a
counter on a task and the task dies.  Can I still use the counter fd
after the task has died, and read out the total count?

Paul.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  6:10   ` Ingo Molnar
  2008-12-05  7:50     ` David Miller
  2008-12-05  9:34     ` Paul Mackerras
@ 2008-12-05 10:05     ` Ingo Molnar
  2 siblings, 0 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05 10:05 UTC (permalink / raw)
  To: David Miller
  Cc: tglx, linux-kernel, linux-arch, akpm, eranian, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt, paulus


* Ingo Molnar <mingo@elte.hu> wrote:

> > >  - No interaction with ptrace: any task (with sufficient permissions) can
> > >    monitor other tasks, without having to stop that task.
> > 
> > This isn't going to work.
> >
> > If you look at the things the perfmon libraries do, you do need to stop 
> > the task.
> >
> > Consider counter virtualization as the most direct example. [...]
> 
> Note that counter virtualization is not offered in the perfmon3 patchset that has 
> been posted to lkml. (It is part of the much larger 'full' perfmon patchset which 
> has not been submitted for integration)
> 
> Nevertheless we will offer counter virtualization in -v2 of our patchset [...]

i've just implemented it. Running an (infinite-loop) hello.c with 6 counters on a 
CPU that has only two counters now gives the expected:

 counter[0 cycles              ]:           3368245084 , delta: 842019470 events
 counter[1 instructions        ]:           1384678210 , delta: 346108294 events
 counter[2 cache-refs          ]:                  659 , delta: 150 events
 counter[3 cache-misses        ]:                    0 
 counter[4 branch-instructions ]:            266919398 , delta: 66731508 events
 counter[5 branch-misses       ]:                 1201 , delta: 315 events

This will be in -v2.

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  9:34     ` Paul Mackerras
@ 2008-12-05 10:41       ` Ingo Molnar
  0 siblings, 0 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05 10:41 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, tglx, linux-kernel, linux-arch, akpm, eranian,
	dada1, robert.richter, arjan, hpa, a.p.zijlstra, rostedt


* Paul Mackerras <paulus@samba.org> wrote:

> Ingo Molnar writes:
> 
> > The 'target' task does not have to be stopped to offer counter 
> > virtualization (counter overcommit or counter scheduling) - or to offer 
> > any of the other performance counter features. Please let us know why it 
> > needs the task to be stopped - we asked about that on lkml in the perfmon 
> > thread and no technical answer was given, and couldnt find any such 
> > technical reason while implementing it ourselves.
> 
> I like this feature of your patchset, in fact, and the code looks 
> pretty clean (as I would expect :).  What I don't like (as I have 
> already said) is having to use an API that splits up the PMU into 
> pieces, plus the requirement that flows from that to have the kernel 
> know about the event selection logic on every CPU model we support.
> 
> One thing I haven't figured out yet is what happens if you have a 
> counter on a task and the task dies.  Can I still use the counter fd 
> after the task has died, and read out the total count?

yes, it will work just the way you'd expect it to work: the counter is 
attached to the fd of the monitoring task, so it does not go away. The 
counter simply stops counting but otherwise can be read even after the 
monitored task has exited.

We are also planning a natural 'the task has died' notification: a -EPIPE 
returned by read(), after the final count has been allowed to be read 
out. With blocking counters this will behave quite smoothly: instead of 
blocking indefinitely, we'd get back -EPIPE. Hm?

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 1/3] performance counters: core code
  2008-12-04 23:44 ` [patch 1/3] performance counters: core code Thomas Gleixner
@ 2008-12-05 10:55   ` Paul Mackerras
  2008-12-05 11:20     ` Ingo Molnar
  0 siblings, 1 reply; 73+ messages in thread
From: Paul Mackerras @ 2008-12-05 10:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, linux-arch, Andrew Morton, Ingo Molnar, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller

Thomas Gleixner writes:

> +static void
> +perf_install_in_context(struct perf_counter_context *ctx,
> +			struct perf_counter *counter,
> +			int cpu)
> +{
> +	struct task_struct *task = ctx->task;
> +

[...]

> +	if (task) {
> +		task_oncpu_function_call(task, __perf_install_in_context,
> +					 counter);
> +	} else {
> +		smp_call_function_single(cpu, __perf_install_in_context,
> +					 counter, 1);
> +	}

What happens if we send an IPI to the cpu where the task is running,
but by the time the IPI arrives, the task has been migrated to another
cpu and is now running there?  Do you chase after it and send another
IPI to its new cpu, or is there some reason why it can't migrate?
If it's the former, where is that code?  I haven't seen it so far (at
least, task_oncpu_function_call doesn't seem to do it).

Paul.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 1/3] performance counters: core code
  2008-12-05 10:55   ` Paul Mackerras
@ 2008-12-05 11:20     ` Ingo Molnar
  0 siblings, 0 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05 11:20 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller


* Paul Mackerras <paulus@samba.org> wrote:

> Thomas Gleixner writes:
> 
> > +static void
> > +perf_install_in_context(struct perf_counter_context *ctx,
> > +			struct perf_counter *counter,
> > +			int cpu)
> > +{
> > +	struct task_struct *task = ctx->task;
> > +
> 
> [...]
> 
> > +	if (task) {
> > +		task_oncpu_function_call(task, __perf_install_in_context,
> > +					 counter);
> > +	} else {
> > +		smp_call_function_single(cpu, __perf_install_in_context,
> > +					 counter, 1);
> > +	}
> 
> What happens if we send an IPI to the cpu where the task is running, 
> but by the time the IPI arrives, the task has been migrated to another 
> cpu and is now running there?  Do you chase after it and send another 
> IPI to its new cpu, or is there some reason why it can't migrate? If 
> it's the former, where is that code?  I haven't seen it so far (at 
> least, task_oncpu_function_call doesn't seem to do it).

in that case the schedule-in method will install the perf counter 
automatically. No need to chase after it. The smp call is only for the 
case where the task does not schedule at all.

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  9:10         ` Paul Mackerras
@ 2008-12-05 12:07           ` Ingo Molnar
  2008-12-06  0:05             ` Paul Mackerras
  0 siblings, 1 reply; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05 12:07 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller


* Paul Mackerras <paulus@samba.org> wrote:

> Ingo Molnar writes:
> 
> > * Paul Mackerras <paulus@samba.org> wrote:
> [...]
> > > Isn't it two separate read() calls to read the two counters?  If so, 
> > > the only way the two values are actually going to correspond to the 
> > > same point in time is if the task being monitored is stopped.  In which 
> > > case the monitoring task needs to use ptrace or something similar in 
> > > order to make sure that the monitored task is actually stopped.
> > 
> > It doesnt matter in practice.
> 
> Can I ask - and this is a real question, I'm not being sarcastic - is 
> that statement made with substantial serious experience in performance 
> analysis behind it, or is it just an intuition?
> 
> I will happily admit that I am not a great expert on performance 
> analysis with years of experience.  But I have taken a bit of a look at 
> what people with that sort of experience do, and I don't think they 
> would agree with your "doesn't matter" statement.

A stream of read()s possibly slightly being off is an order of magnitude 
smaller of an effect to precision. Look at the numbers: on the testbox i 
have a read() syscall takes 0.2 microseconds, while a context-switch 
takes 2 microseconds on the local CPU and about 5-10 microseconds 
cross-CPU (or more, if the cache pattern is unlucky/unaffine). That's 
10-25-50 times more expensive. You can do 9-24-49 reads and still be 
cheaper. Compound syscalls are almost never worth the complexity.

So as a scheduler person i cannot really take the perfmon "ptrace 
approach" seriously, and i explained that in great detail already. It 
clearly came from HPC workload quarters where tasks are persistent 
entities running alone on a single CPU that just use up CPU time there 
and dont interact with each other too much. That's a good and important 
profiling target for sure - but by no means the only workload target to 
design a core kernel facility for. It's an absolutely horrible approach 
for a number of more common workloads for sure.

> > Such kind of 'group system call facility' has been suggested several 
> > times in the past - but ... never got anywhere because system calls 
> > are cheap enough, it really does not count in practice.
> > 
> > It could be implemented, and note that because our code uses a proper 
> > Linux file descriptor abstraction, such a sys_read_fds() facility 
> > would help _other_ applications as well, not just performance 
> > counters.
> > 
> > But it brings complications: demultiplexing of error conditions on 
> > individual counters is a real pain with any compound abstraction. We 
> > very consciously went with the 'one fd, one object, one counter' 
> > design.
> 
> And I think that is the fundamental flaw.  On the machines I am 
> familiar with, the performance counters as not separate things that can 
> individually and independently be assigned to count one thing or 
> another.

Today we've implemented virtual counter scheduling in our to-be-v2 code:

   3 files changed, 36 insertions(+), 1 deletion(-)

hello.c gives:

 counter[0 cycles              ]:  10121258163 , delta:    844256826 events
 counter[1 instructions        ]:   4160893621 , delta:    347054666 events
 counter[2 cache-refs          ]:         2297 , delta:          179 events
 counter[3 cache-misses        ]:            3 , delta:            0 events
 counter[4 branch-instructions ]:    799422166 , delta:     66551572 events
 counter[5 branch-misses       ]:         7286 , delta:          775 events

All we need to get that array of information from 6 sw counters is a 
_single_ hardware counter. I'm not sure where you read "you must map sw 
counters to hw counters directly" or "hw counters must be independent of 
each other" into our design - it's not part of it, emphatically.

And i dont see your (fully correct!) statement above about counter 
constraints to be in any sort of conflict with what we are doing.

Intel hardware is just as constrained as powerpc hardware: there are 
counter inter-dependencies and many CPUs have just two performance 
counters. We very much took this into account while designing this code.

[ Obviously, you _can_ do higher quality profiling if you have more 
  hardware resources that help it. Nothing will change that fact. ]

> Rather, what the hardware provides is ONE performance monitor unit, 
> which the OS can context-switch between tasks.  The performance monitor 
> unit has several counters that can be assigned (within limits) to count 
> various aspects of the performance of the code being executed.  That is 
> why, for instance, if you ask for the counters to be frozen when one of 
> them overflows, they all get frozen at that point.

i dont see this as an issue at all - it's a feature of powerpc over x86 
that the core perfcounter code can support just fine. The overflow IRQ 
handler is arch specific. The overflow IRQ handler, if it triggers, 
updates the sw counters, creates any event records if needed, wakes up 
the monitor task if needed, and continues the task and performance 
measurement without having scheduled out. Demultiplexing of hw counters 
is arch-specific.

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:49                         ` David Miller
@ 2008-12-05 12:13                           ` Ingo Molnar
  0 siblings, 0 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05 12:13 UTC (permalink / raw)
  To: David Miller
  Cc: a.p.zijlstra, paulus, tglx, linux-kernel, linux-arch, akpm,
	eranian, dada1, robert.richter, arjan, hpa, rostedt


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Fri, 5 Dec 2008 09:42:33 +0100
> 
> > Please let me repeat: it's a _fundamental_ thesis of performance 
> > instrumentation to not disturb the monitored context. Your insistence 
> > on _stopping_ the monitored task breaks that fundamental axiom!
> 
> This is only a problem if you make your measurement quantums too small.

But if you make the measurement long enough - say we make it 100,000 
usecs, then 0.2 usecs of delay between two read()s is insignificant 
statistically, right? It's a 1:500,000 ratio.

Scheduling out a task and back is far more drastic of an effect than any 
new events in 0.2 usecs.

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:42                       ` Ingo Molnar
  2008-12-05  8:49                         ` David Miller
@ 2008-12-05 12:39                         ` Andi Kleen
  2008-12-05 20:08                           ` David Miller
  1 sibling, 1 reply; 73+ messages in thread
From: Andi Kleen @ 2008-12-05 12:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, a.p.zijlstra, paulus, tglx, linux-kernel,
	linux-arch, akpm, eranian, dada1, robert.richter, arjan, hpa,
	rostedt

Ingo Molnar <mingo@elte.hu> writes:

> Note that more notification record types is actually where latest 
> hardware is going: for example in Nehalem there's a PEBS notification 
> record type that has cachemiss latency included in the record. I.e. we 
> can get profiles with _cachemiss latency_ included (as measured from 
> issuing the instruction to completion).

One problem is that you have to find out the correct RIP for that PEBS
cache miss you have to disassemble from the last basic block because
the IP in PEBS points to the next instruction. 

If such a thing is ever implemented it should be in user space
I think.

Also in general some of the more useful PEBS information requires
disassembling unfortunately. For example if you want a address
histogram you get the register contents, but you have to interpret the
code to compute the EA. While the kernel has a x86 interpreter now for
this I suspect doing it in kernel space would be quite complicated
and at least I would consider doing it in user space cleaner too.

Given these are more obscure features, but not being able to fit
them easily into your model from the start isn't a very promising sign
for the long term extensibility of the design.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:15         ` David Miller
@ 2008-12-05 13:25           ` Ingo Molnar
  0 siblings, 0 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-12-05 13:25 UTC (permalink / raw)
  To: David Miller
  Cc: paulus, tglx, linux-kernel, linux-arch, akpm, eranian, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt


* David Miller <davem@davemloft.net> wrote:

> > > Isn't it two separate read() calls to read the two counters?  If 
> > > so, the only way the two values are actually going to correspond to 
> > > the same point in time is if the task being monitored is stopped.  
> > > In which case the monitoring task needs to use ptrace or something 
> > > similar in order to make sure that the monitored task is actually 
> > > stopped.
> > 
> > It doesnt matter in practice.
> 
> Yes it DOES!
> 
> If I want to know if a code block triggers event X or Y, and your read 
> call triggers one of those events, I can't figure out the answer to my 
> profiling problem.

( this misunderstanding of yours has been cleared up in a later mail: 
  reading a counter causes events in the monitoring context, not in the
  monitored context. )

> That is completely fundamental to all of this.  And this is why this 
> proposal is a non-workable solution.
> 
> 
> > Also, look at our code: we buffer notification events and do not have 
> > to stop the thread for recording the context information.
> 
> But that's what monitoring libraries want, they want to stop the task 
> and inspect it.
> 
> Look at the PAPI library.  If you can't implement what that thing 
> provides, all the real users of profiling information can't use this 
> stuff.

We have looked, and the PAPI library can be implemented on top of our 
system call as well - just like it was implemented on top of the perfctr 
driver and like it was implemented ontop of "perfmon-full".

PAPI is a relatively simple wrapper around OS level performance counter 
facilities. Both the high level counter abstraction 
(PAPI_start_counters() & friends) and the low level PAPI abstraction 
(PAPI event sets, PAPI_attach/detach) can be readily implemented via the 
use of our performance counter subsystem facilities. (In addition to all 
the facilities around PAPI event enumeration.)

PAPI has about 100 functions - if you think our design does not fit it 
for some fundamental reason then please point out exactly which 
functionality (which PAPI function call) cannot be done.

Perfmon needlessly complicated their design by exposing user-space to a 
'performance counter context' and other lowlevel details that should not 
and must not be handled at the ABI level. The PAPI interfaces do not 
force that design choice in any way. It's a plain unnecessary 
complication that permeates the whole perfmon code.

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05  8:07             ` David Miller
  2008-12-05  8:11               ` Ingo Molnar
@ 2008-12-05 15:00               ` Arjan van de Ven
  1 sibling, 0 replies; 73+ messages in thread
From: Arjan van de Ven @ 2008-12-05 15:00 UTC (permalink / raw)
  To: David Miller
  Cc: a.p.zijlstra, paulus, mingo, tglx, linux-kernel, linux-arch,
	akpm, eranian, dada1, robert.richter, hpa, rostedt

On Fri, 05 Dec 2008 00:07:16 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> These things aren't measuring time, or even just cycles, they
> are measuring things like L2 cache misses, cpu cycles, and
> other similar kinds of events.
> 
> So these counters are going to measure all of the damn crap
> assosciated with doing the read() call as well as the real work
> the task does.

as you said before, not if you do the read() from a thread that's
exempt from the profiling.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05 12:39                         ` Andi Kleen
@ 2008-12-05 20:08                           ` David Miller
  2008-12-10  3:48                             ` Paul Mundt
  0 siblings, 1 reply; 73+ messages in thread
From: David Miller @ 2008-12-05 20:08 UTC (permalink / raw)
  To: andi
  Cc: mingo, a.p.zijlstra, paulus, tglx, linux-kernel, linux-arch,
	akpm, eranian, dada1, robert.richter, arjan, hpa, rostedt

From: Andi Kleen <andi@firstfloor.org>
Date: Fri, 05 Dec 2008 13:39:43 +0100

> Given these are more obscure features, but not being able to fit
> them easily into your model from the start isn't a very promising sign
> for the long term extensibility of the design.

Another thing I'm interested in is if this new stuff will work with
performance counters that lack an overflow interrupt.

We have several chips that are like this, and perfmon supported that
on the kernel side, and also provided overflow emulation for such
hardware in userspace (where such complexity belongs).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05 12:07           ` Ingo Molnar
@ 2008-12-06  0:05             ` Paul Mackerras
  2008-12-06  1:23               ` Mikael Pettersson
  2008-12-06 12:34               ` Peter Zijlstra
  0 siblings, 2 replies; 73+ messages in thread
From: Paul Mackerras @ 2008-12-06  0:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller

Ingo Molnar writes:

> A stream of read()s possibly slightly being off is an order of magnitude 
> smaller of an effect to precision. Look at the numbers: on the testbox i 
> have a read() syscall takes 0.2 microseconds, while a context-switch 
> takes 2 microseconds on the local CPU and about 5-10 microseconds 
> cross-CPU (or more, if the cache pattern is unlucky/unaffine). That's 
> 10-25-50 times more expensive. You can do 9-24-49 reads and still be 
> cheaper. Compound syscalls are almost never worth the complexity.

If we're on an SMP system and the monitored task is currently running,
then it's not just the read syscall, there's also the IPI, which will
add considerably to the cost.  So it's likely to be a thousand or more
cycles between the two counter reads, which is IMHO unacceptable.

Anyway, that's not my major problem with the API, just one of its
little annoyances...

> So as a scheduler person i cannot really take the perfmon "ptrace 
> approach" seriously, and i explained that in great detail already. It 
> clearly came from HPC workload quarters where tasks are persistent 
> entities running alone on a single CPU that just use up CPU time there 
> and dont interact with each other too much. That's a good and important 
> profiling target for sure - but by no means the only workload target to 
> design a core kernel facility for. It's an absolutely horrible approach 
> for a number of more common workloads for sure.

I never defended the use of ptrace, and it isn't IMO an essential part
of the perfmon API, just an aspect of its current implementation.

> > And I think that is the fundamental flaw.  On the machines I am 
> > familiar with, the performance counters as not separate things that can 
> > individually and independently be assigned to count one thing or 
> > another.
> 
> Today we've implemented virtual counter scheduling in our to-be-v2 code:
> 
>    3 files changed, 36 insertions(+), 1 deletion(-)
> 
> hello.c gives:
> 
>  counter[0 cycles              ]:  10121258163 , delta:    844256826 events
>  counter[1 instructions        ]:   4160893621 , delta:    347054666 events
>  counter[2 cache-refs          ]:         2297 , delta:          179 events
>  counter[3 cache-misses        ]:            3 , delta:            0 events
>  counter[4 branch-instructions ]:    799422166 , delta:     66551572 events
>  counter[5 branch-misses       ]:         7286 , delta:          775 events

And this tells me what?  I can't relate any of these measurements to
any others, because I don't know how many cycles or instructions or
milliseconds each of these counts relates to, and I don't know which
counts were taken at the same time as which other counts.

Your abstraction hides all the details of what is being counted with
which counter over what period of time, and that is absolutely crucial
information for any serious analysis of the numbers.

> All we need to get that array of information from 6 sw counters is a 
> _single_ hardware counter. I'm not sure where you read "you must map sw 
> counters to hw counters directly" or "hw counters must be independent of 
> each other" into our design - it's not part of it, emphatically.

I'm not sure those quoted statements are exactly what I said, but
whatever.

Your API has as its central abstraction the "counter".  I am saying
that that is the wrong abstraction.  The abstraction really needs to
be a set of counters that are all active over precisely the same
interval, so that their values can be meaningfully compared and
related to each other.

> And i dont see your (fully correct!) statement above about counter 
> constraints to be in any sort of conflict with what we are doing.
> 
> Intel hardware is just as constrained as powerpc hardware: there are 
> counter inter-dependencies and many CPUs have just two performance 
> counters. We very much took this into account while designing this code.

Well, here's my reasoning.

* Your perf_counter_open call takes the event type but doesn't have
  any way to select a particular hardware counter (deliberately, since
  your API is trying to present some common-denominator abstraction of
  the individual counters).

* On powerpc, the event selector value to count a particular event is
  different for each counter, and may even depend on what's being
  counted on other counters.

* That means that we can't meaningfully pass raw (negative) event
  selector values, since what any particular value means depends on
  which hardware counter we get to use, and we don't know that (and in
  fact it may change from time to time).

* In other words, the kernel will have to know the mapping from
  abstract event types to event selector values for each counter for
  each supported CPU type.

Now, the tables in perfmon's user-land libpfm that describe the
mapping from abstract events to event-selector values and the
constraints on what events can be counted together come to nearly
29,000 lines of code just for the IBM 64-bit powerpc processors.

Your API condemns us to adding all that bloat to the kernel, plus the
code to use those tables.

Furthermore, since your generic code doesn't know anything about the
constraints and thinks it can just add any counter to any task at any
time (subject only to a maximum number of counters in use), we'll
potentially have to work out event selector values at latency-critical
times such as context switches and interrupts.

> [ Obviously, you _can_ do higher quality profiling if you have more 
>   hardware resources that help it. Nothing will change that fact. ]
> 
> > Rather, what the hardware provides is ONE performance monitor unit, 
> > which the OS can context-switch between tasks.  The performance monitor 
> > unit has several counters that can be assigned (within limits) to count 
> > various aspects of the performance of the code being executed.  That is 
> > why, for instance, if you ask for the counters to be frozen when one of 
> > them overflows, they all get frozen at that point.
> 
> i dont see this as an issue at all - it's a feature of powerpc over x86 
> that the core perfcounter code can support just fine. The overflow IRQ 
> handler is arch specific. The overflow IRQ handler, if it triggers, 
> updates the sw counters, creates any event records if needed, wakes up 
> the monitor task if needed, and continues the task and performance 
> measurement without having scheduled out. Demultiplexing of hw counters 
> is arch-specific.

The ability to create event records in a ring buffer is certainly
nice.  I have no problem with that part of your proposal, particularly
if we can optionally record things like a timestamp, task registers,
stacktrace, etc. at the same time, as you have suggested.

My point is that the monitoring task wants to be able to control which
things get measured simultaneously.  The kernel shouldn't be deciding
how the set of software counters gets multiplexed onto the hardware
counters - the monitoring task needs to be able to control that in
order to get meaningful results.

There are three other problems that I see with your API (these are
probably fixable):

1. I don't see any way to control whether I'm counting events in user
   mode, kernel mode, hypervisor mode, or some combination.  That is
   needed for some types of performance analysis.

2. If I'm counting events for all tasks, I want to be able to exclude
   the idle task, optionally.  I don't see a way to do that.

3. If I have a counter in PERF_RECORD_IRQ mode, I have no way to read
   its actual value, which I would want to do (for instance, when some
   other counter overflows, or when the task exits).

Paul.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-06  0:05             ` Paul Mackerras
@ 2008-12-06  1:23               ` Mikael Pettersson
  2008-12-06 12:34               ` Peter Zijlstra
  1 sibling, 0 replies; 73+ messages in thread
From: Mikael Pettersson @ 2008-12-06  1:23 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller

Paul Mackerras writes:
 > Furthermore, since your generic code doesn't know anything about the
 > constraints and thinks it can just add any counter to any task at any
 > time

This observation alone makes this proposal a non-starter.
Counters are not independent. Even on x86. Never have been.

If you want to fix something, here's one:
- Make the decision whether to schedule task t on processor p a
  function of what other set of tasks T are currently on processor p.

The issue is that some performance counter events aren't thread
local, e.g. Nehalem uncore stuff and similar HW crap in AMD
northbridge events and everything P4. So while one task t1
is running it's also reserving off-thread resources R, making those
resources unavailable for other tasks T.

(If you want a simpler metaphor, imagine a multi-threaded or multi-core
processor package having only a single floating-point unit. How would
you handle that in the scheduler? There are performance counter events
from both Intel and AMD that pose the same challenge.)

I "solved" that in perfctr for P4 by enforcing affinity constraints,
but surely the scheduler could be smarter?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-04 23:44 [patch 0/3] [Announcement] Performance Counters for Linux Thomas Gleixner
                   ` (6 preceding siblings ...)
  2008-12-05  3:30 ` Andrew Morton
@ 2008-12-06  2:36 ` stephane eranian
  2008-12-08  2:12     ` Dan Terpstra
  2008-12-10 16:27     ` Rob Fowler
  7 siblings, 2 replies; 73+ messages in thread
From: stephane eranian @ 2008-12-06  2:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, linux-arch, Andrew Morton, Ingo Molnar, Eric Dumazet,
	Robert Richter, Arjan van de Veen, Peter Anvin, Peter Zijlstra,
	Steven Rostedt, David Miller, Paul Mackerras, perfmon2-devel

Hello,

I have been reading all the threads after this unexpected announcement
of a competing proposal for an interface to access the performance counters.
I would like to respond to some of the things I have seen.

* ptrace: as Paul just pointed out, ptrace() is a limitation of the
  current perfmon implementation. This is not a limitation of the
  interface as has been insinuated earlier. In my mind, this does
  not justify starting from scratch. There is nothing that precludes
  removing ptrace and using the IPI to chase down the PMU state,
  like you are doing. And in fact I believe we can do it more efficiently
  because we would potentially collect multiple values in one IPI,
  something your API cannot allow because it is single event oriented.

* There is more to perfmon than what you have looked at on LKML. There
   is advanced sampling support with a kernel level buffer which is remapped
   to user space. So there is no such thing as a couple of ptrace() calls per
   sample. In fact, there is zero copy export to user space. In the
case of PEBS,
   there is even zero-copy from HW to user space.

* The proposed API exposes events as individual entities. To measure N
   events, you need N file descriptors. There is no coordination of actions
   between the various events. If you want to start/stop all events, it seems
   you have to close the  file descriptors and start over. That is not
how people
   use this, especially people doing self monitoring. They want to start/stop
   around critical loops or functions and they want this to be fast.

* To read N events you need N syscalls and potentially N IPIs. There
   is no guarantee of atomicity between the reads. The argument of raising
   the priority to prevent preemption is bogus and unrealistic. We want regular
   users to be able to measure their own applications without having to have
   special privileges. This is especially unpractical when you want to read from
   another thread. It is important to get a view of the counters that
is as consistent
   as possible and for that you want to read the registers are closely
as possible
   from each other.

* As mentioned by Paul, Corey, the API inevitably forces the kernel to
know about
  ALL the events and how they map onto counters. People who have been doing this
  in userland, and I am one of them, can tell you that this is a very
hard problem.
  Looking at it just on the Intel and AMD x86 is misleading. It is not
the number of
  events that matters, even it contributes to the kernel bloat, it is
managing the constraints
  between events (event A and B cannot be measured together, if event
A uses counter X
  then B cannot be measured on counter Y). Sometimes, the value of a
config register depends
  on which register you load it on. With the proposed API, all this
complexity would have to go in
  the kernel. I don't think it belongs here and it will leads to
maintenance problems, and longer
  delays to enable support of new hardware. The argument for doing
this was that it would
  facilitate writing tools. But all that complexity does not belong in
the tools but in a user library.
  This is what libpfm is designed for and it has worked nicely so far.
The role of the kernel
  is to control access to the PMU resource and to make sure incorrect
programming of the registers
  cannot crash the kernel. If you do this, then providing support for
new hardware is for the most part
  simply exposing the registers. Something which can even be
discovered automatically on newer
  processors, e.g., ones supporting Intel architectural perfmon.

* Tools usually manage monitoring as a session. There was criticism
   about the perfmon context abstraction and vectors. A context is  merely
   a synonym for session.  I believe having a file descriptor per session is
   a natural thing to have. Vectors are used to access multiple registers in
   one syscall. Vector have variable sizes, it depends on what you want to
   access. The size is not mandated by the number of registers of the
   underlying hardware.

* As mentioned by Paul, with certain PMUs, it is not possible to solve
  the event -> counter problem without having a global view
  of all the events. Your API being single-event oriented, it is not
  clear to me how this can be solved.

* It is not because you run a per thread session, that you should be
  limited to measuring at priv level 3.

* Modern PMU, including AMD Barcelona. Itanium2, expose more than
  counters. Any API than assumes PMU export only
  counters is going to be limited, e.g. Oprofile. Perfmon does not
  make that mistake, the interface does not know anything
  about counters nor sampling periods. It sees registers with values
  you can read or write. That has allowed us to support
  advanced features such as Itanium2 Opcode filter, Itanium2
  Code/Data range restrictions (hosted in debug regs), AMD
  Barcelona IBS which has no event associated with it, Itanium2
  BranchTraceBuffer, Intel Core 2 LBR, Intel Core i7 uncore PMU.
  Some of those features have no ties with counters, they do not even
  overflow (e.g., LBR). They must be used in combination with
  counters, e.g., LBRs. I don't think you will be able to do this
  with your API.

* With regards to sampling, advanced users have long been collecting
  more than just the IP. They want to collect the values of other
  PMU registers or even values of other non-PMU resources. With your
  API, it seems for every new need, you'd have to create a new
  perf_record_type, which translates into a kernel patch. This is not
  what people want. With perfmon, you have a choice of doing user
  level sampling (users gets notification for each sample) but you can
  also use a kernel sampling buffer. In that case, you can express
  what you want recorded in the buffer using simple bitmasks of PMU
  registers. There is no predefined set, no kernel patch.
  To make this even more flexible the buffer format is not part of the
  interface, you can define your own and record whatever you want
  in whatever format you want. All is provided by kernel modules. You
  want double-buffer, cyclic buffer, just add your kernel module. It
  seems this feature has been overlooked by LKML reviewers but it is
  really powerful.

* It is not clear to me how you would add a sampling buffer and
  remapping using your API given the number of file descriptors you will
  end up using and the fact that you do not have the notion of a session.

* When sampling, you want to freeze the counters on overflow to get an
  as consistent as possible view. There is no such guarantee in
  your API nor implementation. On some hardware platforms, e.g.,
  Itanium, you have no choice this is the behavior.

* Multiple counters can overflow at the same time and generate a
  single interrupt. With your approach, if two counters overflow
  simultaneously, then you need to enqueue two messages, yet only
  one SIGIO wil be generated, it seems. Wonder how that works when
  self-monitoring.


In summary, although the idea of simplifying tools by moving the
complexity elsewhere is legitimate, pushing it down to the kernel
is the wrong approach in my opinion, perfmon has avoided that as much
as possible for good reasons. We have shown , with libpfm,
that a large part of complexity can easily be encapsulated into a user
library. I also don't think the approach of managing events
independently of each others works for all processors. As pointed out
by others, there are other factors at stake and they may not
even be on the same core.

S. Eranian

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-06  0:05             ` Paul Mackerras
  2008-12-06  1:23               ` Mikael Pettersson
@ 2008-12-06 12:34               ` Peter Zijlstra
  2008-12-07  5:15                 ` Paul Mackerras
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-12-06 12:34 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Steven Rostedt, David Miller

On Sat, 2008-12-06 at 11:05 +1100, Paul Mackerras wrote:
> Now, the tables in perfmon's user-land libpfm that describe the
> mapping from abstract events to event-selector values and the
> constraints on what events can be counted together come to nearly
> 29,000 lines of code just for the IBM 64-bit powerpc processors.
> 
> Your API condemns us to adding all that bloat to the kernel, plus the
> code to use those tables.

Since you need those tables and that code anyway, and in a solid
reliable way, what is the objection of carrying it in the kernel?

Furthermore, is there a good technical reason these cpus are so
complicated to use?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-06 12:34               ` Peter Zijlstra
@ 2008-12-07  5:15                 ` Paul Mackerras
  2008-12-08  7:18                   ` stephane eranian
  0 siblings, 1 reply; 73+ messages in thread
From: Paul Mackerras @ 2008-12-07  5:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, LKML, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Steven Rostedt, David Miller

Peter Zijlstra writes:

> On Sat, 2008-12-06 at 11:05 +1100, Paul Mackerras wrote:
> > Now, the tables in perfmon's user-land libpfm that describe the
> > mapping from abstract events to event-selector values and the
> > constraints on what events can be counted together come to nearly
> > 29,000 lines of code just for the IBM 64-bit powerpc processors.
> > 
> > Your API condemns us to adding all that bloat to the kernel, plus the
> > code to use those tables.
> 
> Since you need those tables and that code anyway, and in a solid
> reliable way, what is the objection of carrying it in the kernel?

Because it's about 320kB of unpageable kernel memory, and it doesn't
need to be in the kernel.

The fundamental problem with Ingo and Thomas's proposal is that the
abstraction is at the wrong level.  It makes individual counters the
central idea, when the central idea should be a set of counters that
all start and stop counting at the same times.  People doing
performance analysis want to be able to compare counts of different
events and get ratios, and you can't do that meaningfully if the
counts correspond to different stretches of code.

Once you make the abstraction a set of counters, then you also make it
possible to have a counter-set that is the whole PMU.  Then you don't
have to have the kernel knowing all the possible settings for the PMU;
it only needs to know the simple ones, and if you want to do something
more sophisticated, you can have userspace specifying the bits to
select the more sophisticated setting.

> Furthermore, is there a good technical reason these cpus are so
> complicated to use?

That question is a bit ambiguous.  If you mean, why did the hardware
designers make it so complex? then I don't really know, but it doesn't
matter because the CPU hardware is what it is.  At best I might be
able to influence future designs to be a bit simpler.

If you mean, could the software description of the hardware be
simpler? then maybe - I'm just reading up on the details of the
hardware, and it is pretty complex, with multiple layers of
multiplexers and event buses.

Paul.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [perfmon2] [patch 0/3] [Announcement] Performance Counters forLinux
  2008-12-06  2:36 ` stephane eranian
@ 2008-12-08  2:12     ` Dan Terpstra
  2008-12-10 16:27     ` Rob Fowler
  1 sibling, 0 replies; 73+ messages in thread
From: Dan Terpstra @ 2008-12-08  2:12 UTC (permalink / raw)
  To: eranian, 'Thomas Gleixner'
  Cc: linux-arch, 'Peter Zijlstra', 'David Miller',
	'LKML', 'Steven Rostedt', 'Eric Dumazet',
	'Paul Mackerras', 'Peter Anvin',
	'Andrew Morton', 'Ingo Molnar',
	'perfmon2-devel', 'Arjan van de Veen'

I'm reminded of the quote attributed to Einstein: "Make things as simple as
possible, but no simpler".
In that regard, it appears that Stephane's perfmon is closer to the mark
than this proposal.
If Stephane's observations below are even close to correct, it would make
PAPI's first-person event-set caliper model essentially useless. We must be
able to start and stop multiple counter values simultaneously and quickly to
infer any validity even for derived measurements as simple as
instructions-per-cycle.
dan terpstra
for the PAPI team


> -----Original Message-----
> From: stephane eranian [mailto:eranian@googlemail.com]
> Sent: Friday, December 05, 2008 9:37 PM
> To: Thomas Gleixner
> Cc: linux-arch@vger.kernel.org; Peter Zijlstra; David Miller; LKML; Steven
> Rostedt; Eric Dumazet; Paul Mackerras; Peter Anvin; Andrew Morton; Ingo
> Molnar; perfmon2-devel; Arjan van de Veen
> Subject: Re: [perfmon2] [patch 0/3] [Announcement] Performance Counters
> forLinux
> 
> Hello,
> 
> I have been reading all the threads after this unexpected announcement
> of a competing proposal for an interface to access the performance
> counters.
> I would like to respond to some of the things I have seen.
> 
> * ptrace: as Paul just pointed out, ptrace() is a limitation of the
>   current perfmon implementation. This is not a limitation of the
>   interface as has been insinuated earlier. In my mind, this does
>   not justify starting from scratch. There is nothing that precludes
>   removing ptrace and using the IPI to chase down the PMU state,
>   like you are doing. And in fact I believe we can do it more efficiently
>   because we would potentially collect multiple values in one IPI,
>   something your API cannot allow because it is single event oriented.
> 
> * There is more to perfmon than what you have looked at on LKML. There
>    is advanced sampling support with a kernel level buffer which is
> remapped
>    to user space. So there is no such thing as a couple of ptrace() calls
> per
>    sample. In fact, there is zero copy export to user space. In the
> case of PEBS,
>    there is even zero-copy from HW to user space.
> 
> * The proposed API exposes events as individual entities. To measure N
>    events, you need N file descriptors. There is no coordination of
> actions
>    between the various events. If you want to start/stop all events, it
> seems
>    you have to close the  file descriptors and start over. That is not
> how people
>    use this, especially people doing self monitoring. They want to
> start/stop
>    around critical loops or functions and they want this to be fast.
> 
> * To read N events you need N syscalls and potentially N IPIs. There
>    is no guarantee of atomicity between the reads. The argument of raising
>    the priority to prevent preemption is bogus and unrealistic. We want
> regular
>    users to be able to measure their own applications without having to
> have
>    special privileges. This is especially unpractical when you want to
> read from
>    another thread. It is important to get a view of the counters that
> is as consistent
>    as possible and for that you want to read the registers are closely
> as possible
>    from each other.
> 
> * As mentioned by Paul, Corey, the API inevitably forces the kernel to
> know about
>   ALL the events and how they map onto counters. People who have been
> doing this
>   in userland, and I am one of them, can tell you that this is a very
> hard problem.
>   Looking at it just on the Intel and AMD x86 is misleading. It is not
> the number of
>   events that matters, even it contributes to the kernel bloat, it is
> managing the constraints
>   between events (event A and B cannot be measured together, if event
> A uses counter X
>   then B cannot be measured on counter Y). Sometimes, the value of a
> config register depends
>   on which register you load it on. With the proposed API, all this
> complexity would have to go in
>   the kernel. I don't think it belongs here and it will leads to
> maintenance problems, and longer
>   delays to enable support of new hardware. The argument for doing
> this was that it would
>   facilitate writing tools. But all that complexity does not belong in
> the tools but in a user library.
>   This is what libpfm is designed for and it has worked nicely so far.
> The role of the kernel
>   is to control access to the PMU resource and to make sure incorrect
> programming of the registers
>   cannot crash the kernel. If you do this, then providing support for
> new hardware is for the most part
>   simply exposing the registers. Something which can even be
> discovered automatically on newer
>   processors, e.g., ones supporting Intel architectural perfmon.
> 
> * Tools usually manage monitoring as a session. There was criticism
>    about the perfmon context abstraction and vectors. A context is  merely
>    a synonym for session.  I believe having a file descriptor per session
> is
>    a natural thing to have. Vectors are used to access multiple registers
> in
>    one syscall. Vector have variable sizes, it depends on what you want to
>    access. The size is not mandated by the number of registers of the
>    underlying hardware.
> 
> * As mentioned by Paul, with certain PMUs, it is not possible to solve
>   the event -> counter problem without having a global view
>   of all the events. Your API being single-event oriented, it is not
>   clear to me how this can be solved.
> 
> * It is not because you run a per thread session, that you should be
>   limited to measuring at priv level 3.
> 
> * Modern PMU, including AMD Barcelona. Itanium2, expose more than
>   counters. Any API than assumes PMU export only
>   counters is going to be limited, e.g. Oprofile. Perfmon does not
>   make that mistake, the interface does not know anything
>   about counters nor sampling periods. It sees registers with values
>   you can read or write. That has allowed us to support
>   advanced features such as Itanium2 Opcode filter, Itanium2
>   Code/Data range restrictions (hosted in debug regs), AMD
>   Barcelona IBS which has no event associated with it, Itanium2
>   BranchTraceBuffer, Intel Core 2 LBR, Intel Core i7 uncore PMU.
>   Some of those features have no ties with counters, they do not even
>   overflow (e.g., LBR). They must be used in combination with
>   counters, e.g., LBRs. I don't think you will be able to do this
>   with your API.
> 
> * With regards to sampling, advanced users have long been collecting
>   more than just the IP. They want to collect the values of other
>   PMU registers or even values of other non-PMU resources. With your
>   API, it seems for every new need, you'd have to create a new
>   perf_record_type, which translates into a kernel patch. This is not
>   what people want. With perfmon, you have a choice of doing user
>   level sampling (users gets notification for each sample) but you can
>   also use a kernel sampling buffer. In that case, you can express
>   what you want recorded in the buffer using simple bitmasks of PMU
>   registers. There is no predefined set, no kernel patch.
>   To make this even more flexible the buffer format is not part of the
>   interface, you can define your own and record whatever you want
>   in whatever format you want. All is provided by kernel modules. You
>   want double-buffer, cyclic buffer, just add your kernel module. It
>   seems this feature has been overlooked by LKML reviewers but it is
>   really powerful.
> 
> * It is not clear to me how you would add a sampling buffer and
>   remapping using your API given the number of file descriptors you will
>   end up using and the fact that you do not have the notion of a session.
> 
> * When sampling, you want to freeze the counters on overflow to get an
>   as consistent as possible view. There is no such guarantee in
>   your API nor implementation. On some hardware platforms, e.g.,
>   Itanium, you have no choice this is the behavior.
> 
> * Multiple counters can overflow at the same time and generate a
>   single interrupt. With your approach, if two counters overflow
>   simultaneously, then you need to enqueue two messages, yet only
>   one SIGIO wil be generated, it seems. Wonder how that works when
>   self-monitoring.
> 
> 
> In summary, although the idea of simplifying tools by moving the
> complexity elsewhere is legitimate, pushing it down to the kernel
> is the wrong approach in my opinion, perfmon has avoided that as much
> as possible for good reasons. We have shown , with libpfm,
> that a large part of complexity can easily be encapsulated into a user
> library. I also don't think the approach of managing events
> independently of each others works for all processors. As pointed out
> by others, there are other factors at stake and they may not
> even be on the same core.
> 
> S. Eranian
> 
> --------------------------------------------------------------------------
> ----
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,
> Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.co
> m/
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel


^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [perfmon2] [patch 0/3] [Announcement] Performance Counters forLinux
@ 2008-12-08  2:12     ` Dan Terpstra
  0 siblings, 0 replies; 73+ messages in thread
From: Dan Terpstra @ 2008-12-08  2:12 UTC (permalink / raw)
  To: eranian, 'Thomas Gleixner'
  Cc: linux-arch, 'Peter Zijlstra', 'David Miller',
	'LKML', 'Steven Rostedt', 'Eric Dumazet',
	'Paul Mackerras', 'Peter Anvin',
	'Andrew Morton', 'Ingo Molnar',
	'perfmon2-devel', 'Arjan van de Veen'

I'm reminded of the quote attributed to Einstein: "Make things as simple as
possible, but no simpler".
In that regard, it appears that Stephane's perfmon is closer to the mark
than this proposal.
If Stephane's observations below are even close to correct, it would make
PAPI's first-person event-set caliper model essentially useless. We must be
able to start and stop multiple counter values simultaneously and quickly to
infer any validity even for derived measurements as simple as
instructions-per-cycle.
dan terpstra
for the PAPI team


> -----Original Message-----
> From: stephane eranian [mailto:eranian@googlemail.com]
> Sent: Friday, December 05, 2008 9:37 PM
> To: Thomas Gleixner
> Cc: linux-arch@vger.kernel.org; Peter Zijlstra; David Miller; LKML; Steven
> Rostedt; Eric Dumazet; Paul Mackerras; Peter Anvin; Andrew Morton; Ingo
> Molnar; perfmon2-devel; Arjan van de Veen
> Subject: Re: [perfmon2] [patch 0/3] [Announcement] Performance Counters
> forLinux
> 
> Hello,
> 
> I have been reading all the threads after this unexpected announcement
> of a competing proposal for an interface to access the performance
> counters.
> I would like to respond to some of the things I have seen.
> 
> * ptrace: as Paul just pointed out, ptrace() is a limitation of the
>   current perfmon implementation. This is not a limitation of the
>   interface as has been insinuated earlier. In my mind, this does
>   not justify starting from scratch. There is nothing that precludes
>   removing ptrace and using the IPI to chase down the PMU state,
>   like you are doing. And in fact I believe we can do it more efficiently
>   because we would potentially collect multiple values in one IPI,
>   something your API cannot allow because it is single event oriented.
> 
> * There is more to perfmon than what you have looked at on LKML. There
>    is advanced sampling support with a kernel level buffer which is
> remapped
>    to user space. So there is no such thing as a couple of ptrace() calls
> per
>    sample. In fact, there is zero copy export to user space. In the
> case of PEBS,
>    there is even zero-copy from HW to user space.
> 
> * The proposed API exposes events as individual entities. To measure N
>    events, you need N file descriptors. There is no coordination of
> actions
>    between the various events. If you want to start/stop all events, it
> seems
>    you have to close the  file descriptors and start over. That is not
> how people
>    use this, especially people doing self monitoring. They want to
> start/stop
>    around critical loops or functions and they want this to be fast.
> 
> * To read N events you need N syscalls and potentially N IPIs. There
>    is no guarantee of atomicity between the reads. The argument of raising
>    the priority to prevent preemption is bogus and unrealistic. We want
> regular
>    users to be able to measure their own applications without having to
> have
>    special privileges. This is especially unpractical when you want to
> read from
>    another thread. It is important to get a view of the counters that
> is as consistent
>    as possible and for that you want to read the registers are closely
> as possible
>    from each other.
> 
> * As mentioned by Paul, Corey, the API inevitably forces the kernel to
> know about
>   ALL the events and how they map onto counters. People who have been
> doing this
>   in userland, and I am one of them, can tell you that this is a very
> hard problem.
>   Looking at it just on the Intel and AMD x86 is misleading. It is not
> the number of
>   events that matters, even it contributes to the kernel bloat, it is
> managing the constraints
>   between events (event A and B cannot be measured together, if event
> A uses counter X
>   then B cannot be measured on counter Y). Sometimes, the value of a
> config register depends
>   on which register you load it on. With the proposed API, all this
> complexity would have to go in
>   the kernel. I don't think it belongs here and it will leads to
> maintenance problems, and longer
>   delays to enable support of new hardware. The argument for doing
> this was that it would
>   facilitate writing tools. But all that complexity does not belong in
> the tools but in a user library.
>   This is what libpfm is designed for and it has worked nicely so far.
> The role of the kernel
>   is to control access to the PMU resource and to make sure incorrect
> programming of the registers
>   cannot crash the kernel. If you do this, then providing support for
> new hardware is for the most part
>   simply exposing the registers. Something which can even be
> discovered automatically on newer
>   processors, e.g., ones supporting Intel architectural perfmon.
> 
> * Tools usually manage monitoring as a session. There was criticism
>    about the perfmon context abstraction and vectors. A context is  merely
>    a synonym for session.  I believe having a file descriptor per session
> is
>    a natural thing to have. Vectors are used to access multiple registers
> in
>    one syscall. Vector have variable sizes, it depends on what you want to
>    access. The size is not mandated by the number of registers of the
>    underlying hardware.
> 
> * As mentioned by Paul, with certain PMUs, it is not possible to solve
>   the event -> counter problem without having a global view
>   of all the events. Your API being single-event oriented, it is not
>   clear to me how this can be solved.
> 
> * It is not because you run a per thread session, that you should be
>   limited to measuring at priv level 3.
> 
> * Modern PMU, including AMD Barcelona. Itanium2, expose more than
>   counters. Any API than assumes PMU export only
>   counters is going to be limited, e.g. Oprofile. Perfmon does not
>   make that mistake, the interface does not know anything
>   about counters nor sampling periods. It sees registers with values
>   you can read or write. That has allowed us to support
>   advanced features such as Itanium2 Opcode filter, Itanium2
>   Code/Data range restrictions (hosted in debug regs), AMD
>   Barcelona IBS which has no event associated with it, Itanium2
>   BranchTraceBuffer, Intel Core 2 LBR, Intel Core i7 uncore PMU.
>   Some of those features have no ties with counters, they do not even
>   overflow (e.g., LBR). They must be used in combination with
>   counters, e.g., LBRs. I don't think you will be able to do this
>   with your API.
> 
> * With regards to sampling, advanced users have long been collecting
>   more than just the IP. They want to collect the values of other
>   PMU registers or even values of other non-PMU resources. With your
>   API, it seems for every new need, you'd have to create a new
>   perf_record_type, which translates into a kernel patch. This is not
>   what people want. With perfmon, you have a choice of doing user
>   level sampling (users gets notification for each sample) but you can
>   also use a kernel sampling buffer. In that case, you can express
>   what you want recorded in the buffer using simple bitmasks of PMU
>   registers. There is no predefined set, no kernel patch.
>   To make this even more flexible the buffer format is not part of the
>   interface, you can define your own and record whatever you want
>   in whatever format you want. All is provided by kernel modules. You
>   want double-buffer, cyclic buffer, just add your kernel module. It
>   seems this feature has been overlooked by LKML reviewers but it is
>   really powerful.
> 
> * It is not clear to me how you would add a sampling buffer and
>   remapping using your API given the number of file descriptors you will
>   end up using and the fact that you do not have the notion of a session.
> 
> * When sampling, you want to freeze the counters on overflow to get an
>   as consistent as possible view. There is no such guarantee in
>   your API nor implementation. On some hardware platforms, e.g.,
>   Itanium, you have no choice this is the behavior.
> 
> * Multiple counters can overflow at the same time and generate a
>   single interrupt. With your approach, if two counters overflow
>   simultaneously, then you need to enqueue two messages, yet only
>   one SIGIO wil be generated, it seems. Wonder how that works when
>   self-monitoring.
> 
> 
> In summary, although the idea of simplifying tools by moving the
> complexity elsewhere is legitimate, pushing it down to the kernel
> is the wrong approach in my opinion, perfmon has avoided that as much
> as possible for good reasons. We have shown , with libpfm,
> that a large part of complexity can easily be encapsulated into a user
> library. I also don't think the approach of managing events
> independently of each others works for all processors. As pointed out
> by others, there are other factors at stake and they may not
> even be on the same core.
> 
> S. Eranian
> 
> --------------------------------------------------------------------------
> ----
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,
> Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.co
> m/
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-07  5:15                 ` Paul Mackerras
@ 2008-12-08  7:18                   ` stephane eranian
  2008-12-08 11:11                     ` Ingo Molnar
  0 siblings, 1 reply; 73+ messages in thread
From: stephane eranian @ 2008-12-08  7:18 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, LKML, linux-arch,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Steven Rostedt, David Miller

Hi,

On Sun, Dec 7, 2008 at 6:15 AM, Paul Mackerras <paulus@samba.org> wrote:
> Peter Zijlstra writes:
>
>> On Sat, 2008-12-06 at 11:05 +1100, Paul Mackerras wrote:
>> > Now, the tables in perfmon's user-land libpfm that describe the
>> > mapping from abstract events to event-selector values and the
>> > constraints on what events can be counted together come to nearly
>> > 29,000 lines of code just for the IBM 64-bit powerpc processors.
>> >
>> > Your API condemns us to adding all that bloat to the kernel, plus the
>> > code to use those tables.
>>
>> Since you need those tables and that code anyway, and in a solid
>> reliable way, what is the objection of carrying it in the kernel?
>
> Because it's about 320kB of unpageable kernel memory, and it doesn't
> need to be in the kernel.
>

That inevitably pulls in large amounts of data, the event table for each PMU
model and the description of the constraints between events. New processors
have hundreds of events. Moreover, there is the complexity of the assignment
algorithm to map the events to counters such that they actually measure what
you've asked for. I described some of those constraints in my previous message.
They are not trivial and are oftentimes multi-dimensional. Getting the
algorithms
right is difficult. Event tables are also oftentimes incomplete or
bogus when first
published by HW vendors.

It does not make sense to have this kind of data + code in the kernel. It would
make developing them much more difficult. Maintenance would also be more
difficult. And clearly you don't want to have to re-run the assignment routine
each time you context switch.

The kernel is not the only place for rock-solid code. You can have solid/stable
code in libraries as well.

> The fundamental problem with Ingo and Thomas's proposal is that the
> abstraction is at the wrong level.  It makes individual counters the
> central idea, when the central idea should be a set of counters that
> all start and stop counting at the same times.  People doing
> performance analysis want to be able to compare counts of different
> events and get ratios, and you can't do that meaningfully if the
> counts correspond to different stretches of code.
>
> Once you make the abstraction a set of counters, then you also make it
> possible to have a counter-set that is the whole PMU.  Then you don't
> have to have the kernel knowing all the possible settings for the PMU;
> it only needs to know the simple ones, and if you want to do something
> more sophisticated, you can have userspace specifying the bits to
> select the more sophisticated setting.
>
>> Furthermore, is there a good technical reason these cpus are so
>> complicated to use?
>
> That question is a bit ambiguous.  If you mean, why did the hardware
> designers make it so complex? then I don't really know, but it doesn't
> matter because the CPU hardware is what it is.  At best I might be
> able to influence future designs to be a bit simpler.
>

Let me explain the HW complexity a bit. It's all a matter of tradeoffs.
I have regular discussions with the PMU design architects about this.
If you talk to them, then you understand the environment they have to
live in and you understand why those constraints are there. The key point
to understand is that the PMU is never critical to the chip. The chip can work
well without. The real-estate on the chip is always very tight. PMU is a 2nd
class citizen, thus low in the priority list. For certain PMU features
the tradeoff
is: do you want the feature with constraints on programming or no feature at
all. The common HW limitation is wires. For instance, I was once told: would you
rather have a PMU cache event that can be programmed on any counters but
with an increased cache latency for all accesses or a faster cache and
a constraint
on the event? The response is obvious.

I think you now understand why there are constraints and also why they
will never
go away, quite the contrary. I'd rather have a PMU with constraints than no PMU.
Hardware designers make a lot of efforts to give us what we have today already
and we should be thankful.

> If you mean, could the software description of the hardware be
> simpler? then maybe - I'm just reading up on the details of the
> hardware, and it is pretty complex, with multiple layers of
> multiplexers and event buses.
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-08  7:18                   ` stephane eranian
@ 2008-12-08 11:11                     ` Ingo Molnar
  2008-12-08 11:58                       ` David Miller
  2008-12-09  0:21                       ` stephane eranian
  0 siblings, 2 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-12-08 11:11 UTC (permalink / raw)
  To: eranian
  Cc: Paul Mackerras, Peter Zijlstra, Thomas Gleixner, LKML,
	linux-arch, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Steven Rostedt, David Miller


* stephane eranian <eranian@googlemail.com> wrote:

> Let me explain the HW complexity a bit. It's all a matter of tradeoffs. 
> I have regular discussions with the PMU design architects about this. 
> If you talk to them, then you understand the environment they have to 
> live in and you understand why those constraints are there. The key 
> point to understand is that the PMU is never critical to the chip. The 
> chip can work well without. The real-estate on the chip is always very 
> tight. PMU is a 2nd class citizen, thus low in the priority list. [...]

The chip designers i talk to with my scheduler maintainer hat on do point 
out that performance monitoring is (of course) in the critical path of 
any chip, and hence its overhead and impact on the gate count of various 
critical components of the CPU core and its impact on the power envelope 
must be kept very low.

Nevertheless, the same chip designers rely on performance counters on a 
daily basis to plan their next-gen chip. They very much want them to work 
fine, and they work hard on making them relevant and easy to use. Often 
the performance counters are the _only_ real cheap hands-on insight into 
the dynamic situation of a modern CPU core, even for hw designers.

And all the current hw trends show that it's not just talk but action as 
well: the Core2 PMCs are already much saner (less constrained) than the 
P4 ones, and now they even expanded on them: Nehalem / Core i7 doubled 
the number of generic PMCs from two to four.

So, contrary to your suggestion, chip designers very much care about 
performance counters and they are working very hard to make this stuff 
useful to us. [ Yes, there are constraints even with generic counters 
(for example you only want a single line towards a PMC register from 
divider units), but the number of cross-counter constraints and their 
relevance is decreasing, not increasing. ]

Anyway ... i think your reply highlights why the fundamental premise of 
your patchset is so wrong: i believe you have designed your code and APIs 
at the wrong level by (paradoxically) assuming in essence that 
performance counters do not matter in the general scheme of things. (!)

So you introduced limited, special-purpose but still quite complex APIs 
that tailored the ABIs to intricate low level details of PMUs. I see an 
explosion in complexity due to that incorrect design choice: too many 
syscalls, too broad interaction between core code and architecture code, 
and too little practical utility in the end.

We did what we believe to be the right thing: we gave performance 
counters the proper high-level abstraction they _deserve_, and we made 
performance counters a prime-time Linux citizen as well.

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-08 11:11                     ` Ingo Molnar
@ 2008-12-08 11:58                       ` David Miller
  2008-12-09  0:21                       ` stephane eranian
  1 sibling, 0 replies; 73+ messages in thread
From: David Miller @ 2008-12-08 11:58 UTC (permalink / raw)
  To: mingo
  Cc: eranian, paulus, a.p.zijlstra, tglx, linux-kernel, linux-arch,
	akpm, dada1, robert.richter, arjan, hpa, rostedt

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 8 Dec 2008 12:11:53 +0100

> We did what we believe to be the right thing: we gave performance 
> counters the proper high-level abstraction they _deserve_, and we made 
> performance counters a prime-time Linux citizen as well.

Seperate counters that are read independently is fundamentally wrong,
no matter how many times you try to say it isn't.  In fact it has
been shown (repeatedly) that this abstraction is at the wrong level.

People want to correlate, and it's not possible to do that if the
counters are sampled seperately.

We also don't want half-megabyte PMU tables in the kernel, nor the
complex logic about how PMU counter X can configured when counter Y is
configured for event A.  All of that belongs in userspace.

We also want to support PMUs that do not generate an overflow
interrupt.

Really, all of the backlash these new patches have received is not
about how clean the abstraction is, but rather whether it can even
do the job properly.

And also, another part of the backlash is that the poor perfmon3
person was completely blindsided by this new stuff.  Which to be
honest was pretty unfair.  He might have had great ideas about
the requirements (even if you don't give a crap about his approach
to achieving those requirements) and thus could have helped avoid
the past few days of churn.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-08 11:11                     ` Ingo Molnar
  2008-12-08 11:58                       ` David Miller
@ 2008-12-09  0:21                       ` stephane eranian
  1 sibling, 0 replies; 73+ messages in thread
From: stephane eranian @ 2008-12-09  0:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Mackerras, Peter Zijlstra, Thomas Gleixner, LKML,
	linux-arch, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Steven Rostedt, David Miller

On Mon, Dec 8, 2008 at 12:11 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * stephane eranian <eranian@googlemail.com> wrote:
>
>> Let me explain the HW complexity a bit. It's all a matter of tradeoffs.
>> I have regular discussions with the PMU design architects about this.
>> If you talk to them, then you understand the environment they have to
>> live in and you understand why those constraints are there. The key
>> point to understand is that the PMU is never critical to the chip. The
>> chip can work well without. The real-estate on the chip is always very
>> tight. PMU is a 2nd class citizen, thus low in the priority list. [...]
>
> The chip designers i talk to with my scheduler maintainer hat on do point
> out that performance monitoring is (of course) in the critical path of
> any chip, and hence its overhead and impact on the gate count of various
> critical components of the CPU core and its impact on the power envelope
> must be kept very low.
>

You have a talent for turning people's argument into something else.

You dropped my example about the wire limitation. It was describing my point
about constraints and PMU as 2nd class citizen. I'd rather have a new
constrained
PMU feature that no new feature at all. You also seem to limit your
world to x86,
you have to look beyond like Itanium and Power, for instance.

I know quite well that the PMU is used for debugging internally and early on,
so don't lecture me on this! I have participated in the architectural design of
some.

> Nevertheless, the same chip designers rely on performance counters on a
> daily basis to plan their next-gen chip. They very much want them to work
> fine, and they work hard on making them relevant and easy to use. Often
> the performance counters are the _only_ real cheap hands-on insight into
> the dynamic situation of a modern CPU core, even for hw designers.
>
Like, I did not know that?

> And all the current hw trends show that it's not just talk but action as
> well: the Core2 PMCs are already much saner (less constrained) than the
> P4 ones, and now they even expanded on them: Nehalem / Core i7 doubled
> the number of generic PMCs from two to four.
>

You think I am not aware of that?I know that quite well because I talk to the
PMU architects on a regular basis trying to get them to add new features and
make the PMU easier to manage. And I make sure I broaden my horizon
beyond x86.

And yes, the PMU is becoming more and more critical and a true-value add.
That's good for end-users as long as the new features can be exposed.

> So, contrary to your suggestion, chip designers very much care about

You did not get my point, but I am not surprised...

> performance counters and they are working very hard to make this stuff
> useful to us. [ Yes, there are constraints even with generic counters
> (for example you only want a single line towards a PMC register from
> divider units), but the number of cross-counter constraints and their
> relevance is decreasing, not increasing. ]
>
> Anyway ... i think your reply highlights why the fundamental premise of
> your patchset is so wrong: i believe you have designed your code and APIs
> at the wrong level by (paradoxically) assuming in essence that
> performance counters do not matter in the general scheme of things. (!)
>
> So you introduced limited, special-purpose but still quite complex APIs

That's not a valid argument! Perfmon, unlike any other existing API, has
exposed all advanced features of all existing PMU models and across
multiple architectures.

> that tailored the ABIs to intricate low level details of PMUs. I see an
> explosion in complexity due to that incorrect design choice: too many

You current API does not offer access to any of the advanced features of
X86, like PEBS, IBS, LBR and others,  let alone on the other architectures.
So again your arguments are unfounded.

> syscalls, too broad interaction between core code and architecture code,
> and too little practical utility in the end.
>

I think the number of syscalls is irrelevant, that's not how I measure
the usefulness of an API.
What matters is the functionalities. Any performance monitoring API should have:
   - create a session
   - program the registers
   - start and stop on demand and has many times as you want
   - attach to a thread or CPU
   - read the register values
   - advanced support for event-based sampling

> We did what we believe to be the right thing: we gave performance
> counters the proper high-level abstraction they _deserve_, and we made
> performance counters a prime-time Linux citizen as well.
>
You have no validation to prove you chose the right level.

As if the perfmon project did not put the PMU on the forefront.
Who is going to buy that?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-05 20:08                           ` David Miller
@ 2008-12-10  3:48                             ` Paul Mundt
  2008-12-10  4:42                               ` Paul Mackerras
                                                 ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Paul Mundt @ 2008-12-10  3:48 UTC (permalink / raw)
  To: David Miller
  Cc: andi, mingo, a.p.zijlstra, paulus, tglx, linux-kernel,
	linux-arch, akpm, eranian, dada1, robert.richter, arjan, hpa,
	rostedt

On Fri, Dec 05, 2008 at 12:08:14PM -0800, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Fri, 05 Dec 2008 13:39:43 +0100
> 
> > Given these are more obscure features, but not being able to fit
> > them easily into your model from the start isn't a very promising sign
> > for the long term extensibility of the design.
> 
> Another thing I'm interested in is if this new stuff will work with
> performance counters that lack an overflow interrupt.
> 
> We have several chips that are like this, and perfmon supported that
> on the kernel side, and also provided overflow emulation for such
> hardware in userspace (where such complexity belongs).

There doesn't seem to have been any reply to this point unfortunately,
and it is something I am also wondering about.

The sh perf counters were not designed with overflowing in mind, they are
split in to a pair of 48-bit or 64-bit counters that simply keep running.
Any write simply clears the value and the counter starts over. They are
simply counters only, and generate no events whatsoever.

Oprofile has been a pretty bad fit for them, and while I'm slightly more
optimistic about perfmon, I'm rather less enthusiastic about yet another
peformance counter implementation that I am unable to make any use of. 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-10  3:48                             ` Paul Mundt
@ 2008-12-10  4:42                               ` Paul Mackerras
  2008-12-10  8:43                               ` Mikael Pettersson
  2008-12-10 10:28                                 ` Andi Kleen
  2 siblings, 0 replies; 73+ messages in thread
From: Paul Mackerras @ 2008-12-10  4:42 UTC (permalink / raw)
  To: Paul Mundt
  Cc: David Miller, andi, mingo, a.p.zijlstra, tglx, linux-kernel,
	linux-arch, akpm, eranian, dada1, robert.richter, arjan, hpa,
	rostedt

Paul Mundt writes:

> On Fri, Dec 05, 2008 at 12:08:14PM -0800, David Miller wrote:
> > From: Andi Kleen <andi@firstfloor.org>
> > Date: Fri, 05 Dec 2008 13:39:43 +0100
> > 
> > > Given these are more obscure features, but not being able to fit
> > > them easily into your model from the start isn't a very promising sign
> > > for the long term extensibility of the design.
> > 
> > Another thing I'm interested in is if this new stuff will work with
> > performance counters that lack an overflow interrupt.
> > 
> > We have several chips that are like this, and perfmon supported that
> > on the kernel side, and also provided overflow emulation for such
> > hardware in userspace (where such complexity belongs).
> 
> There doesn't seem to have been any reply to this point unfortunately,
> and it is something I am also wondering about.
> 
> The sh perf counters were not designed with overflowing in mind, they are
> split in to a pair of 48-bit or 64-bit counters that simply keep running.
> Any write simply clears the value and the counter starts over. They are
> simply counters only, and generate no events whatsoever.
> 
> Oprofile has been a pretty bad fit for them, and while I'm slightly more
> optimistic about perfmon, I'm rather less enthusiastic about yet another
> peformance counter implementation that I am unable to make any use of. 

This is the sampling vs. counting distinction again, and it sounds
like these counters were designed for counting but not sampling.  If
Ingo and Thomas extend their infrastructure to provide good support
for counting as well as sampling, then you should hopefully be able to
use them for counting, at least.

On POWER6 we have a somewhat similar situation with two out of the six
available counters.  These two counters are fixed function (they
always count cycles and instructions completed) and don't generate
interrupts.  Furthermore, they are only 32 bits wide.  So I definitely
agree we need support for counters that don't interrupt.

Paul.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-10  3:48                             ` Paul Mundt
  2008-12-10  4:42                               ` Paul Mackerras
@ 2008-12-10  8:43                               ` Mikael Pettersson
  2008-12-10 10:28                                 ` Andi Kleen
  2 siblings, 0 replies; 73+ messages in thread
From: Mikael Pettersson @ 2008-12-10  8:43 UTC (permalink / raw)
  To: Paul Mundt
  Cc: David Miller, andi, mingo, a.p.zijlstra, paulus, tglx,
	linux-kernel, linux-arch, akpm, eranian, dada1, robert.richter,
	arjan, hpa, rostedt

Paul Mundt writes:
 > The sh perf counters were not designed with overflowing in mind, they are
 > split in to a pair of 48-bit or 64-bit counters that simply keep running.
 > Any write simply clears the value and the counter starts over. They are
 > simply counters only, and generate no events whatsoever.
 > 
 > Oprofile has been a pretty bad fit for them, and while I'm slightly more
 > optimistic about perfmon, I'm rather less enthusiastic about yet another
 > peformance counter implementation that I am unable to make any use of. 

My 'perfctr' kernel extension has supported this type of hardware
since its beginning in 1999, simply because that's how much hardware
worked at the time. Typical CPUs in that category include Intel P5s,
Intel P6s where the local APIC isn't available (some don't have one
in HW, many have it disabled by BIOS), 1st gen AMD K7, VIA C3, early
UltraSPARCs (not supported by perfctr but could be), and many G3/G4
type 32-bit PowerPCs where HW errata make the PMU overflow interrupt
facility useless or dangerous.

Plain event counting over a group of counters is a convenient way of
computing metrics for isolated blocks of code, such as CPI, branch
misses / insn or clock, and such, so I often use that even on CPUs
that do support overflow interrupts.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-10 10:28                                 ` Andi Kleen
  (?)
@ 2008-12-10 10:23                                 ` Paul Mundt
  2008-12-10 11:03                                     ` Andi Kleen
  -1 siblings, 1 reply; 73+ messages in thread
From: Paul Mundt @ 2008-12-10 10:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, mingo, a.p.zijlstra, paulus, tglx, linux-kernel,
	linux-arch, akpm, eranian, dada1, robert.richter, arjan, hpa,
	rostedt

On Wed, Dec 10, 2008 at 11:28:19AM +0100, Andi Kleen wrote:
> > Oprofile has been a pretty bad fit for them, and while I'm slightly more
> 
> You could always use a extension of timer mode that reads them
> periodically? 
> 
This is what I do today, but it is not an ideal solution. It would be
nice if these sorts of use cases could be supported by newer frameworks
without every platform with similar requirements having to implement
workarounds hanging off of the timer IRQ.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-10  3:48                             ` Paul Mundt
@ 2008-12-10 10:28                                 ` Andi Kleen
  2008-12-10  8:43                               ` Mikael Pettersson
  2008-12-10 10:28                                 ` Andi Kleen
  2 siblings, 0 replies; 73+ messages in thread
From: Andi Kleen @ 2008-12-10 10:28 UTC (permalink / raw)
  To: Paul Mundt, David Miller, andi, mingo, a.p.zijlstra, paulus,
	tglx, linux-kernel, linux-arch, akpm, eranian, dada1,
	robert.richter, arjan, hpa, rostedt

> Oprofile has been a pretty bad fit for them, and while I'm slightly more

You could always use a extension of timer mode that reads them
periodically? 

-Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
@ 2008-12-10 10:28                                 ` Andi Kleen
  0 siblings, 0 replies; 73+ messages in thread
From: Andi Kleen @ 2008-12-10 10:28 UTC (permalink / raw)
  To: Paul Mundt, David Miller, andi, mingo, a.p.zijlstra, paulus,
	tglx, linux-kernel

> Oprofile has been a pretty bad fit for them, and while I'm slightly more

You could always use a extension of timer mode that reads them
periodically? 

-Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-10 10:23                                 ` Paul Mundt
@ 2008-12-10 11:03                                     ` Andi Kleen
  0 siblings, 0 replies; 73+ messages in thread
From: Andi Kleen @ 2008-12-10 11:03 UTC (permalink / raw)
  To: Paul Mundt, Andi Kleen, David Miller, mingo, a.p.zijlstra,
	paulus, tglx, linux-kernel, linux-arch, akpm, eranian, dada1,
	robert.richter, arjan, hpa, rostedt

On Wed, Dec 10, 2008 at 07:23:36PM +0900, Paul Mundt wrote:
> On Wed, Dec 10, 2008 at 11:28:19AM +0100, Andi Kleen wrote:
> > > Oprofile has been a pretty bad fit for them, and while I'm slightly more
> > 
> > You could always use a extension of timer mode that reads them
> > periodically? 
> > 
> This is what I do today, but it is not an ideal solution. It would be
> nice if these sorts of use cases could be supported by newer frameworks
> without every platform with similar requirements having to implement
> workarounds hanging off of the timer IRQ.

But you shouldn't hang off the timer irq anyways, but better use a regular
timer or hr timer. This would give more regular sampling even with dyntick.
And doing such a timer is only a few lines of code, I'm not sure it would
buy you all that much to generalize it.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
@ 2008-12-10 11:03                                     ` Andi Kleen
  0 siblings, 0 replies; 73+ messages in thread
From: Andi Kleen @ 2008-12-10 11:03 UTC (permalink / raw)
  To: Paul Mundt, Andi Kleen, David Miller, mingo, a.p.zijlstra, paulus, tglx

On Wed, Dec 10, 2008 at 07:23:36PM +0900, Paul Mundt wrote:
> On Wed, Dec 10, 2008 at 11:28:19AM +0100, Andi Kleen wrote:
> > > Oprofile has been a pretty bad fit for them, and while I'm slightly more
> > 
> > You could always use a extension of timer mode that reads them
> > periodically? 
> > 
> This is what I do today, but it is not an ideal solution. It would be
> nice if these sorts of use cases could be supported by newer frameworks
> without every platform with similar requirements having to implement
> workarounds hanging off of the timer IRQ.

But you shouldn't hang off the timer irq anyways, but better use a regular
timer or hr timer. This would give more regular sampling even with dyntick.
And doing such a timer is only a few lines of code, I'm not sure it would
buy you all that much to generalize it.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [perfmon2] [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-06  2:36 ` stephane eranian
@ 2008-12-10 16:27     ` Rob Fowler
  2008-12-10 16:27     ` Rob Fowler
  1 sibling, 0 replies; 73+ messages in thread
From: Rob Fowler @ 2008-12-10 16:27 UTC (permalink / raw)
  To: eranian
  Cc: Thomas Gleixner, linux-arch, Peter Zijlstra, David Miller, LKML,
	Steven Rostedt, Eric Dumazet, Paul Mackerras, Peter Anvin,
	Andrew Morton, Ingo Molnar, perfmon2-devel, Arjan van de Veen

My reaction is more from a downstream tool developer and end user perspective.

What I don't see in the new proposal is support for real end users of hardware
performance counter information.  There is a long-existing community that is using the
counters, including the hardware designers, driver writers, tool developers, and
performance tuning specialists working for both vendors and end customers.   Not
everyone is in the same camp, as each the hardware capabilities change from revision to
revision of the chips as features are added, architectures evolve, and implementations are
cleaned up.  System vendors have their own tools and developers (SpeedShop, Vtune, Tprof, Sun Studio
Code Analyst, etc). There are academic and open source efforts with long histories (PAPI,
oprofile, HPCToolkit (Rice, not IBM), etc). We've lived with proprietary drivers/APIs and with
a succession of open-source drivers (pci, perfctr, oprofile, perfmon).  (My apologies to
readers/developers whose favorite tool(s) I haven't mentioned.)  Out-and-out religious wars
have not erupted, but there are a lot of healthy disagreements. A significant part of this
community has been converging around Perfmon2/3, not because it is a thing of beauty, but
because it is a tool that exposes the full HPM capabilities (which are often ugly) in a useful
way for a community of tool developers and end users.

Before considering this new proposal seriously, I'd need to see it proven.  This means
that it needs to be developed, by the proposers, enough to be used seriously.  I've
got collaborators that measure compute resources in units of tens of TeraFLOP-years, so
my definition of "seriously" is that the HPM tool chain has to work with low overhead
on huge clusters of multi-core, multi-socket machines and it has to be able to provide
performance insights that will let us get even more performance out of applications
that already do pretty well.  Google and other large users have similar notions of "serious".

Here's my set of strawman requirements:

-- Can it support a *completely* functional PAPI?  There are a lot of tools (HPCToolkit,
    TAU, etc.) built on this layer.

-- Means to support IBS/EBS profiling and efficiently record execution contexts?  Can it
    support event-based call stack profiling?

-- Can it supplant or support oprofile by supporting the tools (Code Analyst, etc) that
    depend on it?

-- Kernel and daemon profiling capabilities?

-- Does it have sufficiently low overhead?  Six years ago DCPI/ProfileMe was capable of
    collecting around 5000 samples/second on a quad socket 1GHz Alpha EV67 system with
    about a 1.5% overhead.  That's the gold standard. Oprofile and pfmon are not far off
    that mark.

-- Does it even scale within one box?  My workhorse systems today are quad-socket Barcelonas.
    I'm reliably using multiple, cooperating (Some measure on-core, others measure off-core events.)
    instances of pfmon to collect profiles using  all 64 (4 per core x 16 cores) counters
    productively with low overhead.  Real soon now I will have similar expectations
    regarding multi-socket Nehalems where the resources will be 7 (heterogeneous) counters per
    core plus 8 "uncore" counters (I prefer "nest", Alex Mericas' terminology.) per socket.


Regards,
Rob


stephane eranian wrote:
> Hello,
> 
> I have been reading all the threads after this unexpected announcement
> of a competing proposal for an interface to access the performance counters.
> I would like to respond to some of the things I have seen.
> 

  <<<<<< Details of Stephane's comment's elided >>>>>>

> 
> In summary, although the idea of simplifying tools by moving the
> complexity elsewhere is legitimate, pushing it down to the kernel
> is the wrong approach in my opinion, perfmon has avoided that as much
> as possible for good reasons. We have shown , with libpfm,
> that a large part of complexity can easily be encapsulated into a user
> library. I also don't think the approach of managing events
> independently of each others works for all processors. As pointed out
> by others, there are other factors at stake and they may not
> even be on the same core.
> 
> S. Eranian
> 
> ------------------------------------------------------------------------------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

-- 
Robert J. Fowler
Chief Domain Scientist, HPC
Renaissance Computing Institute
The University of North Carolina at Chapel Hill
100 Europa Dr, Suite 540
Chapel Hill, NC 27517
V: 919.445.9670
F: 919 445.9669
rjf@renci.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
@ 2008-12-10 16:27     ` Rob Fowler
  0 siblings, 0 replies; 73+ messages in thread
From: Rob Fowler @ 2008-12-10 16:27 UTC (permalink / raw)
  To: eranian
  Cc: linux-arch, Peter Zijlstra, Ingo Molnar, LKML, Steven Rostedt,
	Eric Dumazet, Andrew Morton, Paul Mackerras, Peter Anvin,
	Thomas Gleixner, David Miller, perfmon2-devel, Arjan van de Veen

My reaction is more from a downstream tool developer and end user perspective.

What I don't see in the new proposal is support for real end users of hardware
performance counter information.  There is a long-existing community that is using the
counters, including the hardware designers, driver writers, tool developers, and
performance tuning specialists working for both vendors and end customers.   Not
everyone is in the same camp, as each the hardware capabilities change from revision to
revision of the chips as features are added, architectures evolve, and implementations are
cleaned up.  System vendors have their own tools and developers (SpeedShop, Vtune, Tprof, Sun Studio
Code Analyst, etc). There are academic and open source efforts with long histories (PAPI,
oprofile, HPCToolkit (Rice, not IBM), etc). We've lived with proprietary drivers/APIs and with
a succession of open-source drivers (pci, perfctr, oprofile, perfmon).  (My apologies to
readers/developers whose favorite tool(s) I haven't mentioned.)  Out-and-out religious wars
have not erupted, but there are a lot of healthy disagreements. A significant part of this
community has been converging around Perfmon2/3, not because it is a thing of beauty, but
because it is a tool that exposes the full HPM capabilities (which are often ugly) in a useful
way for a community of tool developers and end users.

Before considering this new proposal seriously, I'd need to see it proven.  This means
that it needs to be developed, by the proposers, enough to be used seriously.  I've
got collaborators that measure compute resources in units of tens of TeraFLOP-years, so
my definition of "seriously" is that the HPM tool chain has to work with low overhead
on huge clusters of multi-core, multi-socket machines and it has to be able to provide
performance insights that will let us get even more performance out of applications
that already do pretty well.  Google and other large users have similar notions of "serious".

Here's my set of strawman requirements:

-- Can it support a *completely* functional PAPI?  There are a lot of tools (HPCToolkit,
    TAU, etc.) built on this layer.

-- Means to support IBS/EBS profiling and efficiently record execution contexts?  Can it
    support event-based call stack profiling?

-- Can it supplant or support oprofile by supporting the tools (Code Analyst, etc) that
    depend on it?

-- Kernel and daemon profiling capabilities?

-- Does it have sufficiently low overhead?  Six years ago DCPI/ProfileMe was capable of
    collecting around 5000 samples/second on a quad socket 1GHz Alpha EV67 system with
    about a 1.5% overhead.  That's the gold standard. Oprofile and pfmon are not far off
    that mark.

-- Does it even scale within one box?  My workhorse systems today are quad-socket Barcelonas.
    I'm reliably using multiple, cooperating (Some measure on-core, others measure off-core events.)
    instances of pfmon to collect profiles using  all 64 (4 per core x 16 cores) counters
    productively with low overhead.  Real soon now I will have similar expectations
    regarding multi-socket Nehalems where the resources will be 7 (heterogeneous) counters per
    core plus 8 "uncore" counters (I prefer "nest", Alex Mericas' terminology.) per socket.


Regards,
Rob


stephane eranian wrote:
> Hello,
> 
> I have been reading all the threads after this unexpected announcement
> of a competing proposal for an interface to access the performance counters.
> I would like to respond to some of the things I have seen.
> 

  <<<<<< Details of Stephane's comment's elided >>>>>>

> 
> In summary, although the idea of simplifying tools by moving the
> complexity elsewhere is legitimate, pushing it down to the kernel
> is the wrong approach in my opinion, perfmon has avoided that as much
> as possible for good reasons. We have shown , with libpfm,
> that a large part of complexity can easily be encapsulated into a user
> library. I also don't think the approach of managing events
> independently of each others works for all processors. As pointed out
> by others, there are other factors at stake and they may not
> even be on the same core.
> 
> S. Eranian
> 
> ------------------------------------------------------------------------------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

-- 
Robert J. Fowler
Chief Domain Scientist, HPC
Renaissance Computing Institute
The University of North Carolina at Chapel Hill
100 Europa Dr, Suite 540
Chapel Hill, NC 27517
V: 919.445.9670
F: 919 445.9669
rjf@renci.org

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
  2008-12-10 16:27     ` Rob Fowler
@ 2008-12-10 17:11       ` Andi Kleen
  -1 siblings, 0 replies; 73+ messages in thread
From: Andi Kleen @ 2008-12-10 17:11 UTC (permalink / raw)
  To: Rob Fowler
  Cc: eranian, linux-arch, Peter Zijlstra, Ingo Molnar, LKML,
	Steven Rostedt, Eric Dumazet, Andrew Morton, Paul Mackerras,
	Peter Anvin, Thomas Gleixner, David Miller, perfmon2-devel,
	Arjan van de Veen

Rob Fowler <rjf@renci.org> writes:
>
> -- Can it supplant or support oprofile by supporting the tools (Code Analyst, etc) that
>     depend on it?

There's no need to supplant/support oprofile really because at least
short term oprofile will not go away.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
@ 2008-12-10 17:11       ` Andi Kleen
  0 siblings, 0 replies; 73+ messages in thread
From: Andi Kleen @ 2008-12-10 17:11 UTC (permalink / raw)
  To: Rob Fowler
  Cc: linux-arch, Thomas Gleixner, Peter Zijlstra, David Miller, LKML,
	Steven Rostedt, Eric Dumazet, Andrew Morton, Paul Mackerras,
	Peter Anvin, eranian, Ingo Molnar, perfmon2-devel,
	Arjan van de Veen

Rob Fowler <rjf@renci.org> writes:
>
> -- Can it supplant or support oprofile by supporting the tools (Code Analyst, etc) that
>     depend on it?

There's no need to supplant/support oprofile really because at least
short term oprofile will not go away.

-Andi

-- 
ak@linux.intel.com

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 0/3] [Announcement] Performance Counters for Linux
@ 2008-12-05 21:24 Corey Ashford
  0 siblings, 0 replies; 73+ messages in thread
From: Corey Ashford @ 2008-12-05 21:24 UTC (permalink / raw)
  To: linux-kernel

> * Ingo Molnar <mingo@elte.hu> wrote:
> 
>> > >  - No interaction with ptrace: any task (with sufficient permissions) can
>> > >    monitor other tasks, without having to stop that task.
>> > 
>> > This isn't going to work.
>> >
>> > If you look at the things the perfmon libraries do, you do need to stop 
>> > the task.
>> >
>> > Consider counter virtualization as the most direct example. [...]
>> 
>> Note that counter virtualization is not offered in the perfmon3 patchset that has 
>> been posted to lkml. (It is part of the much larger 'full' perfmon patchset which 
>> has not been submitted for integration)
>> 
>> Nevertheless we will offer counter virtualization in -v2 of our patchset [...]
> 
> i've just implemented it. Running an (infinite-loop) hello.c with 6 counters on a 
> CPU that has only two counters now gives the expected:
> 
>  counter[0 cycles              ]:           3368245084 , delta: 842019470 events
>  counter[1 instructions        ]:           1384678210 , delta: 346108294 events
>  counter[2 cache-refs          ]:                  659 , delta: 150 events
>  counter[3 cache-misses        ]:                    0 
>  counter[4 branch-instructions ]:            266919398 , delta: 66731508 events
>  counter[5 branch-misses       ]:                 1201 , delta: 315 events
> 
> This will be in -v2.
> 
> 	Ingo
>

When you use the term "virtualization" here, I think you mean "event set 
multiplexing" in perfmon terms.  When perfmon talks about 
virtualization, it's the virtualizing of a small counter (e.g. 32-bits) 
to a 64-bit counter via its overflow interrupt.  And 64-bit counter 
support is included in the perfmon3 posted to LKML.

One thing that PAPI needs is some control over which events are in each 
event "set", to use a perfmon term.  In particular, it needs to have a 
cycles counter in each set so that it can properly scale the event 
counts at the time it reads them up.

With your proposal:

* Would there be a way to force a particular event to be in every event 
set that is scheduled onto the processor?

* When monitoring program reads up the counts, how would it find the 
individual cycles count for each set?

* How would it know which other events were in the same set?

* Would it force the round robin scheduling to only a single event 
(paired with the cycles event) in each set?

* On what basis is the round robin scheduling performed?  Time?  Upon 
the overflow of an event counter?  If there is more than one option, how 
is it specified and tweaked? If time is one of the options, how does the 
caller specify the the round-robin switching rate?

These are all things that are supported in a very flexible way in 
perfmon3 (full).

Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com




^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2008-12-10 17:11 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-04 23:44 [patch 0/3] [Announcement] Performance Counters for Linux Thomas Gleixner
2008-12-04 23:44 ` [patch 1/3] performance counters: core code Thomas Gleixner
2008-12-05 10:55   ` Paul Mackerras
2008-12-05 11:20     ` Ingo Molnar
2008-12-04 23:44 ` [patch 2/3] performance counters: documentation Thomas Gleixner
2008-12-05  0:33   ` Paul Mackerras
2008-12-05  0:37     ` David Miller
2008-12-05  2:50       ` Arjan van de Ven
2008-12-05  3:26         ` David Miller
2008-12-05  2:33     ` Andi Kleen
2008-12-04 23:45 ` [patch 3/3] performance counters: x86 support Thomas Gleixner
2008-12-05  0:22 ` [patch 0/3] [Announcement] Performance Counters for Linux Paul Mackerras
2008-12-05  6:31   ` Ingo Molnar
2008-12-05  7:02     ` Arjan van de Ven
2008-12-05  7:52       ` David Miller
2008-12-05  7:03     ` Ingo Molnar
2008-12-05  7:16       ` Peter Zijlstra
2008-12-05  7:57         ` Paul Mackerras
2008-12-05  8:03           ` Peter Zijlstra
2008-12-05  8:07             ` David Miller
2008-12-05  8:11               ` Ingo Molnar
2008-12-05  8:17                 ` David Miller
2008-12-05  8:24                   ` Ingo Molnar
2008-12-05  8:27                     ` David Miller
2008-12-05  8:42                       ` Ingo Molnar
2008-12-05  8:49                         ` David Miller
2008-12-05 12:13                           ` Ingo Molnar
2008-12-05 12:39                         ` Andi Kleen
2008-12-05 20:08                           ` David Miller
2008-12-10  3:48                             ` Paul Mundt
2008-12-10  4:42                               ` Paul Mackerras
2008-12-10  8:43                               ` Mikael Pettersson
2008-12-10 10:28                               ` Andi Kleen
2008-12-10 10:28                                 ` Andi Kleen
2008-12-10 10:23                                 ` Paul Mundt
2008-12-10 11:03                                   ` Andi Kleen
2008-12-10 11:03                                     ` Andi Kleen
2008-12-05 15:00               ` Arjan van de Ven
2008-12-05  9:16             ` Paul Mackerras
2008-12-05  7:57       ` David Miller
2008-12-05  8:18         ` Ingo Molnar
2008-12-05  8:20           ` David Miller
2008-12-05  7:54     ` Paul Mackerras
2008-12-05  8:08       ` Ingo Molnar
2008-12-05  8:15         ` David Miller
2008-12-05 13:25           ` Ingo Molnar
2008-12-05  9:10         ` Paul Mackerras
2008-12-05 12:07           ` Ingo Molnar
2008-12-06  0:05             ` Paul Mackerras
2008-12-06  1:23               ` Mikael Pettersson
2008-12-06 12:34               ` Peter Zijlstra
2008-12-07  5:15                 ` Paul Mackerras
2008-12-08  7:18                   ` stephane eranian
2008-12-08 11:11                     ` Ingo Molnar
2008-12-08 11:58                       ` David Miller
2008-12-09  0:21                       ` stephane eranian
2008-12-05  0:22 ` H. Peter Anvin
2008-12-05  0:43   ` Paul Mackerras
2008-12-05  1:12 ` David Miller
2008-12-05  6:10   ` Ingo Molnar
2008-12-05  7:50     ` David Miller
2008-12-05  9:34     ` Paul Mackerras
2008-12-05 10:41       ` Ingo Molnar
2008-12-05 10:05     ` Ingo Molnar
2008-12-05  3:30 ` Andrew Morton
2008-12-06  2:36 ` stephane eranian
2008-12-08  2:12   ` [perfmon2] [patch 0/3] [Announcement] Performance Counters forLinux Dan Terpstra
2008-12-08  2:12     ` Dan Terpstra
2008-12-10 16:27   ` [perfmon2] [patch 0/3] [Announcement] Performance Counters for Linux Rob Fowler
2008-12-10 16:27     ` Rob Fowler
2008-12-10 17:11     ` Andi Kleen
2008-12-10 17:11       ` Andi Kleen
2008-12-05 21:24 Corey Ashford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.