linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi)
@ 2006-06-15  9:07 Stephane Eranian
  2006-06-16 13:50 ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Stephane Eranian @ 2006-06-15  9:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: eranian

This patch contains the kernel-level API support.




--- linux-2.6.17-rc6.orig/perfmon/perfmon_kapi.c	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.17-rc6/perfmon/perfmon_kapi.c	2006-06-08 01:49:22.000000000 -0700
@@ -0,0 +1,458 @@
+/*
+ * perfmon_kapi.c: perfmon2 kernel level interface
+ *
+ * This file implements the perfmon2 interface which
+ * provides access to the hardware performance counters
+ * of the host processor.
+ *
+ * Copyright (c) 2006 Hewlett-Packard Development Company, L.P.
+ * Contributed by Stephane Eranian <eranian@hpl.hp.com>
+ *
+ * More information about perfmon available at:
+ * 	http://perfmon2.sf.net
+ *
+ * perfmon2 KAPI overview:
+ *  The goal is to allow kernel-level code to use the perfmon2
+ *  interface for both counting and sampling. It is not possible
+ *  to directly use the system calls because they expected parameters
+ *  from user level. The kernel-level interface is more restrictive
+ *  because of inherent kernel constraints.  The limited interface
+ *  is comosed by a set of functions  implemented in this this. For
+ *  ease of use, the mimic the names of the user level interface, e.g.
+ *  pfmk_create_context()  is the equivalent of pfm_create_context().
+ *  The pfmk_ prefix is used on all calls. Those can be called from
+ *  kernel modules or core kernel files.
+ *
+ *  The kernel-level perfmon api (KAPI) does not use file descriptors
+ *  to identify a context. Instead an opaque (void *) descriptor is used.
+ *  It is returned by pfmk_create_context() and must be passed to all
+ *  subsequence pfmk_*() calls. List of calls is:
+ *  	pfmk_create_context();
+ *  	pfmk_write_pmcs();
+ *  	pfmk_write_pmds();
+ *  	pfmk_read_pmds();
+ *  	pfmk_restart();
+ *  	pfmk_stop();
+ *  	pfmk_start();
+ *  	pfmk_load_context();
+ *  	pfmk_unload_context();
+ *  	pfmk_delete_evtsets();
+ *  	pfmk_create_evtsets();
+ *  	pfmk_getinfo_evtsets();
+ *  	pfmk_close();
+ *  	pfmk_read();
+ *
+ *  Unlike pfm_create_context(), the KAPI equivalent, pfmk_create_context()
+ *  does not trigger the PMU description module to be inserted automatically
+ *  (if known). That means that the call may fail if no PMU description module
+ *  is inserted in advance. This is a restriction to avoid deadlocks during
+ *  insmod.
+ *
+ *  When sampling, the kernel level sampling buffer base address is directly
+ *  returned by pfmk_create_context(). There is no re-mapping necessary.
+ * 
+ * When sampling, the buffer overflow notification can generate a message.
+ * But given that there is no file descriptor, it is not possible to use a
+ * plain read() call. Instead the pfmk_read() function must be invoke. It
+ * returns one message at a time. The pfmk_read() function can be blocking
+ * when there is no message, unless the noblock parameter is set to 1.
+ * Because there is no file descriptor, it would be hard for a kernel thread
+ * to wait on an overflow notification message and something else. It would
+ * be hard to get out, should the thread need to terminate. To avoid this
+ * problem, the pfmk_create_context() requires a completion structure be
+ * passed. It is used during pfmk_read() to wait on an event. But the completion
+ * is visible outside the perfmon context and can be used to signal other events
+ * as well. Upon return from pfmk_read() the caller must check the return value,
+ * if zero no message was extracted and the reason for waking up is outside the
+ * scope of perfmon.
+ *
+ * pefmon2 KAPI known restrictions:
+ * 	- only system-wide contexts are supported
+ * 	- with a sampling buffer defined, it is not possible
+ * 	  to call pfmk_close() from an interrupt context
+ * 	  (e.g. from IPI handler)
+ */
+#include <linux/kernel.h>
+#include <linux/perfmon.h>
+#include <linux/module.h>
+#include <asm/uaccess.h>
+
+static int pfmk_get_smpl_arg(pfm_uuid_t uuid, void *addr, size_t size,
+		     struct pfm_smpl_fmt **fmt)
+{
+	struct pfm_smpl_fmt *f;
+	size_t sz;
+	int ret;
+
+	if (!pfm_use_smpl_fmt(uuid))
+		return 0;
+
+	/*
+	 * find fmt and increase refcount
+	 */
+	f = pfm_smpl_fmt_get(uuid);
+	if (f == NULL) {
+		PFM_DBG("buffer format not found");
+		return -EINVAL;
+	}
+
+	sz = f->fmt_arg_size;
+
+	/*
+	 * usize = -1 is for IA-64 backward compatibility
+	 */
+	ret = -EINVAL;
+	if (sz != size && size != -1) {
+		PFM_DBG("invalid arg size %zu, format expects %zu",
+			size, sz);
+		goto error;
+	}
+	*fmt = f;
+	return 0;
+
+error:
+	pfm_smpl_fmt_put(f);
+	return ret;
+}
+
+/*
+ * req: pointer to context creation  argument. ctx_flags msut have
+ *      PFM_FL_SYSTEM_WIDE set.
+ *
+ * smpl_arg: optional sampling format option argument. NULL if unused
+ * smpl_size: sizeof of optional sampling format argument. 0 if unused
+ * c       : pointer to completion structure. Call does not initialization
+ * 	     struct (i.e. no init_completion). Completion used with pfmk_read()
+ * Return:
+ * desc    : pointer to opaque context descriptor. unique identifier for context
+ * smpl_buf: pointer to base of sampling buffer. Pass NULL if unused
+ */
+int pfmk_create_context(struct pfarg_ctx *req, void *smpl_arg,
+			size_t smpl_size,
+			struct completion *c,
+			void **desc,
+			void **buf)
+{
+	struct pfm_context *new_ctx;
+	struct pfm_smpl_fmt *fmt = NULL;
+	int ret = -EFAULT;
+
+	if (desc == NULL)
+		return -EINVAL;
+
+	if (c == NULL)
+		return -EINVAL;
+
+	if ((req->ctx_flags & PFM_FL_SYSTEM_WIDE) == 0) {
+		PFM_DBG("kapi only supoprts system-wide context\n");
+		return -EINVAL;
+	}
+
+	ret = pfmk_get_smpl_arg(req->ctx_smpl_buf_id, smpl_arg, smpl_size, &fmt);
+	if (ret)
+		return ret;
+
+	ret = __pfm_create_context(req, fmt, smpl_arg, PFM_KAPI, c, &new_ctx);
+	if (!ret) {
+		*desc = new_ctx;
+		/*
+		 * return base of sampling buffer
+		 */
+		if (buf)
+			*buf = new_ctx->smpl_addr;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_create_context);
+
+int pfmk_write_pmcs(void *desc, struct pfarg_pmc *req, int count)
+{
+	struct pfm_context *ctx;
+	unsigned long flags;
+	int ret;
+
+	if (count < 0 || desc == NULL)
+		return -EINVAL;
+
+	ctx = desc;
+
+	spin_lock_irqsave(&ctx->lock, flags);
+
+	ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+	if (ret == 0)
+		ret = __pfm_write_pmcs(ctx, req, count);
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_write_pmcs);
+
+int pfmk_write_pmds(void *desc, struct pfarg_pmd *req, int count)
+{
+	struct pfm_context *ctx;
+	unsigned long flags;
+	int ret;
+
+	if (count < 0 || desc == NULL)
+		return -EINVAL;
+
+	ctx = desc;
+
+	spin_lock_irqsave(&ctx->lock, flags);
+
+	ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+	if (ret == 0)
+		ret = __pfm_write_pmds(ctx, req, count, 0);
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_write_pmds);
+
+int pfmk_read_pmds(void *desc, struct pfarg_pmd *req, int count)
+{
+	struct pfm_context *ctx;
+	unsigned long flags;
+	int ret;
+
+	if (count < 0 || desc == NULL)
+		return -EINVAL;
+
+	ctx = desc;
+
+	spin_lock_irqsave(&ctx->lock, flags);
+
+	ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+	if (ret == 0)
+		ret = __pfm_read_pmds(ctx, req, count);
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_read_pmds);
+
+int pfmk_restart(void *desc)
+{
+	struct pfm_context *ctx;
+	unsigned long flags;
+	int ret = 0;
+
+	ctx = desc;
+
+	spin_lock_irqsave(&ctx->lock, flags);
+
+	ret = pfm_check_task_state(ctx, 0, &flags);
+	if (ret == 0)
+		ret = __pfm_restart(ctx);
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_restart);
+
+
+int pfmk_stop(void *desc)
+{
+	struct pfm_context *ctx;
+	unsigned long flags;
+	int ret;
+
+	ctx = desc;
+
+	spin_lock_irqsave(&ctx->lock, flags);
+
+	ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+	if (ret == 0)
+		ret = __pfm_stop(ctx);
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_stop);
+
+int pfmk_start(void *desc, struct pfarg_start *req)
+{
+	struct pfm_context *ctx;
+	unsigned long flags;
+	int ret = 0;
+
+	if (desc == NULL)
+		return -EINVAL;
+	ctx = desc;
+
+	spin_lock_irqsave(&ctx->lock, flags);
+
+	ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+	if (ret == 0)
+		ret = __pfm_start(ctx, req);
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_start);
+
+int pfmk_load_context(void *desc, struct pfarg_load *req)
+{
+	struct pfm_context *ctx;
+	unsigned long flags;
+	int ret;
+
+	if (desc == NULL)
+		return -EINVAL;
+
+	ctx = desc;
+
+	spin_lock(&ctx->lock);
+
+	ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+	if (ret == 0)
+		ret = __pfm_load_context(ctx, req);
+
+	spin_unlock(&ctx->lock);
+
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_load_context);
+
+
+int pfmk_unload_context(void *desc)
+{
+	struct pfm_context *ctx;
+	unsigned long flags;
+	int ret = 0;
+
+	if (desc == NULL)
+		return -EINVAL;
+
+	ctx = desc;
+
+	spin_lock_irqsave(&ctx->lock, flags);
+
+	ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED|PFM_CMD_UNLOAD, &flags);
+	if (ret == 0)
+		ret = __pfm_unload_context(ctx, 0);
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_unload_context);
+
+int pfmk_delete_evtsets(void *desc, struct pfarg_setinfo *req, int count)
+{
+	struct pfm_context *ctx;
+	unsigned long flags;
+	int ret;
+
+	if (count < 0 || desc == NULL)
+		return -EINVAL;
+
+	ctx = desc;
+
+	spin_lock_irqsave(&ctx->lock, flags);
+
+	ret = pfm_check_task_state(ctx, PFM_CMD_UNLOADED, &flags);
+	if (ret == 0)
+		ret = __pfm_delete_evtsets(ctx, req, count);
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_delete_evtsets);
+
+int pfmk_create_evtsets(void *desc, struct pfarg_setdesc  *req, int count)
+{
+	struct pfm_context *ctx;
+	unsigned long flags;
+	int ret;
+
+	if (count < 0 || desc == NULL)
+		return -EINVAL;
+
+	ctx = desc;
+
+	spin_lock_irqsave(&ctx->lock, flags);
+
+	ret = pfm_check_task_state(ctx, PFM_CMD_UNLOADED, &flags);
+	if (ret == 0)
+		ret = __pfm_create_evtsets(ctx, req, count);
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_create_evtsets);
+
+int pfmk_getinfo_evtsets(void *desc, struct pfarg_setinfo *req, int count)
+{
+	struct pfm_context *ctx;
+	unsigned long flags;
+	int ret;
+
+	if (count < 0 || desc == NULL)
+		return -EINVAL;
+
+	ctx = desc;
+
+	spin_lock_irqsave(&ctx->lock, flags);
+
+	ret = pfm_check_task_state(ctx, 0, &flags);
+	if (ret == 0)
+		ret = __pfm_getinfo_evtsets(ctx, req, count);
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(pfmk_getinfo_evtsets);
+
+int pfmk_close(void *desc)
+{
+	struct pfm_context *ctx;
+
+	if (desc == NULL)
+		return -EINVAL;
+
+	ctx = desc;
+
+	return __pfm_close(ctx, NULL);
+}
+EXPORT_SYMBOL(pfmk_close);
+
+/*
+ * desc   : opaque context descriptor
+ * msg    : pointer to message structure
+ * sz     : sizeof of message argument. Must be equal to 1 message 
+ * noblock: 1 means do not wait for messages. 0 means wait for completion
+ *          signal.
+ *
+ * Note on completion:
+ *	- completion structure can be shared with code outside the perfmon2
+ *	  core. This function will return with 0, if there was a completion
+ *	  signal but no messages to read.
+ *
+ * Return:
+ *    0           : no message extracted, but awaken
+ *    sizeof(*msg): one message extracted
+ *    -EAGAIN     : noblock=1 and nothing to read
+ *    -ERESTARTSYS: noblock=0, signal pending
+ */
+ssize_t pfmk_read(void *desc, union pfm_msg *msg, size_t sz, int noblock)
+{
+	struct pfm_context *ctx;
+	union pfm_msg msg_buf;
+
+	if (desc == NULL || msg == NULL || sz != sizeof(*msg))
+		return -EINVAL;
+
+	ctx = desc;
+
+	return __pfmk_read(ctx, &msg_buf, noblock);
+}
+EXPORT_SYMBOL(pfmk_read);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi)
  2006-06-15  9:07 [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi) Stephane Eranian
@ 2006-06-16 13:50 ` Christoph Hellwig
  2006-06-16 14:02   ` Stephane Eranian
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2006-06-16 13:50 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, eranian

On Thu, Jun 15, 2006 at 02:07:38AM -0700, Stephane Eranian wrote:
> This patch contains the kernel-level API support.

NACK.  No one should call this from kernel space.

and apparently noting in your patchkit does either, so this is just dead code.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi)
  2006-06-16 13:50 ` Christoph Hellwig
@ 2006-06-16 14:02   ` Stephane Eranian
  2006-06-16 14:56     ` Christoph Hellwig
  2006-06-16 15:41     ` Frank Ch. Eigler
  0 siblings, 2 replies; 10+ messages in thread
From: Stephane Eranian @ 2006-06-16 14:02 UTC (permalink / raw)
  To: Christoph Hellwig, linux-kernel; +Cc: systemtap, wcohen, perfmon

Hi,

On Fri, Jun 16, 2006 at 02:50:14PM +0100, Christoph Hellwig wrote:
> On Thu, Jun 15, 2006 at 02:07:38AM -0700, Stephane Eranian wrote:
> > This patch contains the kernel-level API support.
> 
> NACK.  No one should call this from kernel space.
> 

Well, that's what I initially thought too but there is a need from the SystemTap
people and given the way they set things up, it is hard to do it from user level.

> and apparently noting in your patchkit does either, so this is just dead code.

I have not immediate need my self, but I have received several requests for
this, systemtap being one of them.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi)
  2006-06-16 14:02   ` Stephane Eranian
@ 2006-06-16 14:56     ` Christoph Hellwig
  2006-06-17  0:15       ` Alan Cox
  2006-06-16 15:41     ` Frank Ch. Eigler
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2006-06-16 14:56 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Christoph Hellwig, linux-kernel, systemtap, wcohen, perfmon

On Fri, Jun 16, 2006 at 07:02:34AM -0700, Stephane Eranian wrote:
> Well, that's what I initially thought too but there is a need from the SystemTap
> people and given the way they set things up, it is hard to do it from user level.

Systemtap doesn' matter.  Please don't put in useless stuff for their
broken requirements - they're all clueless idiots.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi)
  2006-06-16 14:02   ` Stephane Eranian
  2006-06-16 14:56     ` Christoph Hellwig
@ 2006-06-16 15:41     ` Frank Ch. Eigler
  2006-06-16 15:45       ` Christoph Hellwig
  1 sibling, 1 reply; 10+ messages in thread
From: Frank Ch. Eigler @ 2006-06-16 15:41 UTC (permalink / raw)
  To: eranian; +Cc: Christoph Hellwig, linux-kernel, systemtap, wcohen, perfmon


Stephane Eranian <eranian@hpl.hp.com> writes:

> > > This patch contains the kernel-level API support.
> > NACK.  No one should call this from kernel space.
>
> Well, that's what I initially thought too but there is a need from
> the SystemTap people and given the way they set things up, it is
> hard to do it from user level. [...]

Whether one uses systemtap, raw kprobes, or some specialized
tracing/stats-collecting patch surely forthcoming, kernel-level APIs
would be needed to perform fine-grained kernel-scope measurements
using these counters.

- FChE

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi)
  2006-06-16 15:41     ` Frank Ch. Eigler
@ 2006-06-16 15:45       ` Christoph Hellwig
  2006-06-16 16:18         ` Frank Ch. Eigler
  2006-06-22 12:12         ` [perfmon] " Stephane Eranian
  0 siblings, 2 replies; 10+ messages in thread
From: Christoph Hellwig @ 2006-06-16 15:45 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: eranian, Christoph Hellwig, linux-kernel, systemtap, wcohen, perfmon

On Fri, Jun 16, 2006 at 11:41:32AM -0400, Frank Ch. Eigler wrote:
> Whether one uses systemtap, raw kprobes, or some specialized
> tracing/stats-collecting patch surely forthcoming, kernel-level APIs
> would be needed to perform fine-grained kernel-scope measurements
> using these counters.

No, there's not need to add kernel bloat for performance monitoring.
This kind of stuff shoul dabsolutely be done from userspace.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi)
  2006-06-16 15:45       ` Christoph Hellwig
@ 2006-06-16 16:18         ` Frank Ch. Eigler
  2006-06-22 12:12         ` [perfmon] " Stephane Eranian
  1 sibling, 0 replies; 10+ messages in thread
From: Frank Ch. Eigler @ 2006-06-16 16:18 UTC (permalink / raw)
  To: Christoph Hellwig, eranian, linux-kernel, systemtap, wcohen, perfmon

Hi -

> > Whether one uses systemtap, raw kprobes, or some specialized
> > tracing/stats-collecting patch surely forthcoming, kernel-level APIs
> > would be needed to perform fine-grained kernel-scope measurements
> > using these counters.
> 
> No, there's not need to add kernel bloat for performance monitoring.
> This kind of stuff shoul dabsolutely be done from userspace.

Userspace measurements provide only large-grained quantities.  Can you
argue convincingly that there is never a need to measure focused
quantities such as cache behaviors of individual subsystems, branch
prediction statistics of a new algorithm?  That running system-level
benchmarks is the most efficient way for developers to assess their
changes?  That the scheduler would not benefit from access to HT
resource utilization statistics?  All these sorts of efforts seem
to require a kernel-side perfmon API.

- FChE

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi)
  2006-06-16 14:56     ` Christoph Hellwig
@ 2006-06-17  0:15       ` Alan Cox
  0 siblings, 0 replies; 10+ messages in thread
From: Alan Cox @ 2006-06-17  0:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephane Eranian, linux-kernel, systemtap, wcohen, perfmon

Ar Gwe, 2006-06-16 am 15:56 +0100, ysgrifennodd Christoph Hellwig:
> On Fri, Jun 16, 2006 at 07:02:34AM -0700, Stephane Eranian wrote:
> > Well, that's what I initially thought too but there is a need from the SystemTap
> > people and given the way they set things up, it is hard to do it from user level.
> 
> Systemtap doesn' matter.  Please don't put in useless stuff for their
> broken requirements - they're all clueless idiots.

Christoph, thank you for your detailed analytical analysis. The kernel
list would not be the same without your detailed, well explanation and
reasoned rational analyses

Alan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [perfmon] Re: [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi)
  2006-06-16 15:45       ` Christoph Hellwig
  2006-06-16 16:18         ` Frank Ch. Eigler
@ 2006-06-22 12:12         ` Stephane Eranian
  2006-06-22 17:00           ` William Cohen
  1 sibling, 1 reply; 10+ messages in thread
From: Stephane Eranian @ 2006-06-22 12:12 UTC (permalink / raw)
  To: Christoph Hellwig, Frank Ch. Eigler, linux-kernel, systemtap,
	wcohen, perfmon
  Cc: fche, linux-kernel, systemtap, wcohen, perfmon

Christoph,

On Fri, Jun 16, 2006 at 04:45:19PM +0100, Christoph Hellwig wrote:
> On Fri, Jun 16, 2006 at 11:41:32AM -0400, Frank Ch. Eigler wrote:
> > Whether one uses systemtap, raw kprobes, or some specialized
> > tracing/stats-collecting patch surely forthcoming, kernel-level APIs
> > would be needed to perform fine-grained kernel-scope measurements
> > using these counters.
> 
You do not need to be in the kernel to measure kernel level
execution. Monitoring is statistical by nature, this is not about capturing
execution traces. All PMU models have the capability to filter on privilege
levels so you can distinguish user from kernel.

To measure certain functions of the kernel, some PMU models provide a
way to restrict monitoring to a range of contiguous code addresses, e.g.
Itanium 2. 

The case of systemtap is different. I think they would like to start/stop
monitoring on certain systemtap events, e.g., a function is called, a
threshold is met. Start and stop would be triggered from a systemtap
callback which is implemented by a kernel module, if I understand
the architecture. In the scenario, the monitoring session would have
to be created and controlled from the kernel. One could envision an
architecture, where monitoring would be controlled from user level 
with systemtap making upcalls  but I do not think this is possible given
that the instrumentation points can be very low level.

Another usage for a kernel-level monitoring API that I know about is 
people who want to explore how to use the performance monitoring
(and profiles) to guide the scheduler. A thread profile can tell the cache
hit rates, stalls, bus bandwidth utilization, whether it uses flops and so on.
This could be useful to to find the best placement for threads and avoid co-scheduling
threads that trash each other's micro-architectural state or saturate the memory bus.
In this scenario, one could envision a kernel thread controlling monitoring
and processing profiles for the scheduler. But, to concur with you Christoph,
I think this could be achieved from user level and the valuable information
may be passed to the scheduler via a specific system call for instance.

> No, there's not need to add kernel bloat for performance monitoring.
> This kind of stuff shoul dabsolutely be done from userspace.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [perfmon] Re: [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi)
  2006-06-22 12:12         ` [perfmon] " Stephane Eranian
@ 2006-06-22 17:00           ` William Cohen
  0 siblings, 0 replies; 10+ messages in thread
From: William Cohen @ 2006-06-22 17:00 UTC (permalink / raw)
  To: eranian
  Cc: Christoph Hellwig, Frank Ch. Eigler, linux-kernel, systemtap, perfmon

Stephane Eranian wrote:
> Christoph,
> 
> On Fri, Jun 16, 2006 at 04:45:19PM +0100, Christoph Hellwig wrote:
> 
>>On Fri, Jun 16, 2006 at 11:41:32AM -0400, Frank Ch. Eigler wrote:
>>
>>>Whether one uses systemtap, raw kprobes, or some specialized
>>>tracing/stats-collecting patch surely forthcoming, kernel-level APIs
>>>would be needed to perform fine-grained kernel-scope measurements
>>>using these counters.
>>
> You do not need to be in the kernel to measure kernel level
> execution. Monitoring is statistical by nature, this is not about capturing
> execution traces. All PMU models have the capability to filter on privilege
> levels so you can distinguish user from kernel.
> 
> To measure certain functions of the kernel, some PMU models provide a
> way to restrict monitoring to a range of contiguous code addresses, e.g.
> Itanium 2. 

The filtering on privilege level is too coarse. For example want to 
start event counting on entry into a kernel function and stop when 
exiting the function. The itanium hw is not ideal for this application. 
The children functions may not be contiguous with the starting function. 
  Other kinds of predication based on state information, e.g. particular 
process or thread could be very useful.

> The case of systemtap is different. I think they would like to start/stop
> monitoring on certain systemtap events, e.g., a function is called, a
> threshold is met. Start and stop would be triggered from a systemtap
> callback which is implemented by a kernel module, if I understand
> the architecture. In the scenario, the monitoring session would have
> to be created and controlled from the kernel. One could envision an
> architecture, where monitoring would be controlled from user level 
> with systemtap making upcalls  but I do not think this is possible given
> that the instrumentation points can be very low level.
> 
> Another usage for a kernel-level monitoring API that I know about is 
> people who want to explore how to use the performance monitoring
> (and profiles) to guide the scheduler. A thread profile can tell the cache
> hit rates, stalls, bus bandwidth utilization, whether it uses flops and so on.
> This could be useful to to find the best placement for threads and avoid co-scheduling
> threads that trash each other's micro-architectural state or saturate the memory bus.
> In this scenario, one could envision a kernel thread controlling monitoring
> and processing profiles for the scheduler. But, to concur with you Christoph,
> I think this could be achieved from user level and the valuable information
> may be passed to the scheduler via a specific system call for instance.

One probably could configure the performance monitoring hardware from 
userspace. However, for micro-measurement in the kernel it seems like 
the pmu reads in kernel space would still be required.

-Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-06-22 17:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-15  9:07 [PATCH 9/16] 2.6.17-rc6 perfmon2 patch for review: kernel-level API support (kapi) Stephane Eranian
2006-06-16 13:50 ` Christoph Hellwig
2006-06-16 14:02   ` Stephane Eranian
2006-06-16 14:56     ` Christoph Hellwig
2006-06-17  0:15       ` Alan Cox
2006-06-16 15:41     ` Frank Ch. Eigler
2006-06-16 15:45       ` Christoph Hellwig
2006-06-16 16:18         ` Frank Ch. Eigler
2006-06-22 12:12         ` [perfmon] " Stephane Eranian
2006-06-22 17:00           ` William Cohen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).