linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Expose gpu counters via perf pmu driver
@ 2014-10-22 15:28 Robert Bragg
  2014-10-22 15:28 ` [RFC PATCH 1/3] perf: export perf_event_overflow Robert Bragg
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Robert Bragg @ 2014-10-22 15:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs, Robert Bragg

Although I haven't seen any precedent for drivers using perf pmus to
expose device metrics, I've been experimenting with exposing some of the
performance counters of Intel Gen graphics hardware recently and looking
to see if it makes sense to build on the perf infrastructure for our use
cases.

I've got a basic pmu driver working to expose our Observation
Architecture counters, and it seems like a fairly good fit. The main
caveat is that to allow the permission model we would like I needed to
make some changes in events/core which I'd really appreciate some
feedback on...

In this case we're using the driver to support some performance
monitoring extensions in Mesa (AMD_performance_monitor +
INTEL_performance_query) and we don't want to require OpenGL clients to
run as root to be able to monitor a gpu context they own.

Our desired permission model seems consistent with perf's current model
whereby you would need privileges if you want to profile across all gpu
contexts but not need special permissions to profile your own context.

The awkward part is that it doesn't make sense for us to have userspace
open a perf event with a specific pid as the way to avoid needing root
permissions because a side effect of doing this is that the events will
be dynamically added/deleted so as to only monitor while that process is
scheduled and that's not really meaningful when we're monitoring the
gpu.

Conceptually I suppose we want to be able to open an event that's not
associated with any cpu or process, but to keep things simple and fit
with perf's current design, the pmu I have a.t.m expects an event to be
opened for a specific cpu and unspecified process.

To then subvert the cpu centric permission checks, I added a
PERF_PMU_CAP_IS_DEVICE capability that a pmu driver can use to tell
events/core that a pmu doesn't collect any cpu metrics and it can
therefore skip its usual checks and assume the driver will implement its
own checks as appropriate.

In addition I also explicitly black list numerous attributes and
PERF_SAMPLE_ flags that I don't think make sense for a device pmu. This
could be handled in the pmu driver but it seemed better to do in
events/core, avoiding duplication in case we later have multiple device
pmus.

I'd be interested to hear whether is sounds reasonable to others for us
to expose gpu device metrics via a perf pmu and whether adding the
PERF_PMU_CAP_IS_DEVICE flag as in my following patch could be
acceptable.

Patches:

[RFC PATCH 1/3] perf: export perf_event_overflow
[RFC PATCH 2/3] perf: Add PERF_PMU_CAP_IS_DEVICE flag
  The main change to core/events I'd really appreciate feedback on.

[RFC PATCH 3/3] i915: Expose PMU for Observation Architecture
  My current pmu driver, provided for context (work in progress). Early,
  high-level, feedback would be appreciated, though I think it could be
  good to focus on the core/events change first. I also plan to send
  this to the intel-gfx list for review.

  Essentially, this pmu allows us to configure the gpu to periodically
  write snapshots of performance counters (up to 64 32bit counters per
  snapshot) into a circular buffer. It then uses a 200Hz hrtimer to
  forward those snapshots to userspace as perf samples, with the counter
  snapshots written by the gpu attached as 'raw' data.

If anyone is interested in more details about Haswell's gpu performance
counters, the PRM can be found here:

  https://01.org/linuxgraphics/sites/default/files/documentation/
  observability_performance_counters_haswell.pdf

To see how I'm currently using this from userspace, I have a couple of
intel-gpu-tools utilities; intel_oacounter_top_pmu + intel_gpu_trace_pmu:

  https://github.com/rib/intel-gpu-tools/commits/wip/rib/intel-i915-oa-pmu

And the current code I have to use this in Mesa is here:

  https://github.com/rib/mesa/commits/wip/rib/i915_oa_perf

Regards,
- Robert

 drivers/gpu/drm/i915/Makefile       |   1 +
 drivers/gpu/drm/i915/i915_dma.c     |   2 +
 drivers/gpu/drm/i915/i915_drv.h     |  33 ++
 drivers/gpu/drm/i915/i915_oa_perf.c | 675 ++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h     |  87 +++++
 include/linux/perf_event.h          |   1 +
 include/uapi/drm/i915_drm.h         |  21 ++
 kernel/events/core.c                |  40 ++-
 8 files changed, 854 insertions(+), 6 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_oa_perf.c

-- 
2.1.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/3] perf: export perf_event_overflow
  2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg
@ 2014-10-22 15:28 ` Robert Bragg
  2014-10-22 15:28 ` [RFC PATCH 2/3] perf: Add PERF_PMU_CAP_IS_DEVICE flag Robert Bragg
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Robert Bragg @ 2014-10-22 15:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs, Robert Bragg

To support pmu drivers in loadable modules, such as the i915 driver

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 kernel/events/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1cf24b3..9449180 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5478,6 +5478,7 @@ int perf_event_overflow(struct perf_event *event,
 {
 	return __perf_event_overflow(event, 1, data, regs);
 }
+EXPORT_SYMBOL_GPL(perf_event_overflow);
 
 /*
  * Generic software event infrastructure
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/3] perf: Add PERF_PMU_CAP_IS_DEVICE flag
  2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg
  2014-10-22 15:28 ` [RFC PATCH 1/3] perf: export perf_event_overflow Robert Bragg
@ 2014-10-22 15:28 ` Robert Bragg
  2014-10-22 15:28 ` [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture Robert Bragg
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Robert Bragg @ 2014-10-22 15:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs, Robert Bragg

The PERF_PMU_CAP_IS_DEVICE flag provides pmu drivers a way to declare
that they only monitor device specific metrics and since they don't
monitor any cpu metrics then perf should bypass any cpu centric security
checks, as well as disallow cpu centric attributes.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 39 +++++++++++++++++++++++++++++++++------
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 707617a..e1e0153 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -170,6 +170,7 @@ struct perf_event;
  * pmu::capabilities flags
  */
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
+#define PERF_PMU_CAP_IS_DEVICE			0x02
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9449180..3ddb157 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3131,7 +3131,8 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
 
 	if (!task) {
 		/* Must be root to operate on a CPU event: */
-		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+		if (!(pmu->capabilities & PERF_PMU_CAP_IS_DEVICE) &&
+		    perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
 			return ERR_PTR(-EACCES);
 
 		/*
@@ -7091,11 +7092,6 @@ SYSCALL_DEFINE5(perf_event_open,
 	if (err)
 		return err;
 
-	if (!attr.exclude_kernel) {
-		if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-			return -EACCES;
-	}
-
 	if (attr.freq) {
 		if (attr.sample_freq > sysctl_perf_event_sample_rate)
 			return -EINVAL;
@@ -7154,6 +7150,37 @@ SYSCALL_DEFINE5(perf_event_open,
 		goto err_cpus;
 	}
 
+	if (event->pmu->capabilities & PERF_PMU_CAP_IS_DEVICE) {
+
+		/* Don't allow cpu centric attributes... */
+		if (event->attr.exclude_user ||
+		    event->attr.exclude_callchain_user ||
+		    event->attr.exclude_kernel ||
+		    event->attr.exclude_callchain_kernel ||
+		    event->attr.exclude_hv ||
+		    event->attr.exclude_idle ||
+		    event->attr.exclude_host ||
+		    event->attr.exclude_guest ||
+		    event->attr.mmap ||
+		    event->attr.comm ||
+		    event->attr.task)
+			return -EINVAL;
+
+		if (attr.sample_type &
+		    (PERF_SAMPLE_IP |
+		     PERF_SAMPLE_TID |
+		     PERF_SAMPLE_ADDR |
+		     PERF_SAMPLE_CALLCHAIN |
+		     PERF_SAMPLE_CPU |
+		     PERF_SAMPLE_BRANCH_STACK |
+		     PERF_SAMPLE_REGS_USER |
+		     PERF_SAMPLE_STACK_USER))
+			return -EINVAL;
+	} else if (!attr.exclude_kernel) {
+		if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+			return -EACCES;
+	}
+
 	if (flags & PERF_FLAG_PID_CGROUP) {
 		err = perf_cgroup_connect(pid, event, &attr, group_leader);
 		if (err) {
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture
  2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg
  2014-10-22 15:28 ` [RFC PATCH 1/3] perf: export perf_event_overflow Robert Bragg
  2014-10-22 15:28 ` [RFC PATCH 2/3] perf: Add PERF_PMU_CAP_IS_DEVICE flag Robert Bragg
@ 2014-10-22 15:28 ` Robert Bragg
  2014-10-23  7:47   ` Chris Wilson
  2014-10-23  5:58 ` [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Ingo Molnar
  2014-10-30 19:08 ` Peter Zijlstra
  4 siblings, 1 reply; 16+ messages in thread
From: Robert Bragg @ 2014-10-22 15:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs, Robert Bragg

Gen graphics hardware can be set up to periodically write snapshots of
performance counters into a circular buffer and this patch exposes that
capability to userspace via the perf interface.

Only Haswell is supported currently.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/Makefile       |   1 +
 drivers/gpu/drm/i915/i915_dma.c     |   2 +
 drivers/gpu/drm/i915/i915_drv.h     |  33 ++
 drivers/gpu/drm/i915/i915_oa_perf.c | 675 ++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h     |  87 +++++
 include/uapi/drm/i915_drm.h         |  21 ++
 6 files changed, 819 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/i915_oa_perf.c

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index c1dd485..2ddd97d 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -14,6 +14,7 @@ i915-y := i915_drv.o \
 	  intel_pm.o
 i915-$(CONFIG_COMPAT)   += i915_ioc32.o
 i915-$(CONFIG_DEBUG_FS) += i915_debugfs.o
+i915-$(CONFIG_PERF_EVENTS) += i915_oa_perf.o
 
 # GEM code
 i915-y += i915_cmd_parser.o \
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 3f676f9..ce1e1ea 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1792,6 +1792,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 		intel_gpu_ips_init(dev_priv);
 
 	intel_init_runtime_pm(dev_priv);
+	i915_oa_pmu_register(dev);
 
 	return 0;
 
@@ -1839,6 +1840,7 @@ int i915_driver_unload(struct drm_device *dev)
 		return ret;
 	}
 
+	i915_oa_pmu_unregister(dev);
 	intel_fini_runtime_pm(dev_priv);
 
 	intel_gpu_ips_teardown();
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 6fbd316..1b2c557 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -45,6 +45,7 @@
 #include <linux/hashtable.h>
 #include <linux/intel-iommu.h>
 #include <linux/kref.h>
+#include <linux/perf_event.h>
 #include <linux/pm_qos.h>
 
 /* General customization:
@@ -1636,6 +1637,29 @@ struct drm_i915_private {
 	 */
 	struct workqueue_struct *dp_wq;
 
+#ifdef CONFIG_PERF_EVENTS
+	struct {
+	    struct pmu pmu;
+	    spinlock_t lock;
+	    struct hrtimer timer;
+	    struct pt_regs dummy_regs;
+
+	    struct perf_event *exclusive_event;
+	    struct intel_context *specific_ctx;
+
+	    struct {
+		struct kref refcount;
+		struct drm_i915_gem_object *obj;
+		u32 gtt_offset;
+		u8 *addr;
+		u32 head;
+		u32 tail;
+		int format;
+		int format_size;
+	    } oa_buffer;
+	} oa_pmu;
+#endif
+
 	/* Old dri1 support infrastructure, beware the dragons ya fools entering
 	 * here! */
 	struct i915_dri1_state dri1;
@@ -2688,6 +2712,15 @@ int i915_parse_cmds(struct intel_engine_cs *ring,
 		    u32 batch_start_offset,
 		    bool is_master);
 
+/* i915_oa_perf.c */
+#ifdef CONFIG_PERF_EVENTS
+extern void i915_oa_pmu_register(struct drm_device *dev);
+extern void i915_oa_pmu_unregister(struct drm_device *dev);
+#else
+static inline void i915_oa_pmu_register(struct drm_device *dev) {}
+static inline void i915_oa_pmu_unregister(struct drm_device *dev) {}
+#endif
+
 /* i915_suspend.c */
 extern int i915_save_state(struct drm_device *dev);
 extern int i915_restore_state(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
new file mode 100644
index 0000000..d86aaf0
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -0,0 +1,675 @@
+#include <linux/perf_event.h>
+#include <linux/sizes.h>
+
+#include "i915_drv.h"
+#include "intel_ringbuffer.h"
+
+/* Must be a power of two */
+#define OA_BUFFER_SIZE	     SZ_16M
+#define OA_TAKEN(tail, head) ((tail - head) & (OA_BUFFER_SIZE - 1))
+
+#define FREQUENCY 200
+#define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY)
+
+static int hsw_perf_format_sizes[] = {
+	64,  /* A13_HSW */
+	128, /* A29_HSW */
+	128, /* A13_B8_C8_HSW */
+
+	/* XXX: If we were to disallow this format we could avoid needing to
+	 * handle snapshots being split in two when they don't factor into
+	 * the buffer size... */
+	192, /* A29_B8_C8_HSW */
+	64,  /* B4_C8_HSW */
+	256, /* A45_B8_C8_HSW */
+	128, /* B4_C8_A16_HSW */
+	64   /* C4_B8_HSW */
+};
+
+static void forward_one_oa_snapshot_to_event(struct drm_i915_private *dev_priv,
+					     u8 *snapshot,
+					     struct perf_event *event)
+{
+	struct perf_sample_data data;
+	int snapshot_size = dev_priv->oa_pmu.oa_buffer.format_size;
+	struct perf_raw_record raw;
+
+	perf_sample_data_init(&data, 0, event->hw.last_period);
+
+	/* XXX: It seems strange that kernel/events/core.c only initialises
+	 * data->type if event->attr.sample_id_all is set
+	 *
+	 * For now, we explicitly set this otherwise perf_event_overflow()
+	 * may reference an uninitialised sample_type and may not actually
+	 * forward our raw data.
+	 */
+	data.type = event->attr.sample_type;
+
+	/* Note: the 32 bit size + raw data must be 8 byte aligned.
+	 *
+	 * So that we don't have to first copy the data out of the
+	 * OABUFFER, we instead allow an overrun and forward the 32 bit
+	 * report id of the next snapshot...
+	 */
+	raw.size = snapshot_size + 4;
+	raw.data = snapshot;
+
+	data.raw = &raw;
+
+	perf_event_overflow(event, &data, &dev_priv->oa_pmu.dummy_regs);
+}
+
+static u32 forward_oa_snapshots(struct drm_i915_private *dev_priv,
+				u32 head,
+				u32 tail)
+{
+	struct perf_event *exclusive_event = dev_priv->oa_pmu.exclusive_event;
+	int snapshot_size = dev_priv->oa_pmu.oa_buffer.format_size;
+	u8 *oa_buf_base = dev_priv->oa_pmu.oa_buffer.addr;
+	u32 mask = (OA_BUFFER_SIZE - 1);
+	u8 scratch[snapshot_size + 4];
+	u8 *snapshot;
+	u32 taken;
+
+	head -= dev_priv->oa_pmu.oa_buffer.gtt_offset;
+	tail -= dev_priv->oa_pmu.oa_buffer.gtt_offset;
+
+	/* Note: the gpu doesn't wrap the tail according to the OA buffer size
+	 * so when we need to make sure our head/tail values are in-bounds we
+	 * use the above mask.
+	 */
+
+	while ((taken = OA_TAKEN(tail, head))) {
+		u32 before;
+
+		/* The tail increases in 64 byte increments, not in
+		 * format_size steps. */
+		if (taken < snapshot_size)
+			break;
+
+		/* As well as handling snapshots that are split in two we also
+		 * need to pad snapshots at the end of the oabuffer so that
+		 * forward_one_oa_snapshot_to_event() can safely overrun by 4
+		 * bytes for alignment. */
+		before = OA_BUFFER_SIZE - (head & mask);
+		if (before <= snapshot_size) {
+			u32 after = snapshot_size - before;
+
+			memcpy(scratch, oa_buf_base + (head & mask), before);
+			if (after)
+				memcpy(scratch + before, oa_buf_base, after);
+			snapshot = scratch;
+		} else
+			snapshot = oa_buf_base + (head & mask);
+
+		head += snapshot_size;
+
+		/* We currently only allow exclusive access to the counters
+		 * so only have one event to forward too... */
+		if (exclusive_event->state == PERF_EVENT_STATE_ACTIVE)
+			forward_one_oa_snapshot_to_event(dev_priv, snapshot,
+							 exclusive_event);
+	}
+
+	return dev_priv->oa_pmu.oa_buffer.gtt_offset + head;
+}
+
+static void flush_oa_snapshots(struct drm_i915_private *dev_priv,
+			       bool force_wake)
+{
+	unsigned long flags;
+	u32 oastatus2;
+	u32 oastatus1;
+	u32 head;
+	u32 tail;
+
+	/* Can either flush via hrtimer callback or pmu methods/fops */
+	if (!force_wake) {
+
+		/* If the hrtimer triggers at the same time that we are
+		 * responding to a userspace initiated flush then we can
+		 * just bail out...
+		 *
+		 * FIXME: strictly this lock doesn't imply we are already
+		 * flushing though it shouldn't really be a problem to skip
+		 * the odd hrtimer flush anyway.
+		 */
+		if (!spin_trylock_irqsave(&dev_priv->oa_pmu.lock, flags))
+			return;
+	} else
+		spin_lock_irqsave(&dev_priv->oa_pmu.lock, flags);
+
+	WARN_ON(!dev_priv->oa_pmu.oa_buffer.addr);
+
+	oastatus2 = I915_READ(OASTATUS2);
+	oastatus1 = I915_READ(OASTATUS1);
+
+	head = oastatus2 & OASTATUS2_HEAD_MASK;
+	tail = oastatus1 & OASTATUS1_TAIL_MASK;
+
+	if (oastatus1 & (OASTATUS1_OABUFFER_OVERFLOW |
+			 OASTATUS1_REPORT_LOST)) {
+
+		/* XXX: How can we convey report-lost errors to userspace?  It
+		 * doesn't look like perf's _REPORT_LOST mechanism is
+		 * appropriate in this case; that's just for cases where we
+		 * run out of space for samples in the perf circular buffer.
+		 *
+		 * Maybe we can claim a special report-id and use that to
+		 * forward status flags?
+		 */
+		pr_debug("OA buffer read error: addr = %p, head = %u, offset = %u, tail = %u cnt o'flow = %d, buf o'flow = %d, rpt lost = %d\n",
+			 dev_priv->oa_pmu.oa_buffer.addr,
+			 head,
+			 head - dev_priv->oa_pmu.oa_buffer.gtt_offset,
+			 tail,
+			 oastatus1 & OASTATUS1_COUNTER_OVERFLOW ? 1 : 0,
+			 oastatus1 & OASTATUS1_OABUFFER_OVERFLOW ? 1 : 0,
+			 oastatus1 & OASTATUS1_REPORT_LOST ? 1 : 0);
+
+		I915_WRITE(OASTATUS1, oastatus1 &
+			   ~(OASTATUS1_OABUFFER_OVERFLOW |
+			     OASTATUS1_REPORT_LOST));
+	}
+
+	head = forward_oa_snapshots(dev_priv, head, tail);
+
+	I915_WRITE(OASTATUS2, (head & OASTATUS2_HEAD_MASK) | OASTATUS2_GGTT);
+
+	spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, flags);
+}
+
+static void
+oa_buffer_free(struct kref *kref)
+{
+	struct drm_i915_private *i915 =
+		container_of(kref, typeof(*i915), oa_pmu.oa_buffer.refcount);
+
+	BUG_ON(!mutex_is_locked(&i915->dev->struct_mutex));
+
+	vunmap(i915->oa_pmu.oa_buffer.addr);
+	i915_gem_object_ggtt_unpin(i915->oa_pmu.oa_buffer.obj);
+	drm_gem_object_unreference(&i915->oa_pmu.oa_buffer.obj->base);
+
+	i915->oa_pmu.oa_buffer.obj = NULL;
+	i915->oa_pmu.oa_buffer.gtt_offset = 0;
+	i915->oa_pmu.oa_buffer.addr = NULL;
+}
+
+static inline void oa_buffer_reference(struct drm_i915_private *i915)
+{
+	kref_get(&i915->oa_pmu.oa_buffer.refcount);
+}
+
+static void oa_buffer_unreference(struct drm_i915_private *i915)
+{
+	WARN_ON(!i915->oa_pmu.oa_buffer.obj);
+
+	kref_put(&i915->oa_pmu.oa_buffer.refcount, oa_buffer_free);
+}
+
+static void i915_oa_event_destroy(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), oa_pmu.pmu);
+
+	WARN_ON(event->parent);
+
+	mutex_lock(&i915->dev->struct_mutex);
+
+	oa_buffer_unreference(i915);
+
+	if (i915->oa_pmu.specific_ctx) {
+		struct drm_i915_gem_object *obj;
+
+		obj = i915->oa_pmu.specific_ctx->legacy_hw_ctx.rcs_state;
+		if (i915_gem_obj_is_pinned(obj))
+			i915_gem_object_ggtt_unpin(obj);
+		i915->oa_pmu.specific_ctx = NULL;
+	}
+
+	BUG_ON(i915->oa_pmu.exclusive_event != event);
+	i915->oa_pmu.exclusive_event = NULL;
+
+	mutex_unlock(&i915->dev->struct_mutex);
+
+	gen6_gt_force_wake_put(i915, FORCEWAKE_ALL);
+}
+
+static void *vmap_oa_buffer(struct drm_i915_gem_object *obj)
+{
+	int i;
+	void *addr = NULL;
+	struct sg_page_iter sg_iter;
+	struct page **pages;
+
+	pages = drm_malloc_ab(obj->base.size >> PAGE_SHIFT, sizeof(*pages));
+	if (pages == NULL) {
+		DRM_DEBUG_DRIVER("Failed to get space for pages\n");
+		goto finish;
+	}
+
+	i = 0;
+	for_each_sg_page(obj->pages->sgl, &sg_iter, obj->pages->nents, 0) {
+		pages[i] = sg_page_iter_page(&sg_iter);
+		i++;
+	}
+
+	addr = vmap(pages, i, 0, PAGE_KERNEL);
+	if (addr == NULL) {
+		DRM_DEBUG_DRIVER("Failed to vmap pages\n");
+		goto finish;
+	}
+
+finish:
+	if (pages)
+		drm_free_large(pages);
+	return addr;
+}
+
+static int init_oa_buffer(struct perf_event *event)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	struct drm_i915_gem_object *bo;
+	int ret;
+
+	BUG_ON(!IS_HASWELL(dev_priv->dev));
+	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
+	BUG_ON(dev_priv->oa_pmu.oa_buffer.obj);
+
+	kref_init(&dev_priv->oa_pmu.oa_buffer.refcount);
+
+	bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE);
+	if (bo == NULL) {
+		DRM_ERROR("Failed to allocate OA buffer\n");
+		ret = -ENOMEM;
+		goto err;
+	}
+	dev_priv->oa_pmu.oa_buffer.obj = bo;
+
+	ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
+	if (ret)
+		goto err_unref;
+
+	/* PreHSW required 512K alignment, HSW requires 16M */
+	ret = i915_gem_obj_ggtt_pin(bo, SZ_16M, 0);
+	if (ret)
+		goto err_unref;
+
+	dev_priv->oa_pmu.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
+	dev_priv->oa_pmu.oa_buffer.addr = vmap_oa_buffer(bo);
+
+	/* Pre-DevBDW: OABUFFER must be set with counters off,
+	 * before OASTATUS1, but after OASTATUS2 */
+	I915_WRITE(OASTATUS2, dev_priv->oa_pmu.oa_buffer.gtt_offset |
+		   OASTATUS2_GGTT); /* head */
+	I915_WRITE(GEN7_OABUFFER, dev_priv->oa_pmu.oa_buffer.gtt_offset);
+	I915_WRITE(OASTATUS1, dev_priv->oa_pmu.oa_buffer.gtt_offset |
+		   OASTATUS1_OABUFFER_SIZE_16M); /* tail */
+
+	DRM_DEBUG_DRIVER("OA Buffer initialized, gtt offset = 0x%x, vaddr = %p",
+			 dev_priv->oa_pmu.oa_buffer.gtt_offset,
+			 dev_priv->oa_pmu.oa_buffer.addr);
+
+	return 0;
+
+err_unref:
+	drm_gem_object_unreference_unlocked(&bo->base);
+err:
+	return ret;
+}
+
+static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer)
+{
+	struct drm_i915_private *i915 =
+		container_of(hrtimer, typeof(*i915), oa_pmu.timer);
+
+	flush_oa_snapshots(i915, false);
+
+	hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD));
+	return HRTIMER_RESTART;
+}
+
+static struct intel_context *
+lookup_context(struct drm_i915_private *dev_priv,
+	       struct file *user_filp,
+	       u32 ctx_user_handle)
+{
+	struct intel_context *ctx;
+
+	mutex_lock(&dev_priv->dev->struct_mutex);
+	list_for_each_entry(ctx, &dev_priv->context_list, link) {
+		struct drm_file *drm_file;
+
+		if (!ctx->file_priv)
+			continue;
+
+		drm_file = ctx->file_priv->file;
+
+		if (user_filp->private_data == drm_file &&
+		    ctx->user_handle == ctx_user_handle) {
+			mutex_unlock(&dev_priv->dev->struct_mutex);
+			return ctx;
+		}
+	}
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+
+	return NULL;
+}
+
+static int i915_oa_event_init(struct perf_event *event)
+{
+	struct perf_event_context *ctx = event->ctx;
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	int ret = 0;
+
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	/* When tracing a specific pid events/core will enable/disable
+	 * the event only while that pid is running on a cpu but that
+	 * doesn't really make sense here. */
+	if (ctx) {
+		if (ctx->task)
+			return -EINVAL;
+	}
+#if 0
+	else
+	    pr_err("Unexpected NULL perf_event_context\n");
+
+	 /* XXX: it looks like we get a NULL ctx, so check if setting
+	  * pmu->task_ctx_nr to perf_invalid_context in _pmu_register
+	  * implies events/core.c will also implicitly disallow
+	  * associating a perf_oa event with a task?
+	  */
+#endif
+
+	/* To avoid the complexity of having to accurately filter
+	 * counter snapshots and marshal to the appropriate client
+	 * we currently only allow exclusive access */
+	if (dev_priv->oa_pmu.oa_buffer.obj)
+		return -EBUSY;
+
+	/* TODO: improve cooperation with the cmd_parser which provides
+	 * another mechanism for enabling the OA counters. */
+	if (I915_READ(OACONTROL) & OACONTROL_ENABLE)
+		return -EBUSY;
+
+	/* Since we are limited to an exponential scale for
+	 * programming the OA sampling period we don't allow userspace
+	 * to pass a precise attr.sample_period. */
+	if (event->attr.freq ||
+	    (event->attr.sample_period != 0 &&
+	     event->attr.sample_period != 1))
+		return -EINVAL;
+
+	/* Instead of allowing userspace to configure the period via
+	 * attr.sample_period we instead accept an exponent whereby
+	 * the sample_period will be:
+	 *
+	 *   80ns * 2^(period_exponent + 1)
+	 *
+	 * Programming a period of 160 nanoseconds would not be very
+	 * polite, so higher frequencies are reserved for root.
+	 */
+	if (event->attr.sample_period) {
+		u64 period_exponent =
+			event->attr.config & I915_PERF_OA_TIMER_EXPONENT_MASK;
+		period_exponent >>= I915_PERF_OA_TIMER_EXPONENT_SHIFT;
+
+		if (period_exponent < 15 && !capable(CAP_SYS_ADMIN))
+			return -EACCES;
+	}
+
+	if (!IS_HASWELL(dev_priv->dev))
+		return -ENODEV;
+
+	/* We bypass the default perf core perf_paranoid_cpu() ||
+	 * CAP_SYS_ADMIN check by using the PERF_PMU_CAP_IS_DEVICE
+	 * flag and instead authenticate based on whether the current
+	 * pid owns the specified context, or require CAP_SYS_ADMIN
+	 * when collecting cross-context metrics.
+	 */
+	dev_priv->oa_pmu.specific_ctx = NULL;
+	if (event->attr.config & I915_PERF_OA_SINGLE_CONTEXT_ENABLE) {
+		u32 ctx_id = event->attr.config & I915_PERF_OA_CTX_ID_MASK;
+		unsigned int drm_fd = event->attr.config1;
+		struct fd fd = fdget(drm_fd);
+
+		if (fd.file) {
+			dev_priv->oa_pmu.specific_ctx =
+				lookup_context(dev_priv, fd.file, ctx_id);
+		}
+	}
+
+	if (!dev_priv->oa_pmu.specific_ctx && !capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	mutex_lock(&dev_priv->dev->struct_mutex);
+
+	/* XXX: Not sure that this is really acceptable...
+	 *
+	 * i915_gem_context.c currently owns pinning/unpinning legacy
+	 * context buffers and although that code has a
+	 * get_context_alignment() func to handle a different
+	 * constraint for gen6 we are assuming it's fixed for gen7
+	 * here. Another option besides pinning here would be to
+	 * instead hook into context switching and update the
+	 * OACONTROL configuration on the fly.
+	 */
+	if (dev_priv->oa_pmu.specific_ctx) {
+		struct intel_context *ctx = dev_priv->oa_pmu.specific_ctx;
+		int ret;
+
+		ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state,
+					    4096, 0);
+		if (ret) {
+			DRM_DEBUG_DRIVER("Couldn't pin %d\n", ret);
+			ret = -EBUSY;
+			goto err;
+		}
+	}
+
+	if (!dev_priv->oa_pmu.oa_buffer.obj)
+		ret = init_oa_buffer(event);
+	else
+		oa_buffer_reference(dev_priv);
+
+	if (ret)
+		goto err;
+
+	BUG_ON(dev_priv->oa_pmu.exclusive_event);
+	dev_priv->oa_pmu.exclusive_event = event;
+
+	event->destroy = i915_oa_event_destroy;
+
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+
+	/* PRM - observability performance counters:
+	 *
+	 *   OACONTROL, performance counter enable, note:
+	 *
+	 *   "When this bit is set, in order to have coherent counts,
+	 *   RC6 power state and trunk clock gating must be disabled.
+	 *   This can be achieved by programming MMIO registers as
+	 *   0xA094=0 and 0xA090[31]=1"
+	 *
+	 *   0xA094 corresponds to GEN6_RC_STATE
+	 *   0xA090[31] corresponds to GEN6_RC_CONTROL, GEN6_RC_CTL_HW_ENABLE
+	 */
+	/* XXX: We should probably find a more refined way of disabling RC6
+	 * in cooperation with intel_pm.c.
+	 * TODO: Find a way to disable clock gating too
+	 */
+	gen6_gt_force_wake_get(dev_priv, FORCEWAKE_ALL);
+
+	return 0;
+
+err:
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+
+	return ret;
+}
+
+static void i915_oa_event_start(struct perf_event *event, int flags)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	u64 report_format;
+	int snapshot_size;
+	unsigned long ctx_id;
+	u64 period_exponent;
+
+	/* PRM - observability performance counters:
+	 *
+	 *   OACONTROL, specific context enable:
+	 *
+	 *   "OA unit level clock gating must be ENABLED when using
+	 *   specific ContextID feature."
+	 *
+	 * Assuming we don't ever disable OA unit level clock gating
+	 * lets just assert that this condition is met...
+	 */
+	WARN_ONCE(I915_READ(GEN6_UCGCTL3) & GEN6_OACSUNIT_CLOCK_GATE_DISABLE,
+		  "disabled OA unit level clock gating will result in incorrect per-context OA counters");
+
+	/* XXX: On Haswell, when threshold disable mode is desired,
+	 * instead of setting the threshold enable to '0', we need to
+	 * program it to '1' and set OASTARTTRIG1 bits 15:0 to 0
+	 * (threshold value of 0)
+	 */
+	I915_WRITE(OASTARTTRIG6, (OASTARTTRIG6_B4_TO_B7_THRESHOLD_ENABLE |
+				  OASTARTTRIG6_B4_CUSTOM_EVENT_ENABLE));
+	I915_WRITE(OASTARTTRIG5, 0); /* threshold value */
+
+	I915_WRITE(OASTARTTRIG2, (OASTARTTRIG2_B0_TO_B3_THRESHOLD_ENABLE |
+				  OASTARTTRIG2_B0_CUSTOM_EVENT_ENABLE));
+	I915_WRITE(OASTARTTRIG1, 0); /* threshold value */
+
+	/* Setup B0 as the gpu clock counter... */
+	I915_WRITE(OACEC0_0, OACEC0_0_B0_COMPARE_GREATER_OR_EQUAL); /* to 0 */
+	I915_WRITE(OACEC0_1, 0xfffe); /* Select NOA[0] */
+
+	period_exponent = event->attr.config & I915_PERF_OA_TIMER_EXPONENT_MASK;
+	period_exponent >>= I915_PERF_OA_TIMER_EXPONENT_SHIFT;
+
+	if (dev_priv->oa_pmu.specific_ctx) {
+		struct intel_context *ctx = dev_priv->oa_pmu.specific_ctx;
+
+		ctx_id = i915_gem_obj_ggtt_offset(ctx->legacy_hw_ctx.rcs_state);
+	} else
+		ctx_id = 0;
+
+	report_format = event->attr.config & I915_PERF_OA_FORMAT_MASK;
+	report_format >>= I915_PERF_OA_FORMAT_SHIFT;
+	snapshot_size = hsw_perf_format_sizes[report_format];
+
+	I915_WRITE(OACONTROL,  0 |
+		   (ctx_id & OACONTROL_CTX_MASK) |
+		   period_exponent << OACONTROL_TIMER_PERIOD_SHIFT |
+		   (event->attr.sample_period ? OACONTROL_TIMER_ENABLE : 0) |
+		   report_format << OACONTROL_FORMAT_SHIFT|
+		   (ctx_id ? OACONTROL_PER_CTX_ENABLE : 0) |
+		   OACONTROL_ENABLE);
+
+	if (event->attr.sample_period) {
+		__hrtimer_start_range_ns(&dev_priv->oa_pmu.timer,
+					 ns_to_ktime(PERIOD), 0,
+					 HRTIMER_MODE_REL_PINNED, 0);
+	}
+
+	dev_priv->oa_pmu.oa_buffer.format = report_format;
+	dev_priv->oa_pmu.oa_buffer.format_size = snapshot_size;
+
+	event->hw.state = 0;
+}
+
+static void i915_oa_event_stop(struct perf_event *event, int flags)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+
+	I915_WRITE(OACONTROL, I915_READ(OACONTROL) & ~OACONTROL_ENABLE);
+
+	if (event->attr.sample_period) {
+		hrtimer_cancel(&dev_priv->oa_pmu.timer);
+		flush_oa_snapshots(dev_priv, true);
+	}
+
+	event->hw.state = PERF_HES_STOPPED;
+}
+
+static int i915_oa_event_add(struct perf_event *event, int flags)
+{
+	if (flags & PERF_EF_START)
+		i915_oa_event_start(event, flags);
+
+	return 0;
+}
+
+static void i915_oa_event_del(struct perf_event *event, int flags)
+{
+	i915_oa_event_stop(event, flags);
+}
+
+static void i915_oa_event_read(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), oa_pmu.pmu);
+
+	/* We want userspace to be able to use a read() to explicitly
+	 * flush OA counter snapshots... */
+	if (event->attr.sample_period)
+		flush_oa_snapshots(i915, true);
+
+	/* XXX: What counter would be useful here? */
+	local64_set(&event->count, 0);
+}
+
+static int i915_oa_event_event_idx(struct perf_event *event)
+{
+	return 0;
+}
+
+void i915_oa_pmu_register(struct drm_device *dev)
+{
+	struct drm_i915_private *i915 = to_i915(dev);
+
+	/* We need to be careful about forwarding cpu metrics to
+	 * userspace considering that PERF_PMU_CAP_IS_DEVICE bypasses
+	 * the events/core security check that stops an unprivileged
+	 * process collecting metrics for other processes.
+	 */
+	i915->oa_pmu.dummy_regs = *task_pt_regs(current);
+
+	hrtimer_init(&i915->oa_pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	i915->oa_pmu.timer.function = hrtimer_sample;
+
+	spin_lock_init(&i915->oa_pmu.lock);
+
+	i915->oa_pmu.pmu.capabilities  = PERF_PMU_CAP_IS_DEVICE;
+	i915->oa_pmu.pmu.task_ctx_nr   = perf_invalid_context;
+	i915->oa_pmu.pmu.event_init    = i915_oa_event_init;
+	i915->oa_pmu.pmu.add	       = i915_oa_event_add;
+	i915->oa_pmu.pmu.del	       = i915_oa_event_del;
+	i915->oa_pmu.pmu.start	       = i915_oa_event_start;
+	i915->oa_pmu.pmu.stop	       = i915_oa_event_stop;
+	i915->oa_pmu.pmu.read	       = i915_oa_event_read;
+	i915->oa_pmu.pmu.event_idx     = i915_oa_event_event_idx;
+
+	if (perf_pmu_register(&i915->oa_pmu.pmu, "i915_oa", -1))
+		i915->oa_pmu.pmu.event_init = NULL;
+}
+
+void i915_oa_pmu_unregister(struct drm_device *dev)
+{
+	struct drm_i915_private *i915 = to_i915(dev);
+
+	if (i915->oa_pmu.pmu.event_init == NULL)
+		return;
+
+	perf_pmu_unregister(&i915->oa_pmu.pmu);
+	i915->oa_pmu.pmu.event_init = NULL;
+}
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 203062e..1e7cfd4 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -457,6 +457,92 @@
 #define GEN7_3DPRIM_BASE_VERTEX         0x2440
 
 #define OACONTROL 0x2360
+#define  OACONTROL_CTX_MASK		0xFFFFF000
+#define  OACONTROL_TIMER_PERIOD_MASK	0x3F
+#define  OACONTROL_TIMER_PERIOD_SHIFT	6
+#define  OACONTROL_TIMER_ENABLE		(1<<5)
+#define  OACONTROL_FORMAT_A13_HSW	(0<<2)
+#define  OACONTROL_FORMAT_A29_HSW       (1<<2)
+#define  OACONTROL_FORMAT_A13_B8_C8_HSW (2<<2)
+#define  OACONTROL_FORMAT_A29_B8_C8_HSW (3<<2)
+#define  OACONTROL_FORMAT_B4_C8_HSW     (4<<2)
+#define  OACONTROL_FORMAT_A45_B8_C8_HSW (5<<2)
+#define  OACONTROL_FORMAT_B4_C8_A16_HSW (6<<2)
+#define  OACONTROL_FORMAT_C4_B8_HSW     (7<<2)
+#define  OACONTROL_FORMAT_SHIFT         2
+#define  OACONTROL_PER_CTX_ENABLE	(1<<1)
+#define  OACONTROL_ENABLE		(1<<0)
+
+#define OASTARTTRIG5 0x02720
+#define  OASTARTTRIG5_THRESHOLD_VALUE_MASK	0xffff
+
+#define OASTARTTRIG6 0x02724
+#define  OASTARTTRIG6_B4_TO_B7_THRESHOLD_ENABLE (1<<23)
+#define  OASTARTTRIG6_B4_CUSTOM_EVENT_ENABLE	(1<<28)
+
+#define OASTARTTRIG1 0x02710
+#define  OASTARTTRIG1_THRESHOLD_VALUE_MASK	0xffff
+
+#define OASTARTTRIG2 0x02714
+#define  OASTARTTRIG2_B0_TO_B3_THRESHOLD_ENABLE (1<<23)
+#define  OASTARTTRIG2_B0_CUSTOM_EVENT_ENABLE	(1<<28)
+
+#define OACEC0_0 0x2770
+#define  OACEC0_0_B0_COMPARE_ANY_EQUAL		0
+#define  OACEC0_0_B0_COMPARE_OR			0
+#define  OACEC0_0_B0_COMPARE_GREATER_THAN	1
+#define  OACEC0_0_B0_COMPARE_EQUAL		2
+#define  OACEC0_0_B0_COMPARE_GREATER_OR_EQUAL	3
+#define  OACEC0_0_B0_COMPARE_LESS_THAN		4
+#define  OACEC0_0_B0_COMPARE_NOT_EQUAL		5
+#define  OACEC0_0_B0_COMPARE_LESS_OR_EQUAL	6
+#define  OACEC0_0_B0_COMPARE_VALUE_MASK		0xffff
+#define  OACEC0_0_B0_COMPARE_VALUE_SHIFT	3
+
+#define OACEC0_1 0x2774
+#define  OACEC0_1_B0_NOA_SELECT_MASK		0xffff
+
+#define GEN7_OABUFFER 0x23B0 /* R/W */
+#define  GEN7_OABUFFER_OVERRUN_DISABLE	    (1<<3)
+#define  GEN7_OABUFFER_EDGE_TRIGGER	    (1<<2)
+#define  GEN7_OABUFFER_STOP_RESUME_ENABLE   (1<<1)
+#define  GEN7_OABUFFER_RESUME		    (1<<0)
+
+#define GEN8_OABUFFER 0x2B14 /* R/W */
+#define  GEN8_OABUFFER_SIZE_MASK	0x7
+#define  GEN8_OABUFFER_SIZE_128K	(0<<3)
+#define  GEN8_OABUFFER_SIZE_256K	(1<<3)
+#define  GEN8_OABUFFER_SIZE_512K	(2<<3)
+#define  GEN8_OABUFFER_SIZE_1M		(3<<3)
+#define  GEN8_OABUFFER_SIZE_2M		(4<<3)
+#define  GEN8_OABUFFER_SIZE_4M		(5<<3)
+#define  GEN8_OABUFFER_SIZE_8M		(6<<3)
+#define  GEN8_OABUFFER_SIZE_16M		(7<<3)
+#define  GEN8_OABUFFER_EDGE_TRIGGER	(1<<2)
+#define  GEN8_OABUFFER_OVERRUN_DISABLE  (1<<1)
+#define  GEN8_OABUFFER_MEM_SELECT_GGTT  (1<<0)
+
+#define OASTATUS1 0x2364
+#define  OASTATUS1_TAIL_MASK		0xffffffc0
+#define  OASTATUS1_OABUFFER_SIZE_128K	(0<<3)
+#define  OASTATUS1_OABUFFER_SIZE_256K	(1<<3)
+#define  OASTATUS1_OABUFFER_SIZE_512K	(2<<3)
+#define  OASTATUS1_OABUFFER_SIZE_1M	(3<<3)
+#define  OASTATUS1_OABUFFER_SIZE_2M	(4<<3)
+#define  OASTATUS1_OABUFFER_SIZE_4M	(5<<3)
+#define  OASTATUS1_OABUFFER_SIZE_8M	(6<<3)
+#define  OASTATUS1_OABUFFER_SIZE_16M	(7<<3)
+#define  OASTATUS1_COUNTER_OVERFLOW	(1<<2)
+#define  OASTATUS1_OABUFFER_OVERFLOW	(1<<1)
+#define  OASTATUS1_REPORT_LOST		(1<<0)
+
+
+#define OASTATUS2 0x2368
+#define OASTATUS2_HEAD_MASK 0xffffffc0
+#define OASTATUS2_GGTT 0x1
+
+#define GEN8_OAHEADPTR 0x2B0C
+#define GEN8_OATAILPTR 0x2B10
 
 #define _GEN7_PIPEA_DE_LOAD_SL	0x70068
 #define _GEN7_PIPEB_DE_LOAD_SL	0x71068
@@ -5551,6 +5637,7 @@ enum punit_power_well {
 # define GEN6_RCCUNIT_CLOCK_GATE_DISABLE		(1 << 11)
 
 #define GEN6_UCGCTL3				0x9408
+# define GEN6_OACSUNIT_CLOCK_GATE_DISABLE		(1 << 20)
 
 #define GEN7_UCGCTL4				0x940c
 #define  GEN7_L3BANK2X_CLOCK_GATE_DISABLE	(1<<25)
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index ff57f07..fd3b0cb 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -58,6 +58,27 @@
 #define I915_ERROR_UEVENT		"ERROR"
 #define I915_RESET_UEVENT		"RESET"
 
+/**
+ * DOC: perf events configuration exposed by i915 through /sys/bus/event_sources/drivers/i915_oa
+ *
+ */
+#define I915_PERF_OA_CTX_ID_MASK	    0xffffffff
+#define I915_PERF_OA_SINGLE_CONTEXT_ENABLE  (1ULL << 32)
+
+#define I915_PERF_OA_FORMAT_SHIFT	    33
+#define I915_PERF_OA_FORMAT_MASK	    (0x7ULL << 33)
+#define I915_PERF_OA_FORMAT_A13_HSW	    (0ULL << 33)
+#define I915_PERF_OA_FORMAT_A29_HSW	    (1ULL << 33)
+#define I915_PERF_OA_FORMAT_A13_B8_C8_HSW   (2ULL << 33)
+#define I915_PERF_OA_FORMAT_A29_B8_C8_HSW   (3ULL << 33)
+#define I915_PERF_OA_FORMAT_B4_C8_HSW	    (4ULL << 33)
+#define I915_PERF_OA_FORMAT_A45_B8_C8_HSW   (5ULL << 33)
+#define I915_PERF_OA_FORMAT_B4_C8_A16_HSW   (6ULL << 33)
+#define I915_PERF_OA_FORMAT_C4_B8_HSW	    (7ULL << 33)
+
+#define I915_PERF_OA_TIMER_EXPONENT_SHIFT   36
+#define I915_PERF_OA_TIMER_EXPONENT_MASK    (0x3fULL << 36)
+
 /* Each region is a minimum of 16k, and there are at most 255 of them.
  */
 #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver
  2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg
                   ` (2 preceding siblings ...)
  2014-10-22 15:28 ` [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture Robert Bragg
@ 2014-10-23  5:58 ` Ingo Molnar
  2014-10-24 13:39   ` Robert Bragg
  2014-10-30 19:08 ` Peter Zijlstra
  4 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2014-10-23  5:58 UTC (permalink / raw)
  To: Robert Bragg
  Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs


* Robert Bragg <robert@sixbynine.org> wrote:

> [...]
> 
> I'd be interested to hear whether is sounds reasonable to 
> others for us to expose gpu device metrics via a perf pmu and 
> whether adding the PERF_PMU_CAP_IS_DEVICE flag as in my 
> following patch could be acceptable.

I think it's perfectly reasonable, it's one of the interesting 
kernel features I hoped for years would be implemented for perf.

> [...]
> 
> In addition I also explicitly black list numerous attributes 
> and PERF_SAMPLE_ flags that I don't think make sense for a 
> device pmu. This could be handled in the pmu driver but it 
> seemed better to do in events/core, avoiding duplication in 
> case we later have multiple device pmus.

Btw., if the GPU is able to dump (part of) its current execution 
status, you could in theory even do instruction profiling and 
in-GPU profiling (symbol resolution, maybe even annotation, etc.) 
with close to standard perf tooling - which I think is currently 
mostly the domain of proprietary tools.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture
  2014-10-22 15:28 ` [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture Robert Bragg
@ 2014-10-23  7:47   ` Chris Wilson
  2014-10-24  2:33     ` Robert Bragg
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2014-10-23  7:47 UTC (permalink / raw)
  To: Robert Bragg
  Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Rob Clark,
	Samuel Pitoiset, Ben Skeggs

On Wed, Oct 22, 2014 at 04:28:51PM +0100, Robert Bragg wrote:
> +	/* XXX: Not sure that this is really acceptable...
> +	 *
> +	 * i915_gem_context.c currently owns pinning/unpinning legacy
> +	 * context buffers and although that code has a
> +	 * get_context_alignment() func to handle a different
> +	 * constraint for gen6 we are assuming it's fixed for gen7
> +	 * here. Another option besides pinning here would be to
> +	 * instead hook into context switching and update the
> +	 * OACONTROL configuration on the fly.
> +	 */
> +	if (dev_priv->oa_pmu.specific_ctx) {
> +		struct intel_context *ctx = dev_priv->oa_pmu.specific_ctx;
> +		int ret;
> +
> +		ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state,
> +					    4096, 0);

Right if you pin it here with a different alignment, when we try to pin
it with the required hw ctx alignment it will fail. Easiest way is to
record the ctx->legacy_hw_ctx.alignment and reuse that here.

> +		if (ret) {
> +			DRM_DEBUG_DRIVER("Couldn't pin %d\n", ret);
> +			ret = -EBUSY;

As an exercise, think of all the possible error values from pin() and
tell me why overriding that here is a bad, bad idea.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture
  2014-10-23  7:47   ` Chris Wilson
@ 2014-10-24  2:33     ` Robert Bragg
  2014-10-24  6:56       ` Chris Wilson
  0 siblings, 1 reply; 16+ messages in thread
From: Robert Bragg @ 2014-10-24  2:33 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Rob Clark,
	Samuel Pitoiset, Ben Skeggs

On Thu, Oct 23, 2014 at 8:47 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Wed, Oct 22, 2014 at 04:28:51PM +0100, Robert Bragg wrote:
>> +     /* XXX: Not sure that this is really acceptable...
>> +      *
>> +      * i915_gem_context.c currently owns pinning/unpinning legacy
>> +      * context buffers and although that code has a
>> +      * get_context_alignment() func to handle a different
>> +      * constraint for gen6 we are assuming it's fixed for gen7
>> +      * here. Another option besides pinning here would be to
>> +      * instead hook into context switching and update the
>> +      * OACONTROL configuration on the fly.
>> +      */
>> +     if (dev_priv->oa_pmu.specific_ctx) {
>> +             struct intel_context *ctx = dev_priv->oa_pmu.specific_ctx;
>> +             int ret;
>> +
>> +             ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state,
>> +                                         4096, 0);
>
> Right if you pin it here with a different alignment, when we try to  pin
> it with the required hw ctx alignment it will fail. Easiest way is to
> record the ctx->legacy_hw_ctx.alignment and reuse that here.

Ok I can look into that a bit more. I'm not currently sure I can assume the
ctx will have been pinned before, to be able to record the alignment.
Skimming i915_gem_context.c, it looks like we only pin the default context
on creation and a user could open a perf even before we first switch to that
context.

I wonder if it would be ok to expose an i915_get_context_alignment() api to
deal with this?

>
>> +             if (ret) {
>> +                     DRM_DEBUG_DRIVER("Couldn't pin %d\n", ret);
>> +                     ret = -EBUSY;
>
> As an exercise, think of all the possible error values from pin() and
> tell me why overriding that here is a bad, bad idea.

Hmm, I'm not quite sure why I decided to squash the error code there, it
looks pretty arbitrary. My take on your comment a.t.m is essentially that
some of the pin() errors don't really represent a busy state where it would
make sense for userspace to try again later; such as -ENODEV. Sorry if you
saw a very specific case that offended you :-) I have removed the override
locally.

Thanks for taking a look.

- Robert

> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture
  2014-10-24  2:33     ` Robert Bragg
@ 2014-10-24  6:56       ` Chris Wilson
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Wilson @ 2014-10-24  6:56 UTC (permalink / raw)
  To: Robert Bragg
  Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Rob Clark,
	Samuel Pitoiset, Ben Skeggs

On Fri, Oct 24, 2014 at 03:33:14AM +0100, Robert Bragg wrote:
> On Thu, Oct 23, 2014 at 8:47 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Wed, Oct 22, 2014 at 04:28:51PM +0100, Robert Bragg wrote:
> >> +     /* XXX: Not sure that this is really acceptable...
> >> +      *
> >> +      * i915_gem_context.c currently owns pinning/unpinning legacy
> >> +      * context buffers and although that code has a
> >> +      * get_context_alignment() func to handle a different
> >> +      * constraint for gen6 we are assuming it's fixed for gen7
> >> +      * here. Another option besides pinning here would be to
> >> +      * instead hook into context switching and update the
> >> +      * OACONTROL configuration on the fly.
> >> +      */
> >> +     if (dev_priv->oa_pmu.specific_ctx) {
> >> +             struct intel_context *ctx = dev_priv->oa_pmu.specific_ctx;
> >> +             int ret;
> >> +
> >> +             ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state,
> >> +                                         4096, 0);
> >
> > Right if you pin it here with a different alignment, when we try to  pin
> > it with the required hw ctx alignment it will fail. Easiest way is to
> > record the ctx->legacy_hw_ctx.alignment and reuse that here.
> 
> Ok I can look into that a bit more. I'm not currently sure I can assume the
> ctx will have been pinned before, to be able to record the alignment.
> Skimming i915_gem_context.c, it looks like we only pin the default context
> on creation and a user could open a perf even before we first switch to that
> context.
> 
> I wonder if it would be ok to expose an i915_get_context_alignment() api to
> deal with this?

I would either add intel_context_pin_state()/unpin_state() or expose
ctx->...state_alignment. Leaning towards the former so that we don't have
too many places mucking around inside ctx.

> >
> >> +             if (ret) {
> >> +                     DRM_DEBUG_DRIVER("Couldn't pin %d\n", ret);
> >> +                     ret = -EBUSY;
> >
> > As an exercise, think of all the possible error values from pin() and
> > tell me why overriding that here is a bad, bad idea.
> 
> Hmm, I'm not quite sure why I decided to squash the error code there, it
> looks pretty arbitrary. My take on your comment a.t.m is essentially that
> some of the pin() errors don't really represent a busy state where it would
> make sense for userspace to try again later; such as -ENODEV. Sorry if you
> saw a very specific case that offended you :-) I have removed the override
> locally.

Or EINTR/EAGAIN and try again immediately. ;)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver
  2014-10-23  5:58 ` [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Ingo Molnar
@ 2014-10-24 13:39   ` Robert Bragg
  0 siblings, 0 replies; 16+ messages in thread
From: Robert Bragg @ 2014-10-24 13:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs

On Thu, Oct 23, 2014 at 6:58 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Robert Bragg <robert@sixbynine.org> wrote:
>
>> [...]
>>
>> I'd be interested to hear whether is sounds reasonable to
>> others for us to expose gpu device metrics via a perf pmu and
>> whether adding the PERF_PMU_CAP_IS_DEVICE flag as in my
>> following patch could be acceptable.
>
> I think it's perfectly reasonable, it's one of the interesting
> kernel features I hoped for years would be implemented for perf.

Ok, that's good to hear, thanks.

>
>> [...]
>>
>> In addition I also explicitly black list numerous attributes
>> and PERF_SAMPLE_ flags that I don't think make sense for a
>> device pmu. This could be handled in the pmu driver but it
>> seemed better to do in events/core, avoiding duplication in
>> case we later have multiple device pmus.
>
> Btw., if the GPU is able to dump (part of) its current execution
> status, you could in theory even do instruction profiling and
> in-GPU profiling (symbol resolution, maybe even annotation, etc.)
> with close to standard perf tooling - which I think is currently
> mostly the domain of proprietary tools.

I'm not entirely sure; but there are certainly quite a few hw debug features
I haven't really explored yet that might lend themselves to supporting
something like this. At least there's a breakpoint mechanism that looks like
it might help for something like this, including providing a way to focus on
specific threads of interest which would be pretty important here to reduce
how much state would have to be periodically captured. I could be
interested to experiment with this later for sure.

With respect to the OA counters I'm looking to expose first, I would note
that the counter snapshots being written periodically are written by a fixed
function unit where we only have a limited set of layouts that we can
choose.

Beyond the OA counters I can certainly see us wanting further pmus for other
metrics though. For reference Chris Wilson experimented with some related
ideas last year, here:

 http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?
 h=perf&id=f32c19f4fb3a3bda92ab7bd0b9f95da14e81ca0a

Regards,
- Robert

>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver
  2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg
                   ` (3 preceding siblings ...)
  2014-10-23  5:58 ` [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Ingo Molnar
@ 2014-10-30 19:08 ` Peter Zijlstra
  2014-11-03 21:47   ` Robert Bragg
  4 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2014-10-30 19:08 UTC (permalink / raw)
  To: Robert Bragg
  Cc: linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs

On Wed, Oct 22, 2014 at 04:28:48PM +0100, Robert Bragg wrote:
> Our desired permission model seems consistent with perf's current model
> whereby you would need privileges if you want to profile across all gpu
> contexts but not need special permissions to profile your own context.
> 
> The awkward part is that it doesn't make sense for us to have userspace
> open a perf event with a specific pid as the way to avoid needing root
> permissions because a side effect of doing this is that the events will
> be dynamically added/deleted so as to only monitor while that process is
> scheduled and that's not really meaningful when we're monitoring the
> gpu.

There is precedent in PERF_FLAG_PID_CGROUP to replace the pid argument
with a fd to your object.

And do I take it right that if you're able/allowed/etc.. to open/have
the fd to the GPU/DRM/DRI whatever context you have the right
credentials to also observe these counters?

> Conceptually I suppose we want to be able to open an event that's not
> associated with any cpu or process, but to keep things simple and fit
> with perf's current design, the pmu I have a.t.m expects an event to be
> opened for a specific cpu and unspecified process.

There are no actual scheduling ramifications right? Let me ponder his
for a little while more..

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver
  2014-10-30 19:08 ` Peter Zijlstra
@ 2014-11-03 21:47   ` Robert Bragg
  2014-11-05 12:33     ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Robert Bragg @ 2014-11-03 21:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs

On Thu, Oct 30, 2014 at 7:08 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Oct 22, 2014 at 04:28:48PM +0100, Robert Bragg wrote:
>> Our desired permission model seems consistent with perf's current model
>> whereby you would need privileges if you want to profile across all gpu
>> contexts but not need special permissions to profile your own context.
>>
>> The awkward part is that it doesn't make sense for us to have userspace
>> open a perf event with a specific pid as the way to avoid needing root
>> permissions because a side effect of doing this is that the events will
>> be dynamically added/deleted so as to only monitor while that process is
>> scheduled and that's not really meaningful when we're monitoring the
>> gpu.
>
> There is precedent in PERF_FLAG_PID_CGROUP to replace the pid argument
> with a fd to your object.

Ah ok, interesting.

>
> And do I take it right that if you're able/allowed/etc.. to open/have
> the fd to the GPU/DRM/DRI whatever context you have the right
> credentials to also observe these counters?

Right and in particular since we want to allow OpenGL clients to be
able the profile their own gpu context with out any special privileges
my current pmu driver accepts a device file descriptor via config1 + a
context id via attr->config, both for checking credentials and
uniquely identifying which context should be profiled. (A single
client can open multiple contexts via one drm fd)

That said though; when running as root it is not currently a
requirement to pass any fd when configuring an event to profile across
all gpu contexts. I'm just mentioning this because although I think it
should be ok for us to use an fd to determine credentials and help
specify a gpu context, an fd might not be necessary for system wide
profiling cases.

>
>> Conceptually I suppose we want to be able to open an event that's not
>> associated with any cpu or process, but to keep things simple and fit
>> with perf's current design, the pmu I have a.t.m expects an event to be
>> opened for a specific cpu and unspecified process.
>
> There are no actual scheduling ramifications right? Let me ponder his
> for a little while more..

Ok, I can't say I'm familiar enough with the core perf infrastructure
to entirely sure about this.

I recall looking at how some of the uncore perf drivers were working
and it looked like they had a similar issue where conceptually the pmu
doesn't belong to a specific cpu and so the id would internally get
mapped to some package state, shared by multiple cpus.

My understanding had been that being associated with a specific cpu
did have the side effect that most of the pmu methods for that event
would then be invoked on that cpu through inter-process interrupts. At
one point that had seemed slightly problematic because there weren't
many places within my pmu driver where I could assume I was in process
context and could sleep. This was a problem with an earlier version
because the way I read registers had a slim chance of needing to sleep
waiting for the gpu to come out of RC6, but isn't a problem any more.

One thing that does come to mind here though is that I am overloading
pmu->read() as a mechanism for userspace to trigger a flush of all
counter snapshots currently in the gpu circular buffer to userspace as
perf events. Perhaps it would be best if that work (which might be
relatively costly at times) were done in the context of the process
issuing the flush(), instead of under an IPI (assuming that has some
effect on scheduler accounting).

Regards,
- Robert

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver
  2014-11-03 21:47   ` Robert Bragg
@ 2014-11-05 12:33     ` Peter Zijlstra
  2014-11-06  0:37       ` Robert Bragg
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2014-11-05 12:33 UTC (permalink / raw)
  To: Robert Bragg
  Cc: linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs

On Mon, Nov 03, 2014 at 09:47:17PM +0000, Robert Bragg wrote:

> > And do I take it right that if you're able/allowed/etc.. to open/have
> > the fd to the GPU/DRM/DRI whatever context you have the right
> > credentials to also observe these counters?
> 
> Right and in particular since we want to allow OpenGL clients to be
> able the profile their own gpu context with out any special privileges
> my current pmu driver accepts a device file descriptor via config1 + a
> context id via attr->config, both for checking credentials and
> uniquely identifying which context should be profiled. (A single
> client can open multiple contexts via one drm fd)

Ah interesting. So we've got fd+context_id+event_id to identify any one
number provided by the GPU.

> That said though; when running as root it is not currently a
> requirement to pass any fd when configuring an event to profile across
> all gpu contexts. I'm just mentioning this because although I think it
> should be ok for us to use an fd to determine credentials and help
> specify a gpu context, an fd might not be necessary for system wide
> profiling cases.

Hmm, how does root know what context_id to provide? Are those exposed
somewhere? Is there also a root context, one that encompasses all
others?

> >> Conceptually I suppose we want to be able to open an event that's not
> >> associated with any cpu or process, but to keep things simple and fit
> >> with perf's current design, the pmu I have a.t.m expects an event to be
> >> opened for a specific cpu and unspecified process.
> >
> > There are no actual scheduling ramifications right? Let me ponder his
> > for a little while more..
> 
> Ok, I can't say I'm familiar enough with the core perf infrastructure
> to entirely sure about this.

Yeah, so I don't think so. Its on the device, nothing the CPU/scheduler
does affects what the device does.

> I recall looking at how some of the uncore perf drivers were working
> and it looked like they had a similar issue where conceptually the pmu
> doesn't belong to a specific cpu and so the id would internally get
> mapped to some package state, shared by multiple cpus.

Yeah, we could try and map these devices to a cpu on their node -- PCI
devices are node local. But I'm not sure we need to start out by doing
that.

> My understanding had been that being associated with a specific cpu
> did have the side effect that most of the pmu methods for that event
> would then be invoked on that cpu through inter-process interrupts. At
> one point that had seemed slightly problematic because there weren't
> many places within my pmu driver where I could assume I was in process
> context and could sleep. This was a problem with an earlier version
> because the way I read registers had a slim chance of needing to sleep
> waiting for the gpu to come out of RC6, but isn't a problem any more.

Right, so I suppose we could make a new global context for these device
like things and avoid some that song and dance. But we can do that
later.

> One thing that does come to mind here though is that I am overloading
> pmu->read() as a mechanism for userspace to trigger a flush of all
> counter snapshots currently in the gpu circular buffer to userspace as
> perf events. Perhaps it would be best if that work (which might be
> relatively costly at times) were done in the context of the process
> issuing the flush(), instead of under an IPI (assuming that has some
> effect on scheduler accounting).

Right, so given you tell the GPU to periodically dump these stats (per
context I presume), you can at a similar interval schedule whatever to
flush this and update the relevant event->count values and have an NO-OP
pmu::read() method.

If the GPU provides interrupts to notify you of new data or whatnot, you
can make that drive the thing.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver
  2014-11-05 12:33     ` Peter Zijlstra
@ 2014-11-06  0:37       ` Robert Bragg
  2014-11-10 11:13         ` Ingo Molnar
  0 siblings, 1 reply; 16+ messages in thread
From: Robert Bragg @ 2014-11-06  0:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs

On Wed, Nov 5, 2014 at 12:33 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Nov 03, 2014 at 09:47:17PM +0000, Robert Bragg wrote:
>
>> > And do I take it right that if you're able/allowed/etc.. to open/have
>> > the fd to the GPU/DRM/DRI whatever context you have the right
>> > credentials to also observe these counters?
>>
>> Right and in particular since we want to allow OpenGL clients to be
>> able the profile their own gpu context with out any special privileges
>> my current pmu driver accepts a device file descriptor via config1 + a
>> context id via attr->config, both for checking credentials and
>> uniquely identifying which context should be profiled. (A single
>> client can open multiple contexts via one drm fd)
>
> Ah interesting. So we've got fd+context_id+event_id to identify any one
> number provided by the GPU.

Roughly.

The fd represents the device we're interested in.

Since a single application can manage multiple unique gpu contexts for
submitting work we have the context_id to identify which one in
particular we want to collect metrics for.

The event_id here though really represents a set of counters that are
written out together in a hardware specific report layout.

On Haswell there are 8 different report layouts that basically trade
off how many counters to include from 13 to 61 32bit counters plus 1
64bit timestamp. I exposed this format choice in the event
configuration. It's notable that all of the counter values written in
one report are captured atomically with respect to the gpu clock.

Within the reports most of the counters are hard-wired and they are
referred to as Aggregating counters, including things like:

* number of cycles the render engine was busy for
* number of cycles the gpu was active
* number of cycles the gpu was stalled
(i'll just gloss over what distinguishes each of these states)
* number of active cycles spent running a vertex shader
* number of stalled cycles spent running a vertex shader
* number of vertex shader threads spawned
* number of active cycles spent running a pixel shader
* number of stalled cycles spent running a pixel shader"
* number of pixel shader threads spawned
...

The values are aggregated across all of the gpu's execution units
(e.g. up to 40 units on Haswell)

Besides these aggregating counters the reports also include a gpu
clock counter which allows us to normalize these values into something
more intuitive for profiling.

There is a further small set of counters referred to as B counters in
the public prms that are also included in these reports and the
hardware has some configurability for these counters but given the
constrains on configuring them, the expectation would be to just allow
userspace to specify a enum for certain pre-defined configurations.
(E.g. a configuration that exposes a well defined set of B counters
useful for OpenGL profiling vs GPGPU profiling)

I had considered uniquely identifying each of the A counters with
separate perf event ids, but I think the main reasons I decided
against that in the end are:

Since they are written atomically the counters in a snapshot are all
related and the analysis to derive useful values for benchmarking
typically needs to refer to multiple counters in a single snapshot at
a time. E.g. to report the "Average cycles per vertex shader thread"
would need to measure the number of cycles spent running a vertex
shader / the number of vertex shader threads spawned. If we split the
counters up we'd then need to do work to correlate them again in
userspace.

My other concern was actually with memory bandwidth, considering that
it's possible to request the gpu to write out periodic snapshots at a
very high frequency (we can program a period as low as 160
nanoseconds) and pushing this to the limit (running as root +
overriding perf_event_max_sample_rate) can start to expose some
interesting details about how the gpu is working - though notable
observer effects too. I was expecting memory bandwidth to be the
limiting factor for what resolution we can achieve this way and
splitting the counters up looked like it would have quite a big
impact, due to the extra sample headers and that the gpu timestamp
would need to be repeated with each counter. e.g. in the most extreme
case, instead of 8byte header + 61 counters * 4 bytes + 8byte
timestamp every 160ns ~= 1.6GB/s, each counter would need to be paired
with a gpu timestamp + header so we could have 61 * (8 + 4 + 8)bytes
~= 7.6GB/s. To be fair though it's likely that if the counters were
split up we probably wouldn't often need a full set of 61 counters.

One last thing to mention here is that this first pmu driver that I
have written only relates to one very specific observation unit within
the gpu that happens to expose counters via reports/snapshots. There
are other interesting gpu counters I could imagine exposing through
separate pmu drivers too where the counters might simply be accessed
via mmio and for those cases I would imagine having a 1:1 mapping
between event-ids and counters.

>
>> That said though; when running as root it is not currently a
>> requirement to pass any fd when configuring an event to profile across
>> all gpu contexts. I'm just mentioning this because although I think it
>> should be ok for us to use an fd to determine credentials and help
>> specify a gpu context, an fd might not be necessary for system wide
>> profiling cases.
>
> Hmm, how does root know what context_id to provide? Are those exposed
> somewhere? Is there also a root context, one that encompasses all
> others?

No, it's just that the observation unit has two modes of operation;
either we can ask the unit to only aggregate counters for a specific
context_id or tell it to aggregate across all contexts.

>
>> >> Conceptually I suppose we want to be able to open an event that's not
>> >> associated with any cpu or process, but to keep things simple and fit
>> >> with perf's current design, the pmu I have a.t.m expects an event to be
>> >> opened for a specific cpu and unspecified process.
>> >
>> > There are no actual scheduling ramifications right? Let me ponder his
>> > for a little while more..
>>
>> Ok, I can't say I'm familiar enough with the core perf infrastructure
>> to entirely sure about this.
>
> Yeah, so I don't think so. Its on the device, nothing the CPU/scheduler
> does affects what the device does.
>
>> I recall looking at how some of the uncore perf drivers were working
>> and it looked like they had a similar issue where conceptually the pmu
>> doesn't belong to a specific cpu and so the id would internally get
>> mapped to some package state, shared by multiple cpus.
>
> Yeah, we could try and map these devices to a cpu on their node -- PCI
> devices are node local. But I'm not sure we need to start out by doing
> that.
>
>> My understanding had been that being associated with a specific cpu
>> did have the side effect that most of the pmu methods for that event
>> would then be invoked on that cpu through inter-process interrupts. At
>> one point that had seemed slightly problematic because there weren't
>> many places within my pmu driver where I could assume I was in process
>> context and could sleep. This was a problem with an earlier version
>> because the way I read registers had a slim chance of needing to sleep
>> waiting for the gpu to come out of RC6, but isn't a problem any more.
>
> Right, so I suppose we could make a new global context for these device
> like things and avoid some that song and dance. But we can do that
> later.

sure, at least for now it seems workable.

>
>> One thing that does come to mind here though is that I am overloading
>> pmu->read() as a mechanism for userspace to trigger a flush of all
>> counter snapshots currently in the gpu circular buffer to userspace as
>> perf events. Perhaps it would be best if that work (which might be
>> relatively costly at times) were done in the context of the process
>> issuing the flush(), instead of under an IPI (assuming that has some
>> effect on scheduler accounting).
>
> Right, so given you tell the GPU to periodically dump these stats (per
> context I presume), you can at a similar interval schedule whatever to
> flush this and update the relevant event->count values and have an NO-OP
> pmu::read() method.
>
> If the GPU provides interrupts to notify you of new data or whatnot, you
> can make that drive the thing.
>

Right, I'm already ensuring the events will be forwarded within a
finite time using a hrtimer, currently at 200Hz but there are also
times where userspace wants to pull at the driver too.

The use case here is supporting the INTEL_performance_query OpenGL
extension, where an application which can submit work to render on the
gpu and can also start and stop performance queries around specific
work and then ask for the results. Given how the queries are delimited
Mesa can determine when the work being queried has completed and at
that point the application can request the results of the query.

In this model Mesa will have configured a perf event to deliver
periodic counter snapshots, but it only really cares about snapshots
that fall between the start and end of a query. For this use case the
periodic snapshots are just to detect counters wrapping and so the
period will be relatively low at ~50milliseconds. At the end of a
query Mesa won't know whether there are any periodic snapshots that
fell between the start-end so it wants to explicitly flush at a point
where it knows any snapshots will be ready if there are any.

Alternatively I think I could arrange it so that Mesa relies on
knowing the driver will forward snapshots @ 200Hz and we could delay
informing the application that results are ready until we are certain
they must have been forwarded. I think the api could allow us to do
that (except for one awkward case where the application can demand a
synchronous response where we'd potentially have to sleep) My concern
here is having to rely on a fixed and relatively high frequency for
forwarding events which seems like it should be left as an
implementation detail that userspace shouldn't need to know.

I'm guessing it could also be good at some point for the hrtimer
frequency to relate to the buffer size + report sizes + timer
frequency instead of being fixed, but this could be difficult to
change if userspace needs to make assumptions about it, it could also
increase the time userspace would have to wait before it could be sure
outstanding snapshots have been received.

Hopefully that explains why I'm overloading read() like this currently.

Regards
- Robert

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver
  2014-11-06  0:37       ` Robert Bragg
@ 2014-11-10 11:13         ` Ingo Molnar
  2014-11-12 23:33           ` Robert Bragg
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2014-11-10 11:13 UTC (permalink / raw)
  To: Robert Bragg
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs


* Robert Bragg <robert@sixbynine.org> wrote:

> On Wed, Nov 5, 2014 at 12:33 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Nov 03, 2014 at 09:47:17PM +0000, Robert Bragg wrote:
> >
> >> > And do I take it right that if you're able/allowed/etc.. to open/have
> >> > the fd to the GPU/DRM/DRI whatever context you have the right
> >> > credentials to also observe these counters?
> >>
> >> Right and in particular since we want to allow OpenGL clients to be
> >> able the profile their own gpu context with out any special privileges
> >> my current pmu driver accepts a device file descriptor via config1 + a
> >> context id via attr->config, both for checking credentials and
> >> uniquely identifying which context should be profiled. (A single
> >> client can open multiple contexts via one drm fd)
> >
> > Ah interesting. So we've got fd+context_id+event_id to identify any one
> > number provided by the GPU.
> 
> Roughly.
> 
> The fd represents the device we're interested in.
> 
> Since a single application can manage multiple unique gpu contexts for
> submitting work we have the context_id to identify which one in
> particular we want to collect metrics for.
> 
> The event_id here though really represents a set of counters that are
> written out together in a hardware specific report layout.
> 
> On Haswell there are 8 different report layouts that basically trade
> off how many counters to include from 13 to 61 32bit counters plus 1
> 64bit timestamp. I exposed this format choice in the event
> configuration. It's notable that all of the counter values written in
> one report are captured atomically with respect to the gpu clock.
> 
> Within the reports most of the counters are hard-wired and they are
> referred to as Aggregating counters, including things like:
> 
> * number of cycles the render engine was busy for
> * number of cycles the gpu was active
> * number of cycles the gpu was stalled
> (i'll just gloss over what distinguishes each of these states)
> * number of active cycles spent running a vertex shader
> * number of stalled cycles spent running a vertex shader
> * number of vertex shader threads spawned
> * number of active cycles spent running a pixel shader
> * number of stalled cycles spent running a pixel shader"
> * number of pixel shader threads spawned
> ...

Just curious:

Beyond aggregated counts, do the GPU reports also allow sampling 
the PC of the vertex shader and pixel shader execution?

That would allow effective annotated disassembly of them and 
bottleneck analysis - much like 'perf annotate' and how you can 
drill into annotated assembly code in 'perf report' and 'perf 
top'.

Secondly, do you also have cache hit/miss counters (with sampling 
ability) for the various caches the GPU utilizes: such as the LLC 
it shares with the CPU, or GPU-specific caches (if any) such as 
the vertex cache? Most GPU shader performance problems relate to 
memory access patterns and the above aggregate counts only tell 
us the global picture.

Thirdly, if taken branch instructions block/stall non-taken 
threads within an execution unit (like it happens on other vector 
CPUs) then being able to measure/sample current effective thread 
concurrency within an execution unit is generally useful as well, 
to be able to analyze this major class of GPU/GPGPU performance 
problems.

> The values are aggregated across all of the gpu's execution 
> units (e.g. up to 40 units on Haswell)
> 
> Besides these aggregating counters the reports also include a 
> gpu clock counter which allows us to normalize these values 
> into something more intuitive for profiling.

Modern GPUs can also change their clock frequency depending on 
load - is the GPU clock normalized by the hardware to a known 
fixed frequency, or does it change as the GPU's clock changes?

> [...]
> 
> I had considered uniquely identifying each of the A counters 
> with separate perf event ids, but I think the main reasons I 
> decided against that in the end are:
> 
> Since they are written atomically the counters in a snapshot 
> are all related and the analysis to derive useful values for 
> benchmarking typically needs to refer to multiple counters in a 
> single snapshot at a time. E.g. to report the "Average cycles 
> per vertex shader thread" would need to measure the number of 
> cycles spent running a vertex shader / the number of vertex 
> shader threads spawned. If we split the counters up we'd then 
> need to do work to correlate them again in userspace.
> 
> My other concern was actually with memory bandwidth, 
> considering that it's possible to request the gpu to write out 
> periodic snapshots at a very high frequency (we can program a 
> period as low as 160 nanoseconds) and pushing this to the limit 
> (running as root + overriding perf_event_max_sample_rate) can 
> start to expose some interesting details about how the gpu is 
> working - though notable observer effects too. I was expecting 
> memory bandwidth to be the limiting factor for what resolution 
> we can achieve this way and splitting the counters up looked 
> like it would have quite a big impact, due to the extra sample 
> headers and that the gpu timestamp would need to be repeated 
> with each counter. e.g. in the most extreme case, instead of 
> 8byte header + 61 counters * 4 bytes + 8byte timestamp every 
> 160ns ~= 1.6GB/s, each counter would need to be paired with a 
> gpu timestamp + header so we could have 61 * (8 + 4 + 8)bytes 
> ~= 7.6GB/s. To be fair though it's likely that if the counters 
> were split up we probably wouldn't often need a full set of 61 
> counters.

If you really want to collect such high frequency data then you 
are probably right in trying to compress the report format as 
much as possible.

> One last thing to mention here is that this first pmu driver 
> that I have written only relates to one very specific 
> observation unit within the gpu that happens to expose counters 
> via reports/snapshots. There are other interesting gpu counters 
> I could imagine exposing through separate pmu drivers too where 
> the counters might simply be accessed via mmio and for those 
> cases I would imagine having a 1:1 mapping between event-ids 
> and counters.

I'd strong suggest thinking about sampling as well, if the 
hardware exposes sample information: at least for profiling CPU 
loads the difference is like day and night, compared to 
aggregated counts and self-profiling.

> > [...]
> >
> > If the GPU provides interrupts to notify you of new data or 
> > whatnot, you can make that drive the thing.
> 
> Right, I'm already ensuring the events will be forwarded within 
> a finite time using a hrtimer, currently at 200Hz but there are 
> also times where userspace wants to pull at the driver too.
> 
> The use case here is supporting the INTEL_performance_query 
> OpenGL extension, where an application which can submit work to 
> render on the gpu and can also start and stop performance 
> queries around specific work and then ask for the results. 
> Given how the queries are delimited Mesa can determine when the 
> work being queried has completed and at that point the 
> application can request the results of the query.
> 
> In this model Mesa will have configured a perf event to deliver 
> periodic counter snapshots, but it only really cares about 
> snapshots that fall between the start and end of a query. For 
> this use case the periodic snapshots are just to detect 
> counters wrapping and so the period will be relatively low at 
> ~50milliseconds. At the end of a query Mesa won't know whether 
> there are any periodic snapshots that fell between the 
> start-end so it wants to explicitly flush at a point where it 
> knows any snapshots will be ready if there are any.
> 
> Alternatively I think I could arrange it so that Mesa relies on 
> knowing the driver will forward snapshots @ 200Hz and we could 
> delay informing the application that results are ready until we 
> are certain they must have been forwarded. I think the api 
> could allow us to do that (except for one awkward case where 
> the application can demand a synchronous response where we'd 
> potentially have to sleep) My concern here is having to rely on 
> a fixed and relatively high frequency for forwarding events 
> which seems like it should be left as an implementation detail 
> that userspace shouldn't need to know.

It's a very good idea to not expose such limitations to 
user-space - the GPU driver doing the necessary hrtimer polling 
to construct a proper count is a much higher quality solution.

The last thing you want to ask yourself when seeing some weird 
profiling result is 'did user-space properly poll the PMU or did 
we overflow??'. Instrumentation needs to be rock solid dependable 
and fast, in that order.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver
  2014-11-10 11:13         ` Ingo Molnar
@ 2014-11-12 23:33           ` Robert Bragg
  2014-11-16  9:27             ` Ingo Molnar
  0 siblings, 1 reply; 16+ messages in thread
From: Robert Bragg @ 2014-11-12 23:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs

On Mon, Nov 10, 2014 at 11:13 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Robert Bragg <robert@sixbynine.org> wrote:
>

<snip>

>> On Haswell there are 8 different report layouts that basically trade
>> off how many counters to include from 13 to 61 32bit counters plus 1
>> 64bit timestamp. I exposed this format choice in the event
>> configuration. It's notable that all of the counter values written in
>> one report are captured atomically with respect to the gpu clock.
>>
>> Within the reports most of the counters are hard-wired and they are
>> referred to as Aggregating counters, including things like:
>>
>> * number of cycles the render engine was busy for
>> * number of cycles the gpu was active
>> * number of cycles the gpu was stalled
>> (i'll just gloss over what distinguishes each of these states)
>> * number of active cycles spent running a vertex shader
>> * number of stalled cycles spent running a vertex shader
>> * number of vertex shader threads spawned
>> * number of active cycles spent running a pixel shader
>> * number of stalled cycles spent running a pixel shader"
>> * number of pixel shader threads spawned
>> ...
>
> Just curious:
>
> Beyond aggregated counts, do the GPU reports also allow sampling
> the PC of the vertex shader and pixel shader execution?
>
> That would allow effective annotated disassembly of them and
> bottleneck analysis - much like 'perf annotate' and how you can
> drill into annotated assembly code in 'perf report' and 'perf
> top'.

No, I'm afraid these particular counter reports from the OA unit can't
give us access to EU instruction pointers or other EU registers, even
considering the set of configurable counters that can be exposed
besides the aggregate counters. These OA counters are more-or-less
just boolean event counters.

Because your train of thought got me wondering though if it would be
possible to sample instruction pointers of EU threads periodically; I
spent a bit of time investigating how it could potentially be
implemented, out of curiosity. I found at least one possible approach,
but one thing that became apparent is that it wouldn't really be
possible to handle neatly from the kernel and would need tightly
coupled support from Mesa in userspace too...

Gen EUs have some support for exception handling where an exception
could be triggered periodically (not internally by the gpu, but rather
by the cpu) and the EUs made to run a given 'system routine' which
would be able to sample the instruction pointer of the interrupted
threads. One of the difficulties is that it wouldn't be possible for
the kernel to directly setup a system routine for profiling like this,
since the pointer for the routine is set via a STATE_SIP command that
requires a pointer relative to the 'instruction base pointer' which is
state that's really owned and setup by userspace drivers.

Incidentally our current driver stack doesn't currently utilise system
routines for anything and so at least something like this wouldn't
conflict with an existing feature. Some experiments were done with
system routines by Ben Widawsky some years ago now, with the aim of
using them for debugging as opposed to profiling, and that means he
has some code knocking around (intel-gpu-tools/debugger) that could
make it possible to put together an experiment for this.

For now I'd like to continue with enabling access to the OA counters
via perf if possible, since that's much lower hanging fruit but should
still allow a decent range of profiling tools.  If I get a chance
though, I'm tempted to see if I can use Ben's code as a basis to
experiment with this idea.

>
> Secondly, do you also have cache hit/miss counters (with sampling
> ability) for the various caches the GPU utilizes: such as the LLC
> it shares with the CPU, or GPU-specific caches (if any) such as
> the vertex cache? Most GPU shader performance problems relate to
> memory access patterns and the above aggregate counts only tell
> us the global picture.

Right, we can expose some of these via OA counter reports, through the
configurable counters. E.g. we can get a counter for the number of L3
cache read/write transactions via the LLC which can be converted into
a throughput. There are also other interesting counters relating to
the texture samplers for example, that are a common bottleneck.

My initial i915_oa driver doesn't look at exposing those yet since we
still need to work through an approval process for some of the
details. My first interest was to start with creating a driver to
expose the features and counters we already have published public docs
for, which in turn let me send out this RFC sooner rather than later.

>
> Thirdly, if taken branch instructions block/stall non-taken
> threads within an execution unit (like it happens on other vector
> CPUs) then being able to measure/sample current effective thread
> concurrency within an execution unit is generally useful as well,
> to be able to analyze this major class of GPU/GPGPU performance
> problems.

Right, Gen EUs try to co-issue instructions from multiple threads at
the same time, so long as they aren't contending for the same units.

I'm not currently sure of a way to get insight into this for Haswell,
but for Broadwell we gain some more aggregate EU counters (actually
some of them become customisable) and then it's possible to count the
issuing of instructions for some of the sub-units that allow
co-issuing.

>
>> The values are aggregated across all of the gpu's execution
>> units (e.g. up to 40 units on Haswell)
>>
>> Besides these aggregating counters the reports also include a
>> gpu clock counter which allows us to normalize these values
>> into something more intuitive for profiling.
>
> Modern GPUs can also change their clock frequency depending on
> load - is the GPU clock normalized by the hardware to a known
> fixed frequency, or does it change as the GPU's clock changes?

Sadly on Haswell, while these OA counters are enabled we need to
disable RC6 and also render trunk clock gating, so this obviously has
an impact on profiling that needs to be take into account.

On Broadwell I think we should be able to enable both though and in
that case the gpu will automatically write additional counter
snapshots when transitioning in and out of RC6 as well as when the
clock frequency changes.

<snip>

>
>> One last thing to mention here is that this first pmu driver
>> that I have written only relates to one very specific
>> observation unit within the gpu that happens to expose counters
>> via reports/snapshots. There are other interesting gpu counters
>> I could imagine exposing through separate pmu drivers too where
>> the counters might simply be accessed via mmio and for those
>> cases I would imagine having a 1:1 mapping between event-ids
>> and counters.
>
> I'd strong suggest thinking about sampling as well, if the
> hardware exposes sample information: at least for profiling CPU
> loads the difference is like day and night, compared to
> aggregated counts and self-profiling.

Here I was thinking of counters or data that can be sampled via mmio
using a hrtimer. E.g. the current gpu frequency or the energy usage.
I'm not currently aware of any capability for the gpu to say trigger
an interrupt after a threshold number of events occurs (like clock
cycles) so I think we may generally be limited to a wall clock time
domain for sampling.

As above, I'll also keep in mind, experimenting with being able to
sample EU IPs at some point too.

>
>> > [...]
>> >
>> > If the GPU provides interrupts to notify you of new data or
>> > whatnot, you can make that drive the thing.
>>
>> Right, I'm already ensuring the events will be forwarded within
>> a finite time using a hrtimer, currently at 200Hz but there are
>> also times where userspace wants to pull at the driver too.
>>
>> The use case here is supporting the INTEL_performance_query
>> OpenGL extension, where an application which can submit work to
>> render on the gpu and can also start and stop performance
>> queries around specific work and then ask for the results.
>> Given how the queries are delimited Mesa can determine when the
>> work being queried has completed and at that point the
>> application can request the results of the query.
>>
>> In this model Mesa will have configured a perf event to deliver
>> periodic counter snapshots, but it only really cares about
>> snapshots that fall between the start and end of a query. For
>> this use case the periodic snapshots are just to detect
>> counters wrapping and so the period will be relatively low at
>> ~50milliseconds. At the end of a query Mesa won't know whether
>> there are any periodic snapshots that fell between the
>> start-end so it wants to explicitly flush at a point where it
>> knows any snapshots will be ready if there are any.
>>
>> Alternatively I think I could arrange it so that Mesa relies on
>> knowing the driver will forward snapshots @ 200Hz and we could
>> delay informing the application that results are ready until we
>> are certain they must have been forwarded. I think the api
>> could allow us to do that (except for one awkward case where
>> the application can demand a synchronous response where we'd
>> potentially have to sleep) My concern here is having to rely on
>> a fixed and relatively high frequency for forwarding events
>> which seems like it should be left as an implementation detail
>> that userspace shouldn't need to know.
>
> It's a very good idea to not expose such limitations to
> user-space - the GPU driver doing the necessary hrtimer polling
> to construct a proper count is a much higher quality solution.

That sounds preferable.

I'm open to suggestions for finding another way for userspace to
initiate a flush besides through read() in case there's a concern that
might be set a bad precedent. For the i915_oa driver it seems ok at
the moment since we don't currently report a useful counter through
read() and for the main use case where we want the flushing we expect
that most of the time there won't be any significant cost involved in
flushing since we'll be using a very low timer period. Maybe this will
bite us later though.

>
> The last thing you want to ask yourself when seeing some weird
> profiling result is 'did user-space properly poll the PMU or did
> we overflow??'. Instrumentation needs to be rock solid dependable
> and fast, in that order.

That sounds like good advice.

Thanks,
- Robert

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver
  2014-11-12 23:33           ` Robert Bragg
@ 2014-11-16  9:27             ` Ingo Molnar
  0 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2014-11-16  9:27 UTC (permalink / raw)
  To: Robert Bragg
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark,
	Samuel Pitoiset, Ben Skeggs


* Robert Bragg <robert@sixbynine.org> wrote:

> > I'd strong[ly] suggest thinking about sampling as well, if 
> > the hardware exposes sample information: at least for 
> > profiling CPU loads the difference is like day and night, 
> > compared to aggregated counts and self-profiling.
> 
> Here I was thinking of counters or data that can be sampled via 
> mmio using a hrtimer. E.g. the current gpu frequency or the 
> energy usage. I'm not currently aware of any capability for the 
> gpu to say trigger an interrupt after a threshold number of 
> events occurs (like clock cycles) so I think we may generally 
> be limited to a wall clock time domain for sampling.

In general hrtimer-driven polling gives pretty good profiling 
information as well - key is to be able to get a sample of EU 
thread execution state.

(Trigger thresholds and so can be useful as well, but are a 
second order concern in terms of profiling quality.)

> > It's a very good idea to not expose such limitations to 
> > user-space - the GPU driver doing the necessary hrtimer 
> > polling to construct a proper count is a much higher quality 
> > solution.
> 
> That sounds preferable.
> 
> I'm open to suggestions for finding another way for userspace 
> to initiate a flush besides through read() in case there's a 
> concern that might be set a bad precedent. For the i915_oa 
> driver it seems ok at the moment since we don't currently 
> report a useful counter through read() and for the main use 
> case where we want the flushing we expect that most of the time 
> there won't be any significant cost involved in flushing since 
> we'll be using a very low timer period. Maybe this will bite us 
> later though.

You could add an ioctl() as well - we are not religious about 
them, there's always things that are special enough to not 
warrant a generic syscall.

Anyway, aggregate counts alone are obviously very useful to 
analyzing GPU performance, so your initial approach looks 
perfectly acceptable to me already.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-11-16  9:27 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg
2014-10-22 15:28 ` [RFC PATCH 1/3] perf: export perf_event_overflow Robert Bragg
2014-10-22 15:28 ` [RFC PATCH 2/3] perf: Add PERF_PMU_CAP_IS_DEVICE flag Robert Bragg
2014-10-22 15:28 ` [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture Robert Bragg
2014-10-23  7:47   ` Chris Wilson
2014-10-24  2:33     ` Robert Bragg
2014-10-24  6:56       ` Chris Wilson
2014-10-23  5:58 ` [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Ingo Molnar
2014-10-24 13:39   ` Robert Bragg
2014-10-30 19:08 ` Peter Zijlstra
2014-11-03 21:47   ` Robert Bragg
2014-11-05 12:33     ` Peter Zijlstra
2014-11-06  0:37       ` Robert Bragg
2014-11-10 11:13         ` Ingo Molnar
2014-11-12 23:33           ` Robert Bragg
2014-11-16  9:27             ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).